Estimating vRAM

Determining if you have enough GPU memory.
These are just estimates

The estimates in this post are back-of-the-napkin calculations. They are meant to give you a ballpark estimate of how much vRAM you need. The only way to know for sure is to run your model on your hardware and measure the memory usage. This is especially true for training.1

I’ll keep updating this post as I learn more and get feedback from others.

Background

My friend Zach Mueller came out with this handy calculator, which aims to answer the question “How much GPU memory do I need for a model?”. His calculator incorporates all of the math that is stuck in many of our heads and puts it into a simple calculator.

However, I’ve talked with Zach and others for a couple of weeks about the nuances of the calculator and want to share some additional information I think is helpful.

Training w/ LoRA

The key to estimating memory needed for training is to anchor off the # of trainable parameters. The general formula is:

(Estimate from calculator in GB + ( # of trainable params in Billions * (dtype of trainable params / 8) * 4)) * 1.2

Here is the rationale for each of the terms:

  • dtype of trainable params refers to the dtype of your unfrozen model parameters for LoRA, which is usually a different precision than the rest of the frozen model. It’s common to load the rest of the model with 8-bit or 4-bit quantization and keep the LorA adapters at 16 or 32-bit precision. We divide this by 8 to get the number of bytes per parameter. For example, if there are 1B trainable parameters and they are 16-bit, this would add 1 * (16 / 8) = 2GB to the estimate, which you have to further multiply by other quantities (see below example).
  • The 4x is tied to the popular Adam optimizer (if you use a different one, YMMV). You need 2x for the optimizer, 1x for the model and 1x for the gradients.
  • The 1.2x is additional overhead for the forward pass while training the model. You can read more about this heuristic in this blog post from Eleuther AI.
Note

The 1 added to 20% is just a mathematical trick for increasing a quantity by a %. For example, if you want to increase a quantity by 20% you can multiply by 1.20. I only mention this because this term confused some people!

Example

For example, if you are using Lora and 0.5B of your parameters are trainable and are fp16, and the calculator says you need 14GB of vRAM, this is how you would calculate the amount of total memory you need for training:

(14GB + ( 0.5 * (16 / 8) * 4) ) * 1.2 =

(14GB + 4) * 1.2 = 21.6

Answer: ~21.6 GB of vRAM

Inference

The calculator is great as-is for estimating vRAM needed inference. Even though there are other caveats to be aware of, the calculator is a great baseline to start from. For inference, you will also have to consider your batch size for continuous batching, which is use-case specific depending on your throughput vs. latency requirements.

Caveats

There are other optimizations to be aware of that can reduce the amount of memory you need:

  • Flash attention
  • Gradient checkpointing
  • Model compilation (ex: MLC)

Distributed training/inference can add some additional overhead, but that is a complex topic that I won’t cover here.

Footnotes

  1. See this thread as an example of how YMMV.↩︎