Connect via MCP →

Enter Calculation

Formula

Advertisement

Results

Estimated VRAM Required
16.8
GB of GPU memory
Raw weight size 14 GB
Bytes per parameter 2
Overhead factor 1.2×

What this calculator does

The LLM VRAM Requirement Calculator estimates how much GPU memory you need to load and run a large language model. It multiplies the model's parameter count (in billions) by the number of bytes each parameter occupies at your chosen precision, then applies an overhead factor to account for activations, the KV cache, and framework buffers.

How to use it

Enter the model size in billions of parameters (for example 7 for a 7B model, 70 for a 70B model). Pick the quantization: FP16/BF16 uses 2 bytes per weight, INT8 uses 1 byte, 4-bit uses 0.5 bytes, and 2-bit uses 0.25 bytes. The default overhead of 1.2 (a 20% buffer) is a sensible starting point for inference; raise it for long-context or batched workloads.

The formula explained

$$\text{VRAM (GB)} = \text{Params (B)} \times \text{Bytes/Param} \times \text{Overhead}$$ The first two terms give the raw size of the model weights in gigabytes. The overhead multiplier reserves extra memory that PyTorch, CUDA, and the attention KV cache consume at runtime, which the raw weight size alone ignores.

Three bars comparing VRAM for FP16, 8-bit, and 4-bit quantization
Lower precision (8-bit, 4-bit) roughly halves the bytes per parameter and the VRAM needed.
Stacked bar diagram showing VRAM split into weights, KV cache, and overhead
Total VRAM is dominated by model weights, plus extra for KV cache and overhead.

Worked example

A 7B model at 4-bit precision: \(7 \times 0.5 = 3.5\) GB of weights. With a 1.2 overhead factor: \(3.5 \times 1.2 = 4.2\) GB. That comfortably fits on an 8 GB consumer GPU. The same model in FP16 needs \(7 \times 2 \times 1.2 = 16.8\) GB, which requires a 24 GB card.

FAQ

Is this exact? No — it's an inference estimate. Actual usage varies with context length, batch size, and the serving framework. Use it for planning, not to the last megabyte.

Does this include training memory? No. Training needs far more (optimizer states, gradients), often 4× or more the inference figure.

What overhead should I use? 1.2 is fine for short-context inference; use 1.3–1.5 for long context or concurrent requests.

Last updated: