Connect via MCP →

Enter Calculation

Formula

Advertisement

Results

Estimated VRAM Required
16.8
GB
Raw model weights 14 GB
Overhead (KV cache, activations, etc.) 2.8 GB

What is the LLM VRAM Requirement Calculator?

This tool estimates how much GPU video memory (VRAM) you need to load and run a large language model (LLM) for inference. The amount of memory is driven primarily by the number of model parameters and the numeric precision used to store each weight. A safety/overhead factor accounts for the KV cache, activations, and CUDA context that consume memory beyond the raw weights.

How to use it

Enter the model size in billions of parameters (e.g. 7 for a 7B model, 70 for Llama-3 70B). Choose the precision: FP32 uses 4 bytes per weight, FP16/BF16 uses 2 bytes, INT8 uses 1 byte, and INT4 quantization uses 0.5 bytes. Finally set the overhead factor — 1.2 (a 20% buffer) is a reasonable default for short-context inference; raise it for long contexts or batching.

The formula explained

$$\text{VRAM (GB)} = \text{Params (B)} \times \text{Bytes/Param} \times \text{Overhead}$$ Because 1 billion bytes ≈ 1 GB, multiplying parameters in billions by bytes per parameter gives gigabytes directly. The overhead factor then scales that up to cover runtime memory.

Bar comparison of bytes per parameter for FP32, FP16, INT8 and INT4 precisions
Lower precision formats use fewer bytes per parameter, reducing VRAM.
Diagram showing model parameter count multiplied by bytes per parameter and an overhead factor producing total GPU VRAM
VRAM equals parameter count times bytes per parameter times the overhead factor.

Worked example

For a 7B model in FP16 with a 1.2 overhead factor: $$7 \times 2 \times 1.2 = 16.8 \text{ GB}$$ That comfortably fits on a 24 GB card. The same model in INT4: $$7 \times 0.5 \times 1.2 = 4.2 \text{ GB}$$ easily running on an 8 GB GPU.

FAQ

Why is the actual usage higher than the raw weights? The KV cache grows with context length and batch size, and the framework reserves memory for activations and buffers — that is what the overhead factor approximates.

Does this include training? No. Training needs roughly 3–4× more memory for optimizer states and gradients; this estimate targets inference.

What overhead should I use? Use ~1.2 for short prompts, and 1.5–2.0+ for long contexts or heavy batching.

Last updated: