A Llama 3.1 70B model in fp16 has 141 GB of weights. An RTX 4090 has 24 GB of VRAM. The two facts together explain why most "can I run this?" questions on r/LocalLLaMA are answered with a math problem instead of a yes or no. This guide walks the math: where every byte of VRAM actually goes, why GQA and MLA matter, what quantization gives back, and the gap between inference, LoRA, and full fine-tuning.
What VRAM Has to Hold
When a model is loaded for inference, GPU memory has to fit at minimum:
- Weights — the parameters themselves
- KV cache — keys and values for every token already in context
- Activations — intermediate tensors during the forward pass
- Overhead — CUDA kernels, framework allocations, fragmentation
For training or fine-tuning, add:
- Gradients — same size as weights (in their training precision)
- Optimizer state — Adam needs two tensors per parameter (momentum + variance)
Skip any one of these and the number is wrong by a factor of two or more. Most "how much VRAM does Llama 70B need?" answers on the internet ignore the KV cache, which is exactly the thing that explodes when context length grows.
The GPU VRAM Calculator keeps all six terms explicit and shows the formula behind each one — paste a model, pick a workload, see the breakdown.
Weights: The Easy Number
Weights = params × bytes_per_param. That's it.
| Precision | Bytes/param | Llama 8B | Llama 70B | DeepSeek V3 671B |
|---|---|---|---|---|
| fp32 | 4 | 32 GB | 280 GB | 2,684 GB |
| fp16 / bf16 | 2 | 16 GB | 141 GB | 1,342 GB |
| int8 | 1 | 8 GB | 70 GB | 671 GB |
| AWQ / GPTQ Q4 | 0.5 | 4 GB | 35 GB | 336 GB |
| GGUF Q4_K_M | ~0.56 | 4.5 GB | 40 GB | 376 GB |
| Q2_K | ~0.31 | 2.5 GB | 22 GB | 208 GB |
A 70B model that needs 141 GB in fp16 fits on a single H100 80 GB at int8, and on a single 24 GB consumer card at Q2_K — but quality degrades. Q4_K_M is the popular sweet spot for local use: roughly 0.56 bytes per parameter, with perplexity within ~3% of fp16 on most benchmarks.
KV Cache: Where Long Context Hurts
The KV cache stores key and value vectors for every token in the prompt and every token generated so far. The formula:
kv_cache_bytes = 2 × n_layers × n_kv_heads × head_dim × seq_len × batch × bytes_per_kv
The 2 × is for K and V. For Llama 3.1 70B with default head config, fp16 KV cache at 8K context is about 5 GB. At 128K context it's 80 GB — larger than the weights themselves at int8.
Two architectural choices change this drastically.
Grouped-Query Attention (GQA) reduces n_kv_heads while keeping n_q_heads the same. Llama 3.1 70B has 64 query heads but only 8 KV heads — an 8× reduction in cache size versus a hypothetical non-GQA equivalent. Without GQA, that 128K-context cache would be 640 GB.
Multi-head Latent Attention (MLA) is DeepSeek's trick: instead of caching K and V directly, cache a low-rank projection and reconstruct K/V on the fly. DeepSeek V3's KV cache is roughly 16× smaller than a same-size traditional model. It's the reason running DeepSeek V3 with long context is even tractable.
Bottom line: if you care about long context, the KV cache is usually the dominant cost, not the weights.
Activations and Overhead
During the forward pass, intermediate activations for the current layer have to stay in memory. For inference this is small — typically 1–2 GB for a 70B model at modest batch size. For training it explodes because all activations are kept for the backward pass; gradient checkpointing trades compute for memory and can cut this by 5–10×.
CUDA context, framework allocator slack, and kernel workspaces add another 1–4 GB depending on framework (PyTorch is heavier than llama.cpp, vLLM is heavier than ollama). Plan for ~5 GB of headroom on top of the calculated requirement.
Four Workloads, Four Budgets
For Llama 3.1 70B at 4K context, batch 1, fp16 base weights:
| Workload | Approx VRAM | Realistic hardware |
|---|---|---|
| Inference (fp16) | ~150 GB | 2× H100 80 GB or 2× A100 80 GB |
| Inference (Q4) | ~40 GB | Single RTX 4090 24 GB (just barely) or A100 40 GB |
| QLoRA fine-tune | ~48 GB | Single H100 80 GB or single A100 80 GB |
| LoRA fine-tune (fp16 base) | ~180 GB | 2× H100 80 GB |
| Full fine-tune | ~640 GB | 8× H100 SXM |
The QLoRA-vs-full-fine-tune gap is the biggest leverage point in the whole table. Full fine-tuning a 70B model needs an 8-GPU node. QLoRA fine-tunes the same model on a single H100 by quantizing the base to 4-bit, freezing it, and training only small low-rank adapters in fp16. Quality on most downstream tasks is within 1–2% of full fine-tuning for a ~13× memory reduction.
Multi-GPU Has Its Own Overhead
Tensor parallelism (splitting each layer across multiple GPUs) adds roughly 10–15% overhead per extra GPU for activation memory and inter-GPU communication. So 2× H100s don't give you a clean 160 GB of usable VRAM — they give you closer to 145 GB after parallelism overhead. The pattern is worth checking before you commit to a topology: sometimes a single GPU at slightly worse precision is faster and cheaper than two GPUs at higher precision.
Pipeline parallelism (different layers on different GPUs) has lower memory overhead but worse latency for single-request inference because each GPU sits idle while waiting for the previous stage.
Cloud vs Buy
H100 80 GB SXM cost as of May 2026: roughly $2.49–$2.99/hr on RunPod, Lambda, Vast.ai. At 24/7 that's around $22K–$26K per year. A bare H100 80 GB SXM is ~$30K plus power, cooling, and 25%/year OpEx for a co-located server — break-even is roughly 18 months of full utilization.
If your duty cycle is below ~40%, cloud is cheaper essentially forever. If you're running fine-tunes on weekends and inference during business hours, you're under 40%. The LLM Cost Calculator models the hosted-API alternative for the same workload — if you're under ~10M tokens/day, paying per-token is often cheaper than provisioning any GPU at all.
The Honest Limitations
VRAM math is a model, not a guarantee. Real-world numbers drift from the calculator for several reasons:
- Framework overhead varies. vLLM and TensorRT-LLM reserve memory aggressively for the paged KV cache; llama.cpp is leaner. Same workload, different framework, ±15% on observed VRAM.
- Memory fragmentation. Long-running servers eventually fragment the allocator; a workload that fits at startup may OOM after hours of mixed batch sizes.
- Speculative decoding adds a draft model. Faster inference, but you're paying VRAM for two models.
- KV-cache offloading isn't modeled. llama.cpp can spill cache to system RAM; that lets you run beyond GPU VRAM at the cost of throughput, but the calculator assumes everything stays on GPU.
- Architectural quirks aren't always public. Some closed models or fine-tunes change head config or layer count from the base; the calculator's numbers come from the official model card and may be off for community variants.
Use the calculator to size hardware. Use a real load test to confirm.
Related Tools
- GPU VRAM Calculator — the source-of-truth tool for everything above
- LLM Cost Calculator — hosted-API pricing for the same workload
- RAG Cost Calculator — full RAG pipeline pricing (vector DB + embedding + LLM)
- Vector Database Cost Calculator — pgvector vs Pinecone, Weaviate, Qdrant, Milvus
TL;DR
VRAM = weights + KV cache + activations + (gradients + optimizer if training) + overhead. Llama 3.1 70B fits on a single 24 GB GPU at Q4_K_M; at fp16 it needs ~150 GB. GQA cuts KV cache 8×; MLA cuts it 16×. QLoRA fine-tunes a 70B model on a single H100 at the cost of ~1–2% quality. The GPU VRAM Calculator keeps the full breakdown explicit instead of giving you a single misleading number.