How much VRAM do I need to run Llama 3.1 70B?

Depends on precision. fp16 ≈ 142 GB (won't fit any single GPU), Q8_0 ≈ 75 GB (fits 1× H100 80GB), AWQ 4-bit ≈ 40 GB (fits 1× RTX 6000 Ada 48GB or 1× A100 80GB), Q4_K_M ≈ 42 GB (same), Q3_K_M ≈ 35 GB (fits 1× RTX 6000 Pro Blackwell 96GB easily), Q2_K ≈ 26 GB (fits 1× RTX 5090 32GB). Add 2–5 GB for KV cache at 8K context. The calculator above gives the exact number for your configuration.

Can I run a 70B model on a single RTX 4090?

Yes, but only at Q2_K (~26 GB needed for weights, doesn't leave room for KV cache + overhead) or Q3_K_S with CPU offload of some layers via llama.cpp's --n-gpu-layers. Realistically, 4090 (24 GB) handles 70B at Q2_K with short context only. For practical 70B serving on consumer hardware, you want a 4090 + 3090 (48 GB combined) or a single RTX 6000 Pro Blackwell (96 GB) or RTX 5090 (32 GB at Q3_K_S).

What's the difference between GGUF Q4_K_M and AWQ?

GGUF Q4_K_M is llama.cpp's 4-bit format optimized for CPU + GPU mixed inference, with K-block quantization and selective higher-precision blocks for important weights ("M" = medium). AWQ is "Activation-aware Weight Quantization", a different 4-bit scheme optimized for GPU inference with vLLM, TensorRT-LLM, and ExLlamaV2. Quality is roughly equivalent (both within 1 perplexity point of fp16); AWQ tends to be faster on GPU because the kernels are GPU-optimized; Q4_K_M is the default for local llama.cpp / Ollama / LM Studio because of cross-platform support.

How much VRAM do I need to fine-tune Llama 3.1 8B?

Full fine-tune ≈ 85 GB (needs 4× A100 80GB with ZeRO-2). LoRA ≈ 28 GB (1× RTX 6000 Ada 48GB or 1× A100 40GB). QLoRA ≈ 12 GB (fits comfortably on a single 24 GB consumer GPU). Numbers are at batch size 1, seq length 2048, AdamW optimizer, gradient checkpointing on. Larger batches and longer sequences scale activations roughly linearly.

Why does long context use so much VRAM?

Because the KV cache scales linearly with sequence length and batch size. Llama 3.1 8B at 8K context with batch 1 has a KV cache of ~1 GB. At 128K context, that grows to ~16 GB — equal to the model weights themselves. Add concurrent users (batch 16) and you're at 256 GB of KV cache alone. The formula is `2 × n_layers × n_kv_heads × head_dim × seq × batch × bytes_per_kv` — see the calculator's "Show the math" for your specific numbers.

Is QLoRA worth the quality tradeoff vs full LoRA or full FT?

For most fine-tunes, yes. QLoRA hits 95–99% of full-FT quality on standard fine-tuning benchmarks (Alpaca, OpenOrca, etc.) at 5–10% of the VRAM. The compromises are real but small: marginally noisier gradients (4-bit base introduces quantization error), slightly worse loss curves, requires QLoRA-friendly optimizers (paged 8-bit AdamW). For domain adaptation, instruction tuning, or chat fine-tuning — QLoRA is the right default. For fine-tuning that pushes the base model into genuinely new capabilities (multilingual, novel reasoning), full LoRA on bf16 base sometimes pulls ahead.

Can I run DeepSeek V3 671B locally?

Yes if you have ~400 GB of VRAM total — typically 2× MI300X (192 GB each) or 4× H200 (141 GB each, with NVLink) or 8× H100 (80 GB each). At Q4_K_M ≈ 380 GB. The MLA architecture compresses KV cache 16× vs standard attention, so context length doesn't blow up the budget like it would for a non-MLA 671B model would. The "Run DeepSeek R1 671B locally" preset on the calculator above shows the math.

What's the cheapest GPU for running LLMs?

For under 13B models: a used RTX 3090 ($600–900) is unbeatable on cost-per-VRAM. For 30B–70B at Q4_K_M: RTX 5090 (32 GB, ~$2,000) or used dual 3090 (48 GB combined). For renting: Vast.ai marketplace spot instances start at $0.20/hr for 3090, $0.40/hr for 4090, $0.85/hr for A100 80GB, $1.49/hr for H100. For frontier models (405B, 671B), MI300X on TensorWave at ~$3.99/hr per GPU is often cheaper than H100 because of the larger 192 GB capacity.

Does CPU offloading actually work?

Yes, but with throughput hits. Llama.cpp's `--n-gpu-layers` keeps some layers on GPU and the rest in CPU RAM, giving you 3–10 tokens/sec for a 70B at Q4_K_M on a 24GB consumer card vs 30–60 tokens/sec when fully on GPU. For training, DeepSpeed ZeRO-Infinity offloads optimizer states to CPU and even NVMe; the calculator models this with the "CPU offload optimizer state" toggle, which can roughly halve training VRAM at the cost of significantly slower step time.

How much VRAM for batch inference at scale?

KV cache scales linearly with batch size. Llama 3.1 70B at fp16, 4K context, batch 1 ≈ 142 GB weights + 1.3 GB KV. At batch 32 ≈ 142 GB weights + 41 GB KV — total ~190 GB, fits 2× H100 80GB or 1× H200 141GB tightly. At batch 256 ≈ 470 GB total — needs 4× H200 or 3× B200. vLLM with paged attention squeezes more sequences into the same KV pool by avoiding fragmentation. Set batch size to your real concurrent serving load on the calculator above.

RTX 4090 vs RTX 5090 for LLMs — which to buy?

For LLM serving / fine-tuning, the 5090 is the clear winner — 32 GB vs 24 GB VRAM (33% more), 1.79 TB/s vs 1.0 TB/s memory bandwidth (1.79× — bandwidth is what bounds inference token throughput on quantized models), GDDR7 vs GDDR6X, and supports fp8 natively. The 5090 fits Llama 3.3 70B at Q3_K_S where the 4090 cannot. The price premium ($1,999 vs $1,599 MSRP) is justified for any LLM workload. For pure gaming or small-model dev, the 4090 is still excellent.

Can I split a model across multiple consumer GPUs?

Yes. Tensor parallelism (vLLM's `--tensor-parallel-size 2`, llama.cpp's automatic split, exllamav2's `gpu_split`) splits each layer across GPUs. Two RTX 3090s in tensor parallel = 48 GB usable VRAM and ~80% the throughput of a single A6000. NVLink helps but isn't required (3090 supports NVLink, 4090/5090 do not — communication falls back to PCIe and adds 5–15% overhead per extra GPU). The calculator factors a 12% per-extra-GPU overhead into multi-GPU recommendations to be honest about this.

What's the difference between fp8 and int8?

Both are 1 byte per parameter, but they're fundamentally different. fp8 (E4M3 or E5M2) is a floating-point format with exponent — it preserves a wide dynamic range, so it's near-lossless on most LLMs without recalibration. int8 is integer with a per-tensor or per-channel scale factor — it requires calibration data and typically loses 0.5–1.5 perplexity points without QAT. For inference, fp8 is the modern preference on H100/H200/B200 (which have native fp8 tensor cores); int8 is older and primarily lives on consumer cards (4090, 5090) via bitsandbytes.

How accurate is this calculator?

Within 10% of measured nvidia-smi values for the configurations we've tested. Sources of variance: (1) Activation memory varies 5–25% depending on attention implementation and CUDA version. (2) Framework overhead varies 0.5–4 GB by framework. (3) Quantization libraries (bitsandbytes, AutoGPTQ, exllamav2) have different workspace allocations. (4) Speculative decoding, prefix caching, and other vLLM features add memory not modelled here. (5) MoE expert prefetching can spike VRAM beyond steady-state. Always test with actual workloads before committing budget to hardware. The 5% safety margin in the headline number gives some cushion.

Why does my actual VRAM usage differ from the estimate?

Most common reasons: (1) You're running a different precision than configured (e.g. fp16 base with int4 quantized loras = mixed). (2) Your sequence length in production is longer than what you set in the slider — KV cache scales linearly with seq length. (3) Your batch size in production is bigger — same effect. (4) The framework reserves a KV cache pool larger than one sequence (vLLM's default). (5) You're using flash-attn-3 (saves more activation memory than flash-attn-2). (6) Real-world workloads include speculative decoding scratch, embedding lookups, RoPE caches that the calculator doesn't enumerate but typically fit within the 1.5 GB overhead. If your actual differs by more than 25%, please email devzonetools@gmail.com with your config — we'll add the case to the calibration set.

Does this calculator save my data anywhere?

No. All math runs in your browser. Your inputs are encoded into the URL query string so you can share configurations, but nothing is sent to a server. There's no signup, no telemetry on inputs, no analytics on dropdown selections. Same privacy model as the LLM Cost Calculator and RAG Cost Calculator.

Are closed-source models (Claude, GPT, Gemini) supported?

No — they're served via API, not self-hosted, so VRAM isn't a useful concept for them. To estimate API costs for closed models, use the LLM Cost Calculator instead (links from related tools). For open-weight equivalents (Llama 3.1 405B vs GPT-4-class quality, DeepSeek R1 vs o1-class reasoning, Qwen 3 235B vs frontier) — those are all in the dropdown.

Can I add a model not in the dropdown?

Yes — pick "Custom model" in the model selector, then enter total params, layers, hidden dim, attention heads, KV heads (for GQA), vocab size, and (if MoE) experts. The math works for any transformer-style architecture. For non-transformer models (Mamba, RWKV, hybrid) the formulas are different and not currently supported.

How do I share a specific configuration with someone?

Click "Copy shareable link" at the bottom of the outputs panel. The URL encodes every input as compact query parameters (e.g. `?m=llama-3.3-70b&q=gguf-q4_k_m&u=inference&s=8192`). Paste into Slack, Reddit, X — clicking the link recreates your exact configuration. Shareable links don't expire and don't track who clicked them.

GPU VRAM Calculator for LLMs

Estimate exactly how much GPU VRAM your LLM needs — for inference or fine-tuning. Compare 35+ open-weight models across 15 quantization levels, see which GPUs fit, and project cloud cost across 8 providers. Privacy-first, runs entirely in your browser.

Last updated: May 2026

What it does

Architecturally accurate KV cache (GQA + MLA)

KV cache is the most underestimated VRAM component for long-context inference. The calculator stores per-model n_kv_heads, head_dim, and n_layers, so a Llama 3 model (8 KV heads via GQA) computes correctly vs. a Llama 2 7B (32 KV heads, no GQA) — an 8× difference. DeepSeek V3's MLA is modeled with its 16× compression vs standard attention.

Four workload types in one tool

Inference (weights + KV + activations + overhead), Full FT (+ gradients + Adam optimizer state + master fp32 weights), LoRA (frozen base + tiny fp32 adapter), QLoRA (4-bit base + adapter + dequant buffer). Each renders its own breakdown segments and math walkthrough.

15 precision options including all GGUF levels

fp32 / fp16 / fp8 / int8 / int4 / GPTQ / AWQ plus the full GGUF family (Q2_K through Q8_0) with the correct effective bits/weight including block-scale overhead. The most-asked "how much VRAM for Q4_K_M?" question gets an exact answer.

GPU compatibility matrix across 17 GPUs

Sortable table with green ✓ / amber ⚠ / red ✗ status for every NVIDIA consumer (4090, 5090, 3090, 4080 SUPER), workstation (L4, L40S, RTX 6000 Ada, RTX 6000 Pro Blackwell), datacenter (A100 40 / 80, H100 PCIe / SXM, H200, B200), and AMD datacenter (MI300X, MI325X, MI300A) GPU. Multi-GPU recommendations include tensor-parallel overhead.

Cloud cost projection across 8 providers

Pick a fitting GPU; see hourly / daily / weekly / monthly / yearly cost across RunPod, Lambda Labs, Vast.ai, CoreWeave, plus AWS / GCP / Azure hyperscaler pricing. On-demand and spot pricing both surfaced. The most-screenshotted feature on the page.

"What If?" delta table

Auto-generates 5–12 alternative configurations and ranks by savings — "Switch to AWQ 4-bit: −18.4 GB", "Reduce context to 4K: −2.1 GB", "Switch to QLoRA: −145 GB". Sorted biggest-savings first. Goldmine for distribution on Reddit, X, and Discord LLM threads.

Custom-model input for unreleased architectures

Selecting "Custom" lets you enter total params, layers, hidden dim, attention heads, KV heads (for GQA), vocab, and MoE configuration. Useful for in-house models, leaked weights, or any architecture not in the dropdown.

How to use GPU VRAM Calculator for LLMs

1
Pick a preset or your model
Click one of the 8 presets ("Run Llama 3.3 70B locally", "QLoRA fine-tune Llama 3.3 70B", "Run DeepSeek R1 671B locally", etc.) for a sensible starting point. Or pick a model from the searchable dropdown — Llama 3.x, Mistral / Mixtral, Qwen 3, DeepSeek V3 / R1, Gemma 3, Phi 4, and more.
2
Choose your use case
Inference (running the model), LoRA (parameter-efficient fine-tune), QLoRA (4-bit base + adapters — the standard for fine-tuning large models on a single GPU), Full fine-tuning (every parameter trainable, ~7× the VRAM of inference), or Pre-training. Inputs change automatically.
3
Pick a precision / quantization
For inference: fp16 (full quality), AWQ / GPTQ 4-bit (production), GGUF Q4_K_M (most popular for local), or anything from Q8_0 down to Q2_K. For training: bf16 mixed precision is the modern default. Each option shows bytes/param and qualitative quality loss.
4
Set context, batch, and framework
Drag the context-length log slider (512 → 1M tokens) or click a quick pill. Set batch size (1 to 256). For inference, pick your framework (vLLM, llama.cpp, TensorRT-LLM, ExLlamaV2, Ollama, MLC-LLM). For training, pick your optimizer (AdamW / 8-bit AdamW / Lion / AdaFactor) and ZeRO / FSDP stage.
5
Read the VRAM card
The hero number at the top is your total VRAM in GB, with a 5% safety margin. Below it: a green list of GPUs that fit (with how many you need) and a red list that won't. Borderline fits (over 92% utilisation) are flagged amber.
6
Open "Show the math"
Every estimate expands to show the exact formulas — model weights = params × bytes/param, KV cache = 2 × n_layers × n_kv_heads × head_dim × seq × batch × bytes, plus optimizer state, gradients, activations, and overhead. Auditable, reproducible, copy-friendly.
7
Use the "What If?" panel
Live table ranking 5–12 alternative configurations by VRAM delta — switch precision, halve the context, enable gradient checkpointing, switch to QLoRA, etc. The single highest-leverage feature for "how do I make this fit?" questions.
8
Project cloud cost or share the link
The Cloud Cost Projector picks any fitting GPU configuration and shows hourly / monthly / yearly cost across 8 providers (RunPod, Lambda Labs, Vast.ai, CoreWeave, AWS, GCP, Azure). Copy the shareable link to send the exact configuration to a teammate or paste in a Reddit / X thread.

When to use this

"Can I run Llama 70B on my hardware?"

Pick "Llama 3.3 70B" + "GGUF Q4_K_M" + 8K context. The calculator returns ~42 GB and shows the green list (1× RTX 6000 Ada / RTX 6000 Pro Blackwell / A100 80GB / H100 80GB / H200) and red list (4090, 5090, 3090, A6000 48GB borderline). Cloud projector shows H100 from $1.49/hr on Vast.ai spot. Settles every recurring "how much VRAM for 70B?" Reddit thread in two clicks.

Choosing between LoRA and QLoRA for fine-tuning

Switch a Llama 3.1 8B fine-tune from LoRA to QLoRA in the use-case toggle and watch the VRAM drop from ~28 GB to ~12 GB. For 70B, the gap is more dramatic: full LoRA needs 80GB+, QLoRA fits in 24GB. The "Show the math" expands to show 4-bit base + dequant buffer + tiny fp32 adapter.

Sizing a long-context inference deployment

Pick "Qwen 3 32B at 128K context" preset. KV cache balloons to ~16 GB (38% of total VRAM). The calculator surfaces an insight — "At 131,072 tokens of context, KV cache is the dominant cost. Cutting to 8K tokens would save ~15 GB." Tells you immediately whether you need an H100 or a B200.

Budgeting cloud GPU spend before signing up

Pick your model + workload, switch the projector to "1 month" duration. Exact dollar amount across RunPod, Lambda Labs, Vast.ai, CoreWeave, AWS, GCP, and Azure. Spot vs on-demand toggle. Engineers screenshot this row to justify infrastructure choices to managers.

MoE model sizing (DeepSeek V3, Mixtral)

DeepSeek V3 has 671B total params but only 37B active per token. The calculator surfaces an insight — "VRAM scales with total params (all experts loaded), not active params" — so you don't accidentally try to fit it on an H100. MLA compression cuts the KV cache 16× vs standard attention; the calculator factors this in.

Common errors & fixes

Calculator shows you fit on a single 4090 but you OOM at runtime: Two likely culprits. (1) Activation factor — real activation memory varies by impl; bump it from 15% to 25% in Advanced and re-check. (2) KV cache grows with actual prompt length, not the slider — if you're serving real users hitting 8K-context prompts but the slider says 4K, the calculator under-estimates. Set the slider to your worst-case context.
QLoRA fine-tune appears to fit on 24GB but bitsandbytes errors: NF4 dequantization workspace and the optimizer state can spike during the first backward pass. Try (a) gradient checkpointing on, (b) batch size 1, (c) 8-bit paged AdamW (bitsandbytes paged_adamw_8bit), (d) reduce LoRA rank from 64 to 16. The calculator surfaces these as "What If?" rows; pick the one that drops total below ~22 GB to leave headroom.
70B inference recommends an H100 but Llama.cpp on a 24GB 4090 actually works fine: You're probably running a CPU/GPU split with --n-gpu-layers. The calculator assumes the entire model lives in VRAM. For partial offload, divide n_layers proportionally. Llama.cpp's output gives you the exact split that fits — start there, not from the calculator.
KV cache is "only 2GB" in the calculator but 30GB in vLLM at runtime: vLLM pre-allocates a KV cache pool sized for many concurrent sequences (controlled by --gpu-memory-utilization). The calculator estimates the cache for one sequence at the configured batch size. To match vLLM's reservation, set batch size to your max concurrent sequences (typically 16–256 in production).
Multi-GPU recommendation says "8× H100" but DeepSpeed ZeRO-3 on 4× H100 works: The matrix recommends without ZeRO/FSDP partitioning. To model partitioned training, switch the use case to "Full FT", set ZeRO stage to ZeRO-3 or FSDP, and the per-GPU VRAM drops accordingly. The matrix then shows the smaller fit.
Estimate seems too high for AWQ — "Llama 70B AWQ should be 35GB, not 42GB": You're comparing weights-only vs total VRAM. Weights at AWQ 4-bit = 70.6B × 0.5 ≈ 35.3 GB. Add KV cache (~2.5 GB at 8K context), activations (~3 GB with Flash Attention), and overhead (~1.5 GB) and you land near 42 GB. Open "Show the math" — every number is broken out.

Technical details

VRAM engine	Pure TypeScript functions per workload type. Inference, full-FT, LoRA, and QLoRA each return { total_gb, breakdown, steps } so the same data feeds the hero card, breakdown bar, and Show-the-math view.
Inference formula	VRAM = weights + KV cache + activations + overhead. Weights = params × bytes_per_param. KV = 2 × n_layers × n_kv_heads × head_dim × seq × batch × bytes_per_kv. Activations ≈ 5–30% of weights, divided by ~3 with Flash Attention.
Full FT formula	VRAM = weights (bf16) + gradients (bf16) + optimizer state (Adam family: 8 bytes/param + 4 bytes fp32 master) + activations + overhead. ZeRO-1/2/3 and FSDP partition optimizer / grad / weights across DP ranks (modelled at 8-way default).
LoRA / QLoRA formula	VRAM = base weights (frozen, full or 4-bit) + LoRA adapter (fp32) + adapter gradients + adapter optimizer state + activations + overhead. Adapter param count = layers × (per-target-projections) × 2 × rank × hidden, target group selectable.
KV cache for GQA / MLA	GQA shrinks n_kv_heads by 4–8× vs n_attn_heads, divided into the formula directly. DeepSeek's Multi-head Latent Attention (MLA) is modelled as a 16× compression on the standard formula.
GPU fit	Single-GPU first; if doesn't fit, walk up tensor-parallel configurations applying a 12% per-extra-GPU activation/comm overhead. Borderline marker (amber) at >92% utilisation. Up to 16-GPU configurations considered.
Cloud pricing	Static JSON per GPU × provider. On-demand and spot prices tracked separately. Hyperscaler pricing (AWS p4/p5, GCP A3, Azure ND) is per-GPU (instance ÷ N).
Privacy	No server processing. All math runs in your browser. URL state encodes inputs only — never identity. No analytics on dropdowns.

How GPU VRAM is calculated for LLMs

Every LLM workload reduces to four memory components, regardless of model or framework. (1) **Model weights** — the parameter tensors themselves, size = number of params × bytes per parameter. (2) **KV cache** — keys and values from all previous tokens, kept around so attention doesn't recompute them. Size = 2 (K and V) × number of layers × number of KV heads × head dim × sequence length × batch size × bytes per KV element. (3) **Activations** — intermediate tensors that have to be kept around for backpropagation (training) or attention (inference). Roughly 10–20% of weights for inference, much larger for training. (4) **Overhead** — CUDA context, framework state, dequantization buffers, workspace. Usually 1–2 GB.

For *training*, multiply this list because each parameter needs a gradient (same dtype as weights) and an optimizer state (Adam keeps two fp32 momentum tensors per param, plus a fp32 master copy of the weights — 12 bytes per param total, 6× the weights themselves at bf16). That's why full fine-tuning a 70B model needs 8× H100 with FSDP and inference needs 1× H100.

The trick the calculator above plays — and most others miss — is that the KV cache formula uses *per-model* values for n_layers, n_kv_heads, and head_dim. Llama 3 70B with GQA has 8 KV heads vs Llama 2 70B's 8 KV heads (Llama 2 70B also has GQA), but Llama 2 13B has 40 KV heads (no GQA) — meaning the KV cache for 13B at 8K context is actually larger than the KV cache for 70B at the same context. The reference table below makes this concrete.

How quantization reduces VRAM (with real numbers)

Quantization compresses model weights from fp16 (2 bytes/param) to lower-precision formats. The size savings are roughly linear with bits per weight, but quality loss is not — modern post-training quantization (AWQ, GPTQ, GGUF Q4_K_M) reaches near-lossless quality at 4 bits, while older RTN-style int8 had measurable quality drops.

**The practical landscape in 2026.** For local serving (llama.cpp, Ollama, LM Studio, MLX), GGUF Q4_K_M is the de-facto default — 0.58 bytes/param effective (including block scales), under 1 perplexity point worse than fp16, runs on consumer hardware. For production serving (vLLM, TensorRT-LLM), AWQ 4-bit is the standard — same quality as Q4_K_M, faster inference because vLLM's kernels are AWQ-optimized. For maximum quality at half the size, Q8_0 or fp8 (on H100/H200/B200) gets you there. For minimum size at the cost of quality, Q3_K_M and Q2_K still produce coherent text but degrade noticeably on harder benchmarks.

**Concrete numbers.** Llama 3.3 70B at fp16 = ~141 GB of weights (won't fit on any single GPU). At AWQ 4-bit = ~35 GB (fits on a single 48GB workstation card). At Q4_K_M = ~41 GB (slightly larger because GGUF block scales add overhead). At Q8_0 = ~74 GB (fits on a single 80GB card). At Q2_K = ~25 GB (fits on a 32GB 5090, but quality is meaningfully worse).

The calculator above shows the exact bytes/param it uses for every option in the precision dropdown. Check the math in the "Show calculation" section — every number is auditable.

VRAM for fine-tuning: full vs LoRA vs QLoRA

The gap is enormous and usually misunderstood. **Full fine-tuning** of Llama 3.1 8B with AdamW: ~85 GB of VRAM at batch 1 / seq 2K. Weights = 16 GB (bf16), gradients = 16 GB, optimizer state (8 bytes/param + 4 bytes master) = 96 GB before partitioning. With ZeRO-2 across 8 GPUs, optimizer state drops to 12 GB per rank. Realistic configuration: 4× A100 80GB or 2× H100 80GB.

**LoRA** of Llama 3.1 8B: ~28 GB of VRAM. Weights stay in fp16 (16 GB) but are frozen — no gradients, no optimizer state for them. The trainable adapter (rank 16, q/k/v/o targets) is ~70M params; its gradients and Adam state add ~1 GB. Activations dominate — they still scale with the full forward pass. Realistic configuration: 1× RTX 6000 Ada (48 GB) or 1× A100 40GB.

**QLoRA** of Llama 3.1 8B: ~12 GB of VRAM. The base model is quantized to 4-bit NF4 (~4 GB), adapters and their state add ~1 GB, activations add ~5 GB. Realistic configuration: 1× RTX 4090 (24 GB) with comfortable headroom.

For 70B-class models the gap is more dramatic. Full FT of Llama 3.3 70B needs ~640 GB → 8× H100 SXM with FSDP. LoRA of Llama 3.3 70B needs ~150 GB → 2× H100 80GB. QLoRA of Llama 3.3 70B needs ~48 GB → fits on a single H100 80GB or RTX 6000 Pro Blackwell. This is why QLoRA single-handedly unblocked 70B fine-tuning for individual researchers and small teams in 2023, and why it remains the right default in 2026 unless you're specifically chasing the last 1–2% of quality that full FT delivers.

KV cache: the hidden VRAM cost of long context

For long-context inference, the KV cache is the single most underestimated VRAM line. The formula:

KV cache (bytes) = 2 (K and V) × n_layers × n_kv_heads × head_dim × seq_len × batch × bytes_per_param

It scales linearly with sequence length and batch size. Both grow fast in production. Llama 3.3 70B serving 16 concurrent users at 32K context each = 16 × 32,768 = 524K KV slots × 80 layers × 8 KV heads × 128 head dim × 2 bytes = ~85 GB of KV cache *alone*, not counting weights or activations.

**Why GQA matters so much.** Llama 3 / 3.1 / 3.3 use Grouped-Query Attention with only 8 KV heads (vs 64 attention heads on the 70B). That's an 8× KV cache reduction vs a hypothetical "Llama 3 without GQA" model. For Llama 2 7B (no GQA, 32 KV heads), the KV cache for 8K context = 4 GB. For Llama 3 / 3.1 8B (GQA, 8 KV heads), the same calculation = 1 GB. Same param count, 4× smaller KV cache, identical inference quality.

**Multi-Query Attention and MLA take it further.** MQA models (Mistral 7B v0.1, Falcon 7B) use only 1 KV head — KV cache 32× smaller than a non-GQA 7B. DeepSeek V3 / R1's Multi-head Latent Attention compresses K and V into a 512-dim latent (vs 16,384-dim full K/V), giving ~16× compression on the KV cache. This is what makes a 671B MoE model survivable for inference.

**Paged attention (vLLM) doesn't reduce KV cache size — it eliminates fragmentation.** Without it, vLLM has to over-allocate KV slots for the worst-case sequence length, wasting 60–80% of allocated KV memory on shorter prompts. Paged attention allocates KV in 16-token blocks on demand, so memory utilization approaches 100%. The calculator above uses the underlying formula — paged attention's win is throughput, not headline VRAM.

**Sliding-window attention** (Mistral, Gemma, some Phi models) caps the KV cache at the window size (4K for Mistral 7B v0.1, 4K for Gemma 2). This bounds long-context VRAM, but at the cost of forgetting tokens beyond the window — so it's great for chat, less great for document QA over long contexts.

Best GPU for running LLMs in 2026

The answer depends on three things: (a) what you're doing, (b) how much VRAM the workload needs, and (c) whether you're renting or buying.

**Hobbyist / local experimentation.** RTX 4090 (24 GB) for 7B–13B at fp16 or 30B–34B at Q4_K_M. RTX 5090 (32 GB) for 70B at Q3_K_M or 32B at Q5_K_M. RTX 3090 still the price/performance king on the used market for 24GB. Apple Silicon Macs (M3 Max 128GB / M4 Max 128GB) are surprisingly competitive for batch-1 inference up to 70B — slow per-token, but fits the model.

**Indie / startup serving.** RTX 6000 Ada (48 GB) or RTX 6000 Pro Blackwell (96 GB) for single-GPU fits up to 70B at AWQ 4-bit. L40S (48 GB) on cloud is similar, often cheaper per hour. Don't buy A100 40GB anymore — H100 / L40S / 6000 Ada are better dollar-per-throughput.

**Production serving.** H100 SXM (80 GB) is the workhorse. H200 SXM (141 GB) for long-context or fitting bigger models on a single GPU. B200 (180 GB) where available — ~2× the throughput of H100 for fp8 workloads, justifies the price premium for high-QPS production. AMD MI300X (192 GB) competitive on memory bandwidth + capacity, software story is improving but vLLM / TensorRT-LLM-equivalent maturity still lags by ~6 months.

**Frontier scale.** 8× H200 or 4× B200 nodes for serving 405B / DeepSeek V3 671B-class models. Hyperscaler pricing (AWS p5, GCP A3) makes single-instance economics painful — most teams at this scale rent from CoreWeave / Lambda / Foundry / Together / Anyscale who buy reserved capacity and resell at ~50% of list.

**Renting vs buying.** Break-even is ~3,000 hours/year of utilization. Below that, rent (Vast.ai for hobbyist, RunPod/Lambda for production). Above, buy and operate yourself if you have the team. For LoRA fine-tuning specifically, renting almost always wins because runs are short and bursty.

Renting vs buying GPUs for LLM workloads

A simple framework: **rent if your average daily utilization is below 30% AND your workload bursts are predictable**. Buy if you're saturating GPUs >60% of the time AND you have an on-call rotation and platform team.

Numeric break-even: an H100 SXM at $2.69/hr on RunPod = $19,600/year if used 24/7. Buying an H100 SXM is ~$30,000 + 25% / year for power, cooling, networking, datacenter, and depreciation. Year-1 break-even around 70% utilization. Year-2 onwards around 50% utilization, because the capex amortizes.

For most teams, **renting wins for the first 12–18 months**. You don't know your workload yet, your team isn't set up to operate on-call infrastructure, and the rental market gives you the option to upgrade to whatever's newest (H200 → B200 → next generation) without writing off owned hardware.

**The hidden costs of buying** are bigger than people expect. A single H100 server (8× H100 SXM) draws ~10kW continuous, needs 120+ amp service and dedicated cooling, and has a 6–12 month lead time for purchase. You'll need a colocation deal, network engineering for InfiniBand or 400GbE, a hardware failure replacement plan, and people who know how to triage GPU faults at 3am.

**The hidden costs of renting** are smaller but real. Spot instances get pre-empted at the worst possible moment in a long training run. On-demand pricing is 2–3× spot. Marketplace providers (Vast.ai, RunPod community) have variable supply — the cheap GPUs disappear when demand spikes (typically right before LLM benchmark releases). Egress fees on hyperscalers add up if you're shipping checkpoints out.

**The right strategy for 95% of teams.** Rent during research and prototyping. Reserve capacity (CoreWeave, Lambda Labs reserved, hyperscaler savings plans) once you have steady production traffic. Buy hardware only when (a) you're running the same workload >60% of the time for >2 years and (b) you have a platform engineer on staff.

Frequently Asked Questions

How much VRAM do I need to run Llama 3.1 70B?: Depends on precision. fp16 ≈ 142 GB (won't fit any single GPU), Q8_0 ≈ 75 GB (fits 1× H100 80GB), AWQ 4-bit ≈ 40 GB (fits 1× RTX 6000 Ada 48GB or 1× A100 80GB), Q4_K_M ≈ 42 GB (same), Q3_K_M ≈ 35 GB (fits 1× RTX 6000 Pro Blackwell 96GB easily), Q2_K ≈ 26 GB (fits 1× RTX 5090 32GB). Add 2–5 GB for KV cache at 8K context. The calculator above gives the exact number for your configuration.
Can I run a 70B model on a single RTX 4090?: Yes, but only at Q2_K (~26 GB needed for weights, doesn't leave room for KV cache + overhead) or Q3_K_S with CPU offload of some layers via llama.cpp's --n-gpu-layers. Realistically, 4090 (24 GB) handles 70B at Q2_K with short context only. For practical 70B serving on consumer hardware, you want a 4090 + 3090 (48 GB combined) or a single RTX 6000 Pro Blackwell (96 GB) or RTX 5090 (32 GB at Q3_K_S).
What's the difference between GGUF Q4_K_M and AWQ?: GGUF Q4_K_M is llama.cpp's 4-bit format optimized for CPU + GPU mixed inference, with K-block quantization and selective higher-precision blocks for important weights ("M" = medium). AWQ is "Activation-aware Weight Quantization", a different 4-bit scheme optimized for GPU inference with vLLM, TensorRT-LLM, and ExLlamaV2. Quality is roughly equivalent (both within 1 perplexity point of fp16); AWQ tends to be faster on GPU because the kernels are GPU-optimized; Q4_K_M is the default for local llama.cpp / Ollama / LM Studio because of cross-platform support.
How much VRAM do I need to fine-tune Llama 3.1 8B?: Full fine-tune ≈ 85 GB (needs 4× A100 80GB with ZeRO-2). LoRA ≈ 28 GB (1× RTX 6000 Ada 48GB or 1× A100 40GB). QLoRA ≈ 12 GB (fits comfortably on a single 24 GB consumer GPU). Numbers are at batch size 1, seq length 2048, AdamW optimizer, gradient checkpointing on. Larger batches and longer sequences scale activations roughly linearly.
Why does long context use so much VRAM?: Because the KV cache scales linearly with sequence length and batch size. Llama 3.1 8B at 8K context with batch 1 has a KV cache of ~1 GB. At 128K context, that grows to ~16 GB — equal to the model weights themselves. Add concurrent users (batch 16) and you're at 256 GB of KV cache alone. The formula is `2 × n_layers × n_kv_heads × head_dim × seq × batch × bytes_per_kv` — see the calculator's "Show the math" for your specific numbers.
Is QLoRA worth the quality tradeoff vs full LoRA or full FT?: For most fine-tunes, yes. QLoRA hits 95–99% of full-FT quality on standard fine-tuning benchmarks (Alpaca, OpenOrca, etc.) at 5–10% of the VRAM. The compromises are real but small: marginally noisier gradients (4-bit base introduces quantization error), slightly worse loss curves, requires QLoRA-friendly optimizers (paged 8-bit AdamW). For domain adaptation, instruction tuning, or chat fine-tuning — QLoRA is the right default. For fine-tuning that pushes the base model into genuinely new capabilities (multilingual, novel reasoning), full LoRA on bf16 base sometimes pulls ahead.
Can I run DeepSeek V3 671B locally?: Yes if you have ~400 GB of VRAM total — typically 2× MI300X (192 GB each) or 4× H200 (141 GB each, with NVLink) or 8× H100 (80 GB each). At Q4_K_M ≈ 380 GB. The MLA architecture compresses KV cache 16× vs standard attention, so context length doesn't blow up the budget like it would for a non-MLA 671B model would. The "Run DeepSeek R1 671B locally" preset on the calculator above shows the math.
What's the cheapest GPU for running LLMs?: For under 13B models: a used RTX 3090 ($600–900) is unbeatable on cost-per-VRAM. For 30B–70B at Q4_K_M: RTX 5090 (32 GB, ~$2,000) or used dual 3090 (48 GB combined). For renting: Vast.ai marketplace spot instances start at $0.20/hr for 3090, $0.40/hr for 4090, $0.85/hr for A100 80GB, $1.49/hr for H100. For frontier models (405B, 671B), MI300X on TensorWave at ~$3.99/hr per GPU is often cheaper than H100 because of the larger 192 GB capacity.
Does CPU offloading actually work?: Yes, but with throughput hits. Llama.cpp's `--n-gpu-layers` keeps some layers on GPU and the rest in CPU RAM, giving you 3–10 tokens/sec for a 70B at Q4_K_M on a 24GB consumer card vs 30–60 tokens/sec when fully on GPU. For training, DeepSpeed ZeRO-Infinity offloads optimizer states to CPU and even NVMe; the calculator models this with the "CPU offload optimizer state" toggle, which can roughly halve training VRAM at the cost of significantly slower step time.
How much VRAM for batch inference at scale?: KV cache scales linearly with batch size. Llama 3.1 70B at fp16, 4K context, batch 1 ≈ 142 GB weights + 1.3 GB KV. At batch 32 ≈ 142 GB weights + 41 GB KV — total ~190 GB, fits 2× H100 80GB or 1× H200 141GB tightly. At batch 256 ≈ 470 GB total — needs 4× H200 or 3× B200. vLLM with paged attention squeezes more sequences into the same KV pool by avoiding fragmentation. Set batch size to your real concurrent serving load on the calculator above.
RTX 4090 vs RTX 5090 for LLMs — which to buy?: For LLM serving / fine-tuning, the 5090 is the clear winner — 32 GB vs 24 GB VRAM (33% more), 1.79 TB/s vs 1.0 TB/s memory bandwidth (1.79× — bandwidth is what bounds inference token throughput on quantized models), GDDR7 vs GDDR6X, and supports fp8 natively. The 5090 fits Llama 3.3 70B at Q3_K_S where the 4090 cannot. The price premium ($1,999 vs $1,599 MSRP) is justified for any LLM workload. For pure gaming or small-model dev, the 4090 is still excellent.
Can I split a model across multiple consumer GPUs?: Yes. Tensor parallelism (vLLM's `--tensor-parallel-size 2`, llama.cpp's automatic split, exllamav2's `gpu_split`) splits each layer across GPUs. Two RTX 3090s in tensor parallel = 48 GB usable VRAM and ~80% the throughput of a single A6000. NVLink helps but isn't required (3090 supports NVLink, 4090/5090 do not — communication falls back to PCIe and adds 5–15% overhead per extra GPU). The calculator factors a 12% per-extra-GPU overhead into multi-GPU recommendations to be honest about this.
What's the difference between fp8 and int8?: Both are 1 byte per parameter, but they're fundamentally different. fp8 (E4M3 or E5M2) is a floating-point format with exponent — it preserves a wide dynamic range, so it's near-lossless on most LLMs without recalibration. int8 is integer with a per-tensor or per-channel scale factor — it requires calibration data and typically loses 0.5–1.5 perplexity points without QAT. For inference, fp8 is the modern preference on H100/H200/B200 (which have native fp8 tensor cores); int8 is older and primarily lives on consumer cards (4090, 5090) via bitsandbytes.
How accurate is this calculator?: Within 10% of measured nvidia-smi values for the configurations we've tested. Sources of variance: (1) Activation memory varies 5–25% depending on attention implementation and CUDA version. (2) Framework overhead varies 0.5–4 GB by framework. (3) Quantization libraries (bitsandbytes, AutoGPTQ, exllamav2) have different workspace allocations. (4) Speculative decoding, prefix caching, and other vLLM features add memory not modelled here. (5) MoE expert prefetching can spike VRAM beyond steady-state. Always test with actual workloads before committing budget to hardware. The 5% safety margin in the headline number gives some cushion.
Why does my actual VRAM usage differ from the estimate?: Most common reasons: (1) You're running a different precision than configured (e.g. fp16 base with int4 quantized loras = mixed). (2) Your sequence length in production is longer than what you set in the slider — KV cache scales linearly with seq length. (3) Your batch size in production is bigger — same effect. (4) The framework reserves a KV cache pool larger than one sequence (vLLM's default). (5) You're using flash-attn-3 (saves more activation memory than flash-attn-2). (6) Real-world workloads include speculative decoding scratch, embedding lookups, RoPE caches that the calculator doesn't enumerate but typically fit within the 1.5 GB overhead. If your actual differs by more than 25%, please email devzonetools@gmail.com with your config — we'll add the case to the calibration set.
Does this calculator save my data anywhere?: No. All math runs in your browser. Your inputs are encoded into the URL query string so you can share configurations, but nothing is sent to a server. There's no signup, no telemetry on inputs, no analytics on dropdown selections. Same privacy model as the LLM Cost Calculator and RAG Cost Calculator.
Are closed-source models (Claude, GPT, Gemini) supported?: No — they're served via API, not self-hosted, so VRAM isn't a useful concept for them. To estimate API costs for closed models, use the LLM Cost Calculator instead (links from related tools). For open-weight equivalents (Llama 3.1 405B vs GPT-4-class quality, DeepSeek R1 vs o1-class reasoning, Qwen 3 235B vs frontier) — those are all in the dropdown.
Can I add a model not in the dropdown?: Yes — pick "Custom model" in the model selector, then enter total params, layers, hidden dim, attention heads, KV heads (for GQA), vocab size, and (if MoE) experts. The math works for any transformer-style architecture. For non-transformer models (Mamba, RWKV, hybrid) the formulas are different and not currently supported.
How do I share a specific configuration with someone?: Click "Copy shareable link" at the bottom of the outputs panel. The URL encodes every input as compact query parameters (e.g. `?m=llama-3.3-70b&q=gguf-q4_k_m&u=inference&s=8192`). Paste into Slack, Reddit, X — clicking the link recreates your exact configuration. Shareable links don't expire and don't track who clicked them.

Related Tools

calculate

LLM Cost Calculator

Compare token counts and API costs across GPT-5, Claude Opus 4.7, Gemini 3, and more — all in your browser.

calculate

RAG Cost Calculator

End-to-end RAG pipeline cost — Embedding · Storage · Retrieval · Generation. Compare 17 embedding models × 10 vector DBs × 20+ LLMs in one view, with the math shown for every layer.

database

Vector Database Cost Calculator

Compare monthly cost across Pinecone, Weaviate, Qdrant, Milvus, Chroma, MongoDB Atlas, and self-hosted pgvector. Quantization, replication, and DevOps overhead all baked in.

code

JSON Formatter

Clean, minify, and validate JSON data structures.

tag

Hash Generator

Generate MD5, SHA-1, SHA-256, and SHA-512 hashes from text or files. Supports HMAC authentication codes.