How much does it cost to run a RAG system?

Wildly variable. A hobby RAG over a personal note collection (10K notes, 50 queries/day, BGE-small + pgvector + Haiku 4.5) costs ~$5–15/mo. A production customer-support chatbot (50K docs, 100K queries/day, text-embedding-3-small + Pinecone Serverless + GPT-5 mini) lands ~$1.5–3K/mo. An enterprise legal RAG (500K cases, 1K queries/day with reranking, Voyage + Pinecone Pod + Sonnet 4.6 + Cohere Rerank 3) hits $5–15K/mo. The calculator above runs the math for any of those scenarios — start with a preset.

What's the cheapest RAG stack in 2026?

For a small workload (under 100K docs, under 1K queries/day): BGE-small-en-v1.5 (self-hosted embedding) + pgvector self-hosted + Claude Haiku 4.5 or GPT-5 mini. Total can be under $50/mo if you already have infra to run pgvector. For a larger workload where self-hosting isn't viable: text-embedding-3-small + Qdrant Cloud + Haiku 4.5. The calculator shows your specific number.

Is GPT-5 or Claude Opus 4.7 cheaper for RAG?

GPT-5 ($2.50/$10 per M) is meaningfully cheaper than Claude Opus 4.7 ($15/$75 per M) — roughly 7× cheaper on input and 7.5× on output. But for most RAG workloads, neither is the right pick. GPT-5 mini ($0.40/$1.60) and Claude Haiku 4.5 ($0.80/$4) handle factual RAG generation well at a fraction of the price. Use the calculator to see the dollar delta on your workload.

How much does RAG cost per query?

For most production RAG configurations, between $0.001 and $0.05 per query. The hero output of the calculator includes a "Cost / query" tile so you can see your number directly. Bot-style RAG with a small LLM lands $0.001–0.005; deep-reasoning RAG with Opus 4.7 and a big top-K can reach $0.10–0.30 per query.

Should I use OpenAI or open-source embeddings for RAG?

OpenAI text-embedding-3-small ($0.02/M, 1536 dim, MTEB 62.3) is the default — it's cheap, well-supported, and good. Open-source (BGE, Nomic, GTE) is worth it if (a) you process 100M+ tokens/month so embedding cost is material, or (b) you need data residency / on-prem and can't send text to OpenAI. For most teams, text-embedding-3-small is the right answer until you hit one of those constraints.

What's the cheapest vector database for RAG?

It depends on scale. Under 1M vectors: free tiers of Pinecone, Weaviate, Qdrant, or Chroma. 1M–50M vectors at moderate QPS: Qdrant Cloud is typically the cheapest managed option. 50M+ vectors with a platform team: self-hosted pgvector or Qdrant on AWS. The dedicated [Vector Database Cost Calculator](/tools/vector-database-cost-calculator) compares all 10 providers side-by-side.

Does chunking strategy affect RAG cost?

Yes — significantly. Smaller chunks mean more vectors (more storage), more chunks per query at the same top-K (more retrieval reads on usage-based providers), and more or fewer prompt tokens depending on how top-K interacts with chunk size. Larger chunks reduce vector count but reduce retrieval precision. A common sweet spot: 512–1024 token chunks with 50–100 token overlap. The calculator updates instantly when you adjust the chunk size slider.

How much does it cost to RAG over 1 million documents?

For 1M documents at 2K tokens each (typical), with text-embedding-3-small + Pinecone Serverless + Sonnet 4.6 at 10K queries/day: roughly $200–400/mo on storage and embedding, $1.5–3K/mo on generation. Switching the LLM to Haiku 4.5 cuts the generation line by ~75%. Set the calculator to 1M docs and play with the dropdowns to see your number.

Why is generation usually 80%+ of my RAG bill?

Because LLM input pricing is $0.40–$15/M tokens and you send a lot of tokens per query. With top-K=5 and chunk_size=512, every query injects 2,560 tokens of context plus the question and system prompt. At 100K queries/day on Sonnet 4.6, that's ~$1.7K/mo of input cost alone. Storage on most VDBs is $50–500/mo at the same scale — an order of magnitude smaller. The fix is almost always: smaller LLM, smaller top-K, or caching.

Can I run RAG for under $100/month?

Yes, if (a) your corpus is small (under 100K docs), (b) your query volume is modest (under 1K/day), (c) you use a cheap LLM like Haiku 4.5 or GPT-5 mini, (d) you pick a free-tier VDB or pgvector, and (e) you use text-embedding-3-small or a self-hosted BGE. The "Personal AI assistant" preset above lands here. Going beyond ~10K queries/day on a frontier LLM blows the $100/mo budget no matter what.

What's the ROI of switching to a smaller LLM in RAG?

Typically the highest single cost-saving move available. Claude Opus 4.7 → Sonnet 4.6 cuts the LLM bill 5×. Sonnet → Haiku cuts it another 4×. For factual RAG (extractive QA, summarization, knowledge-base search), the quality drop is small — often imperceptible to end users. The "What If?" panel in the calculator shows the exact dollar delta for your specific configuration. Test both and let the user feedback decide.

Does this calculator include reranking costs?

Yes. The Reranker dropdown supports Cohere Rerank 3 ($2/1K queries), Voyage Rerank 2 ($0.05/1K), Jina Reranker (~$0.20/1K), and a self-hosted cross-encoder (compute-only). Reranking can dominate the Retrieval layer at high QPS — at 100K queries/day on Cohere Rerank 3 that's ~$6K/mo. Voyage is 40× cheaper for the same workload.

How accurate are these RAG cost estimates?

Within 10% of vendor calculators in the scenarios we've tested, assuming you set the inputs realistically. Sources of variance: (1) RU/WU per query on Pinecone Serverless varies with filter complexity — we use 2 RU/query as a default. (2) Real query volume often has a long tail of expensive outliers. (3) Most providers offer enterprise discounts not reflected in list prices. Real-world bills typically run 10–25% higher than calculator estimates due to retries, egress, observability, and surprise capacity fees. Validate against vendor calculators before signing.

What about OpenAI Assistants / file_search built-in RAG cost?

OpenAI Assistants v2 file_search bundles embedding, vector storage, and retrieval into a single bill. Storage is ~$0.10/GB/day (yes, per day — that's ~$3/GB/month, much higher than Pinecone's $0.33/GB/month). file_search retrieval is ~$2.50 per 1K calls. For most workloads this is more expensive than building your own RAG; the convenience may justify it for prototypes. This calculator does not currently model the bundled Assistants pricing — use the standalone-stack model and compare manually.

How do I monitor real RAG costs in production?

Three places to watch: (1) your LLM provider dashboard (Anthropic Console, OpenAI Usage page, Vertex AI Billing) — usually the biggest line item. (2) Your vector DB provider dashboard (Pinecone Usage, Weaviate Console). (3) Your observability tool (LangSmith, Helicone, Phoenix) for per-trace cost attribution. Set monthly spend alerts at 80% of your projected budget. Re-run this calculator monthly with actual query volumes from your analytics to recalibrate.

Does this calculator save my data anywhere?

No. All math runs in your browser. Your inputs are encoded into the URL so you can share configurations, but nothing is sent to a server. There's no signup, no telemetry on your inputs, no analytics on the dropdown selections. Same privacy model as the LLM Cost Calculator and Vector Database Cost Calculator.

RAG Cost Calculator

Estimate the full monthly cost of a RAG pipeline across four layers — Embedding, Storage, Retrieval, and Generation. Compare OpenAI, Anthropic, Pinecone, Weaviate, Qdrant, and 15+ other models in your browser. No signup, no telemetry.

Last updated: May 2026

What it does

The four-layer cost model nobody else publishes

A clean Embedding · Storage · Retrieval · Generation breakdown with each layer's percentage of total. Shows you immediately whether to optimize the LLM, the vector DB, or the chunking strategy — instead of guessing.

17 embedding models, 10 vector DBs, 20+ LLMs

Mix-and-match any combination. OpenAI text-embedding-3-small + Pinecone Serverless + Claude Sonnet 4.6, or BGE-large self-hosted + pgvector + GPT-5 mini, or anything in between. The math runs the same across every combination.

"Show the math" for every layer

Each of the four layer cards expands to reveal the exact formula — chunking math (`numDocs × ceil((docTokens − overlap) / (chunkSize − overlap))`), embedding tokens × $/M, VDB storage × replicas, LLM input × output × per-million rates. Auditable, reproducible, copy-friendly.

12-month projection with growth and re-indexing baked in

Compound corpus growth into the next 12 months and visualize the cost ramp. Weekly or daily re-indexing shows up as bigger embedding spikes; growing storage compounds VDB costs.

"What If?" delta table

5–7 alternative configurations auto-generated and ranked by savings — "Switch LLM to Haiku 4.5: −$756/mo", "Reduce top-K from 5 to 3: −$496/mo". The single highest-leverage SEO and shareability feature on the page.

Shared pricing source of truth

Reuses the same pricing JSON files that power the LLM Cost Calculator and Vector Database Cost Calculator. When we update Claude Opus 4.7 pricing, all three calculators reflect it instantly.

How to use RAG Cost Calculator

1
Pick a preset or set your corpus
Click one of the 6 preset chips ("Customer support chatbot", "Internal knowledge base", "E-commerce semantic search", etc.) for an instant realistic baseline. Or set the number of documents, average document size (in tokens, words, chars, or KB), chunk size, chunk overlap, re-indexing frequency, and growth rate from scratch.
2
Set your traffic
Drag the queries-per-day log slider (10 to 100M+) or click a quick pill (Hobby / Indie / Startup / Scale / Enterprise). Set the average query length, top-K retrieved chunks, and average answer length. The displayed queries-per-month figure updates instantly.
3
Pick your stack
Choose an embedding model (OpenAI, Cohere, Voyage, Google, Mistral, or one of 5 self-hosted options), a vector database (Pinecone Serverless / Pod, Weaviate, Qdrant Cloud, Zilliz, Chroma, MongoDB Atlas, or 3 self-hosted setups), an LLM for generation (Claude, GPT, Gemini, Llama, DeepSeek), and optionally a reranker (Cohere Rerank 3, Voyage, Jina).
4
Read the four-layer breakdown
The hero output shows monthly cost split across Embedding · Storage · Retrieval · Generation, each with its share of the total. Click "Show calculation" on any layer to see the exact formula — chunking math, per-token costs, vector-DB pricing model, and LLM input/output rates.
5
Use the "What If?" panel
A live table shows how your monthly cost would change if you swapped one input — cheaper LLM, smaller embedding, lower top-K, added caching, larger chunks, or a different vector database. Sorted by savings so the highest-leverage change is at the top.
6
Open Advanced settings to refine
Toggle quantization (float32 / int8 / binary), replication factor (1 / 2 / 3), the percentage of queries served from cache, region, and "Include DevOps overhead" for honest self-hosted comparisons.
7
Share or export your scenario
Copy the URL to share a fully reproducible configuration. Copy as Markdown for documentation or as CSV for procurement spreadsheets. URLs encode every input as compact single-letter keys.

When to use this

Customer support chatbot scoping

50K docs at 1.5K tokens each, 100K queries/day, weekly re-indexing, GPT-5 mini for cost, 30% cache hit rate. The preset shows you Pinecone Serverless at the $50 minimum, ~$15 of embedding cost, and generation as the dominant line — exactly what to budget for.

Internal knowledge base for a 50-person team

5K docs at 4K tokens each, 500 queries/day, monthly re-indexing, Claude Sonnet 4.6, Qdrant Cloud. Total typically lands $30–60/mo, with generation as 70%+ of the bill. Useful for IT teams trying to budget against SaaS-style flat alternatives like Glean.

E-commerce semantic search at 1M SKUs

1M products at 200 tokens each, 1M queries/day, daily re-indexing, BGE-small self-hosted + Qdrant Cloud + GPT-5 mini, int8 quantization, 50% cache. Shows how aggressive caching plus int8 plus a cheap LLM keeps cost-per-query under $0.001 even at high QPS.

Legal research RAG over 500K cases

500K cases at 8K tokens each, 1K queries/day, monthly re-indexing, Voyage embed + Pinecone Pod + Claude Sonnet 4.6 + Cohere Rerank 3. Every input is the cost-driver: long answers, large top-K, and a reranker push generation + retrieval into the thousands per month.

Procurement & vendor selection

Switch the embedding model dropdown across all 17 options to see how `text-embedding-3-large` adds 2× storage cost over `text-embedding-3-small`, or how Voyage-3-lite at 512 dim halves the storage line versus 1536-dim defaults — without leaving the page.

Common errors & fixes

Forgot to budget for re-indexing and your year-1 estimate is 30% short: The Re-indexing Frequency dropdown is the lever most calculators ignore. Weekly re-indexing of a 100K doc corpus on text-embedding-3-small adds ~$16/mo of embedding cost; daily adds ~$70. Set it honestly — it's usually one of your top-3 line items.
Generation looks suspiciously expensive — culprit is top-K and chunk size: Each query injects `top_K × chunk_size` context tokens into the LLM. With top_K=10 and chunk_size=1024 you're sending 10K tokens per query before the question. Drop to top_K=5 and chunk_size=512 and the generation line typically halves. Open the Generation card's "Show calculation" to see the exact prompt-token math.
Self-hosted vector DB looks 10× cheaper, but you have no platform team: Toggle "Include DevOps overhead" in Advanced settings. The default $500–$750/mo estimate covers patching, backups, monitoring, and incident response. Without an existing platform team, the real overhead is often higher — the "What If?" panel makes the trade-off explicit.
Storage cost spikes when you switch from text-embedding-3-small (1536) to -3-large (3072): Storage scales linearly with dimensions. Doubling dimensions doubles VDB storage cost. The MTEB delta from 62.3 → 64.6 may not justify a 2× storage bill, especially at 10M+ vectors. Compare side-by-side with the dropdown.

Technical details

Pricing source	Vendor public pricing pages; verified May 3, 2026 across all 47 model entries
Cost engine	Pure functions per layer — chunking math, embedding cost, VDB pricing engine (8 pricing models), LLM input/output, optional reranker
Chunking formula	chunks_per_doc = ceil((doc_tokens − overlap) / (chunk_size − overlap)); total_chunks = num_docs × chunks_per_doc
Generation formula	(top_K × chunk_size + query_tokens + 200 system overhead) × queries_per_month × $/M input + answer_tokens × queries_per_month × $/M output
Caching model	A flat percentage applied to retrieval and generation — equivalent to assuming cache hits skip those layers entirely
Storage delegation	Storage and Retrieval delegate to the existing Vector Database Cost Calculator engine, so all 8 VDB pricing models (serverless, pod, per-dimension, cluster-hourly, compute-units, storage-plus-query, cluster-tier, self-hosted-VM) are honoured
Currency	USD only (USD listed prices used for non-USD vendors at 1:1)
Privacy	Zero server-side processing — all math runs in your browser; URL state encodes inputs, never identity

How RAG pricing actually works

Every RAG system has the same four cost layers, regardless of which models or databases you pick. Layer 1, Embedding, is what you pay to convert documents and queries into vectors — once for the initial corpus, then ongoing for re-indexing, growth, and per-query embeddings. Layer 2, Storage, is what you pay to keep those vectors in a database, billed as GB-month, dim-month, pod-hour, or cluster-tier depending on the provider. Layer 3, Retrieval, is the per-query cost of similarity search and (optionally) reranking. Layer 4, Generation, is the LLM cost of producing the final answer using the retrieved context.

Most teams blow their budgets because they only model layers 2 and 4 — the parts that have obvious calculators on vendor sites. Layer 1 looks negligible during prototyping (you embedded the docs once, what could it cost?) and balloons in production once you start re-indexing weekly. Layer 3 looks free on cluster-based providers and turns into a major line item the moment you switch to Pinecone Serverless at high QPS or add a reranker.

The single most important framing this calculator gives you: it's almost always Generation that dominates. For a typical RAG workload with 10K queries/day, top-K=5, and Claude Sonnet 4.6, the LLM line is 80–90% of total cost. That means every "how do I cut RAG costs?" question collapses to "should I use a smaller LLM?" — and the calculator above lets you A/B that in two clicks.

The four layers — deep dive

**Embedding (Layer 1).** Cost equals total tokens you embed multiplied by $/M from your provider. Total tokens = `num_docs × chunks_per_doc × chunk_size_tokens`. The amortized initial cost over 12 months is usually small (text-embedding-3-small is $0.02/M); the killer is re-indexing. Weekly re-indexing on a 100K-doc corpus = 4 × full corpus tokens × rate every month. Self-hosted models (BGE, Nomic, GTE) zero out this layer's API cost, but the GPU/CPU compute is real and not metered here.

**Storage (Layer 2).** This is where the calculator delegates to our [Vector Database Cost Calculator](/tools/vector-database-cost-calculator) engine. Eight pricing models are supported: Pinecone Serverless (storage GB × $0.33 + RU/WU), Pinecone Pod (hourly × 730), Weaviate per-dim ($0.000095/M-dim/mo), Qdrant Cloud cluster-hourly, Zilliz Compute Units, Chroma storage-plus-query, MongoDB Atlas tiers, and self-hosted EC2. Replication multiplies most of these. Quantization (int8, binary) cuts the storage portion 4× to 32×.

**Retrieval (Layer 3).** Per-query reads against the VDB plus the optional reranker. Pod-based and cluster-based VDBs bundle reads into the cluster cost (so this layer is effectively free for them); usage-based VDBs (Pinecone Serverless, Chroma) have an explicit per-query line that scales linearly with queries-per-month. Cohere Rerank 3 is $2/1K calls — at 100K queries/day that's $6,000/mo for reranking alone, often the entire Retrieval layer.

**Generation (Layer 4).** The LLM call. Cost = `(top_K × chunk_size + query_tokens + system_prompt) × queries_per_month × $/M input` + `answer_tokens × queries_per_month × $/M output`. The single biggest controllable input here is the LLM choice. Claude Opus 4.7 is $15/$75 per M (input/output); Claude Haiku 4.5 is $0.80/$4 — a 19× input cost difference. For most RAG workloads where the LLM doesn't need frontier intelligence, picking the smaller model in the family is the highest-leverage cost cut available.

Customer support chatbot — worked example

**Workload:** 50,000 support docs at 1,500 tokens each, 100,000 queries/day, weekly re-indexing, 3% monthly growth, top-K=5, 250-token answers, 30% cache hit rate. **Stack:** OpenAI text-embedding-3-small + Pinecone Serverless + GPT-5 mini.

**Embedding:** 50K docs × ~3 chunks/doc = 150K chunks × 512 tokens = 76.8M tokens initial, $0.02/M = $1.54 one-time / $0.13 amortized. Weekly re-indexing: 4.34 × 76.8M × $0.02/M = $6.66/mo. Plus growth + query embeddings. **~$8/mo total.**

**Storage (Pinecone Serverless):** 150K vectors × 1536 dim × 4 bytes × 2 replicas = 1.84 GB × $0.33/GB = $0.60. Pinecone Serverless minimum: $50/mo. **$50/mo (minimum applies).**

**Retrieval:** Pinecone reads = 3M queries/mo × 2 RU = 6M RU × $8.25/M = $49.50, then × 0.7 (cache) = $34.65. No reranker. **~$35/mo.**

**Generation:** prompt = 5 × 512 + 30 + 200 = 2,790 tokens. 3M queries × 2,790 × 0.7 = 5.86B input tokens × $0.40/M = $2,344. Plus 3M × 250 × 0.7 = 525M output tokens × $1.60/M = $840. **~$3,184/mo.**

**Total: ~$3,277/mo, $39K/year, $0.0011 per query.** Generation is 97% of the bill — switch to Haiku 4.5 ($0.80/$4) and the total drops to ~$1,400/mo. Switch to int8 storage and the Storage line stays at the $50 minimum (already minimum-bound). The "What If?" panel surfaces both swaps automatically.

Internal knowledge base — worked example

**Workload:** 5,000 internal docs at 4,000 tokens each, 500 queries/day, monthly re-indexing, top-K=5, 500-token answers. **Stack:** OpenAI text-embedding-3-small + Qdrant Cloud + Claude Sonnet 4.6.

**Embedding:** 5K docs × ~9 chunks/doc = 45K chunks × 768 tokens = 34.5M initial tokens. Amortized + monthly re-indexing + tiny growth ≈ $0.80/mo. **~$1/mo.**

**Storage (Qdrant Cloud, smallest tier):** 1 vCPU / 4 GB RAM at $0.05/hr × 730 hrs × 2 replicas = $73/mo. No per-query reads. **$73/mo.**

**Retrieval:** $0 (Qdrant bundles queries into the cluster cost). **$0/mo.**

**Generation:** prompt = 5 × 768 + 30 + 200 = 4,070 tokens. 15.2K queries/mo × 4,070 = 62M tokens × $3/M = $185. Plus 15.2K × 500 = 7.6M output × $15/M = $114. **~$300/mo.**

**Total: ~$374/mo, $4.5K/year, $0.025 per query.** A small knowledge base for 50 employees easily fits the budget. Generation dominates again at 80%; switching to Haiku 4.5 cuts the bill to ~$155/mo. For internal-use cases where the answer quality is good enough, Haiku is almost always the right pick.

How to reduce RAG costs by 50% or more

In rough order of impact:

**1. Switch to a smaller LLM in the same family.** This is almost always the highest-leverage change. Claude Opus 4.7 → Sonnet 4.6 cuts input cost 5× and output cost 5×. Sonnet → Haiku cuts another 4×. For most RAG generation tasks (extractive QA, summarization, factual lookup), the smaller model is indistinguishable in user experience.

**2. Add caching.** Even a 30% cache hit rate (typical for chatbots with FAQ-style traffic) cuts Retrieval + Generation by 30%. The Cache slider in Advanced settings shows the impact instantly. Build it with Redis, Vercel KV, or a simple in-memory LRU.

**3. Reduce top-K.** Each retrieved chunk multiplies the prompt token count. Going from K=10 to K=3 cuts generation prompt tokens roughly 70%. Quality often doesn't suffer because the top-3 chunks usually contain the answer; the rest is noise the model ignores.

**4. Use int8 quantization.** Cuts VDB storage 75% with under 2% recall loss in most benchmarks. This matters most when storage dominates (large corpora, low query volume). Set in Advanced settings.

**5. Pick a smaller-dimension embedding model.** text-embedding-3-small (1536) → BGE-base (768) cuts storage 50% and many teams find recall is comparable. text-embedding-3-large (3072) → text-embedding-3-small (1536) is a free 50% cut for marginal recall loss.

**6. Drop re-indexing to monthly with incremental updates.** Weekly re-indexing of a static-ish corpus is wasteful. Most knowledge bases barely change week-to-week; re-index monthly and just upsert the diff weekly.

**7. Negotiate.** List prices are not what anyone pays at scale. Pinecone, Anthropic, OpenAI, and the major VDB vendors all offer 25–50% volume discounts above $5K/mo of spend.

Embedding model comparison for RAG

Quality (MTEB), cost, and dimensions are the three axes. Quality matters for retrieval relevance, cost matters for the embedding line, dimensions multiply your storage line.

**OpenAI text-embedding-3-small** is the default for a reason — $0.02/M, 1536 dim, MTEB 62.3. Cheap and good. Use unless you have a specific reason to switch.

**OpenAI text-embedding-3-large** is 6.5× more expensive ($0.13/M) and 2× the dimensions for a 2.3-point MTEB lift. Worth it for high-stakes search (legal, medical) where relevance matters more than cost. Storage cost penalty is real.

**Voyage-3** ($0.06/M, 1024 dim, MTEB 65.4) is the current price/quality leader. Quietly underrated. **Voyage-3-lite** at $0.02/M and 512 dim is the best storage-cost option with usable quality.

**Cohere embed-v3** ($0.10/M, 1024 dim, MTEB 64.5) is comparable to text-embedding-3-large at lower cost. **embed-multilingual-v3** is the right pick for non-English corpora.

**Self-hosted (BGE, GTE, Nomic, MiniLM)** zeros out the API cost but adds compute. BGE-large-en-v1.5 (1024 dim, MTEB 64.2) is the strongest open-source default. BGE-small-en-v1.5 (384 dim, MTEB 62.2) is what we use in the e-commerce preset because storage matters at 1M+ vectors. all-MiniLM-L6-v2 (384 dim, MTEB 56.5) is fine for prototypes but not production.

For most production RAG: start with text-embedding-3-small. Switch to Voyage-3 if you want a small quality bump. Switch to BGE-large self-hosted if your scale (>10M vectors) makes embedding cost meaningful and you have GPU infrastructure already.

Vector database comparison for RAG

For depth on this layer, see the dedicated [Vector Database Cost Calculator](/tools/vector-database-cost-calculator) and the [Vector Database Cost Comparison 2026](/blog/vector-database-cost-comparison-2026) blog post. Quick decision guide for RAG specifically:

**Prototype / under 1M vectors / low QPS:** Pinecone Serverless or Weaviate Cloud Standard — both are bound by minimum spend ($25–$50/mo) at this scale. Or free tiers from any provider.

**Steady-state production / 1M–50M vectors / >100K queries/day:** Qdrant Cloud is usually cheapest. Pinecone Pod-based becomes competitive above ~10K QPS because Read Units stop scaling linearly. MongoDB Atlas if you're already on Atlas.

**Large scale / 100M+ vectors:** Self-hosted Qdrant or Milvus on AWS r6g instances if you have a platform team. Pinecone Pod-based or Zilliz if not.

**Heavy filtered search (per-tenant, per-region):** Pinecone's filtered Read Units cost 5–10 RU vs. 1–2 for unfiltered, so a multi-tenant RAG app may quietly use 3–5× the Read Units the calculator shows. Qdrant's filter implementation is included in the cluster cost — no surprise scaling.

LLM comparison for RAG generation

For full pricing detail across 20+ models, see the dedicated [LLM Cost Calculator](/tools/llm-cost-calculator). For RAG specifically, three considerations matter: **price**, **context window**, and **how well the model uses retrieved context** (faithfulness).

**Price.** Claude Haiku 4.5 ($0.80/$4 per M) and GPT-5 mini ($0.40/$1.60) are the cheapest competent RAG generators. Most RAG workloads (extractive QA, summarization, fact-grounded answers) don't need anything more capable. Switching to either of these from a frontier model is the single biggest cost lever.

**Context window.** Most modern models handle 128K+ tokens, which is plenty for top-K up to ~50 with 1024-token chunks. Where it matters: Gemini 2.5 Flash and Gemini 3 Pro support 1M+ tokens, useful for very large top-K or document-heavy reasoning. The calculator warns you when prompt + answer exceeds the model's context.

**Faithfulness.** Frontier models (Opus 4.7, GPT-5, Gemini 3 Pro) are noticeably better at sticking to the retrieved context vs. confabulating. For high-stakes applications (medical, legal), the cost premium is justified. For consumer chatbots, the smaller models are fine — and a reranker improves faithfulness more than swapping to a bigger model would.

Hidden RAG costs nobody warns you about

The headline calculator above covers the four primary layers. Real production deployments incur costs the calculator does not include.

**Egress.** Pulling query results out to your application server costs $0.05–$0.09/GB on most clouds. Cross-region egress (your VDB in us-east, your app in eu-west) is significantly worse — co-locate them.

**Re-embedding when you change models.** Switching from text-embedding-3-small to text-embedding-3-large requires re-embedding every document — a one-time cost equal to the entire initial embedding. For a 10M-doc corpus that's tens of thousands of dollars on the embedding line, plus 100% Write Unit consumption against the entire VDB.

**Failed parses, retries, malformed inputs.** Real production traffic includes bot scrapers, garbage queries, and PDF parsing failures. Plan for ~5–10% overhead on the generation line just for retries and error paths.

**Evaluation runs.** Setting up RAGAS, TruLens, or custom eval pipelines means embedding + generation costs against a held-out test set every CI run. A 1,000-question eval suite run on every PR adds up.

**Observability tools.** LangSmith, Helicone, Phoenix — all have per-trace costs that scale with query volume. $0.001/trace × 100K queries/day = $3K/mo just for observability.

**Legal & compliance.** PII redaction passes, audit logging, encryption-at-rest premium tiers. Often invisible until your security review.

When to switch from managed to self-hosted

A reasonable framework: total managed cost > 3× the raw infrastructure cost AND your team has bandwidth to operate a stateful service. The 3× multiplier accounts for the loaded cost of an on-call rotation, monitoring, backups, and incident response.

At $500/mo of managed VDB cost, raw infra equivalent is ~$170/mo and self-hosted is rarely worth the operational investment. At $2,000/mo of managed cost, infra equivalent is ~$670/mo, and self-hosted starts making sense for teams that already run other stateful services. At $10,000/mo of managed cost, the infra cost is ~$3,300/mo and the savings clearly justify a dedicated platform team.

The other consideration is feature drift. Managed VDB providers ship new features (filtered search optimizations, hybrid search, sparse vector support, GPU acceleration) faster than you can keep up. The "save 60% by self-hosting" calculation looks worse if you also need to backport six months of upstream improvements every year.

For embedding models specifically, self-hosted (BGE, GTE, Nomic) only makes sense at very high embedding volumes — typically >100M tokens/month, which corresponds to corpora over ~1M documents with frequent re-indexing. Below that, OpenAI text-embedding-3-small at $0.02/M is too cheap to bother with self-hosting.

For LLMs, self-hosting Llama 3.3 70B on a vLLM cluster makes economic sense above ~10M tokens/day of generation; below that, the per-million pricing of GPT-5 mini and Haiku 4.5 wins on TCO.

Frequently Asked Questions

How much does it cost to run a RAG system?: Wildly variable. A hobby RAG over a personal note collection (10K notes, 50 queries/day, BGE-small + pgvector + Haiku 4.5) costs ~$5–15/mo. A production customer-support chatbot (50K docs, 100K queries/day, text-embedding-3-small + Pinecone Serverless + GPT-5 mini) lands ~$1.5–3K/mo. An enterprise legal RAG (500K cases, 1K queries/day with reranking, Voyage + Pinecone Pod + Sonnet 4.6 + Cohere Rerank 3) hits $5–15K/mo. The calculator above runs the math for any of those scenarios — start with a preset.
What's the cheapest RAG stack in 2026?: For a small workload (under 100K docs, under 1K queries/day): BGE-small-en-v1.5 (self-hosted embedding) + pgvector self-hosted + Claude Haiku 4.5 or GPT-5 mini. Total can be under $50/mo if you already have infra to run pgvector. For a larger workload where self-hosting isn't viable: text-embedding-3-small + Qdrant Cloud + Haiku 4.5. The calculator shows your specific number.
Is GPT-5 or Claude Opus 4.7 cheaper for RAG?: GPT-5 ($2.50/$10 per M) is meaningfully cheaper than Claude Opus 4.7 ($15/$75 per M) — roughly 7× cheaper on input and 7.5× on output. But for most RAG workloads, neither is the right pick. GPT-5 mini ($0.40/$1.60) and Claude Haiku 4.5 ($0.80/$4) handle factual RAG generation well at a fraction of the price. Use the calculator to see the dollar delta on your workload.
How much does RAG cost per query?: For most production RAG configurations, between $0.001 and $0.05 per query. The hero output of the calculator includes a "Cost / query" tile so you can see your number directly. Bot-style RAG with a small LLM lands $0.001–0.005; deep-reasoning RAG with Opus 4.7 and a big top-K can reach $0.10–0.30 per query.
Should I use OpenAI or open-source embeddings for RAG?: OpenAI text-embedding-3-small ($0.02/M, 1536 dim, MTEB 62.3) is the default — it's cheap, well-supported, and good. Open-source (BGE, Nomic, GTE) is worth it if (a) you process 100M+ tokens/month so embedding cost is material, or (b) you need data residency / on-prem and can't send text to OpenAI. For most teams, text-embedding-3-small is the right answer until you hit one of those constraints.
What's the cheapest vector database for RAG?: It depends on scale. Under 1M vectors: free tiers of Pinecone, Weaviate, Qdrant, or Chroma. 1M–50M vectors at moderate QPS: Qdrant Cloud is typically the cheapest managed option. 50M+ vectors with a platform team: self-hosted pgvector or Qdrant on AWS. The dedicated [Vector Database Cost Calculator](/tools/vector-database-cost-calculator) compares all 10 providers side-by-side.
Does chunking strategy affect RAG cost?: Yes — significantly. Smaller chunks mean more vectors (more storage), more chunks per query at the same top-K (more retrieval reads on usage-based providers), and more or fewer prompt tokens depending on how top-K interacts with chunk size. Larger chunks reduce vector count but reduce retrieval precision. A common sweet spot: 512–1024 token chunks with 50–100 token overlap. The calculator updates instantly when you adjust the chunk size slider.
How much does it cost to RAG over 1 million documents?: For 1M documents at 2K tokens each (typical), with text-embedding-3-small + Pinecone Serverless + Sonnet 4.6 at 10K queries/day: roughly $200–400/mo on storage and embedding, $1.5–3K/mo on generation. Switching the LLM to Haiku 4.5 cuts the generation line by ~75%. Set the calculator to 1M docs and play with the dropdowns to see your number.
Why is generation usually 80%+ of my RAG bill?: Because LLM input pricing is $0.40–$15/M tokens and you send a lot of tokens per query. With top-K=5 and chunk_size=512, every query injects 2,560 tokens of context plus the question and system prompt. At 100K queries/day on Sonnet 4.6, that's ~$1.7K/mo of input cost alone. Storage on most VDBs is $50–500/mo at the same scale — an order of magnitude smaller. The fix is almost always: smaller LLM, smaller top-K, or caching.
Can I run RAG for under $100/month?: Yes, if (a) your corpus is small (under 100K docs), (b) your query volume is modest (under 1K/day), (c) you use a cheap LLM like Haiku 4.5 or GPT-5 mini, (d) you pick a free-tier VDB or pgvector, and (e) you use text-embedding-3-small or a self-hosted BGE. The "Personal AI assistant" preset above lands here. Going beyond ~10K queries/day on a frontier LLM blows the $100/mo budget no matter what.
What's the ROI of switching to a smaller LLM in RAG?: Typically the highest single cost-saving move available. Claude Opus 4.7 → Sonnet 4.6 cuts the LLM bill 5×. Sonnet → Haiku cuts it another 4×. For factual RAG (extractive QA, summarization, knowledge-base search), the quality drop is small — often imperceptible to end users. The "What If?" panel in the calculator shows the exact dollar delta for your specific configuration. Test both and let the user feedback decide.
Does this calculator include reranking costs?: Yes. The Reranker dropdown supports Cohere Rerank 3 ($2/1K queries), Voyage Rerank 2 ($0.05/1K), Jina Reranker (~$0.20/1K), and a self-hosted cross-encoder (compute-only). Reranking can dominate the Retrieval layer at high QPS — at 100K queries/day on Cohere Rerank 3 that's ~$6K/mo. Voyage is 40× cheaper for the same workload.
How accurate are these RAG cost estimates?: Within 10% of vendor calculators in the scenarios we've tested, assuming you set the inputs realistically. Sources of variance: (1) RU/WU per query on Pinecone Serverless varies with filter complexity — we use 2 RU/query as a default. (2) Real query volume often has a long tail of expensive outliers. (3) Most providers offer enterprise discounts not reflected in list prices. Real-world bills typically run 10–25% higher than calculator estimates due to retries, egress, observability, and surprise capacity fees. Validate against vendor calculators before signing.
What about OpenAI Assistants / file_search built-in RAG cost?: OpenAI Assistants v2 file_search bundles embedding, vector storage, and retrieval into a single bill. Storage is ~$0.10/GB/day (yes, per day — that's ~$3/GB/month, much higher than Pinecone's $0.33/GB/month). file_search retrieval is ~$2.50 per 1K calls. For most workloads this is more expensive than building your own RAG; the convenience may justify it for prototypes. This calculator does not currently model the bundled Assistants pricing — use the standalone-stack model and compare manually.
How do I monitor real RAG costs in production?: Three places to watch: (1) your LLM provider dashboard (Anthropic Console, OpenAI Usage page, Vertex AI Billing) — usually the biggest line item. (2) Your vector DB provider dashboard (Pinecone Usage, Weaviate Console). (3) Your observability tool (LangSmith, Helicone, Phoenix) for per-trace cost attribution. Set monthly spend alerts at 80% of your projected budget. Re-run this calculator monthly with actual query volumes from your analytics to recalibrate.
Does this calculator save my data anywhere?: No. All math runs in your browser. Your inputs are encoded into the URL so you can share configurations, but nothing is sent to a server. There's no signup, no telemetry on your inputs, no analytics on the dropdown selections. Same privacy model as the LLM Cost Calculator and Vector Database Cost Calculator.

Related Tools

calculate

LLM Cost Calculator

Compare token counts and API costs across GPT-5, Claude Opus 4.7, Gemini 3, and more — all in your browser.

database

Vector Database Cost Calculator

Compare monthly cost across Pinecone, Weaviate, Qdrant, Milvus, Chroma, MongoDB Atlas, and self-hosted pgvector. Quantization, replication, and DevOps overhead all baked in.

memory

GPU VRAM Calculator for LLMs

Estimate GPU VRAM for running or fine-tuning open LLMs — Llama, Mistral, Qwen, DeepSeek, Gemma, Phi. Inference, LoRA, QLoRA, full FT, every quantization level, with the math shown.

code

JSON Formatter

Clean, minify, and validate JSON data structures.

edit_note

Markdown Editor

Live preview markdown rendering engine.