🌏 閱讀中文版本
💡 Still evaluating whether to build RAG? Read this first: Not Every Company Should Build RAG—Here’s How to Know
You’ve Been Assigned to Build RAG
Your boss says: “We need an internal knowledge base search. Use RAG.”
You start researching.
After reading tutorials, it seems straightforward:
- Split documents into chunks
- Convert to vectors using an embedding model
- Store in a vector database
- When users ask questions, find relevant chunks, feed them to the LLM
Three days later, you have a demo.
Boss is happy: “Great, let’s launch next month.”
Then your nightmare begins.
Why RAG Demos Always Work, But Production Always Fails
According to a 2024 survey, 42% of RAG project failures were caused by data processing issues, not model issues.
Here’s a more sobering number: over 1,200 RAG research papers were published in 2024 alone, indicating this field is still rapidly evolving with no “standard answers.”
Demos succeed because:
- You used clean test data
- You only tested happy paths
- You knew what questions would be asked, so answers were “conveniently” in the context
Production fails because:
- Real data is messy, outdated, and contradictory
- Users ask questions you never anticipated
- No one told users “this thing makes mistakes”
Here are the 5 pitfalls you’ll hit in your first month.
Pitfall 1: Thinking Embedding Is Everything
The Symptom
You spend days choosing an embedding model.
OpenAI’s text-embedding-3-large? Open-source bge-large? Cohere’s embed-v3?
You compare benchmarks and pick the “best” one.
Then discover: retrieval results are still terrible.
The Real Problem: Chunking Strategy
The embedding model accounts for only 30% of RAG effectiveness.
The other 70% is chunking strategy.
NVIDIA ran a large-scale benchmark in 2024 testing 7 chunking strategies. Conclusion: there’s no universal strategy—different document types need different approaches.
Chunking Options You Need to Know
Fixed-size chunking
# Simplest, but most error-prone
chunk_size = 512
overlap = 50
# Problem: splits concepts in half
# "Customers can request refunds within 30" | "days of purchase"
# User asks about "refund deadline", both chunks might be missed
Pros: Simple, fast, predictable Cons: Completely ignores semantic boundaries, cuts important information
Semantic chunking
# Splits based on semantic similarity
# When embedding similarity between sentences drops below threshold, split
from langchain.text_splitter import SemanticChunker
splitter = SemanticChunker(
embeddings=embedding_model,
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95
)
Pros: Preserves complete concepts Cons: Expensive. Every split requires running embeddings—costs explode with large document volumes.
Document-aware chunking
# Splits based on document structure
# Markdown by headers
# PDFs by pages or sections
# Code by functions
# This is usually the best starting point
Practical Recommendations
- First ask: What do your documents look like?
- Technical docs? Split by headers
- Legal contracts? Split by clauses
- Chat logs? Split by conversation turns
- Mixed types? Need classification and separate processing
- Chunk size rules of thumb
- Factoid questions (“What’s the refund deadline?”): 256-512 tokens
- Analytical questions (“What are this product’s pros and cons?”): 1024+ tokens
- Uncertain? Start with 512, adjust based on retrieval results
- Overlap matters
- Industry recommendation: 10-20% overlap
- For 500-token chunks, use 50-100 tokens overlap
- Too little overlap loses information; too much wastes storage
- Don’t use just one strategy
- Research papers: semantic chunking
- Financial reports: page-level chunking
- Code: function-level chunking
- Hybrid strategies usually perform best
Pitfall 2: Skipping Data Quality and Going Straight to Production
The Symptom
You dump all company documents into the vector database.
Confluence, SharePoint, Google Drive, Slack conversations, emails…
Index everything at once.
Result: RAG confidently gives you outdated, incorrect, and contradictory answers.
Garbage In, Confident Garbage Out
Traditional systems’ problem is “can’t find it.”
RAG’s problem is “finds the wrong thing, but with confidence.”
This is more dangerous than not finding anything.
User asks: “What’s our refund policy?”
RAG answers: “According to the 2019 policy document, the refund period is 14 days.”
But it was changed to 30 days in 2023.
The old document is still in the index, and because it’s more detailed, its embedding similarity might be higher.
Data Processing You Need to Do
1. Data Inventory
Before writing any code, answer these questions:
- How many data sources?
- Update frequency for each source?
- Is there a single source of truth?
- Does the same information exist in multiple versions?
2. Deduplication
# Not just identical documents
# Also handle "nearly identical" versions
# Method 1: MinHash + LSH (fast approximate dedup)
from datasketch import MinHash, MinHashLSH
# Method 2: Embedding similarity (more accurate but slower)
# Documents with similarity > 0.95, keep only the newest version
3. Version Control
# Every chunk needs metadata
{
"content": "Refund period is 30 days...",
"source": "refund-policy.md",
"version": "2.3",
"last_updated": "2024-06-15",
"deprecated": false
}
# Filter during retrieval
results = vector_db.query(
query_embedding,
filter={"deprecated": False}
)
4. Regular Cleanup
This isn’t a one-time job.
- Set up pipelines to regularly scan for outdated documents
- Establish ownership—every document needs a maintainer
- If no one is willing to maintain it, that document shouldn’t be indexed
Enterprise RAG Challenges
According to statistics, enterprises use an average of 112 SaaS applications to store content.
This means:
- Data formats vary wildly (PDF, Word, Notion, Confluence, Slack…)
- Permission management is complex (not everyone should see everything)
- No unified metadata standard
If no one uses your company’s Confluence, RAG won’t make things better.
RAG is an amplifier, not a fixer.
Pitfall 3: Not Defining What “Success” Looks Like
The Symptom
Boss asks: “How’s RAG performing since launch?”
You say: “Uh… users are using it.”
Boss: “Is it better than before?”
You: “Uh… probably?”
Without defined success metrics, you’ll never know if you’re improving.
Demo Success ≠ Production Ready
Demo success criteria:
- “Looks like it works”
- “Answers seem reasonable”
- “Boss nodded”
Production success criteria:
- What’s the retrieval accuracy?
- What’s the answer faithfulness?
- What percentage of answers are wrong?
- User satisfaction?
- How much better than the existing solution?
Three Levels of RAG Evaluation
1. Retrieval Layer: Is it finding the right things?
# Precision@K: How many of the top K results are relevant?
# Recall@K: How many of all relevant documents appear in top K?
# MRR (Mean Reciprocal Rank): Ranking of the first correct answer
# Example: Evaluate retrieval
def evaluate_retrieval(queries, ground_truth, k=5):
precision_scores = []
for query, relevant_docs in zip(queries, ground_truth):
retrieved = retriever.get_top_k(query, k)
hits = len(set(retrieved) & set(relevant_docs))
precision_scores.append(hits / k)
return sum(precision_scores) / len(precision_scores)
2. Generation Layer: Is it answering correctly?
# Faithfulness: Is the answer based on context without hallucination?
# Answer Relevancy: Is the answer on-topic?
# Hallucination Rate: Percentage of hallucinations
# Using RAGAS framework for evaluation
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
results = evaluate(
dataset,
metrics=[faithfulness, answer_relevancy]
)
But beware: According to benchmarks, RAGAS Answer Relevancy has only 17% accuracy in detecting hallucinations.
Why? Modern LLM hallucinations aren’t “off-topic answers”—they’re “answers that sound right but have incorrect details.”
3. End-to-End Layer: Overall effectiveness
- User task completion rate
- User satisfaction surveys
- A/B testing against existing solutions
Establish a Baseline
Before going live, you need to know what “current state” is:
## Current State Baseline
### Method: Direct Confluence Search
- Average time to find answer: 15 minutes
- Correct answer rate: 60%
- User satisfaction: 3.2/5
### Target: RAG System
- Average time to find answer: < 5 minutes
- Correct answer rate: > 80%
- User satisfaction: > 4.0/5
Without a baseline, you’ll never know if RAG is better than “just using Google.”
Pitfall 4: Underestimating Maintenance Costs
The Symptom
RAG goes live.
First week goes great.
Second week, users complain: “Why can’t it answer questions about the new product?”
You discover: The new product documents aren’t in the index yet.
RAG Isn’t “Set It and Forget It”
RAG requires ongoing maintenance:
1. Data Updates
# Question: How do new documents get added?
# - Manual upload? (Will be forgotten)
# - Auto-sync? (Need to build pipelines)
# - Who validates data quality?
# Question: How are old documents handled?
# - Auto-mark as deprecated?
# - Manual review and deletion?
# - How long to keep historical versions?
2. Model Updates
# When you switch embedding models...
# All documents need re-embedding
# This could take hours to days
# Cost estimate (for 1 million chunks):
# OpenAI text-embedding-3-large: ~$130
# Self-hosted embedding server: GPU time + labor
# Bigger question: Does performance change after model switch?
# Need to re-evaluate all metrics
3. Monitoring
Production RAG “silently degrades.”
Unlike traditional systems that throw errors, RAG will:
- Gradually decrease answer quality (but still answers)
- Gradually increase hallucination rate (but users might not notice)
- Gradually increase latency (but no timeouts)
You need to monitor three layers of metrics:
# System layer
- Latency P50, P95, P99
- Throughput
- Error rate
# RAG layer
- Retrieval precision (periodic sampling evaluation)
- Faithfulness score (automated LLM evaluation)
- Percentage of "I don't know" responses
# Business layer
- User satisfaction
- Task completion rate
- Cost per query
4. Costs
Post-launch costs are often higher than development:
| Item | One-time Cost | Monthly Maintenance |
|---|---|---|
| Vector DB (Pinecone Pro) | – | $70+ |
| Embedding API | – | By usage |
| LLM API | – | By usage |
| Data Processing Pipeline | Dev time | Maintenance time |
| Monitoring System | Setup time | Alert handling time |
If there’s no budget for ongoing maintenance, don’t start this project.
Pitfall 5: Not Managing User Expectations
The Symptom
Users think RAG is “the company’s internal ChatGPT.”
Ask anything, get answers, and all answers are correct.
Then users make decisions based on RAG’s incorrect answers.
Things go wrong.
RAG Will Make Mistakes
This isn’t a bug; it’s a feature.
According to research, even the best RAG systems struggle to exceed 95% faithfulness.
That means every 20 answers, 1 may contain incorrect information.
And when RAG is wrong, it doesn’t say “I’m not sure.” It confidently tells you the wrong answer.
This is epistemic uncertainty: the model doesn’t know what it doesn’t know.
What You Should Do
1. Set expectations before launch
## Usage Guidelines
This system uses AI technology to answer questions and may produce errors.
- For important decisions, always verify against source documents
- If you find incorrect answers, please report [link]
- This system is NOT suitable for: legal advice, financial decisions, medical recommendations
2. Design honest UI
// ❌ Wrong: Like Google, just show the answer
<Answer text={response} />
// ✅ Right: Show sources so users can verify
<Answer text={response} />
<Sources>
{chunks.map(chunk => (
<SourceCard
title={chunk.source}
link={chunk.url}
relevance={chunk.score}
/>
))}
</Sources>
// ✅ Better: Show confidence score
<ConfidenceIndicator score={0.72} />
<Disclaimer>
AI-generated answer. Recommend verifying with source documents.
</Disclaimer>
3. Design Fallbacks
# When confidence score is too low, don't force an answer
if confidence_score < 0.5:
return "I couldn't find enough information to answer this. You can try:\n" \
"1. Rephrasing your question\n" \
"2. Searching the knowledge base directly [link]\n" \
"3. Contacting [owner]"
# When retrieval results are empty
if len(retrieved_chunks) == 0:
return "This question may be outside my knowledge scope."
4. Collect Feedback
# Every answer needs a feedback mechanism
# "Was this answer helpful?"
# "Was this answer correct?"
# Use this feedback to:
# - Find common error patterns
# - Identify knowledge base gaps
# - Adjust retrieval strategy
“I Don’t Know” Is Better Than “Making Stuff Up”
Good RAG systems recognize their own limits.
Testing approach:
## Hallucination Testing
1. Ask a question not in the knowledge base
- Expected: "I couldn't find relevant information"
- Failure: Makes up an answer
2. Ask with false premise ("I heard your refund period is 7 days?")
- Expected: "According to policy, refund period is 30 days, not 7 days"
- Failure: "Yes, refund period is 7 days"
3. Ask something requiring inference (answer not directly in documents)
- Expected: "I don't have enough information to conclude"
- Failure: Draws conclusions from incomplete information
Summary: First Month Checklist
Week 1: Data Inventory
List all data sources
Evaluate each source’s quality and update frequency
Identify duplicate and outdated content
Decide who’s responsible for data quality
Week 2: Chunking Strategy
Choose chunking strategy based on document types
Set chunk size and overlap
Test retrieval effectiveness of different strategies
Document your choices (you’ll need to adjust later)
Week 3: Evaluation Framework
Establish baseline metrics
Prepare test dataset (with ground truth)
Set up retrieval and generation evaluation metrics
Define what “success” means
Week 4: Launch Prep
Set up data update pipeline
Set up monitoring system
Write usage guidelines
Design feedback mechanism
Manage user expectations
Final Thoughts
RAG isn’t hard.
What’s hard is:
- Admitting your data isn’t as clean as you thought
- Admitting your system will make mistakes
- Admitting maintenance requires ongoing investment
These aren’t technical problems—they’re organizational problems.
If no one at your company uses Confluence, If no one is willing to maintain the knowledge base, If there’s no budget for ongoing maintenance,
Don’t build RAG.
Fix the fundamentals first.
RAG is an amplifier. If your knowledge management is good, RAG makes it better. If your knowledge management is bad, RAG makes it worse—and more confidently wrong.
The choice is yours.