5 Pitfalls You'll Hit in Your First Month Building RAG

🌏 閱讀中文版本

💡 Still evaluating whether to build RAG? Read this first: Not Every Company Should Build RAG—Here’s How to Know

Table of Contents

You’ve Been Assigned to Build RAG

Your boss says: “We need an internal knowledge base search. Use RAG.”

You start researching.

After reading tutorials, it seems straightforward:

Split documents into chunks
Convert to vectors using an embedding model
Store in a vector database
When users ask questions, find relevant chunks, feed them to the LLM

Three days later, you have a demo.

Boss is happy: “Great, let’s launch next month.”

Then your nightmare begins.

Why RAG Demos Always Work, But Production Always Fails

According to a 2024 survey, 42% of RAG project failures were caused by data processing issues, not model issues.

Here’s a more sobering number: over 1,200 RAG research papers were published in 2024 alone, indicating this field is still rapidly evolving with no “standard answers.”

Demos succeed because:

You used clean test data
You only tested happy paths
You knew what questions would be asked, so answers were “conveniently” in the context

Production fails because:

Real data is messy, outdated, and contradictory
Users ask questions you never anticipated
No one told users “this thing makes mistakes”

Here are the 5 pitfalls you’ll hit in your first month.

Pitfall 1: Thinking Embedding Is Everything

The Symptom

You spend days choosing an embedding model.

OpenAI’s text-embedding-3-large? Open-source bge-large? Cohere’s embed-v3?

You compare benchmarks and pick the “best” one.

Then discover: retrieval results are still terrible.

The Real Problem: Chunking Strategy

The embedding model accounts for only 30% of RAG effectiveness.

The other 70% is chunking strategy.

NVIDIA ran a large-scale benchmark in 2024 testing 7 chunking strategies. Conclusion: there’s no universal strategy—different document types need different approaches.

Chunking Options You Need to Know

Fixed-size chunking

# Simplest, but most error-prone
chunk_size = 512
overlap = 50

# Problem: splits concepts in half
# "Customers can request refunds within 30" | "days of purchase"
# User asks about "refund deadline", both chunks might be missed

Pros: Simple, fast, predictable Cons: Completely ignores semantic boundaries, cuts important information

Semantic chunking

# Splits based on semantic similarity
# When embedding similarity between sentences drops below threshold, split

from langchain.text_splitter import SemanticChunker

splitter = SemanticChunker(
    embeddings=embedding_model,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95
)

Pros: Preserves complete concepts Cons: Expensive. Every split requires running embeddings—costs explode with large document volumes.

Document-aware chunking

# Splits based on document structure
# Markdown by headers
# PDFs by pages or sections
# Code by functions

# This is usually the best starting point

Practical Recommendations

First ask: What do your documents look like?
- Technical docs? Split by headers
- Legal contracts? Split by clauses
- Chat logs? Split by conversation turns
- Mixed types? Need classification and separate processing
Chunk size rules of thumb
- Factoid questions (“What’s the refund deadline?”): 256-512 tokens
- Analytical questions (“What are this product’s pros and cons?”): 1024+ tokens
- Uncertain? Start with 512, adjust based on retrieval results
Overlap matters
- Industry recommendation: 10-20% overlap
- For 500-token chunks, use 50-100 tokens overlap
- Too little overlap loses information; too much wastes storage
Don’t use just one strategy
- Research papers: semantic chunking
- Financial reports: page-level chunking
- Code: function-level chunking
- Hybrid strategies usually perform best

Pitfall 2: Skipping Data Quality and Going Straight to Production

The Symptom

You dump all company documents into the vector database.

Confluence, SharePoint, Google Drive, Slack conversations, emails…

Index everything at once.

Result: RAG confidently gives you outdated, incorrect, and contradictory answers.

Garbage In, Confident Garbage Out

Traditional systems’ problem is “can’t find it.”

RAG’s problem is “finds the wrong thing, but with confidence.”

This is more dangerous than not finding anything.

User asks: “What’s our refund policy?”

RAG answers: “According to the 2019 policy document, the refund period is 14 days.”

But it was changed to 30 days in 2023.

The old document is still in the index, and because it’s more detailed, its embedding similarity might be higher.

Data Processing You Need to Do

1. Data Inventory

Before writing any code, answer these questions:

How many data sources?
Update frequency for each source?
Is there a single source of truth?
Does the same information exist in multiple versions?

2. Deduplication

# Not just identical documents
# Also handle "nearly identical" versions

# Method 1: MinHash + LSH (fast approximate dedup)
from datasketch import MinHash, MinHashLSH

# Method 2: Embedding similarity (more accurate but slower)
# Documents with similarity > 0.95, keep only the newest version

3. Version Control

# Every chunk needs metadata
{
    "content": "Refund period is 30 days...",
    "source": "refund-policy.md",
    "version": "2.3",
    "last_updated": "2024-06-15",
    "deprecated": false
}

# Filter during retrieval
results = vector_db.query(
    query_embedding,
    filter={"deprecated": False}
)

4. Regular Cleanup

This isn’t a one-time job.

Set up pipelines to regularly scan for outdated documents
Establish ownership—every document needs a maintainer
If no one is willing to maintain it, that document shouldn’t be indexed

Enterprise RAG Challenges

According to statistics, enterprises use an average of 112 SaaS applications to store content.

This means:

Data formats vary wildly (PDF, Word, Notion, Confluence, Slack…)
Permission management is complex (not everyone should see everything)
No unified metadata standard

If no one uses your company’s Confluence, RAG won’t make things better.

RAG is an amplifier, not a fixer.

Pitfall 3: Not Defining What “Success” Looks Like

The Symptom

Boss asks: “How’s RAG performing since launch?”

You say: “Uh… users are using it.”

Boss: “Is it better than before?”

You: “Uh… probably?”

Without defined success metrics, you’ll never know if you’re improving.

Demo Success ≠ Production Ready

Demo success criteria:

“Looks like it works”
“Answers seem reasonable”
“Boss nodded”

Production success criteria:

What’s the retrieval accuracy?
What’s the answer faithfulness?
What percentage of answers are wrong?
User satisfaction?
How much better than the existing solution?

Three Levels of RAG Evaluation

1. Retrieval Layer: Is it finding the right things?

# Precision@K: How many of the top K results are relevant?
# Recall@K: How many of all relevant documents appear in top K?
# MRR (Mean Reciprocal Rank): Ranking of the first correct answer

# Example: Evaluate retrieval
def evaluate_retrieval(queries, ground_truth, k=5):
    precision_scores = []
    for query, relevant_docs in zip(queries, ground_truth):
        retrieved = retriever.get_top_k(query, k)
        hits = len(set(retrieved) & set(relevant_docs))
        precision_scores.append(hits / k)
    return sum(precision_scores) / len(precision_scores)

2. Generation Layer: Is it answering correctly?

# Faithfulness: Is the answer based on context without hallucination?
# Answer Relevancy: Is the answer on-topic?
# Hallucination Rate: Percentage of hallucinations

# Using RAGAS framework for evaluation
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy

results = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy]
)

But beware: According to benchmarks, RAGAS Answer Relevancy has only 17% accuracy in detecting hallucinations.

Why? Modern LLM hallucinations aren’t “off-topic answers”—they’re “answers that sound right but have incorrect details.”

3. End-to-End Layer: Overall effectiveness

User task completion rate
User satisfaction surveys
A/B testing against existing solutions

Establish a Baseline

Before going live, you need to know what “current state” is:

## Current State Baseline

### Method: Direct Confluence Search

- Average time to find answer: 15 minutes
- Correct answer rate: 60%
- User satisfaction: 3.2/5

### Target: RAG System

- Average time to find answer: < 5 minutes
- Correct answer rate: > 80%
- User satisfaction: > 4.0/5

Without a baseline, you’ll never know if RAG is better than “just using Google.”

Pitfall 4: Underestimating Maintenance Costs

The Symptom

RAG goes live.

First week goes great.

Second week, users complain: “Why can’t it answer questions about the new product?”

You discover: The new product documents aren’t in the index yet.

RAG Isn’t “Set It and Forget It”

RAG requires ongoing maintenance:

1. Data Updates

# Question: How do new documents get added?
# - Manual upload? (Will be forgotten)
# - Auto-sync? (Need to build pipelines)
# - Who validates data quality?

# Question: How are old documents handled?
# - Auto-mark as deprecated?
# - Manual review and deletion?
# - How long to keep historical versions?

2. Model Updates

# When you switch embedding models...
# All documents need re-embedding
# This could take hours to days

# Cost estimate (for 1 million chunks):
# OpenAI text-embedding-3-large: ~$130
# Self-hosted embedding server: GPU time + labor

# Bigger question: Does performance change after model switch?
# Need to re-evaluate all metrics

3. Monitoring

Production RAG “silently degrades.”

Unlike traditional systems that throw errors, RAG will:

Gradually decrease answer quality (but still answers)
Gradually increase hallucination rate (but users might not notice)
Gradually increase latency (but no timeouts)

You need to monitor three layers of metrics:

# System layer

- Latency P50, P95, P99
- Throughput
- Error rate

# RAG layer

- Retrieval precision (periodic sampling evaluation)
- Faithfulness score (automated LLM evaluation)
- Percentage of "I don't know" responses

# Business layer

- User satisfaction
- Task completion rate
- Cost per query

4. Costs

Post-launch costs are often higher than development:

Item	One-time Cost	Monthly Maintenance
Vector DB (Pinecone Pro)	–	$70+
Embedding API	–	By usage
LLM API	–	By usage
Data Processing Pipeline	Dev time	Maintenance time
Monitoring System	Setup time	Alert handling time

If there’s no budget for ongoing maintenance, don’t start this project.

Pitfall 5: Not Managing User Expectations

The Symptom

Users think RAG is “the company’s internal ChatGPT.”

Ask anything, get answers, and all answers are correct.

Then users make decisions based on RAG’s incorrect answers.

Things go wrong.

RAG Will Make Mistakes

This isn’t a bug; it’s a feature.

According to research, even the best RAG systems struggle to exceed 95% faithfulness.

That means every 20 answers, 1 may contain incorrect information.

And when RAG is wrong, it doesn’t say “I’m not sure.” It confidently tells you the wrong answer.

This is epistemic uncertainty: the model doesn’t know what it doesn’t know.

What You Should Do

1. Set expectations before launch

## Usage Guidelines

This system uses AI technology to answer questions and may produce errors.

- For important decisions, always verify against source documents
- If you find incorrect answers, please report [link]
- This system is NOT suitable for: legal advice, financial decisions, medical recommendations

2. Design honest UI

// ❌ Wrong: Like Google, just show the answer
<Answer text={response} />

// ✅ Right: Show sources so users can verify
<Answer text={response} />
<Sources>
  {chunks.map(chunk => (
    <SourceCard
      title={chunk.source}
      link={chunk.url}
      relevance={chunk.score}
    />
  ))}
</Sources>

// ✅ Better: Show confidence score
<ConfidenceIndicator score={0.72} />
<Disclaimer>
  AI-generated answer. Recommend verifying with source documents.
</Disclaimer>

3. Design Fallbacks

# When confidence score is too low, don't force an answer
if confidence_score < 0.5:
    return "I couldn't find enough information to answer this. You can try:\n" \
           "1. Rephrasing your question\n" \
           "2. Searching the knowledge base directly [link]\n" \
           "3. Contacting [owner]"

# When retrieval results are empty
if len(retrieved_chunks) == 0:
    return "This question may be outside my knowledge scope."

4. Collect Feedback

# Every answer needs a feedback mechanism
# "Was this answer helpful?"
# "Was this answer correct?"

# Use this feedback to:
# - Find common error patterns
# - Identify knowledge base gaps
# - Adjust retrieval strategy

“I Don’t Know” Is Better Than “Making Stuff Up”

Good RAG systems recognize their own limits.

Testing approach:

## Hallucination Testing

1. Ask a question not in the knowledge base
   - Expected: "I couldn't find relevant information"
   - Failure: Makes up an answer

2. Ask with false premise ("I heard your refund period is 7 days?")
   - Expected: "According to policy, refund period is 30 days, not 7 days"
   - Failure: "Yes, refund period is 7 days"

3. Ask something requiring inference (answer not directly in documents)
   - Expected: "I don't have enough information to conclude"
   - Failure: Draws conclusions from incomplete information

Summary: First Month Checklist

Week 1: Data Inventory

List all data sources
Evaluate each source’s quality and update frequency
Identify duplicate and outdated content
Decide who’s responsible for data quality

Week 2: Chunking Strategy

Choose chunking strategy based on document types
Set chunk size and overlap
Test retrieval effectiveness of different strategies
Document your choices (you’ll need to adjust later)

Week 3: Evaluation Framework

Establish baseline metrics
Prepare test dataset (with ground truth)
Set up retrieval and generation evaluation metrics
Define what “success” means

Week 4: Launch Prep

Final Thoughts

RAG isn’t hard.

What’s hard is:

Admitting your data isn’t as clean as you thought
Admitting your system will make mistakes
Admitting maintenance requires ongoing investment

These aren’t technical problems—they’re organizational problems.

If no one at your company uses Confluence, If no one is willing to maintain the knowledge base, If there’s no budget for ongoing maintenance,

Don’t build RAG.

Fix the fundamentals first.

RAG is an amplifier. If your knowledge management is good, RAG makes it better. If your knowledge management is bad, RAG makes it worse—and more confidently wrong.

The choice is yours.

5 Pitfalls You’ll Hit in Your First Month Building RAG

You’ve Been Assigned to Build RAG

Why RAG Demos Always Work, But Production Always Fails

Pitfall 1: Thinking Embedding Is Everything

The Symptom

The Real Problem: Chunking Strategy

Chunking Options You Need to Know

Practical Recommendations

Pitfall 2: Skipping Data Quality and Going Straight to Production

The Symptom

Garbage In, Confident Garbage Out

Data Processing You Need to Do

Enterprise RAG Challenges

Pitfall 3: Not Defining What “Success” Looks Like

The Symptom

Demo Success ≠ Production Ready

Three Levels of RAG Evaluation

Establish a Baseline

Pitfall 4: Underestimating Maintenance Costs

The Symptom

RAG Isn’t “Set It and Forget It”

Pitfall 5: Not Managing User Expectations

The Symptom

RAG Will Make Mistakes

What You Should Do

“I Don’t Know” Is Better Than “Making Stuff Up”

Summary: First Month Checklist

Week 1: Data Inventory

Week 2: Chunking Strategy

Week 3: Evaluation Framework

Week 4: Launch Prep

Final Thoughts

Leave a Comment Cancel reply

You’ve Been Assigned to Build RAG

Why RAG Demos Always Work, But Production Always Fails

Pitfall 1: Thinking Embedding Is Everything

The Symptom

The Real Problem: Chunking Strategy

Chunking Options You Need to Know

Practical Recommendations

Pitfall 2: Skipping Data Quality and Going Straight to Production

The Symptom

Garbage In, Confident Garbage Out

Data Processing You Need to Do

Enterprise RAG Challenges

Pitfall 3: Not Defining What “Success” Looks Like

The Symptom

Demo Success ≠ Production Ready

Three Levels of RAG Evaluation

Establish a Baseline

Pitfall 4: Underestimating Maintenance Costs

The Symptom

RAG Isn’t “Set It and Forget It”

Pitfall 5: Not Managing User Expectations

The Symptom

RAG Will Make Mistakes

What You Should Do

“I Don’t Know” Is Better Than “Making Stuff Up”

Summary: First Month Checklist

Week 1: Data Inventory

Week 2: Chunking Strategy

Week 3: Evaluation Framework

Week 4: Launch Prep

Final Thoughts

Related posts:

Leave a Comment Cancel reply