🌏 閱讀中文版本
200 Tests Passed, Production Still Broke
It was a Friday afternoon.
I had just started using Claude Code to write tests. The results were incredible. I told it “write comprehensive tests for this module,” and 10 minutes later, 200 tests were ready.
Ran CI. All green. Coverage jumped from 45% to 89%.
I thought: Perfect, we can finally deploy with confidence.
Friday evening deployment to staging, ready for Monday release.
Saturday morning, a QA colleague doing UAT found the issue.
She added an iPhone and a pre-order AirPods to the cart. At checkout, the system should only apply 15% off to the iPhone—pre-order items don’t participate in promotions, as stated in the campaign rules.
But the system discounted both items.
Good thing it was staging, not production. But if the QA colleague hadn’t tested this exact scenario, the bug would have gone live.
Saturday afternoon, I went back to look at those 200 tests.
Search “pre-order”: 0 results. Search “mixed”: 0 results. Search “promotion exclusion”: 0 results.
200 tests, not a single one covered the “regular items + pre-order items mixed checkout” scenario.
Why?
Because AI guesses “what’s worth testing” based on code structure. It saw the discount calculation function, so it tested various discount percentages. It saw the cart logic, so it tested add, remove, update quantity.
But it didn’t know “pre-order + regular items mixed” was the most complained-about scenario last month. It didn’t know this promotion rule was added last-minute by marketing, just one inconspicuous if statement in the code. It didn’t know this bug had happened three months ago, just nobody noticed then.
That day I learned something: The problem wasn’t that AI wrote bad tests. The problem was AI tested the wrong things.
Core Framework: What vs How
This experience made me rethink AI testing entirely.
Eventually I figured it out: AI testing needs to be split into two layers.
What (what to test) = Human decides
- Which features need testing?
- Which scenarios are most important?
- What’s the priority order?
- Should we add more coverage where we broke last time?
The answers to these questions aren’t in the code.
They’re in business logic. In customer complaint records. In last incident’s postmortem. In your head.
AI can’t see any of this.
How (how to test) = AI executes
- How should the test code be written?
- What assertions to use?
- How to cover edge cases?
- How to prepare test data?
AI is great at these. Give it a clear target, and it can write more comprehensive tests than you.
This entire article is about one thing: how to separate What from How.
Why What Can’t Be Given to AI
You might think: AI is so powerful now, why can’t it decide what to test?
Let me explain with a few real scenarios.
AI Can’t See Your Incident History
Last month, our checkout flow had a race condition under high concurrency. Two requests came in simultaneously, inventory was deducted twice, but only one order was created.
This incident took us three days to fix. We wrote a 15-page postmortem.
Where is this information? In Notion. In the #incident channel on Slack. In the on-call engineer’s memory.
Not in the code.
So when I asked AI to “test the checkout flow,” it tested normal checkout, tested insufficient inventory, tested payment failure.
But it didn’t specifically test “two requests checking out simultaneously,” because it didn’t know we’d had an incident there.
AI Doesn’t Know Which Features Customers Use Most
Our product has 50 APIs.
But looking at analytics, 80% of traffic concentrates on 5 APIs: login, homepage, search, add to cart, checkout.
If these 5 APIs break, users feel it immediately. The other 45 could break and nobody might notice for a week.
AI doesn’t know this distribution. It only judges by code complexity—“this function is complex, should test it more.”
But complex doesn’t mean important. That complex function might be a report export used once a month.
AI Can’t See Specs Outside the Code
Let me give a specific example.
Our checkout API has a parameter called apply_coupon.
AI diligently tested: – apply_coupon = true, has coupon, discount applied ✓ – apply_coupon = false, no discount ✓ – apply_coupon = null, defaults to no discount ✓ – apply_coupon = "invalid", returns error ✓
Technically perfect. All four cases passed.
But it missed one thing: when apply_coupon = true but the user’s account has no available coupon, the system should display “You have no available coupons” instead of returning a 400 error.
Why didn’t it test this?
Because this behavior was written in a requirements doc from three months ago, titled “Optimize Checkout Experience.” The PM wrote one line: “When user attempts to use a non-existent coupon, display a friendly prompt.”
This line isn’t in the code. Not in the API docs. Not anywhere AI can read.
This is what I mean: AI can see code, but can’t see business context.
Why How Is Perfect for AI
Now that we’ve covered AI’s limitations, let’s talk about what it’s good at.
“How to test”—AI does this better than humans.
API Testing: AI’s Sweet Spot
Ask yourself: If AI could only help you test one thing, what would you choose?
My answer is API testing.
Why? Because API characteristics make it perfect for AI:
Clear inputs and outputs. What the request looks like, what the response should look like—it’s all in the API documentation. AI doesn’t need to guess.
Edge cases have rules. Required fields, format validation, length limits—these rules are explicit. AI can automatically derive all edge cases.
Success and failure are clear. 200 means success, 400 means client error, 500 means server error. There’s no “sort of successful” gray area.
My experience: API tests handed to AI barely need modification.
I tell it: “Here’s our User API, test CRUD and permission validation.”
It produces: – Normal CRUD flows – Missing required fields – Format errors (email format, phone format) – Insufficient permissions (regular user trying to delete someone else’s data) – Edge cases (256-character name, extremely long email)
These cases would take a human 2 hours to write. AI does it in 5 minutes. And it doesn’t miss anything because it derives systematically.
Unit Tests: Need Context
Unit tests are slightly harder than API tests.
Because AI needs to understand “why this function exists.”
Example: We have a function called calculateDiscount().
AI sees this function and tests: – Input 100, 10% discount, output 90 ✓ – Input 0, output 0 ✓ – Input negative, throw error ✓
Technically correct. But no business meaning.
Because the real complexity of this function is: – VIP customers get an extra 5% discount – During anniversary sale, additional 15% off storewide – But VIP discount and anniversary discount don’t stack – Except for Black Card members, who can stack – But Black Card members who are employees can’t use employee pricing and Black Card discount together
AI doesn’t know these rules.
So I tell it first:
“This function calculates order discounts. Rules are: 1. Base discount determined by promotionType 2. VIP members get extra 5% 3. VIP discount and campaign discount don’t stack, take the higher one 4. Black Card members are an exception, can stack 5. Employees can’t use employee pricing and member discounts together
Please write tests for these rules.”
With context, AI’s tests go from “technically correct” to “business meaningful.”
E2E Tests: Must Be Split
E2E testing is where it’s easiest to make mistakes.
First time I asked AI to write E2E tests, I said: “Test the entire shopping flow, from login to checkout.”
It produced a 500-line test file.
Problems: – Login, browsing, add to cart, checkout—all mixed together – One moment testing normal flow, next moment testing out of stock, then back to normal flow with different payment methods – Just reading and understanding this test took me 30 minutes
Worse was maintenance. Two weeks later, the login flow changed. This 500-line test had 47 places that needed changing.
Later I learned: Split it.
- Login flow →
login.spec.ts - Product browsing →
browse.spec.ts - Cart operations →
cart.spec.ts - Checkout flow →
checkout.spec.ts
Each file, ask AI to write separately. Each time, give it only one clear scope.
Result? Each test file is clear, and when maintaining, you only need to modify the corresponding file.
More importantly: scenarios with very different logic must never be in the same batch.
I once asked AI to simultaneously write “normal checkout flow” and “various error handling.”
The tests came out like this:
1. Test normal checkout
2. Test expired credit card
3. Test normal checkout but different payment method
4. Test insufficient balance
5. Test normal checkout but different discount code
6. Test API timeout
Normal and abnormal interleaved. Completely unreadable.
Now I separate them: – First ask AI to write all variations of normal flow (different payment methods, different discount codes, different shipping methods) – After confirming no issues, then ask it to write error handling (payment failure, out of stock, API errors)
Two batches, clear logic, easy to maintain.
How to Separate What from How
Knowing principles isn’t enough. You need to execute.
Step 1: Before Asking AI to Write Tests, List “Critical Scenarios”
This is human work.
Ask yourself: – What are the 3 most important scenarios for this module? – Where did we break last time? – What features do customers use most? – If I could only test 5 things, what would they be?
Write down the answers.
Like this:
Checkout Module Critical Scenarios:
1. Normal checkout (credit card) - most common payment, 70% of orders
2. Regular items + pre-order items mixed - 3 complaints last month
3. VIP discount calculation - complex rules, error-prone
4. Inventory deduction under high concurrency - where last incident happened
5. Session expiring mid-checkout - frequently reported UX issue
This is your “What.” Everything on this list must have test coverage.
Step 2: Split Work, Give AI One Clear Scope at a Time
Don’t say: “Write tests for the checkout module.”
Say:
“Write tests for the checkout API, scenario is ‘regular items + pre-order items mixed checkout.’
Pre-order items cannot participate in promotional discounts. Regular items can use coupons.
Please test these situations: 1. Only regular items, apply coupon 2. Only pre-order items, cannot apply coupon 3. Mixed items, only regular items get discount 4. Mixed items, coupon amount exceeds regular items total”
Clear scope means AI doesn’t need to guess. Tests produced will be much more precise.
Step 3: Feedback Must Be Specific Enough That AI Doesn’t Guess
AI writing problematic tests is normal. The key is how you give feedback.
Bad feedback: > “This test is wrong, rewrite it.”
AI asks: What’s wrong? It doesn’t know, can only guess. High chance of guessing wrong.
Good feedback: > “Line 23’s assertion is wrong. Expected discount amount should be 85, not 80. > Because VIP discount is 15%, 100 × 0.85 = 85. > Also please add one case: VIP + anniversary discount don’t stack, should take the higher one.”
Specific enough that AI doesn’t need to guess, and it can fix it in one go.
My experience: Vague feedback once, AI goes back and forth 3-4 times to get it right. Specific feedback once, usually correct by second try.
How to Validate AI’s Validation
This is the question Tech Leads and QA Leads ask me most.
“AI says tests passed, can I trust it?”
Answer: Depends on what you validated.
Three Levels of Validation
Let me explain with an example.
Say AI wrote 50 tests for your checkout module, all passing.
Level 1: Did the Tests Run
Did CI run? All green? Any flaky tests?
This is the most basic. This level can be fully automated, no humans needed.
But “ran” doesn’t mean “tested correctly.”
Level 2: Did the Tests Test the Right Things
Open those 50 tests and look:
- Are assertions correct? (Confirming amount is 100, but is 100 the right number?)
- Is it testing what this function “should do”? (Or just what it “does”?)
- Are edge cases covered? (Zero, negative, extremely large amounts)
This level needs someone “who understands code” to review.
My experience: 15 minutes reviewing AI-written tests can save 3 hours of debugging.
Level 3: Do Tests Cover Business-Critical Paths
Go back to your “critical scenarios list”:
- Is “regular + pre-order mixed” covered?
- Is “high concurrency inventory deduction” covered?
- Is “VIP discount calculation” covered?
Even if those 50 tests are all technically correct, if they don’t cover these critical scenarios, the tests fail.
This level needs someone “who understands the business” to judge.
Actual Process
What I do now:
- Human defines critical scenarios list (What)
- Split work, ask AI to write in batches (control scope)
- Review AI’s tests (confirm How is correct)
- Cross-check against critical scenarios list (confirm What is covered)
- Run tests (AI executes)
- Final judgment: can we ship (human decides)
Steps 1 and 6, always human decisions.
Steps in between, human-AI collaboration.
This Isn’t Just About Testing
At this point, I want to say something bigger.
This article is about AI testing, but the underlying problem is more universal:
What should AI do? What should you do yourself? Where’s the boundary?
Testing is just one example. The answer: What stays with humans, How goes to AI.
This framework applies to many areas:
Writing code: – What = what feature to build, what problem to solve → human decides – How = how to implement, how to write the code → AI can help
Writing documentation: – What = who’s reading, what message to convey → human decides – How = how to organize paragraphs, how to phrase things → AI can help
Making decisions: – What = whether to do this, what’s the priority → human decides – How = how to execute, what are the steps → AI can help
The core skill of the AI era isn’t knowing how to use AI.
It’s knowing when to use it and when not to.
That’s judgment.
And judgment can only be built by yourself.