🌏 閱讀中文版本
How to Evaluate AI Projects: The Step Most Teams Skip
This article focuses on the technical side of project acceptance. Organizational politics and managing up are separate topics for another day.
The Question You Can’t Answer
Every team that adopts AI will eventually face this scenario:
Your boss asks: “We’ve invested three months in AI. What are the results?”
Your answer probably sounds like this:
“Well… people seem to like it… coding feels a bit faster… PM says it saves some time…”
Your boss nods, but you know he’s not satisfied. And you know you’re just rambling.
The problem isn’t that AI doesn’t work. The problem is you can’t prove it works.
And you can’t prove it because no one ever defined what “works” means.
Why AI Projects Are Particularly Hard to Evaluate
Traditional software projects have straightforward acceptance criteria:
- Feature complete? ✓
- Tests passing? ✓
- Users can access it? ✓
Done. Ship it.
AI projects are different. You’ve deployed Copilot, AI testing, a RAG knowledge base—then what?
- “People are using it” doesn’t mean “it’s effective”
- “Feels faster” doesn’t mean “is actually faster”
- “Everyone says it’s good” doesn’t mean “it’s worth the investment”
AI project outputs are fuzzy, gradual, and hard to isolate. This makes evaluation extremely difficult.
But difficult doesn’t mean optional.
The Harsh Reality: You Have No Baseline
Let me ask you a few questions:
Before adopting AI—
- How many working days did a medium-sized feature take from requirements to deployment?
- How many person-hours did a full QA cycle require?
- How long did it take a PM to write a PRD from first draft to final?
- How many bugs escaped to production each sprint?
If your answer is “roughly…” or “probably…” or “we didn’t really track that…”
Then you have no baseline.
No baseline means no comparison. No comparison means any claim of “faster” or “better” is just subjective feeling, not fact.
This isn’t AI’s problem. This is our industry’s long-standing bad habit: not measuring, not recording, going by gut feel.
AI just exposed the problem.
Why Does This Happen?
Because measuring is tedious, and the short-term benefits aren’t visible.
No one gets praised for “tracking development hours.” When projects are rushed, this is the first thing to get cut.
But the long-term cost is severe: you can never prove any improvement. Not just AI—any process improvement, tool adoption, or methodology change becomes impossible to evaluate.
All you can say is “it feels better.” And “feelings” are the least reliable thing.
Already Adopted AI with No Historical Data? Here’s What You Can Do
Let’s be pragmatic. What’s missing from the past can’t be recovered. But you still have options:
Option 1: Retrospective Estimation
Find 3-5 senior people on your team and ask them individually:
- “Before AI, how long did it typically take you to complete a similar-sized task?”
- “How about now?”
- “On a scale of 1-5, how helpful is AI to you?”
Record these answers. Not precise, but numbers are better than nothing.
This is called an “educated guess”—far better than pure intuition.
Option 2: Control Group Comparison
This works for larger organizations or teams still in pilot phase.
If some team members haven’t started using AI, or some projects weren’t included in the rollout—use them as a control group.
Same task size, same engineer level, similar complexity—compare AI users vs. non-users. What’s the difference?
This is the closest to scientific method. Small sample size, but the conclusions are more defensible.
Option 3: Start Recording Now
No past data? You can still have future data.
Design a minimal tracking method (Excel is fine). Starting today, record for each task:
- Task description
- Start/end time
- Whether AI was used
- Subjective difficulty (1-5)
- Subjective satisfaction (1-5)
Don’t need minute-level precision. Don’t need every field filled. The point is “having records.”
Two weeks later, you’ll have data to analyze. One month later, you’ll start seeing trends.
Four Metrics You Can Start Tracking Today
No complex data systems required. These four metrics can be tracked with Google Forms + Excel:
1. Adoption Rate
The most basic question: is the team actually using it?
How to track:
- Ask weekly: “How many times did you use AI tools this past week?”
- Or check the tool’s backend usage logs
If adoption keeps declining, no matter how good the AI is, the rollout has failed. A tool no one uses has no value.
2. Stickiness
A deeper question than adoption: if this tool disappeared tomorrow, would the team protest?
Ask:
- “If the company decided to cancel AI tool licenses tomorrow, your reaction would be?”
-
- Strongly oppose—this tool is essential to me
-
- A bit disappointed, but okay
-
- Indifferent
-
- Actually, removing it wouldn’t matter
-
This is the most honest value test. If most people choose C or D, you know where the problem is.
3. Rework Rate
How much of AI’s output needs human correction?
How to track:
- Record frequency of “AI output → human correction”
- Or ask inversely: “What percentage of AI suggestions do you adopt as-is?”
If 80% of AI output needs rewriting, the time saved might not cover the time spent fixing it.
This metric also tracks trends: as the team gets better at prompting, rework rate should decrease. If it doesn’t, the problem isn’t prompting—it’s the tool itself.
4. Trust Level
Subjective but important: how much does the team trust AI output?
Ask weekly:
- “Do you trust the code/documents/tests AI produces? Rate 1-5”
Track how this score changes.
- Rising trust → Team is learning to use it better, or AI is becoming more stable
- Falling trust → Possibly burned by errors, or expectations vs. reality gap
The Real Purpose of Acceptance
Many people treat acceptance as a “closing report”—something to show the boss that money wasn’t wasted.
This is wrong.
The real purpose of acceptance is decision-making.
Based on acceptance results, you need to answer three questions:
Continue Investing?
If results are positive: high adoption, high satisfaction, clear ROI—then double down.
- Expand rollout scope
- Invest in better tools or more licenses
- Build best practices for company-wide benefit
Adjust Direction?
If results are mixed: effective in some scenarios, not in others—then focus.
- Identify where “AI truly adds value” and concentrate resources there
- Abandon applications that aren’t working
- Maybe switch tools, methods, or usage patterns
Cut Losses and Exit?
If results are clearly negative: no one uses it, low satisfaction, costs exceed benefits—then admit failure.
This isn’t shameful. Continuing down the wrong path is shameful.
Document the experience, analyze why it failed, avoid the same pitfalls next time.
Advice for Your Next AI Project
If you haven’t adopted yet, or are preparing to adopt another AI tool, here’s my advice:
Before Adoption: Spend One Week Building a Baseline
Doesn’t need to be perfect or complex. Excel is fine.
Record:
- For the next week, each task’s type, duration, output quality (subjective 1-5)
- This becomes your baseline
One week of data isn’t precise, but it’s enough to have a comparison point after adoption.
During Adoption: Define What “Success” Looks Like
Before starting, force yourself to answer:
- Under what circumstances would this adoption be considered “successful”?
- Be specific: save 20% time? Cut error rate in half? Or just “team wants to keep using it”?
- What data will you use to judge?
Write this down. Doesn’t need to be formal—a Slack message or email is enough.
The point is “having a clear standard,” not “we’ll see when we get there.”
After Adoption: Set Acceptance Checkpoints
- 2 weeks: Quick review. Any obvious problems? How’s adoption?
- 1 month: Formal acceptance. Compare to baseline—did we meet expectations?
- 3 months: Long-term evaluation. Is this tool still generating value? Or has the excitement faded?
At each checkpoint, have a simple review: what went well, what didn’t, what’s next.
Final Thoughts
AI project acceptance is indeed harder than traditional projects. Outputs are fuzzy, benefits are gradual, there’s no clear “done” moment.
But this isn’t a reason to skip acceptance.
Precisely because AI’s benefits are hard to quantify, we need to deliberately quantify them.
Otherwise, you’ll forever be stuck between “feels helpful” and “seems okay,” unable to make any decisions.
Starting to measure matters more than measuring precisely.
Imperfect data beats no data.
Start today.
If you try this framework, or if you have a different approach to AI project acceptance, I’d love to hear about it in the comments.