Minimum AI Knowledge for Practical Use, Part 7: Evals and Verification
A practical explanation of evals, automatic checks, human review, regression prevention, and how to verify AI agent work.
Contents
In the previous post, I wrote about tool calling and agents.
That leaves the most important question:
"How do we trust the work AI did?"
This is where evals and verification become necessary.
If we only read an AI result and say "looks fine," the workflow is fragile. That is especially true for code changes, deploys, and public publishing.
Verification Starts After the AI Answer
An AI answer is not the end of the work.
For development tasks, the next steps matter:
AI proposal
-> review the diff
-> run tests
-> check local or preview behavior
-> inspect production impact
-> record the result
Trusting AI does not mean skipping verification.
It means breaking work into units that can be checked.
Evals Matter for Repeated Work
For a one-off task, human review may be enough.
For repeated tasks, we need criteria.
For example, repeated content drafting should check:
| Area | Question |
|---|---|
| Duplicate topic | Is this too close to an existing post? |
| Status | Is it still draft? |
| Language pair | Do KO and EN share the same slug? |
| Public exposure | Is the draft absent from public listings and indexes? |
| Quality | Is it too generic? |
| Report | Were preview links and verification results sent? |
If we rely only on memory, mistakes will happen.
Some checks should be automated. Some should remain human review.
Automatic Checks vs Human Review
Not every quality check can be automated.
But many can.
| Automatic | Human Review |
|---|---|
| File exists | Is the writing persuasive? |
| Frontmatter format | Does the tone feel natural? |
| Slug uniqueness | Is it safe to publish? |
| Test result | Is the conclusion reasonable? |
| Public exposure check | Are examples appropriate? |
| URL 200/404 | Does the series flow well? |
The useful pattern is:
Let automation catch mechanical failures. Let humans judge meaning, tone, and risk.
Start With a Small Eval Set
An eval does not need to be complex.
For this AI basics series, a simple eval could be:
1. Does the post connect naturally to the previous part?
2. Is there one clear core concept?
3. Does it include at least one table, code block, or concrete example?
4. Does the next-post preview match the actual next topic?
5. Does it avoid internal workflow traces?
For coding tasks:
1. Did it modify only relevant files?
2. Did it avoid unrelated reverts?
3. Did tests pass?
4. Did it explain failure cases?
5. Did it report remaining risk?
Small evals are easier to keep using.
Good Evals Define Failure Conditions Too
An eval is incomplete if it only describes success.
"All links should work" is a useful criterion. But it is more useful when the next action is also clear.
link check passes -> PASS
404 found -> FIX
sensitive or unintended public exposure found -> STOP
This gives the agent a safer decision path after verification.
I like splitting results into three buckets:
| Result | Meaning | Next Action |
|---|---|---|
| PASS | The work meets the criteria | Continue |
| FIX | It can be corrected safely | Revise and verify again |
| STOP | A human should inspect the risk | Stop and report |
The important bucket is STOP.
A failed test may be a normal FIX. But unintended exposure, unexpected deletion, or production data impact should not be pushed forward automatically.
Regression Prevention
AI can generate changes quickly.
It can also break things quickly.
That makes regression checks important.
For a static content generator, a minimal check may be:
1. Does static page generation pass?
2. Are draft identifiers absent from public output?
3. Does a public post URL respond correctly?
4. Does a draft URL stay unavailable externally?
Expected results:
published post: 200
draft post: 404
RSS/sitemap/public data: no draft slug
The point is to turn expectations into commands.
The commands do not need to be complex at first.
They can start as simple checks:
confirm generation command succeeds
confirm the expected slug exists in public HTML
confirm sitemap only includes public posts
confirm public URLs return 200
search for words that should not be exposed
Once these checks exist, AI work is not judged only by intuition. At least the non-negotiable failure cases can be caught mechanically.
Eval Criteria Should Be Versioned
Eval criteria evolve.
At first, checking that a file exists may be enough.
Later, criteria may include:
- Is the post long enough?
- Are series numbers continuous?
- Do KO and EN titles mean the same thing?
- Were current facts checked against primary sources?
- Are security and cost claims conservative?
These rules should live in docs or scripts.
Then future AI runs can follow the same standard.
Summary
AI workflows need verification criteria.
Evals are not only big benchmark suites. They can be practical checklists and scripts.
1. Verification starts after the AI answer.
2. Repeated work needs eval criteria.
3. Automatic checks and human review should be separated.
4. Small checklists are valid evals.
5. Regression checks should become commands.
In the final post of this series, I will cover permission boundaries and security: how far we should let AI agents go.
Comments
0