2026.05.23 AI AI Basics en

Minimum AI Knowledge for Practical Use, Part 7: Evals and Verification

A practical explanation of evals, automatic checks, human review, regression prevention, and how to verify AI agent work.

#AI #Eval #Testing #Verification #AI Basics

Kouji Operations Notes

9 min read 2026.05.23

In the previous post, I wrote about tool calling and agents.

That leaves the most important question:

"How do we trust the work AI did?"

This is where evals and verification become necessary.

If we only read an AI result and say "looks fine," the workflow is fragile. That is especially true for code changes, deploys, and public publishing.

Eval board separating automated checks from human review

Verification Starts After the AI Answer

An AI answer is not the end of the work.

For development tasks, the next steps matter:

AI proposal
-> review the diff
-> run tests
-> check local or preview behavior
-> inspect production impact
-> record the result

Trusting AI does not mean skipping verification.

It means breaking work into units that can be checked.

Evals Matter for Repeated Work

For a one-off task, human review may be enough.

For repeated tasks, we need criteria.

For example, repeated content drafting should check:

Area	Question
Duplicate topic	Is this too close to an existing post?
Status	Is it still draft?
Language pair	Do KO and EN share the same slug?
Public exposure	Is the draft absent from public listings and indexes?
Quality	Is it too generic?
Report	Were preview links and verification results sent?

If we rely only on memory, mistakes will happen.

Some checks should be automated. Some should remain human review.

Automatic Checks vs Human Review

Not every quality check can be automated.

But many can.

Automatic	Human Review
File exists	Is the writing persuasive?
Frontmatter format	Does the tone feel natural?
Slug uniqueness	Is it safe to publish?
Test result	Is the conclusion reasonable?
Public exposure check	Are examples appropriate?
URL 200/404	Does the series flow well?

The useful pattern is:

Let automation catch mechanical failures. Let humans judge meaning, tone, and risk.

Start With a Small Eval Set

An eval does not need to be complex.

For this AI basics series, a simple eval could be:

1. Does the post connect naturally to the previous part?
2. Is there one clear core concept?
3. Does it include at least one table, code block, or concrete example?
4. Does the next-post preview match the actual next topic?
5. Does it avoid internal workflow traces?

For coding tasks:

1. Did it modify only relevant files?
2. Did it avoid unrelated reverts?
3. Did tests pass?
4. Did it explain failure cases?
5. Did it report remaining risk?

Small evals are easier to keep using.

Good Evals Define Failure Conditions Too

An eval is incomplete if it only describes success.

"All links should work" is a useful criterion. But it is more useful when the next action is also clear.

link check passes -> PASS
404 found -> FIX
sensitive or unintended public exposure found -> STOP

This gives the agent a safer decision path after verification.

I like splitting results into three buckets:

Result	Meaning	Next Action
PASS	The work meets the criteria	Continue
FIX	It can be corrected safely	Revise and verify again
STOP	A human should inspect the risk	Stop and report

The important bucket is STOP.

A failed test may be a normal FIX. But unintended exposure, unexpected deletion, or production data impact should not be pushed forward automatically.

A verification gate that separates AI work results into PASS, FIX, and STOP before publishing

Regression Prevention

AI can generate changes quickly.

It can also break things quickly.

That makes regression checks important.

For a static content generator, a minimal check may be:

1. Does static page generation pass?
2. Are draft identifiers absent from public output?
3. Does a public post URL respond correctly?
4. Does a draft URL stay unavailable externally?

Expected results:

published post: 200
draft post: 404
RSS/sitemap/public data: no draft slug

The point is to turn expectations into commands.

The commands do not need to be complex at first.

They can start as simple checks:

confirm generation command succeeds
confirm the expected slug exists in public HTML
confirm sitemap only includes public posts
confirm public URLs return 200
search for words that should not be exposed

Once these checks exist, AI work is not judged only by intuition. At least the non-negotiable failure cases can be caught mechanically.

Eval Criteria Should Be Versioned

Eval criteria evolve.

At first, checking that a file exists may be enough.

Later, criteria may include:

- Is the post long enough?
- Are series numbers continuous?
- Do KO and EN titles mean the same thing?
- Were current facts checked against primary sources?
- Are security and cost claims conservative?

These rules should live in docs or scripts.

Then future AI runs can follow the same standard.

Summary

AI workflows need verification criteria.

Evals are not only big benchmark suites. They can be practical checklists and scripts.

1. Verification starts after the AI answer.
2. Repeated work needs eval criteria.
3. Automatic checks and human review should be separated.
4. Small checklists are valid evals.
5. Regression checks should become commands.

In the final post of this series, I will cover permission boundaries and security: how far we should let AI agents go.

Minimum AI Knowledge for Practical Use, Part 7: Evals and Verification

Contents

Verification Starts After the AI Answer

Evals Matter for Repeated Work

Automatic Checks vs Human Review

Start With a Small Eval Set

Good Evals Define Failure Conditions Too

Regression Prevention

Eval Criteria Should Be Versioned

Summary

Comments

Write a Comment

Minimum AI Knowledge for Practical Use, Part 7: Evals and Verification

Contents

Verification Starts After the AI Answer

Evals Matter for Repeated Work

Automatic Checks vs Human Review

Start With a Small Eval Set

Good Evals Define Failure Conditions Too

Regression Prevention

Eval Criteria Should Be Versioned

Summary

Comments

Write a Comment

Read Next