2026.05.23 AI AI Basics en

Minimum AI Knowledge for Practical Use, Part 7: Evals and Verification

A practical explanation of evals, automatic checks, human review, regression prevention, and how to verify AI agent work.

Contents

In the previous post, I wrote about tool calling and agents.

That leaves the most important question:

"How do we trust the work AI did?"

This is where evals and verification become necessary.

If we only read an AI result and say "looks fine," the workflow is fragile. That is especially true for code changes, deploys, and public publishing.

Eval board separating automated checks from human review

Verification Starts After the AI Answer

An AI answer is not the end of the work.

For development tasks, the next steps matter:

AI proposal
-> review the diff
-> run tests
-> check local or preview behavior
-> inspect production impact
-> record the result

Trusting AI does not mean skipping verification.

It means breaking work into units that can be checked.

Evals Matter for Repeated Work

For a one-off task, human review may be enough.

For repeated tasks, we need criteria.

For example, repeated content drafting should check:

AreaQuestion
Duplicate topicIs this too close to an existing post?
StatusIs it still draft?
Language pairDo KO and EN share the same slug?
Public exposureIs the draft absent from public listings and indexes?
QualityIs it too generic?
ReportWere preview links and verification results sent?

If we rely only on memory, mistakes will happen.

Some checks should be automated. Some should remain human review.

Automatic Checks vs Human Review

Not every quality check can be automated.

But many can.

AutomaticHuman Review
File existsIs the writing persuasive?
Frontmatter formatDoes the tone feel natural?
Slug uniquenessIs it safe to publish?
Test resultIs the conclusion reasonable?
Public exposure checkAre examples appropriate?
URL 200/404Does the series flow well?

The useful pattern is:

Let automation catch mechanical failures. Let humans judge meaning, tone, and risk.

Start With a Small Eval Set

An eval does not need to be complex.

For this AI basics series, a simple eval could be:

1. Does the post connect naturally to the previous part?
2. Is there one clear core concept?
3. Does it include at least one table, code block, or concrete example?
4. Does the next-post preview match the actual next topic?
5. Does it avoid internal workflow traces?

For coding tasks:

1. Did it modify only relevant files?
2. Did it avoid unrelated reverts?
3. Did tests pass?
4. Did it explain failure cases?
5. Did it report remaining risk?

Small evals are easier to keep using.

Good Evals Define Failure Conditions Too

An eval is incomplete if it only describes success.

"All links should work" is a useful criterion. But it is more useful when the next action is also clear.

link check passes -> PASS
404 found -> FIX
sensitive or unintended public exposure found -> STOP

This gives the agent a safer decision path after verification.

I like splitting results into three buckets:

ResultMeaningNext Action
PASSThe work meets the criteriaContinue
FIXIt can be corrected safelyRevise and verify again
STOPA human should inspect the riskStop and report

The important bucket is STOP.

A failed test may be a normal FIX. But unintended exposure, unexpected deletion, or production data impact should not be pushed forward automatically.

A verification gate that separates AI work results into PASS, FIX, and STOP before publishing

Regression Prevention

AI can generate changes quickly.

It can also break things quickly.

That makes regression checks important.

For a static content generator, a minimal check may be:

1. Does static page generation pass?
2. Are draft identifiers absent from public output?
3. Does a public post URL respond correctly?
4. Does a draft URL stay unavailable externally?

Expected results:

published post: 200
draft post: 404
RSS/sitemap/public data: no draft slug

The point is to turn expectations into commands.

The commands do not need to be complex at first.

They can start as simple checks:

confirm generation command succeeds
confirm the expected slug exists in public HTML
confirm sitemap only includes public posts
confirm public URLs return 200
search for words that should not be exposed

Once these checks exist, AI work is not judged only by intuition. At least the non-negotiable failure cases can be caught mechanically.

Eval Criteria Should Be Versioned

Eval criteria evolve.

At first, checking that a file exists may be enough.

Later, criteria may include:

- Is the post long enough?
- Are series numbers continuous?
- Do KO and EN titles mean the same thing?
- Were current facts checked against primary sources?
- Are security and cost claims conservative?

These rules should live in docs or scripts.

Then future AI runs can follow the same standard.

Summary

AI workflows need verification criteria.

Evals are not only big benchmark suites. They can be practical checklists and scripts.

1. Verification starts after the AI answer.
2. Repeated work needs eval criteria.
3. Automatic checks and human review should be separated.
4. Small checklists are valid evals.
5. Regression checks should become commands.

In the final post of this series, I will cover permission boundaries and security: how far we should let AI agents go.

Comments

0

Write a Comment

Comments are public by default. Private comments are visible to the admin only.