Validation Harnesses for AI Agents

Sometimes old is gold.

One of the key aspects of agentic coding is to have a good validation harness in place. Now, I've really seen three levels that developers go through when it comes to validation harness:

First, many devs simply not have a validation harness. You ask the agent to build whatever you want to build, and when it's done, you manually run the app and test it out. This, of course, is not very efficient, because the agent may make mistakes, and each time you have to be sitting there in front of the terminal validating it. Also, manual validation is usually very cursory and not to the required depth.

The next level is to connect some kind of validation. If you're building a web application, a popular choice is to connect Playwright MCP to your agentic tool. Now the agent has access to a browser.

After it implements the feature, the agent can open the browser and navigate the pages, take screenshots, read the DOM, and check whether the feature has been implemented properly or not. If it's not implemented properly, the agent will automatically go back and change the code. It's going to keep repeating the loop till it is satisfied that the feature has been implemented correctly. This is really nice because now you can just leave the agent to implement the feature and come back after five to ten minutes and verify the final implementation. However, there are a couple of important downsides:

The agent will probably only verify that feature. It's not going to do a full regression, so if there is an error in some other feature because of this new implementation, you're not going to catch it.
Agents always have some amount of randomness. Sometimes your feature may be failing; the agent might still say that everything is working great.

What I like to do instead is to ask the agent to generate automated tests just like I would do in the old days of test-driven development. These automated tests could be implemented using Playwright, but there are a couple of very important advantages:

I can examine the automated tests and verify if all the conditions that need to be tested are included or not.
Running automated tests is completely deterministic - you don't have a situation where the feature is not working but the test is going to claim that everything is working.

As an added bonus, the test cases are also useful as regression tests and also useful for your CI/CD pipeline. Overall, it's a win-win-win.

Running automated tests almost seems old-fashioned in this age of agents, MCP and tools, but sometimes old is gold.

Recent Articles