Thought Leadership

How AI Models Became Agents: A Short History of Harness Engineering

May 19, 2026

5 minutes

In February 2026, an engineer named Mitchell Hashimoto, co-founder of HashiCorp, creator of Terraform, published a post about his journey adopting AI tools. Buried in it was a step he called "Engineer the Harness," with a definition so plain it almost passed for common sense: every time an AI agent makes a mistake, take the time to engineer a fix into its environment, so it can never make that mistake again.

Within a week, OpenAI published a longer field report describing how a small team had built a one-million-line application using AI agents - with zero hand-written code. Anthropic followed. Martin Fowler published an analysis. Almost overnight, harness engineering went from a personal blog post to working vocabulary across every major AI lab.

It stuck because it named something practitioners had been doing for years without a shared word for it: building the layer around the model that turns AI from a clever demo into something that can do operational work.

This piece tells the story of how that layer came to exist, built up, one capability at a time, in response to specific things that kept failing. Hashimoto's formula is the spine of the whole thing: Agent = Model + Harness. Everything that follows is the story of how the right-hand side got built.

Is started with a stateless model

When ChatGPT launched in late 2022, the world got its first widespread experience of a powerful language model. The model had no memory of its own. Not of past conversations, not even of earlier in the same conversation. Every time the user typed something new, the application bundled up everything said so far and sent the whole thing to the model as a single request. Every model has a limit on how much it can read at once, the context window, and when the bundle exceeded that limit, the conversation broke.

The software wrapped around the model - the application that called it, sent it messages, displayed responses - was the first version of what would later be called the harness.

Keeping the conversation alive

The first thing the harness had to learn was how to keep a long conversation from breaking.

Two approaches emerged. The simpler one: just delete the oldest messages to make room for new ones. This worked, but users would give an important instruction at the start of a chat and find the model acting an hour later as if they'd never said it.

The more sophisticated one: summarization. As the conversation approached the limit, the harness would send the older messages to a separate model and ask it to compress them. The summary replaced the raw history. The conversation could continue, but it was now carrying a summary, not the actual exchange. Nuance got lost. This is an approach that is still in use today.

This was capability one. The harness could now keep a single conversation going. But every new conversation still started from zero.

Making the AI feel like it knew you

The next problem was continuity across sessions. Users didn't want to re-explain who they were, what business they were in, and what format they preferred every time they opened the app.

The solution was straightforward. The model itself still had no memory. But the harness could store things on the model's behalf. When something useful came up - a business rule, a preference, a recurring data structure - the model could instruct the harness to save it. Next session, the harness would load that information back into context before the conversation began.

When this worked well, the shift was striking. The AI suddenly felt like it knew you. It wasn't actually remembering anything. The harness was remembering, and feeding the model just enough at the start of each session for it to behave like it remembered. Capability two.

Knowing what you're looking at

Around the same time, applications started embedding AI more deeply - Microsoft 365's Copilot, GitHub Copilot, and others - passing the model information about the context the user was working in: which file was open, which page they were on, what role they had.

This is what made the early "co-pilot" generation feel meaningfully more useful than a chat window. You didn't have to describe what you were looking at. The AI just saw it. A casual question like "summarize this for the team" - useless in a chat window without specifying what - became instantly answerable in a co-pilot, because the harness was already feeding the model the document, the audience, and the user's normal tone.

The model wasn't different. The harness had gotten smarter about what it fed in. Capability three.

Giving the model hands

By this point, the AI could remember a conversation, remember the user, and understand the user's working context. But it still couldn't do anything beyond talk.

The next capability changed that. The harness started exposing tools to the model - functions it could call. Search the web. Read this file. Query the database. The model decided when to call a tool, what to ask, and how to use what came back.

What made this powerful wasn't tool-calling itself, it was that the model could chain calls together, taking multi-turns. Ask a question, call one tool, read the result, realize it needs more, call another, then a third, then compose an answer drawing on everything it gathered. One user prompt, ten tool calls behind the scenes, one synthesized answer out.

This is what made AI start to feel less like autocomplete and more like research. Suddenly, "What's our exposure to this borrower? Pull recent moves and any covenant issues" could be a single question. The agent would query the loan database, search public filings, check the news, pull market data, and produce one coherent answer. Capability four.

‍

Making it do the same thing the same way every time

By 2025, AI systems could remember conversations, remember users, know what you were looking at, and act on real systems. But there was a stubborn problem left: they weren't consistent. Ask the same agent the same question twice and you'd often get two slightly different answers. Different format. Different ordering. Different method.

The solution to this was skills - standardized instructions, saved as files, that tell the agent exactly how to perform a specific operation. A reconciliation skill specifies the matching order, the rounding rule, the balance method. A reporting skill specifies the template, the data sources, the disclosure language. Once defined, the agent loads the skill every time that task comes up and follows it precisely.

This is where Hashimoto's discipline really lives. Every time the agent makes a mistake, you write a skill that makes that mistake structurally impossible to repeat. The harness gets smarter over time, not by training a new model, but by accumulating the lessons from every failure. Capability five.

‍

What you get when you add it all up

A model whose context is actively managed, with memory across sessions, awareness of its working environment, tools it can call across multiple turns, and skills that ensure consistency on repeated operations.

That's an agent. As the description makes clear, the model is a smaller part of the system than most people realize. The reasoning happens inside the model. Almost everything else happens in the harness around it. This is why frontier models from different labs are now hard to distinguish on most benchmarks, but the systems built on top of them vary dramatically. The model is increasingly a commodity. The harness is where the engineering happens.

Why the harness matters

The conversation around AI used to be about which model was best. Then it was about which prompts worked. Then about how to manage context. Now it's about how to engineer the harness - and that shift has happened because everything else has flattened out.

The model is no longer where the differentiation lives. Anyone can access frontier models through an API. What determines whether an AI system actually works in production, is the engineering around the model.

That's the layer Hashimoto's blog post finally gave a name to and it's what determines, more than anything else, whether AI becomes a system that actually changes how work gets done.

For how this plays out specifically in private credit operations, see our practitioner's guide on AI agents for post-close loan operations.

How AI Models Became Agents: A Short History of Harness Engineering

Is started with a stateless model

Keeping the conversation alive

Making the AI feel like it knew you

Knowing what you're looking at

Giving the model hands

Making it do the same thing the same way every time

What you get when you add it all up

Why the harness matters

Recommended articles

The Part of AI Nobody Talks About: Why the Model Is the Easy Part

Andalusian Credit Partners has chosen Hypercore to support their loan management operations.

What Loan Servicing Looks Like in 2030

How AI Models Became Agents: A Short History of Harness Engineering

Share on

Share on

Is started with a stateless model

Share on

Keeping the conversation alive

Share on

Making the AI feel like it knew you

Share on

Knowing what you're looking at

Share on

Giving the model hands

Share on

Making it do the same thing the same way every time

Share on

What you get when you add it all up

Share on

Why the harness matters

Share on

Share on

Share on

Share on

Recommended articles

The Part of AI Nobody Talks About: Why the Model Is the Easy Part

Andalusian Credit Partners has chosen Hypercore to support their loan management operations.

What Loan Servicing Looks Like in 2030