The Anatomy of an AI Agent Harness — A BlueAlly Field Guide

01 The problem isn't your model

You bought the best brain. The agent still stalls.

Teams pick the strongest model they can find. They write a careful prompt. They wire up a few tools. Then the agent stalls. It forgets what it was doing. It calls the wrong tool, reads the error, and calls it again. It loops without end, or it quits too early.

The model is not broken. The system around it is thin.

Here is the evidence, clean. One team at LangChain held the model fixed and changed only the system around it. On a hard agent benchmark, the same model climbed from the bottom third of the field to the top handful — its score rising from the low fifties to the mid sixties.² Same brain. Better body.

Top 30 → Top 5

Same model, harness rebuilt. Rank on Terminal-Bench 2.0.²

52.8 → 66.5

The score that move produced — no model change at all.²

80% → 100%

Vercel's task success after cutting 80% of its agent's tools.¹⁴

That is the whole argument for this guide. The model gets the headlines. The harness does the work. If you build agents, the harness is where most of your real decisions live — and most of your failures.

02 Start here

The words you'll need

No jargon you can't unpack. Seven terms carry the whole field. Learn these and the rest of this guide reads easily.

Plain English

Large language model (LLM): A program that predicts text. Give it words; it returns the words that should come next. That is the whole trick — and it is enough to do remarkable things.
Stateless: The model remembers nothing between requests. Each call starts cold. Whatever it should know, you must hand it again, every time.
Token: A chunk of text, a few characters long. Models read and write in tokens, and you pay by the token. Waste them and the bill grows.
Context window: The model's short-term memory: the text it can see at once, measured in tokens. It is finite. Fill it with noise and the answers get worse.
Tool calling: The model can't act on its own. It asks. "Run this search." "Send this email." Code outside the model does the work and hands the result back.
MCP: The Model Context Protocol — an open standard from Anthropic, introduced in late 2024, for connecting models to tools and data the same way every time. A common port for AI.⁴
Agent vs. chatbot: A chatbot answers. An agent acts. It loops: calls tools, reads results, decides what to do next, and keeps going until the job is done.

03 What it is

The harness is everything around the model

A raw model takes text and returns text. Nothing more. It cannot remember last week. It cannot run your code, read your database, or stop itself from looping. On its own, it is a brilliant mind in an empty room.

The harness is the room. It is the code that decides what the model sees, runs the tools the model asks for, remembers what happened, checks the work, and knows when to stop. It is the loop, the memory, the tools, the guardrails — the whole apparatus that turns a text predictor into something that gets work done.

If you're not the model, you're the harness.— Vivek Trivedy, LangChain¹

Say it plainly: you are probably not training the model. You are building the harness. That is where your craft goes, and that is what this guide is about.

04 A familiar shape

The computer in your pocket already explains it

If this feels new, it isn't — not really. We have built this shape before. A modern computer is a fast processor wrapped in layers that make it useful: memory, storage, drivers, an operating system. An agent has the same anatomy, part for part.

The Von Neumann mapping. The processor is the model. The operating system is the harness — the layer that turns raw compute into something you can actually use.

The researcher Beren Millidge put it well: in building agents this way, we have reinvented the von-Neumann architecture.³ The model is the CPU. The context window is RAM. A vector database is the hard disk. Tools are the device drivers. And the harness is the operating system — the part nobody sees, doing the work that makes the machine worth owning.

A raw model is a CPU with no operating system. Powerful, and useless, until you give it one.

05 Three kinds of engineering

Prompt, context, harness

The work of making models useful has grown up in three stages. Each is harder than the last, and each contains the one before it.

Prompt engineering

Writing the instruction well. Clear words, good examples, the right framing. It still matters — but one good prompt does not make a working agent.

Context engineering

Curating everything the model sees each turn, not just the instruction. Anthropic frames the goal as finding the smallest possible set of high-signal tokens — the least text that carries the most meaning.⁵

Harness engineering

Designing the whole running system: the loop, the tools, the memory, the checks, the limits. This is the discipline that decides whether an agent ships. It is the subject of the rest of this guide.

06 The anatomy

Twelve components, three layers

Open up any serious agent and you find the same parts. They fall into three layers around the model: the Runtime that drives each turn, the Capabilities that let it do and remember things, and the Governance & Scale layer that keeps it safe and lets it grow.

The harness, laid out. The model sits at the center, stateless and waiting. Everything around it is engineering you own.

Runtimethe engine of each turn

Orchestration Loop.

The engine. It calls the model, runs the tools, feeds back the results, and decides whether to go again.

Prompt Construction.

What the model sees this turn — instructions, history, tool results — assembled with care, every time.

Output Parsing.

Turning the model's text into something a program can use: a tool call, a final answer, a clean object.

Error Handling.

Tools fail. The harness catches the failure, explains it to the model, and lets it try another way.

Capabilitieswhat it can do and remember

Tools.

The agent's hands: search, code, a database, an API. Fewer, sharper tools beat a crowded toolbox.

Memory.

What survives across turns and sessions — notes, files, a store the agent can search when it needs to.

Context Management.

Keeping the window full of signal, not noise. Summarize, prune, retrieve only what the task needs.

State Management.

Tracking the work in progress: the goal, the steps already taken, and what is still left to do.

Governance & Scalekeeping it safe, letting it grow

Guardrails & Safety.

The limits. What the agent may touch, what it must refuse, and where a human signs off.

Verification Loops.

Checking the work before trusting it — tests, linters, a second model as judge. The checking pays for itself.

Subagent Orchestration.

Splitting a big job across focused agents, each with its own clean context, then bringing the work back together.

Observability & Tracing.

Seeing what happened — every call, every tool, every decision logged. Modern SDKs now turn this on by default.⁶

07 The engine

The loop in motion

At its heart, an agent is a loop. The mechanism is almost embarrassingly simple — a while-loop — but each step in it carries real infrastructure. Here is one full turn.

One turn, seven steps. The agent keeps circling until it returns an answer with no tool calls, or hits a limit you set.

Prompt Assembly.

Gather the instructions, the history, and the latest results. Build the message the model will see.

LLM Inference.

Send it. Get text back.

Classify Output.

Read the reply. Is it a request to call a tool, or is it the final answer?

Tool Execution.

If it called a tool, run it. As a rule of thumb, read-only calls can run together; calls that change things run one at a time, in order.

Result Packaging.

Format the result so the model can read it cleanly on the next turn.

Context Update.

Add the new turn to the running context. Prune if it is getting heavy.

Loop.

Go back to step one. Stop when the work is done, the budget is spent, or a guardrail fires.

The simplest honest version of this has a name — the Ralph loop, after Geoffrey Huntley: run the same prompt, in a fresh context, again and again, until the task is finished.¹⁶ Crude, and often enough. Most of the art is in knowing when to stop.

08 In the wild

How the real frameworks do it

Every serious framework builds the same core — a model wrapped in a harness. They differ on one question: where should the control live? In the model, or in your code? Here is how five of them answer.

Five harnesses, one pattern. The same anatomy, with different bets on how much logic to write yourself.

Claude Agent SDK

Anthropic's harness, renamed from the Claude Code SDK.⁴ A deliberately plain loop around a strong model: call the model, run the tool, repeat — driven by a simple async iterator. The bet: trust the model.

OpenAI Agents SDK

A code-first Python framework. A Runner runs the loop — synchronous, async, or streaming. Agents become tools for other agents, and control passes by handoff. Tracing is on by default.⁶

LangGraph

Control drawn as an explicit graph. You define the states and the transitions yourself; checkpoints let you pause, resume, and inspect. The bet: make the flow explicit.

CrewAI

Roles and tasks. You define a crew of agents and the work each one owns, run in sequence or in a hierarchy. Flows add precise control when a job needs it.

Microsoft Agent Framework

Reached version 1.0 — generally available — in April 2026, merging the lessons of AutoGen and Semantic Kernel, now in maintenance.⁹ It ships five orchestration patterns, including the Magentic pattern for open-ended work.

09 A working metaphor

Scaffolding, and why it shrinks

Picture a building going up. Around it stands the scaffolding — the rails, the lifts, the safety nets that let the crew reach the upper floors. The scaffolding does not become the building. When the work is done, it comes down.

The harness is the scaffolding. Necessary to build with today's models. Lighter with tomorrow's.

The harness works the same way. It is the structure that lets a model do real work today. And as models get stronger, they need less of it. The scaffolding thins.

This is not a theory. Manus, a startup building agents, rebuilt its harness four times in six months. Each rewrite removed complexity the newer models no longer needed.⁸ The lesson is steadying: build the harness your model needs now, and expect to take part of it down later. The model and the harness rise together.

10 The design space

Seven decisions that define your harness

There is no universal right answer here — only trade-offs you choose on purpose. Seven decisions cover most of the design space. Make them deliberately and you have a harness; drift into them and you have a mess.

Seven forks in the road. Each branch trades one good thing for another. There is no free choice.

How many agents?

One is simpler and cheaper. Many give isolation and specialism — paid for in coordination.

How should it reason?

ReAct — think, act, observe, repeat — is flexible but spends calls.¹⁰ Plan-and-execute drafts the whole plan first; running the steps in parallel can be up to 3.7× faster.¹¹

How much context?

Keep it rich for recall, or compact it hard to save tokens and dodge the quality drop that comes as the window fills.¹²¹³

How do you verify?

Computationally — tests, linters, types — which is exact. Or inferentially, with a model as judge, which is flexible. Either way, checking pays: one engineering lead measured roughly 2–3× the quality from verification.¹⁸

What may it touch?

Permissive is fast and risky. Restrictive is safe and slower. The real question is where a human signs off.

How many tools per step?

The full toolkit is flexible; a minimal, scoped set performs better. Vercel cut eighty percent of its tools and watched success climb from eighty to one hundred percent.¹⁴

How thick is the harness?

Thin trusts the model. Thick encodes the logic in your own code. More on that next.

11 The central bet

How thick should the harness be?

This is the decision under all the others. A thin harness hands the model the wheel: you supply tools, context, and permissions, and let the model decide what to do. A thick harness writes the logic itself — the routing, the planning, the multi-step strategy — and uses the model for the parts only it can do.

The thickness spectrum. Bet on the model, or bet on your own control. As models improve, the field keeps sliding left.

There is a temptation to over-build — to encode every step in code because code feels safe. Often it is the wrong instinct. Compaction frameworks now cut token use by roughly a quarter to a half while holding accuracy above ninety-five percent.¹⁵ The thoughtful move is to write the structure your current model genuinely needs, and no more.

The trend is steady and worth internalizing: as models get stronger, the right harness gets thinner. Build for today. Plan to remove.

12 The takeaway

The harness is the product

The model gets the headlines. The harness does the work. The same model can sit at the bottom of the field or the top, and the only thing that changed was the system around it.

So spend your care accordingly. Choose the model, yes. Then build the body with the same attention you spent choosing the brain — the loop, the tools, the memory, the checks, the limits. Build the harness your model needs today, and keep it light enough to thin out tomorrow.

If you're not the model, you're the harness.Build it on purpose.

§ Sources

Where this comes from

This guide synthesizes current, primary sources — engineering blogs from the labs and framework teams, and peer-reviewed papers. Every figure above is drawn from the work below. Where a claim rests on a single team's report, we've said so in the text.

Vivek Trivedy, "The Anatomy of an Agent Harness," LangChain Blog. langchain.com/blog/the-anatomy-of-an-agent-harness
LangChain, "Improving Deep Agents with Harness Engineering" (Terminal-Bench 2.0, model held fixed). langchain.com/blog/improving-deep-agents-with-harness-engineering
Beren Millidge, "Scaffolded LLMs as Natural Language Computers." beren.io/2023-04-11-Scaffolded-LLMs-natural-language-computers
Anthropic, "Building Agents with the Claude Agent SDK" (renamed from the Claude Code SDK). anthropic.com/engineering/building-agents-with-the-claude-agent-sdk
Anthropic, "Effective Context Engineering for AI Agents." anthropic.com/engineering/effective-context-engineering-for-ai-agents
Celia Chen, "Unlocking the Codex Harness," OpenAI. openai.com/index/unlocking-the-codex-harness
Ryan Lopopolo, "Harness Engineering," OpenAI. openai.com/index/harness-engineering
Yichao "Peak" Ji, "Context Engineering for AI Agents: Lessons from Building Manus." manus.im/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus
Microsoft, "Microsoft Agent Framework, Version 1.0" (GA, April 2026). devblogs.microsoft.com/agent-framework/microsoft-agent-framework-version-1-0
Yao et al., "ReAct: Synergizing Reasoning and Acting in Language Models," arXiv:2210.03629. arxiv.org/abs/2210.03629
Kim et al., "An LLM Compiler for Parallel Function Calling" (up to 3.7× faster), arXiv:2312.04511. arxiv.org/abs/2312.04511
Liu et al., "Lost in the Middle: How Language Models Use Long Contexts," arXiv:2307.03172. arxiv.org/abs/2307.03172
Chroma Research, "Context Rot." research.trychroma.com/context-rot
Vercel, "We Removed 80% of Our Agent's Tools" (success 80% → 100%). vercel.com/blog/we-removed-80-percent-of-our-agents-tools
ACON, "Agent Context Optimization" (26–54% fewer tokens, 95%+ accuracy), arXiv:2510.00615. arxiv.org/abs/2510.00615
Geoffrey Huntley, "The Ralph Loop." ghuntley.com/loop
Birgitta Böckeler, "Sensors and Guides for Coding Agents," martinfowler.com. martinfowler.com/articles/sensors-for-coding-agents.html
Boris Cherny, on verification quality (≈2–3×). x.com/bcherny/status/2007179861115511237

If you're not the model,you're the harness.