Long Live the Harness

In the early days of AI, we saw the rise of “GPT Wrappers.”

Companies that created a product that resembled a thin layer on top of a model. People loved to mock these products, saying all the value was in the model with everything around it commoditized. “Why would I use your app when I can just use ChatGPT directly?”

Years later, we have a new name for “wrapper” which is now “harness.” OK that’s a crude analogy and not exactly apples to apples... a harness is really the code that determines what information a model sees at each step, what to store, what to retrieve, and what context to present.

It’s the scaffolding around the model. But the spirit of the comparison is directionally right: there’s an enormous amount of value in what sits around the model, not just the model itself. And we now have data to prove it.

Stanford just released a study called Meta $Meta Platforms, Inc.(META)$ -Harness that showed something pretty remarkable. Changing the harness around a fixed model (same model, same weights, nothing different about the model itself) can produce a 6x performance gap on the same benchmark. 6x!

Same model. The only thing that changed was the code wrapping it. Their system, which uses an AI agent to automatically search for better harnesses, beat the best hand-engineered solutions by 7.7 points on text classification while using 4x fewer tokens. It hit #1 on an actively contested coding benchmark. And the harnesses it discovered for math problems transferred across five completely different models that were never seen during the search process.

The model race, while important, is not the whole game. For the last few years, the industry has been laser focused on who has the best model. Trillions of dollars of value creation tied to “my model is smarter than your model.” And yes, model quality matters. But this research suggests that the orchestration layer around the model, how you manage context, what you retrieve, when you retrieve it, what you store, how you handle errors, is where a massive amount of real-world performance also lives.

For founders, this is pretty freeing.

You don’t need to train your own foundation model. [Side note - we ARE seeing a lot of leading application companies train their own models (either post trained or pre trained) which is contrary to what I just mentioned. I think that trend continues which is something I’ll write about next week]

You don’t even necessarily need to pick the single “best” model. What you need is to build the best harness for your domain. If a 6x performance gap comes from harness quality on a generic benchmark, imagine what a deeply tuned, domain-specific harness could do in your vertical.

There’s new value to the wrapper (harness)! To my earlier “side note” - I think we’ll see some of the most successful application companies start off by building a killer harness, and then overtime parlay that into amazing data collection, which is then parlayed into a post trained model, and ultimately a pre trained model of their own. BUT - i think the key first step for many will be a killer harness.

One of the findings from the paper that I think deserves more attention is about compression. The researchers tested what happens when you summarize the feedback given to the harness optimizer versus giving it the raw, uncompressed execution traces.

Summaries made things worse. The full execution traces (the raw prompts, tool calls, model outputs, state updates from every prior run) outperformed the compressed version by 15 points at median. Now, this probably doesn't shock anyone who's been building agents... we all know intuitively that you lose something when you over-abstract (ie compress).

But having it quantified at a 15 point delta is useful. It puts a number on the cost of being lazy with context. And it runs counter to a lot of the default patterns people are using today, where the first instinct is to summarize everything to save tokens and reduce costs. There's a real tradeoff there, and most teams are probably leaving performance on the table.

There’s also something kind of wild happening here from a “meta” standpoint. The system Stanford built uses Claude Code as the agent that writes better harnesses. So you have AI... writing the orchestration code... for other AI systems. And it’s doing it better than humans. We hear a lot about recursive self-improving models lately, is this a good implementation of that?? It’s happening right now, in a practical engineering context, producing measurable results on real benchmarks. Agents improving agents. We’re going to see a lot more of this.

So if the harness matters this much, who owns it? This week Anthropic launched Claude Managed Agents, which is Anthropic saying “we’ll own the harness for you.” They’re productizing the entire orchestration layer: sandboxed execution, context management, error recovery, permissions, long-running sessions.

According to their launch post, companies like Notion, Rakuten, Asana, and Sentry are already building on it. And at $0.08 per agent runtime hour plus model usage, they’re making it very easy to get started. [another side note - check out the pricing. We’re getting closer and closer to compute cycle runtime pricing! exact text from their blog: “Managed Agents is priced on consumption. Standard Claude Platform token rates apply, plus $0.08 per session-hour for active runtime.”]

Anthropic is turning into more and more of a platform. Provide the model AND the infrastructure around it. Make it so easy to build on your platform that switching costs compound over time. It’s SO smart. However, how will developers feel? Using Anthropic’s harness will certianly “lock you in” to using their model more and more. Which is what you’d expect! But as a developer, do you want the model itself to be fungible? Swappable for the latest and greatest?

But here’s the tension, and this is where it gets interesting for founders. The entire Stanford paper is about how harness optimization is domain-specific. The generic harness is good. The tuned one is 6x better. So if you’re building a deeply vertical AI product, say an agent that handles insurance claims or manages clinical trials or runs a supply chain, a managed, general-purpose harness from Anthropic is going to be fine out of the box. But fine isn’t the same as great. And in competitive markets, the gap between fine and great is the whole ballgame.

This maps directly to the build vs. buy question I wrote about a while back. For foundational infrastructure (sandboxing, auth, session management), buy it. Use Managed Agents or something like it. But for the actual orchestration intelligence, what context to surface, when to retrieve it, how to handle domain-specific edge cases, that’s where you build. That’s where your differentiation lives. The founders who understand this distinction are going to build much better products than the ones who treat the harness as an afterthought.

Maybe wrappers weren’t such a joke after all!

Disclaimer: Investing carries risk. This is not financial advice. The above content should not be regarded as an offer, recommendation, or solicitation on acquiring or disposing of any financial products, any associated discussions, comments, or posts by author or other users should not be considered as such either. It is solely for general information purpose only, which does not consider your own investment objectives, financial situations or needs. TTM assumes no responsibility or warranty for the accuracy and completeness of the information, investors should do their own research and may seek professional advice before investing.

1.48K

Report

Comment

Top
Latest

No comments yet

To The Moon

Long Live the Harness

Comment

Most Discussed