AI AgentsMetric Governance

Anthropic and OpenAI both said context is the bottleneck for data agents. Here's what they didn't say.

Kyle Hui·June 9, 2026

Anthropic and OpenAI recently published unusually candid accounts of how they make data agents accurate on their own internal data. Anthropic's "How Anthropic enables self-service data analytics with Claude" measures what moved their eval numbers. OpenAI's "Inside our in-house data agent" describes the context architecture behind an agent serving more than 3,500 internal users across 600+ petabytes. Different angles, same conclusion: the bottleneck is not SQL generation. It is context — how it is structured, routed, and maintained as the warehouse drifts underneath it.

We build a small product on exactly this premise, and on their platforms: it lives inside the agents analysts already use — Claude Code, Cursor, and Codex — connected over MCP. So this is not a rebuttal; both posts are right. But they share an assumption that leaves most working analysts out of the story, and once you build past the failure modes they describe, you meet new ones neither post covers. These are our field notes.

What both posts establish

Two of Anthropic's results deserve quoting at anyone who says "just give the agent more context." First, procedural knowledge dominated: without skills — files encoding how to work, which sources to consult, in what order — accuracy "didn't exceed 21%" on their evals; with them, "consistently above 95%." Second, the anti-RAG ablation: they gave the agent grep access to their entire SQL corpus — thousands of dashboard, transformation, and notebook files — and accuracy moved by less than a point, even though the right answer was present roughly 80% of the time. Their summary: "the information was there, the agent saw it, and it still didn't use it." The bottleneck was not access; it was structure — mapping a question to the right entity. They also put a number on drift: ~95% at launch fell to ~65% over a month, until maintenance became an engineering practice.

OpenAI supplies the architecture view: six layers of context — usage patterns, human annotations, code-derived enrichment, institutional knowledge, memory, runtime lookups — including a nightly job that reads pipeline code, on their lesson that meaning lives in code, and a memory that retains the non-obvious corrections, filters, and constraints that decide correctness. Two of their lessons generalize far beyond their scale: fewer, stronger tools beat a full toolbox — exposing the full tool set made the agent less reliable — and guiding the goal beats prescribing the path — highly prescriptive prompting degraded results. All of it built by roughly two engineers — this is now buildable.

The person both posts assume away

Look at what the two architectures presuppose: canonical datasets. A semantic layer as the first source of truth. Owners for every domain. A data team to keep skill files fresh and wire review hooks into CI. Even Anthropic's start-from-zero advice — "a handful of canonical datasets, a few dozen offline evals, and a thin knowledge skill" — is a data-team project. These are blueprints for organizations that already have a data organization.

Most analysts do not work at one. The person we build for is the individual analyst — RevOps, analytics engineering, the one data person at a company that never had two — who lives in Claude Code, Cursor, or Codex and works alone against a messy, ungoverned warehouse: no semantic layer, no canonical datasets, nobody to file a ticket with, and a recurring moment that eats afternoons: why don't these two numbers match? ClariLayer is an individual-analyst context layer (MCP-delivered): the thesis both labs validated, productized for one person with no platform team behind them. The difference changes the engineering: at lab scale, process fixes context problems — review hooks, owners, marketplaces. Alone, the layer itself has to do that work.

The failure we ran into

A framing note, because we would rather under-claim: these are single paired runs on a synthetic warehouse with a real coding agent — built to find failure modes, not to measure rates.

First we ran into the failure class behind Anthropic's ablation, at the opposite end of the scale. We gave an agent full warehouse access and served it context including the exact checked metric contract for the question — a definition reconciled against the warehouse, grain and filters spelled out. It wrote wrong SQL anyway. It followed a stale prose note served right beside it: it added bonus_eligible_flag = true, a filter the contract does not list, and wrapped the time column in DATE_TRUNC('month', ...) where the contract declares an integer revenue_month — that stale note is exactly what a hand-tended CLAUDE.md becomes by month three. The same agent, the same day, with our context disconnected, wrote SQL without the invented filter.

In that paired run, serving the right context made the output worse: the contract arrived undifferentiated, next to a wrong note, and the note won. Retrieval was not the problem; structure and routing were — consistent with Anthropic's finding, at single-analyst scale.

The fix was not more context. It was making the contract data instead of prose — grain, filters, expected columns as fields — plus intent-aware routing that puts the one checked contract in front of the agent with explicit guidance on how far to trust it. The agent then used it.

Three failure modes that appear after routing works

Then routing created new failure modes — the ones nobody is writing about yet.

Agents invent scope tags. Context entries carry a scope tag — which use-case they belong to — and agents make up reasonable-but-wrong ones: we watched one ask for sales_commission_reporting when the metric lived under commission-report. Filter scope exactly and the checked contract silently vanishes from recall; the agent goes back to confident improvisation. Our answer: a collision-safe widened second pass. When scoped recall finds no checked contract, it widens across scopes — and anything it surfaces from another scope arrives explicitly labeled with the requested and the matched scope, so the agent confirms rather than assumes. Never silently authoritative.

Refusing to guess can strand the agent. Widening surfaces the next problem: two checked metrics under different scopes can both plausibly answer the question. Refusing to auto-pick is correct — auto-picking the wrong contract is the precise failure a context layer exists to prevent. But our first version of "safe" was a dead-end: the agent saw no route, invented yet another wrong tag, and stalled. The fix: refuse to pick and name the candidates. The recall response carries a hint listing the candidate scope tags. In our re-test, the agent read the named tags, re-recalled with the right one, and wrote the correct SQL unaided. Safe does not have to mean stranded; refusal needs an exit ramp.

And the conflict neither lab addresses: trusted context disagreeing with trusted context. Anthropic orders source types by trust — semantic layer first, then lineage, then docs. OpenAI describes six layers with no published precedence rule. Neither says what the agent should do when two trusted sources collide. We hit this head-on: a checked metric contract that says, in effect, "use exactly these filters," against a standing procedural rule — "always exclude internal and test orgs." The agent silently dropped the rule. The contract's exactness beat it, and the output was authoritative-looking numbers that were wrong for the business. And the obvious patch is a trap: silently applying the rule would be worse — an agent that quietly edits the filters of a checked contract is the failure mode a trust system dies on.

So the agent does neither. The served guidance now says: if a procedural rule adds a filter the contract does not list, neither silently drop nor apply it — surface the conflict and propose a contract update for human review. In the re-test, the agent's own run notes read: "Per the returned guidance, I did not silently apply the procedural filter… I used the asserted contract's exact filters and grain, surfaced the conflict, and proposed a contract update for human review." And it filed the proposal. A human approves it; the contract is re-reconciled with the rule baked in. Governed change, not silent judgment.

The principles that fell out

Six things we now believe, each earned by a failure above:

Recall-first proactivity. The MCP connection carries standing instructions that tell the agent to recall context before writing SQL — durable setup, not a prompt you paste each session. Context that relies on the analyst remembering to ask for it usually does not get asked for.
Structured contracts over prose. A metric definition is data — grain, filters, expected columns — not a paragraph the agent may or may not absorb. The stale bonus_eligible_flag came from prose.
Reconciled, not blindly asserted. Saved context is checked against the warehouse — the agent runs the check with its own access and reports the result shape; ClariLayer never holds credentials or executes SQL server-side. Mismatches become caveats on the entry.
Honest labels over silent confidence. Cross-scope matches are labeled, ambiguity is named, caveats travel with the entry — the agent is told how much to trust what it is served. Silent confidence is how wrong numbers acquire authority.
Humans gate change. Agents propose updates to canon, never silently rewrite it. The conflict above ends with a human approving a proposal, not an agent's judgment call.
Routing is regression-tested. Every failure in this essay became a pinned case in an offline routing scorecard in CI. Routing regressions are bugs, and bugs get tests.

What we deliberately don't claim

One more receipt, of a different kind. ClariLayer's trust statuses are asserted and caveat. There is no verified — we built that stamp, then gated it off before launch. The heuristic that would have backed it — extracting declared signals from SQL for comparison — could not be made sound against the long tail of SQL dialects; every adversarial review round found another hole. A single false "verified" is the one failure a trust product cannot ship. So today a clean reconcile pass still reads asserted, a mismatch reads caveat, and verified returns only when a real SQL-AST parser replaces the heuristic. A trust product withholding its strongest trust mark is ironic; shipping it unsound would have been worse. Trust products earn trust by what they refuse to claim.

If you read those posts and have no data team

The labs are right, twice over: data-agent accuracy is a context problem — structure, routing, maintenance — not a SQL-generation problem. They have shown the destination for organizations with a data platform; we are building the version for the analyst with none of that — one person, a messy warehouse, an agent, and numbers that have to be right.

Stop re-explaining your data to your AI every session. ClariLayer connects to Claude Code, Cursor, or Codex over MCP; single-player is free, at clarilayer.com. If the lab posts left you wondering what this looks like without a data team — that is the product, and we would genuinely like to hear where it breaks on your warehouse.

Written by

Kyle Hui

Founder, ClariLayer

Building the context layer for business metrics in the AI era.