AI AgentsMetric Governance

Your AI Agent Used a Retired Metric Definition. Did It Tell You?

Kyle Hui·
Editorial photograph of an empty boardroom at golden hour with a wall display showing an FY 2024 ARR slide of $2,617,940. A small indigo annotation badge in the upper-left of the slide reads 'DEFINITION RETIRED · OCT 2025.' Walnut conference table and leather chairs in the foreground; clarilayer.com wordmark in lower-right.

A board-deck question lands in your AI analytics agent. The model writes SQL, pulls a number, and ships it into the deck under a row labeled "ARR." The number is plausible. The query ran. Nobody flagged anything. But the framing the user asked for — "compute it the way Marketing prefers" or "use the v2 churn formula we had last year" — pointed at a definition Finance retired in October. The agent didn't refuse. It didn't even mention it.

This is the failure mode that breaks AI for analytics — not hallucination in the dramatic sense, but the quieter problem of an agent confidently answering the wrong question. You can't catch it in QA because the SQL is syntactically correct and the number prints. You catch it in the board meeting, when someone asks where the number came from.

We measured how often this happens, with and without governance. On Drift questions, the comparison is almost binary.

What we ran

Across 9,000 single-turn SQL questions spanning 120 metric questions, three frontier LLMs (Claude Opus 4.7, GPT-5.4, Claude Sonnet 4.5), and five stability runs, we tested five context configurations against the same deterministically-seeded synthetic SaaS warehouse:

  • A — bare schema (column names + types only)
  • B — well-documented warehouse (DDL + comments + table descriptions)
  • C — Cube semantic layer, expert-configured per the published adequacy checklist
  • D — dbt MetricFlow, expert-configured per the published adequacy checklist (8 staging models, 8 semantic models, 36 metrics, 4 saved queries)
  • E — ClariLayer's governed envelope

Every baseline answered the same questions. Every baseline ran under the same response contract — a structured {warnings, clarification_request, sql, rationale} shape, applied uniformly so the comparison stays apples-to-apples. The only thing that varied was the context block. ClariLayer's envelope adds query-conditioned governance directives — explicit rules the model can act on, not just descriptive metadata it has to infer over.

For Drift questions, a locked judge model classified whether the answer flagged the deprecated request, refused it and gave the canonical definition, silently used the current definition, or used the deprecated one. We counted the first two as passes.

The full methodology, every limitation we know of, and the raw per-call data are in the white paper and the companion repo.

The headline

On Drift questions — where the user explicitly asks for a deprecated framing — ClariLayer's governed envelope was the only baseline that consistently refused to compute it.

| Baseline | Drift PASS rate (5 runs, n=360 each) |
|---|---:|
| A — bare schema | 0 / 360 |
| B — documented warehouse | 0 / 360 |
| C — Cube semantic layer | 1 / 360 |
| D — dbt MetricFlow | 0 / 360 |
| **E — ClariLayer governed envelope** | **297 / 360 (82.5%)** |
Stacked horizontal bar chart titled "Drift label distribution across baselines (n=360 each)." Each baseline row sums to 360 across five label categories; only the ClariLayer governed row has a meaningful indigo "canonical-with-rejection (PASS)" segment of 297, with 30 silent-canonical, 32 deprecated, and 1 error. Baselines A and B are entirely "deprecated" (360/360). Cube has 1 PASS, 1 silent-canonical, 358 deprecated. dbt MetricFlow has 2 silent-canonical and 358 deprecated.

Figure 1: Drift label distribution across baselines (V2.1-RR, n=360 each).

Read that table left to right. Across 1,440 non-ClariLayer Drift calls, exactly one row passed. The rest split into two flavors of wrong: 1,436 produced the deprecated answer outright (A: 360/360, B: 360/360, C: 358/360, D: 358/360) and 3 produced the canonical answer silently with no rejection or warning (C: 1/360, D: 2/360). The single PASS came from Cube on a canonical-with-rejection. ClariLayer's E baseline produced 297 canonical-with-rejection (PASS), 30 silent-canonical (FAIL), 32 deprecated (FAIL), and 1 ERROR across 360 Drift calls — totals reconcile to 360.

That's the qualitative finding. In the tested field, ClariLayer is the only baseline that surfaces rejections at routine rates when the user asks for a definition the business retired. The non-ClariLayer baselines silently obeyed in 1,439 of 1,440 Drift calls.

For readers who want a single number: on aggregate accuracy across all five categories, ClariLayer scored 47.17% (849 / 1,800) versus a high of 2.94% for the next-best alternative (documented warehouse) — a ~16× lift on the highest-scoring non-governed baseline, and ~71× vs bare warehouse access. Aggregate stability across 5 stability runs holds to σ = 0.36 pp (range 46.94–47.78%). The dominance is not a single-run artifact.

Vertical bar chart titled "Per-baseline aggregate PASS rate (V2.1-RR)." Bars for baselines A (0.67%), B (2.94%), C (1.78%), and D (1.39%) sit near the floor; ClariLayer's E baseline towers at 47.17% in indigo, with a callout reading "~71× vs A · ~16× vs B · ~27× vs C · ~34× vs D."

Figure 2: Per-baseline aggregate PASS rate. ClariLayer scored 47.17% versus 2.94% for the next-best non-governed alternative.

Why this happens

Editorial still life of three printed cards stacked in a column on an off-white surface. The top "USER REQUEST" card is overlaid with a faded red "DEPRECATED" stamp; the middle "ENVELOPE REJECTION" card has a thin indigo border and a wax seal embossed "CL"; the bottom card shows monospace pseudo-code reading warnings: ["deprecated_framing"] and sql: <canonical version>. Title hand-set above reads "THE TRUST CONTRACT." with subtitle "Retired definition surfaced · canonical version returned · 297 / 360 PASS on Drift."

Figure 3: The trust contract — when a request matches a retired definition, the envelope surfaces the rejection text and computes the canonical version.

Documented schemas, semantic layers, and dbt configurations all describe how a number is computed. They are designed for the case where the right computation is the only computation worth doing. None of them is designed to encode the business fact that the v2 churn formula was retired in October, or that the marketing-preferred revenue framing isn't what the board approved.

ClariLayer's envelope ships that fact as a directive the model can act on. When a Drift question's framing matches a registered trigger pattern, the envelope surfaces a rejection_template the model copies into its warnings field verbatim, then computes the canonical version in the sql field. The model isn't being smarter. The envelope is being more legible.

That's the load-bearing claim. The model didn't suddenly learn to reason about deprecation history — it can't, on a single-turn call against an unfamiliar warehouse. What changed is that ClariLayer's envelope tells the model, in the prompt, that this framing was retired, when, by whom, and what to surface in its response if a user asks for it. Every other tested baseline leaves the model to infer that, and across the three frontier models in this benchmark, that inference almost never happened.

ClariLayer turns that business rule into instructions the agent can use. If a request matches a retired definition, the envelope tells the agent to warn the user, name the current definition, and compute the approved version. The same pattern covers version pins, approval state, and ambiguous scopes: give the agent the business rule and the expected response, instead of hoping it infers both from metadata.

What we're not claiming

In this re-derivation benchmark, ClariLayer's governed envelope scored 47.17% versus 2.94% for the next-best non-governed baseline. That is not a claim that the product is "half-right." The Canonical Metric API is not an LLM re-derivation path: it returns the governed metric definition and governance context directly. The benchmark tests what happens when an LLM, mid-conversation, has to re-derive SQL from a context block on a question battery designed to stress governance. The ~16× lift versus documented warehouse is the right comparison frame.

We also want to surface what the data does and doesn't say.

Single-turn methodology under-counts two categories where the right behavior is multi-turn. Ambiguity questions ("what's our churn rate?" with three defensible scopes) and Versioning questions ("which definition applied to the Q1 board deck?") both have a correct product behavior of asking back — and a strict-pass scoring rubric counts asking-for-clarification as a fail. We disclose this candidly. ClariLayer scored 9.7% on Ambiguity and 10.3% on Versioning under this methodology — better than every alternative on Ambiguity (Cube and dbt MetricFlow tied at 4.2%; documented at 0.0%; bare at 1.1%), but well short of the >30 pp lift threshold the spec called for. The envelope already ships the fields a multi-turn agent would consume; we expect a multi-turn evaluation to lift both categories meaningfully. v2.2 will run that evaluation and report it.

Approval modestly regressed (52.5% → 45.8%) between v2.0 and V2.1-RR. We added an explicit approval_state.policy directive that prioritizes surfacing non-APPROVED state behavior — the right product choice — and we suspect it changes the tradeoff against the simple-fact-retrieval pattern v2.0 was tuned for. We have not isolated the cause by question subtype. We report the regression because the alternative is publishing only the numbers that flatter us, and we don't do that. The absolute lift over every alternative on Approval remains decisive (45.8% vs ≤0.6% for A/B/C/D).

The benchmark uses a synthetic warehouse, three models, and a vendor-authored question battery. All disclosed. We mitigate by publishing the questions, ground-truth SQL, harness, raw per-call JSONL, judge prompt, calibration set, and adequacy checklists openly. The next proof point is external design-partner runs against real warehouses on fresh seeds with the then-current frontier roster.

What this changes for AI agent builders

If you are shipping an analytics agent — whether it's a chat copilot, a workflow agent that auto-triggers actions on metric thresholds, or an internal tool that lets non-technical employees ask warehouse questions in plain English — the practical implication of this benchmark is direct.

The bottleneck on whether your agent answers the right question is not which frontier model you picked. The non-governed baselines we tested ran against three of the strongest production models available, expert-configured semantic layers, and well-commented schemas — and they refused the deprecated framing once in 1,440. That's not a model-capability gap. That's a context-layer gap.

It is also not a problem that gets fixed by adding guardrails on top of an ungoverned context block. The model isn't going wrong because of a safety failure; it's going wrong because nothing in its prompt told it which framing was retired. Guardrails on top of bad context produce the same wrong answer with a more confident-sounding disclaimer.

Action: before you let an analytics agent write to a dashboard, deck, or workflow, test it on deprecated-definition prompts from your own warehouse. If the prompt context does not include the current canonical definition, deprecated aliases, owner, approval state, and the exact warning or refusal text to return, treat the agent as ungoverned no matter how strong the model is. On our benchmark, that difference was 0–1 PASS in 1,440 non-governed Drift calls versus 297 PASS in 360 with ClariLayer's envelope.

What's next

The honest gap in V2.1-RR is multi-turn. The V2.1-RR result is single-turn — the model writes one response and the harness scores it. Real analytics agents iterate: they ask one clarifying question, read the user's reply, refine, sometimes call a tool. The two categories we under-count under single-turn methodology — Ambiguity and Versioning — are precisely the categories where the right product behavior is multi-turn, and the envelope already ships the fields a multi-turn agent would consume.

v2.2 will run a multi-turn evaluation against the same warehouse and question battery, with the agent allowed one round of clarification, and report the comparison candidly. We expect Ambiguity and Versioning to lift meaningfully; Drift and Lookup are largely single-turn-complete and should hold. We'll know in the next benchmark cycle.

In parallel, we are recruiting private, invitation-only design partners to run this benchmark against their own warehouse and metric questions. If you are deploying analytics agents at a 200+ company and want to compare your current setup against a governed envelope on your own metrics, the deliverable is a confidential benchmark report scored against the same rubric, with the methodology and harness we ran here.

Read more

  • Read the white paper: Trust Benchmark V2.1-RR white paper
  • Reproduce the run: companion repo — questions, ground-truth SQL, harness, judge prompt, adequacy checklists, and per-call JSONL across all five stability runs
  • Compare your setup in a private design-partner run: email kyle@clarilayer.com with subject line "Trust Benchmark — DP run" (invitation-only; for teams deploying analytics agents at 200+ companies)
  • Learn what ClariLayer governs: clarilayer.com

Written by

Kyle Hui

Founder, ClariLayer

Building the context layer for business metrics in the AI era.