AI AgentsMetric Governance

The ClariLayer Trust Benchmark v1: A 2,136-Call Study of AI Accuracy

Kyle Hui·
Editorial title spread for The ClariLayer Trust Benchmark v1: 2,136 model calls, 89 questions, 5x accuracy lift with governance, 91-99% error rate without.

Last quarter, a deprecated version of one of our test metrics returned a 139.49% churn rate for FY 2024. Mathematically impossible — you can't lose more customers than you started with. But the SQL ran, the number printed, and a model that read a documented schema would happily put it on a slide.

That number is from our own benchmark, on our own synthetic warehouse. The point is not that one query was wrong. The point is that across 89 real metric questions, 2,136 model calls, and three frontier LLMs writing SQL against the same data, the ungoverned configurations produced answers that were wrong 91-99% of the time — and wrong differently in every tool. Your CFO, your BI lead, and your growth team are looking at the same AI assistant and seeing different numbers for the same metric. That is not a model problem. It is a context problem.

We tested it. The data is unambiguous.

The setup

We built a deliberately messy SaaS warehouse — eight tables, ~5M rows, mixed timezones, naming inconsistency (customer_id, cust_id, account_id), 5% test-account contamination, currency mixing, the soft-delete inconsistency every analyst has cursed at. Real warehouses look like this; the academic ones don't.

We then asked three production-grade models — Claude Opus 4.7, GPT-5.4, and Claude Sonnet 4.5 — to write SQL for 89 natural-language business questions ("How many active users in March 2026?", "What was Q1 revenue?"). Each question had pre-registered ground-truth SQL. A model passes if its number matches expected within 0.1%. The harness runs the SQL against DuckDB and grades it.

Each model saw the same 89 questions across four context configurations:

  • A — Bare schema (column names + types). The "point an LLM at your warehouse" naive integration.
  • B — Documented schema (DDL + comments). A well-run dbt project.
  • C — Cube.dev semantic layer. The strongest non-ClariLayer prior art — Cube YAML defining cubes, dimensions, joins, and a single canonical measure per metric.
  • D — ClariLayer governed context. The structured Metric API output: governed SQL, version, owner, time logic, canonical filters, deprecated-variant reasons.

Single-turn, no retrieval, no tools. Every confound that wasn't context quality was deliberately stripped out. Reproducible from a public commit (run id v1-2026-04-27, total spend $58.58).

What we found

Bar chart of per-model SQL accuracy across four context baselines. The bare-schema, documented-schema, and Cube semantic-layer bars sit near zero; the governed-context bar exceeds 35% for all three models.

Pooled across all three models, baseline accuracy:

| Baseline | Accuracy |
|---|---:|
| A — Bare schema | 1.3% |
| B — Documented schema | 8.6% |
| C — Cube semantic layer | 2.8% |
| **D — ClariLayer governed context** | **42.7%** |

Governance multiplies accuracy 5× over a documented schema and 33× over bare warehouse access. The 95% confidence intervals do not overlap. The H1 paired bootstrap on D-vs-B comes in at +34.1 percentage points, p < 0.001.

That is the headline. Below it is the part that matters more: what the wrong answers actually look like.

The wrong answers aren't random

When the ungoverned baselines fail, they don't fail at random. They fail in the same way an analyst would fail on day one of a new job: count the rows in dim_users without filtering test accounts, sum the obvious revenue column without applying the timezone and tier filters the business actually uses. On new_user-002 ("how many new users in March 2026?"), 12 of 14 ungoverned attempts converged on the same wrong answer — 14,230 — because that's the textbook formula applied to messy data.

This is not hallucination in the usual sense. The model is being confidently wrong with the textbook variant, because nothing in the context tells it your business doesn't use the textbook variant. Governance is the thing that tells it.

A more dramatic example, from our own dataset: the deprecated v2 version of our churn_rate metric counted any churn in a period over the start-of-period cohort without intersecting the two sets. On this warehouse, that variant returns the 139.49% we opened with — because the year's churns include customers who joined after the period started, so they're in the numerator but not the denominator. The governed v3 added cohort-matching to keep the metric mathematically bounded. Without governance, models reading raw schemas readily produce >100% churn rates.

Frontier reasoning, surprisingly, hurts

Editorial illustration: three frontier-model robots at a starting line. The most expensive one trips over its own reasoning while the cheaper ones cross the finish line.

The H2 finding is the one that surprised us. On the governed baseline (D), where every model is doing its best:

| Model | Accuracy on D |
|---|---:|
| Claude Opus 4.7 | 37.6% |
| GPT-5.4 | 44.4% |
| Claude Sonnet 4.5 | **46.1%** |

Opus 4.7 — the most expensive frontier model in the roster — comes in dead last. The paired-bootstrap delta is −8.4 pp, p = 0.002. Not a tie.

The mechanism is consistent across the metrics where Opus underperforms: when the governed context block notes a real fact (the warehouse stores naive UTC, but the business reports in America/Los_Angeles), Sonnet uses naive UTC dates for period boundaries and gets the right answer. Opus applies the timezone shift to the boundary literals, drops a thin slice of records that signed up in the early hours of March PT (still February UTC), and lands 2% off — past the rubric's 0.1% tolerance. The frontier model's extra reasoning kicks in and reasons its way past the right answer.

For BI leaders evaluating "should we upgrade the model tier or invest in metric governance?": Opus list pricing is roughly 5× Sonnet's. On these tasks, that 5× multiplier on cost bought negative accuracy. Spending the same budget on closing the context gap — the difference between Baseline B and Baseline D — bought +34 percentage points. The math is one-sided.

The auditable-residual finding

Here is the part that closes the loop on trust.

D doesn't get every question right. On ~57% of cells, the model still produces a wrong answer. The next question is: are those failures random, or are they predictable? If they are random, you cannot trust the system. If they are predictable, you can audit them, fix the governance once, and move on.

Heatmap of per-metric pass-rates across four context baselines. The governed-context column lights up across nearly every metric while the other three columns remain dark.

We measured the share of each baseline's wrong answers that converge on the same value per question. The result:

| Baseline | Share of failures clustering on one value |
|---|---:|
| A — bare schema | 57.9% |
| B — documented schema | 48.7% |
| C — Cube semantic layer | 82.2% |
| **D — ClariLayer governed** | **84.6%** |

Under governance, residual failures concentrate on the same wrong answer 85% of the time — a higher cohesion rate than any of the alternatives. When governance is in place and the model still gets it wrong, it is still landing on one specific intuitively-plausible variant that governance can be tightened to rule out.

That is the difference between AI you watch and AI you trust. The former is a slot machine — the wrong answers are random, and you can never tell which numbers in the deck are wrong. The latter is a defect process — the wrong answers cluster, you can investigate, and you fix the governance once, not 89 times.

What this means if you're shipping AI agents

If you are an AI transformation lead, BI lead, or CFO with three teams disagreeing about the same metric, the finding is direct: the bottleneck on AI accuracy on your business numbers is not which model you picked. It is whether the model has access to a governed envelope around the metrics it is being asked to compute. Frontier capability is not the missing ingredient — and on this benchmark, frontier reasoning made things measurably worse, not better.

This is also why the conventional response — "the AI is hallucinating, let's add more guardrails" — does not work. The model is not hallucinating. It is correctly applying the textbook variant of a metric your business does not use, because nothing in the context told it which variant your business uses. That is a context-layer problem. It is fixed by governance, not by guardrails on top of bad context.

For the full methodology, the per-metric breakdown, the limitations (we ship them honestly — three-model roster, a tier-column emitter bug, synthetic warehouse), and the reproducibility instructions: read the white paper at benchmark/publication/trust-benchmark-v1.md. Total cost to reproduce is about $60.

What we want next

Five empty conference chairs around a polished boardroom table, soft golden window light. Symbol of the five Q3 design-partner spots open for the Trust Benchmark program.

We are looking for 5 Design Partners to run this benchmark on their own warehouse. The deliverable is a confidential report comparing the partner's current AI integration — raw warehouse, semantic layer, or other — against ClariLayer governed context, with a metric-by-metric breakdown of where their AI currently fails and where governance closes the gap. The methodology, dataset, and harness are open-source. What we add is the governed Metric API plus the analyst lift to map your warehouse into it.

If you are deploying AI agents with warehouse access at a 200+ company and you want to know what this matrix looks like for your business numbers, email kyle@clarilayer.com with subject line "Trust Benchmark — DP run". Five Q3 spots.

Written by

Kyle Hui

Founder, ClariLayer

Building the context layer for business metrics in the AI era.