Everyone in Contract AI Suddenly Has a “Benchmark.” Here’s How to Read One.
Four different things “benchmark” can mean in contracts AI — and the questions that separate useful evidence from a dressed-up press release.
A few weeks after I launched this newsletter — The Contract Signal — TermScout launched a “Contract Signals Report.”
I’ll take the overlap as validation. The framing works in part because contracts are becoming more than documents to store, negotiate, and search. Once you parse through the dense language, contracts start to look like signals: of risk, leverage, governance, market dynamics, opportunity, and operational behavior.
But the naming overlap is also a small symptom of a larger market pattern.
The volume of contract AI news keeps accelerating. Every week brings another product announcement, research release, workflow claim, or market report. And in that stream of news, one word keeps coming up: benchmark.
Crosby published a contract-redlining benchmark. Harvey has published benchmarks for contract understanding and legal agents. TermScout, LexisNexis, and others benchmark contract terms against market norms. CLM vendors benchmark cycle time and operational performance.
One word — benchmark — is taking on multiple meanings.
Sometimes “benchmark” means a test. Sometimes it means a dataset. Sometimes it means the results from a market survey or a dashboard metric summarizing the same to compare against industry peers.
It has quietly become a word that sounds precise but is often focused on marketing.
So when (or before) the next benchmark lands in your inbox, the first move is simple: ask what kind of benchmark you are actually looking at.
Benchmark of what? Against what? Scored by whom? Using whose data? And for whose benefit?
What the word actually means
Strip away the jargon and a benchmark is just a reference point. A fixed thing you measure against — a baseline, a starting line — so that any other number means something relative to that initial reference point.
You set the reference, then you compare: better than the benchmark, worse than the benchmark, by how much.
Contracts professionals already had a specific version of this long before AI showed up. In outsourcing and other long-term agreements, a benchmarking clause lets a customer bring in an independent third party to compare a supplier’s pricing and service levels against the wider market — and adjust the deal if the supplier has drifted off-market.
Note what made that mechanism trustworthy: an independent party, an agreed method, and a defined market to compare against.
In the AI era, the word has stretched to cover at least four genuinely different activities. They get announced with the same or similar vocabulary, which is exactly why they are hard to read.
The first evaluates models. The second evaluates contract positions (i.e. the values of given clauses or other data points). The third evaluates operational efficiency. The fourth evaluates AI or system improvement over time.
Separating these concepts can help provide clarity.
The four benchmarks hiding under one word
1. Performance benchmarks — “How good is the AI?”
This is the one generating the loudest headlines: take a task, have AI models do it, and score them against a human expert or a model answer.
One recent example is Crosby’s Redline Bench, which scored frontier models on realistic SaaS-contract redlining against attorney-authored “golden” responses. The reported results put the top model around 50% and the rest clustered below it — a narrow spread, with humans still better at finding new routes to resolution while models tended to anchor on their opening positions.
Conceptually, there is nothing wrong with this. Measuring model performance against domain experts in realistic conditions is a reasonable thing to do. To their credit, the better recent benchmarks in this category disclose a lot: the use cases, the scoring dimensions, the rubric, and in some cases the whole thing is open for other labs to run.
That is valuable as it helps practitioners assess actual performance.
The trouble starts with how the numbers get used downstream.
“Model X scored 50.5%” gets repeated far from the methodology that produced it. Without that context, the number is almost impossible to interpret.
It may be directionally interesting. It may tell you something about frontier-model behavior in a simulated negotiation. It may help compare models under one defined test design.
But it does not automatically tell you which product to buy, which workflow to automate, or whether the same system will work inside your contract population.
That distinction matters.
A multi-turn negotiation benchmark is not the same thing as a data extraction benchmark. A redlining benchmark is not the same thing as a playbook-compliance benchmark. A test of SaaS agreement negotiation is not the same thing as a test of high-volume vendor onboarding, procurement triage, or post-signature obligation extraction.
And in many contract AI workflows, the advanced task depends on the foundational layer being right first.
Before a system can negotiate from a contract, it often has to identify the agreement, extract the relevant provisions, understand the parties, map the clause to the right playbook position, and preserve the context across the workflow. If those earlier steps are noisy, the later benchmarked task may inherit the errors. Sometimes the problem is not only model performance. It is error propagation – the snowballing or compounding of small mistakes at different stages in the process until the end result is a far cry from the original performance findings.
That is why the headline score is not enough.
Did the model miss obvious issues? Did it identify the right concern but propose a weaker redline? Did it produce something commercially reasonable but different from the attorney-authored answer? Was it penalized for not matching the exact path the human took? Would two senior attorneys have agreed on the same “perfect” answer?
A performance benchmark can be useful. But the headline score is not the benchmark. The benchmark is a combination of the task, the dataset, the scoring method, and the judgment calls underneath the score.
2. Norm benchmarks — “Is my contract typical?”
A completely different activity wears the same label.
Here, you assemble a large body of contract data, derive what is standard for a given type of deal, and then compare a new agreement against that norm or standard.
LexisNexis’s Market Standards, TermScout’s reports, and various “state of contracts” reports live in this category. The output is not a model score. It is a reference set.
For example: for vendor agreements of this type, in this industry, at this deal size, mutual indemnification is standard; one-way indemnification is aggressive; this limitation of liability formulation is common; that AI-use restriction is becoming more frequent.
This is genuinely useful and very doable.
Most of it reduces to structured data. A clause like indemnification can be reduced to a small set of positions — both parties indemnified, one party indemnified, neither party indemnified — and once you have coded enough contracts and broken down each clause type sufficiently, you can say how common each position is. Control for industry, deal size, template source, whether the agreement is customer paper or vendor paper, and other criteria and you have a real reference point for negotiation and risk.
But norm benchmarks have a problem that rarely makes the announcement: representativeness.
Much of this analysis is built on public filings, and public contracts are a biased sample by construction. A contract gets filed with the SEC because it crosses a materiality threshold. That means the corpus skews toward large, heavily negotiated agreements that may look very different from the thousands of ordinary commercial contracts a business actually runs on.
Even the filed agreements are frequently redacted at the commercially sensitive points you would most want to benchmark.
I used SEC contracts myself in last week’s article on creating a contract data extraction benchmark, and the same caveat applied there: they are a convenient, legitimate starting point. They are not a stand-in for your private contract population.
A norm built on a skewed sample is a useful hint, not ground truth.
3. Process benchmarks — “How fast is my operation?”
A third meaning has nothing to do directly with model quality or contract terms at all.
It is operational: cycle time, turnaround, throughput, percentage of contracts on standard templates, time-to-signature, number of negotiation rounds, fallback frequency, legal touch rate, approval bottlenecks.
CLM vendors increasingly sell “benchmarking” dashboards that compare your contracting operation against peers.
This is often the management-consulting meaning of benchmark, applied to legal ops. It may also be the most actionable version for many teams.
If your average NDA takes 12 days and comparable teams complete theirs in three, that is useful information to have. By the same token, if 80% of your low-risk agreements still require legal review while similar companies route most of these through self-service or your contracting cycle time spikes in one region or business unit, or by contract type, that is useful too.
But it is worth separating this cleanly from the other meanings.
A vendor saying it can “benchmark your contracts” might mean it can compare clause language to market norms. It might mean it can compare your contracting process to peer operations. It might mean it can measure model performance. Those are different use cases with different products, solutions, and datasets.
Process benchmarks are valuable when the metric maps to a real operational decision.
They are less valuable when the metric becomes theater: a dashboard showing you are slower than “peers” without explaining who the peers are, what work was included, or whether the comparison reflects your risk profile, company stage, industry, deal size, or contracting model.
4. Baseline benchmarks — “Is my system getting better?”
The fourth meaning is the one practitioners building AI systems use most, and the one that almost never makes the press release.
Here, a benchmark is an internal baseline.
You fix a starting level of performance — an out-of-the-box model, a first prompt, a v1 pipeline, an existing review process — and then measure every later change against it.
Did the new prompt help? Did the more expensive model actually earn its cost? Did retrieval improve accuracy or just add latency? Did fine-tuning move the metric or just the vibes? Did the new clause taxonomy improve consistency? Did the latest model version break something that used to work?
This is how serious AI and data teams actually work.
And it carries two critical ideas the headline benchmarks tend to gloss over but often have a huge impact on assessing the benchmarks and making them actionable.
The first is generalization. A model trained or tuned on one contract population can fall apart on a new population due to different drafting, file quality, document structures and conventions used.
The second is data drift. A system that scored well last quarter can quietly degrade as the documents flowing through it change.
A single published score is a photograph: one snapshot in time, taken under a defined set of assumptions.
Truly effective benchmarking is more like a video – multiple points in time, multiple snapshots, multiple angles, and enough continuity to tell whether the system is improving, degrading or staying flat.
And the video is what tells you whether something works in production.
Four meanings. Four different questions. One word.
The moment you hear “benchmark,” your first move should be to figure out which one you are actually being sold.
How to read any benchmark
Sorting the type gets you halfway there. The rest is actually performing the analysis.
There are three layers that rarely get discussed and almost always decide whether a number means anything: the use cases, the scoring and the incentives.
The use cases
First, what was actually tested?
“Contract review” can mean redlining a standard NDA against a clear playbook. It can also mean a no-playbook judgment call on a bespoke, heavily negotiated agreement with limited commercial context.
Those are wildly different tasks.
Then, the specific scenarios corresponding to the use cases.
Before you trust a performance number, you want to know the distribution of difficulty behind it. Was the benchmark mostly easy cases? Edge cases? Common issues? Rare but high-risk provisions? First-pass issue spotting? Full redlining? Negotiation strategy? Escalation judgment?
That is the difference between “the AI is good” and “the AI is good at the easy version.”
The scoring
How was “right” decided?
This is where legal benchmarks are genuinely hard, because you are scoring dense, qualitative prose, not simple multiple-choice.
Did they grade it as a simple classification problem — risk flagged correctly or not, one point each over the total possible? Or did they use a rubric that accounts for subjectivity, ambiguity, commercial reasonableness, drafting quality, and the context the model was given?
And what does 100% mean?
Does it mean the model matched the attorney answer exactly? If so, would two senior attorneys have agreed 100% of the time?
Often they would not.
Law is full of gray areas. That is why it needs lawyers. A scoring method that pretends the gray areas are black and white will produce a confident number that does not survive contact with reality.
In addition, watch for false precision.
A score reported to a tenth of a percent — 50.5, 45.1, 44.4 — signals rigor. But the decimal is only as meaningful as the rubric underneath it.
When the rubric is scoring subjective legal judgment, three-significant-figure precision is often decoration. The confident number is the easiest part to produce and the least informative.
The incentives
Who ran the benchmark or experiment, and what do they sell?
This is the layer that matters most and gets discussed least.
Charlie Munger is often quoted as saying: “Show me the incentive and I will show you the outcome.”
A private sector benchmark is almost never published for philanthropy.
A law firm has an interest in a benchmark that shows legal judgment remains hard to automate. A vendor has an interest in a benchmark that makes its workflow look measurable and differentiated. A model lab has an interest in a benchmark that defines progress around the capabilities its model performs well on.
None of that means the benchmark is dishonest.
It means the benchmark has a point of view.
Wherever judgment enters — which tasks to include, how to score gray areas, what to call 100%, which dataset to use, which comparison group to define — the benchmark will reflect the author’s theory of the problem.
Even a scrupulously transparent benchmark is shaped by what its author chose to measure in the first place.
Transparency lets you check the math. It does not remove the incentive that selected the problem.
How to spot a benchmark worth trusting
None of this is a reason to dismiss benchmarks.
They are one of the better things to happen to a field that ran on vibes and demos for years. But they need to be read like evidence, not like slogans.
To summarize, a benchmark you can actually lean on tends to disclose:
Scope — the tasks tested, how hard they were, and whether the headline average hides meaningful variation.
Method — how “correct” was scored, who judged, and what a perfect score would mean.
Data provenance — where the corpus came from and how representative it is of the contracts you care about.
Authorship and incentive — who built it, who funded it, and what they sell.
Reproducibility — whether anyone else can run it and get the same answer.
Business relevance — whether the metric maps to a decision you actually need to make or problem you need to solve.
Use that as a checklist.
A benchmark that discloses its scope, method, and data but is published by a party with an obvious stake in how it is used is not worthless. It just needs to be read with the stake in view.
A benchmark with a strong headline but no corpus, no rubric, no task design, no scoring explanation, and no discussion of limitations is not reliable. It is a claim dressed up as a benchmark, often for marketing purposes.
The through-line
Last week I made the case that contract AI looks magical on one document and gets hard across a population — that the demo and the workflow are different problems.
A benchmark is the same story one level up.
The score is the demo. The methodology is the workflow.
A number that travels without its context is a promise without the receipts.
That is why independent reading of this market matters.
Independent does not mean neutral or detached. I have a point of view, shaped by a decade in enterprise contracts, taxonomy design, structured data and extraction workflows, AI, and legaltech product work.
It means something narrower and more important: I do not have a platform, model, law firm service, or benchmark to sell inside the analysis.
When the company publishing the benchmark also benefits from the way the benchmark is framed, the right response is not cynicism. It is professional skepticism.
What was tested? Against what? By whom? On whose data? And who benefits from the answer?
More signals soon.
If you found this useful, subscribe to follow along, forward it to someone who is about to make a tooling decision off a benchmark headline, and tell me which contract AI terms or claims you want me to unpack next.


