Not All Contract AI Is Built for Extraction Across Portfolios
How to choose the right contract AI tool
Drafting, reviewing, marking up, and summarizing individual or a small number of contracts constitutes a very different set of use cases versus extracting structured data from a large volume of contracts.
In today’s market many tools are presented as being able to do all of these things equally well.
They cannot...
That is why the starting point for choosing a tool should not be the tool itself, but the use case. Focusing on the use case will help you discern the signal from the marketing/sales noise.
For Portfolio Data Extraction: Focus on Three Key Criteria
If the use case is extracting structured data across a large contract portfolio, the outcome and evaluation criteria become much clearer.
The outcome you are seeking is whether an AI tool can produce reliable, structured data for a large number of agreements without involving too much manual work.
For this, three criteria matter most: speed/ease of getting data, accuracy, and scalability.
1. Speed/Ease of Getting Data
Some AI tools require users to write prompts, refine prompts, label provisions, or train models to extract data. These tools are often described as “co-pilots,” where the AI assists a human who is still doing a meaningful portion of the work.
If extracting data across an entire contract population depends heavily on human input, that is a significant burden and slows things down. If every field requires human instruction, prompting, review, and iteration, the process does not scale and the cost savings from automation start to disappear.
2. Accuracy
Accuracy is not just a technical metric in this context — it determines whether the data can actually be used.
For contract data extraction, accuracy is closely tied to confidence and traceability. Users need to know not just what was extracted, but how confident the system is in that data and where the data came from. Without that, it is difficult to rely on the extracted data for decision-making.
When extracted data is unreliable and there is no indication of confidence, all extracted data is suspect and needs to be checked by people which is slow, expensive and defeats the purpose of trying to automate extraction in the first place.
3. Scalability
Scalability is what separates tools that work well on a handful of contracts from tools that work across an entire enterprise portfolio. The extraction process must be fast enough and cost-effective enough to run across that entire population, not just a small sample.
Approaches that may work well on individual documents can become slow or expensive when applied at scale. When evaluating tools, it is important to understand what happens when you attempt to load 50,000, 100,000 or 800,000 documents into the system.
Why Many AI Tools Struggle with Extraction
In the past few years, many vendors have added AI features to analyze contracts and extract information and these “contract AI” tools tend to fall into one of a few categories:
-
-
General-purpose Large Language Model (LLM) based legal AI tools
-
Pre-execution contract review and redlining tools
-
Contract lifecycle management (CLM) systems
-
Stand-alone contract analysis tools
-
Tools in the first three categories are completely dependent on LLMs for extraction (often combined with vector-based Retrieval-Augmented Generation (RAG)). This is also true of newer stand-alone contract analysis tools that have come to market. While these tools can play an important role in the contract lifecycle, they were not developed for the extraction use case.
LLMs are designed to generate language on a probabilistic basis. They are very good at answering questions and producing readable outputs. But portfolio-level data extraction is not primarily a language generation problem. It is a data extraction and normalization problem where consistency, repeatability, and traceability matter.
When using tools that are completely dependent on LLMs for extraction, the outputs can look very convincing, but it is not clear which results are correct, and which are not. As noted above, this can necessitate people having to check the data, which is slow, expensive and defeats the purpose of trying to automate extraction in the first place.
In addition, running LLMs across a portfolio of long, complex agreements are inefficient and expensive, particularly if each extraction requires multiple model calls or large context windows.
Vector-based RAG has been proffered as a way to solve these problems by helping the LLMs find the most relevant parts of a document before generating an answer, improving performance for some use cases, but it does not, by itself, solve the problem of consistent, field-level extraction at scale.
In contrast, tools that are designed specifically for structured extraction at scale tend to use a combination of methods (analytical models, rule-based systems, and validation mechanisms alongside LLMs) to ensure the data extracted is accurate enough to be used in real business decisions.
What a Fit-for-Purpose Extraction System Looks Like
For a fit-for-purpose system not to burden its users, it should be closer to an “autopilot” than a “co-pilot.” Documents should be loaded into the system, and the system should extract data automatically using a predefined contract data model that reflects legal and commercial subject matter expertise. Users should be able to review and validate results, but they should not have to teach the system how to extract the data in the first place.
When it comes to accuracy, it is most effective to use multiple methods to extract the same data and then compare the results. For example, an analytical or deterministic model can extract data using structured methods, while an LLM analyzes the same text independently. Where the two methods agree, the result can be treated as high confidence. Where they disagree, those items can be flagged for human review, improving the overall accuracy and focusing human effort only where it is actually needed.
Each extracted data point should also be linked back to its location in the source document, so users can quickly verify if needed. Confidence scores and clear document references are what make extracted data usable.
Finally, the system has to be able to run across an entire contract population quickly and cost-effectively. That requires an architecture that is not dependent on running LLMs across entire documents for every extraction task. Scalable systems tend to use LLMs selectively, where they add the most value, and rely on other analytical methods for the bulk of the extraction work. This reduces cost, avoids context window limitations, and allows the system to process large volumes of contracts in a reasonable timeframe.
When these design choices are combined — speed with low user burden, high accuracy with confidence scoring, and an architecture designed for scale — the result is a fit-for-purpose system that can analyze thousands of agreements.
This is the problem Catylex was designed to solve: extracting structured, decision-grade data from large contract portfolios in a way that is accurate, scalable, and does not create additional work for already busy legal and contract teams.
A Simple Way to Evaluate Contract AI Tools
If you are evaluating AI tools for contract data extraction, it is easy to get pulled into product demos and feature comparisons. Posing a few practical questions up front can cut through the noise:
1. How much work does users have to do to get usable data? How long does it take to get access to the data? Do users have to write prompts, label clauses, train models, or continually refine instructions? Or can documents be loaded and processed automatically, with users focusing only on reviewing exceptions and edge cases?
2. How do you measure and show accuracy? Is accuracy measured in a rigorous way, and can the vendor explain how it is calculated? More importantly, does the system show confidence levels or otherwise indicate which results are likely to need human review?
3. Can you see exactly where each data point came from in the contract? Extracted data is only useful if it is traceable. Users should be able to click on a data field and immediately see the source text in the document.
4. What happens when you run this across 10,000, 50,000 or 800,000 contracts? Many tools work well on a small set of documents. The real test is speed, cost, and workflow when applied to an entire contract population.
5. What does the human review process look like? No system is perfect. The important question is whether human effort is focused on a small number of exceptions, or whether teams end up re-checking large portions of the data.
These questions quickly reveal the difference between tools that are designed for reviewing individual contracts and tools that are designed to extract data across an entire portfolio.
AI can make contract data far more accessible and useful than it has been historically. But the outcomes depend heavily on choosing tools that are designed for the specific problem you are trying to solve. For large-scale contract data extraction, speed, accuracy, and scalability are still the criteria that matter most.