Data-Centric AI is Exciting News for Legal
Data-centric AI is good news for legal teams, especially those of you who've spent frustrating hours trying to get value from underperforming AI tools.
Data-centric AI is good news for legal teams, especially those of you who've spent frustrating hours trying to get value from underperforming AI tools. So, what is data-centric AI, and what does it mean for law departments and legal operations?
In simple terms, data-centric AI is a new approach to making AI work when you don't have truck-loads of training data. It's a term coined by AI guru Andrew Ng, who founded Google Brain, amongst other things. He says it may be the most important AI development since deep learning. It will take AI from big data consumer-web applications, to smaller scale industry-specific applications, where data quality is more important than data quantity.
Legal is exactly the sort of industry that is starting to benefit from data-centric AI. If you take contract AI as an example, there simply are no very large data sets of well-labeled contract data. Many players have created smallish data sets and AI models, and most of them suffer from inconsistent labeling problems. It won't matter how many data scientists and code tweaks you throw at the problem. Unless you tackle data quality your results will disappoint.
Some contract AI players responded to this problem by allowing (requiring) customers to create their own training data. This may seem like a good idea, and indeed may yield limited results, with machines learning to identify clauses very similar to the examples on which they are trained. But it just kicks the can of data quality down the road. Unless the customer spends a very large amount of time and money to create a consistent, heterogeneous data set, performance will never be great.
To illustrate the importance (and difficulty) of consistent labeling, consider the following example:
I want to label examples of termination rights, so I gather clause language where someone has the right to terminate, and I label the events that trigger the right:
- Supplier may terminate for breach;
- Vendor shall have the right to terminate for material default;
- Either party has the right to terminate if the other fails to perform.
- Consultant is entitled to terminate for non-payment.
- Supplier may only terminate for non-payment.
- And so on...
Some of the labels for these examples are easy. They would all get the "termination right" label, for example. But beyond that, things quickly get tricky, and without serious discipline, inconsistency is likely to creep in. Do I label everything "for cause", in the sense that it is not "for convenience"? Do I have a separate label "for breach"? Do I label "non-payment" also as "breach" because non-payment is a form of breach, or do I omit this label where a specific type of breach (non-payment) is already labeled? Does everyone doing labeling know that the last example needs special treatment due to the word "only"? You would not want that example labeled as "breach" because it applies only to non-payment (and yet non-payment is a form of breach).
Catylex, it turns out, has been part of the data-centric AI movement from the beginning. We just didn't have a label for it until recently (pun intended). Our approach is to apply the deep knowledge of legal domain experts to create high quality training data with consistent labeling throughout. We've done the hard work of figuring out the thousands of ways lawyers and business people can express themselves contractually. We've invested in the tools and processes to ensure these thousands of concepts are consistently identified and labeled. And the good news is, it works.
Training contract AI is time consuming and subject matter dependent, and there are many more ways to get it wrong than to get it right. Don't let anyone kick that can into your lap. Talk to Catylex instead.