Litigation Support

 View Only

Demystifying the Alphabet Soup of Legal AI

By Catherine Casey posted 04-15-2021 11:39


What is legal AI?

There is quite a bit of confusion surrounding legal AI and frankly, AI in general. For the purposes of this article, let’s use the original definition proffered in 1956 by the man who coined the term AI, John McCarthy: "the science and engineering of making intelligent machines." A more elaborate definition characterizes AI as “a system’s ability to correctly interpret external data, to learn from such data, and to use those learnings to achieve specific goals and tasks through flexible adaptation.”[1]

Understanding the machine learning at play in legal AI

In the context of the practice of law and ediscovery in particular, the deployment of AI has generally fallen into the category of machine learning. As the name implies, machine learning is a subset of AI characterized by the use of algorithms and statistical methods to enable machines to improve or learn through experience. Legal AI has focused on supervised and semi-supervised types of machine learning with more recent iterations delving into reinforcement models.

Broadly speaking, most legal AI use cases today rely on algorithms trained with human input to identify similar or dissimilar categories of documents to reduce the amount of time it takes to surface key concepts or evidence. Understanding the capabilities and limitations of your specific legal AI is important in framing expectations and workflows to maximize its effectiveness.

Supervised machine learning

The first type of AI to hit the legal (ediscovery) stage was a version of supervised machine learning that was trademarked with the unfortunate name “Predictive Coding™”. (If you have somehow escaped Monique da Silva Moore, et al., v. Publicis Groupe SA & MSL Group (S.D.N.Y. Feb. 24, 2012, it is worth the read if only to compare to the more advanced approaches we have today).

Now commonly categorized as technology assisted review (TAR) 1.0, this version of machine learning entailed a small group of experts coding a subset of data to “train” an algorithm about what to select as relevant, nonrelevant, or related to certain issues. This was repeated for several “rounds” until the suggestions made by the algorithm met a certain statistical level of precision (statistical threshold of accuracy) and recall (statistical threshold of completeness). The algorithm would then run against the full data universe and the results would be pushed out to a larger group of reviewers who could confirm or reject the coding suggestions. Ultimately, allowing the algorithm to auto-code with certain levels of precision and recall were statistically validated.

Semi-supervised machine learning

As comfort level and adoption of TAR increased there was a move closer to what is known as semi-supervised learning, although arguably no one fully took that leap. This process begins in the same manner as supervised machine learning, with iterative rounds of training on a seed set until an algorithm stabilizes and demonstrates minimum requisite levels of precision and recall. But rather than pushing this data out to review teams to accept or reject the computer generated coding predictions, in this model, the coding decisions made by the system would receive no human validation. For obvious reasons attorneys, who still incorrectly believe that eyes on every document is the gold standard, did not immediately embrace this model.

Unsupervised machine learning

Unsupervised machine learning takes the inverse approach from the prior two categories. Rather than starting with human input, this model starts with algorithmic autocategroization and then pushes the resulting data sets to humans for further analysis or review. In ediscovery generally, unsupervised learning algorithms are used for concept clustering, near-duplicate detection, and concept search.

At times, both unsupervised machine learning and supervised machine learning are used on the same data set but for different purposes. Investigators in particular rely on various unsupervised machine learning tools to quickly understand what key concepts are contained in a data set or to facilitate preliminary investigation while TAR 1.0 or TAR 2.0 models (discussed in more detail below) are leveraged to quickly prioritize review and accelerate comprehensive analysis of all data in a given dataset.

Reinforcement machine learning

The most recent addition to the ediscovery lexicon, reinforcement learning, is often described as TAR 2.0. In this machine learning model, rather than having a discrete window of time to train the model, human input continually refines and informs the algorithm.

Learning and making better predictions about likely responsive material, this model allows the review team to greatly accelerate their review speds and surfaces likely relevant material in a fraction of the time it takes for linear review or TAR 1.0 models. Additionally, the challenges of new data or refinement of scope are much better addressed in this approach because inputs throughout the lifecycle of the case refine what coding suggestions the algorithm presents to the review team.


Each of the above models has discrete use cases as a standalone solution and/or as part of an integrated solution that incorporates multiple machine learning models. Savvy legal practitioners can benefit from the speed to insight and reduction of cost associated with leveraging some or all of these solutions. In the next installment we will dig into the perceived challenges in using various machine learning-based legal AI and best practices to get the most out of your AI-powered workflow.




1 comment