In the early technology assisted review (TAR 1.0) era, many thought training with randomly selected documents was important to the success of a TAR review. The fear was that attorneys would bias the TAR algorithm if they selected documents for initial training. When challenged, most relied on the old shibboleth: You don’t know what you don’t know.
Frankly the “don’t know” point made sense at a time when legal professionals were just realizing that their carefully crafted keyword searches were failing because their targets used different terms. If lawyer keywords couldn’t be trusted, why trust lawyer-selected training seeds? And, random selection was the cornerstone of the TAR 1.0 protocol called simple passive learning. The computer passively (randomly) selected all the training documents.
Then came the research. In 2014, Grossman and Cormack issued their landmark study comparing different TAR protocols. It showed that random selection was the least efficient way to find relevant documents including those you didn’t know about originally. Our research and that of several others consistently told the same tale. Our goal in this post is not to dissect the arguments on either side of the random sampling debate. Rather, we want to focus on contextual diversity, which is an algorithm we created to help find those “You don’t know what you don’t know” documents. We also want to have a bit of fun and show you how Zipf’s law (we hadn’t heard of it either) supports our approach and will help you understand why integrating contextual diversity with a continuous active learning (CAL) system is more efficient and effective than random sampling for ensuring topical coverage and avoiding bias.
What Is Contextual Diversity?
In a TAR 2.0 CAL system, we continuously use all the judgments of the review teams to make the algorithm smarter (which means you find relevant documents faster). In large part, we feed highly ranked documents to the review team and use their judgments to train the system. However, our continuous learning approach also throws other options into the mix to (1) further improve performance, (2) combat potential bias and (3) ensure complete topical coverage. One of these options that addresses all three concerns is Catalyst’s contextual diversity algorithm.
Contextual diversity focuses on documents that are highly different from the ones already seen by human reviewers. Because our system ranks all of the documents on a continual basis, we know a lot about the documents—both those the review team has seen and also those the review team has not yet seen. The contextual diversity algorithm identifies documents based on how significant and how different they are from the ones already seen, and then selects training documents that are the most representative of those unseen topics for human review.
It’s important to note that the algorithm doesn’t know what those topics mean or how to rank them. But it can see that these topics need human judgments on them and then selects the most representative documents it can find for the reviewers. This accomplishes two things:
- It is constantly selecting training documents that will provide the algorithm with the most information possible from one attorney document view, and
- It is constantly putting the next biggest “unknown unknown” it can find in front of attorneys so they can judge for themselves whether it is relevant or important to their case. We feed in enough of the contextual diversity documents to ensure that the review team gets a balanced view of the document population, regardless of how any initial seed documents were selected. But we also want the review team focused on highly relevant documents, not only because this is their ultimate goal, but also because these documents are highly effective at further training the TAR system. Therefore, we want to make the contextual diversity portion of the review as efficient as possible. How we optimize that mix is a trade secret, but the concepts behind contextual diversity and active modeling of the entire document population are explained below.
Contextual Diversity: Explicitly Modeling the Unknown
In the following example, assume you started the training with contract documents found either through keyword search or witness interviews. You might see terms like the ones above the blue dotted line showing up in the documents. Documents 10 and 11 have human judgments on them (indicated in red and green), so the TAR system can assign weights to the contract terms (indicated in dark blue).But what if there are other documents in the collection, like those shown below the dotted line, that have highly technical terms but few or none of the contract terms? Maybe they just arrived in a rolling collection. Or maybe they were there all along but no one knew to look for them. How would you find them based on your initial terms? That’s the essence of the bias argument.
With contextual diversity, we analyze all of the documents. Again, we’re not solving the strong artificial intelligence problem here, but the machine can still plainly see that there is a pocket of different, unjudged documents there. It can also see that one document in particular, 1781, is the most representative of all those documents, being at the center of the web of connections among the unjudged terms and unjudged documents.
Our contextual diversity engine would therefore select that one for review, not only because it gives the best “bang for the buck” for a single human judgment, but also because it gives the attorneys the most representative and efficient look into that topic that the machine can find.
So Who Is This Fellow Named Zipf?
Zipf’s law was named after the famed American linguist George Kingsley Zipf, who died in 1950. The law refers to the fact that many types of data, including city populations and a host of other things studied in the physical and social sciences, seem to follow a Zipfian distribution, which is part of a larger family of power law probability distributions. (You can read all about Zipf’s law in Wikipedia, where we pulled this description.)
Why does this matter? Bear with us, you will see the fun of this in just a minute.
It turns out that the frequency of words and many other features in a body of text tend to follow a Zipfian power law distribution. For example, you can expect the most frequent word in a large population to be twice as frequent as the second most common word, three times as frequent as the third most common word and so on down the line. Studies of Wikipedia itself have found that the most common word, “the,” is twice as frequent as the next, “of,” with the third most frequent word being “and.” You can see how the frequency drops here:
Topical Coverage and Zipf’s Law
Here’s something that may sound familiar: Ever seen a document population where documents about one topic were pretty common, and then those about another topic were somewhat less common, and so forth down to a bunch of small, random stuff? We can model the distribution of subtopics in a document collection using Zipf’s law too. And doing so makes it easier to see why active modeling and contextual diversity is both more efficient and more effective than random sampling.
Here is a model of our document collection, broken out by subtopics. The subtopics are shown as bubbles, scaled so that their areas follow a Zipfian distribution. The biggest bubble represents the most prevalent subtopic, while the smaller bubbles reflect increasingly less frequent subtopics in the documents.
Now to be nitpicky, this is an oversimplification. Subtopics are not always discrete, boundaries are not precise, and the modeling is much too complex to show accurately in two dimensions. But this approximation makes it easier to see the main points.
So let’s start by taking a random sample across the documents, both to start training a TAR engine and also to see what stories the collection can tell us.We’ll assume that the documents are distributed randomly in this population, so we can draw a grid across the model to represent a simple random sample. The red dots reflect each of 80 sample documents. The portion of the grid outside the circle is ignored.
We can now represent our topical coverage by shading the circles covered by the random sample. You can see that a number of the randomly sampled documents hit the same topical circles. In fact, over a third (32 out of 80) fall in the largest subtopic. A full dozen are in the next largest. Others hit some of the smaller circles, which is a good thing, and we can see that we’ve colored a good proportion of our model yellow with this sample.So in this case, a random sample gives fairly decent results without having to do any analysis or modeling of the entire document population. But it’s not great. And with respect to topical coverage, it’s not exactly unbiased, either. The biggest topics have a ton of representation, a few tiny ones are now represented by a full 1/80 of the sample, and many larger ones were completely missed.
So a random sample has some built-in topical bias that varies randomly—a different random sample might have biases in different directions. Sure, it gives you some rough statistics on what is more or less common in the collection, but both attorneys and TAR engines usually care more about what is in the collection rather than how frequently it appears.
What if we actually can perform analysis and modeling of the entire document population? Can we do better than a random sample? Yes, as it turns out, and by quite a bit.
Let’s attack the problem again by putting attorney eyes on 80 documents—the exact same effort as before—but this time we select the sample documents using a contextual diversity process. Remember: our mission is to find representative documents from as many topical groupings as possible to train the TAR engine most effectively, avoid any bias that might arise from judgmental sampling, and to help the attorneys quickly learn everything they need to from the collection. Here is the topical coverage achieved using contextual diversity for the same size review set of 80 documents:Now look at how much of that collection is colored yellow. By actively modeling the whole collection, the TAR engine with contextual diversity uses everything it can see in the collection to give reviewing attorneys the most representative document it can find from each subtopic.By using its knowledge of the documents to systematically work through the subtopics, it avoids massively oversampling the larger ones and having to rely on random samples to eventually hit all the smaller ones (which, given the nature of random samples, need to be very large to have a decent chance of hitting all the small stuff). It achieves much broader coverage for the exact same effort.
Below is a comparison of the two different approaches to selecting a sample of 80 documents. The subtopics colored yellow were covered by both. Orange indicates those that were found using contextual diversity but missed by the random sample of the same size. Dark blue shows those smaller topics that the random sample hit but contextual diversity did not reach in the first 80 seed documents.Finally, here is a side by side comparison of the topical coverage achieved for the same amount of review effort:Now imagine that the attorneys started with some judgmental seeds taken from one or two topics. You can also see how contextual diversity would help balance the training set and keep the TAR engine from running too far down only one or two paths at the beginning of the review by methodically giving attorneys new, alternative topics to evaluate.
When subtopics roughly follow a Zipfian distribution, we can easily see how simple random sampling tends to produce inferior results compared to an active learning approach like contextual diversity. In fact, systematic modeling of the collection and algorithmic selection of training documents beats random sampling even if every topic were the exact same size, but for other reasons we will not go into here.
The goal in e-discovery review is to find relevant documents as quickly and efficiently as possible while also helping attorneys learn everything they need to know about to litigate the case effectively. With contextual diversity, George Zipf is in our corner.
To learn more about TAR 2.0 based on continuous active learning, download a free copy of TAR for Smart People, 3rd Edition. To learn about Insight Predict, Catalyst’s TAR 2.0 engine, click here.