Once you have processed your documents (extraction plus deduplication) and search terms have been applied, you must take at least two things into account when estimating the cost of review:
(1) page count (2) file size
Most everyone asks for file count and I think this is a big mistake (unless you are performing a native file review where page counts are NOT available).
Page counts will give you speed of review (i.e. number of contract reviewers multiplied by their collective review rate, say, 25000 pages per hour). File counts don't help here because assignments containing an equal number of files may reflect an asymmetric page count. Therefore, 500 files in one assignment may go quickly for a reviewer while another batch of 500 files for the SAME reviewer may take much longer if page count is doubled.
Page counts will also give you cost of TIFF'ing and blowbacks (if this is the form of production or *gasp* review).
File size will give you the cost of electronic production since this is typically charged on a per gigabyte basis.
So when you are gathering statistics for estimating the cost of review, make sure you ask for (1) page count, and (2) file size.
Damn, it feels good to be a Banker! [video]
This is hilarious! FYI, I've been in consulting for 10 years. The banker-side of the rap is a bit hard to make out, so you'll have to listen to it twice (volume turned up even) to catch all the lyrics.
E-Discovery is Low Tech
A lot of the work we do is low tech in nature. It's funny because the college graduates we hire have been raised in a Web 2.0 world and are shocked to find their time copying & counting files, tying out exceptions, recovering passwords, converting files from one format to another, etc. What's worse is that we'll hire experienced technologists who end up doing similar low grade work. The lucky ones get to run SQL queries. Whoopee! That's considered advanced.
One soon realizes that work in our industry is low tech because it has to be--the volume of data that we process is gargantuan. We have to keep things simple so that we can easily detect mistakes. With all the progress that we've made in document analytics, these advanced techniques tend to fall by the wayside once something faulty is detected in the processing pipeline. Then everyone gets back to basics. Suddenly, our energy is focused on why a piece of meta data is missing; why our file counts are off; why multilingual characters aren't displaying. You only have to experience the tirade of a partner or senior attorney once to realize that data integrity is paramount. Everything else is fluff. Mistakes are bound to happen in our business; whether it's due to a software bug or human error--it doesn't matter. A singular processing mistake can be reason enough to convince a review team to fall back to reviewing everything linearly, one document at a time.
It makes me think of companies that make things like dandruff shampoo. There's very little room for innovation, but they're providing a product for the masses. I'm sure the chemical formula of dandruff shampoo can be pretty complex, but the expectations are simple. Try to invent a cherry flavored version of dandruff shampoo, for example, and you run the risk of introducing a widespread allergic reaction. Do that, and the trust is broken.
Electronic Discovery is a shampoo industry. Meanwhile, the semantic web, social media, spatial GIS and other high tech trends are passing us by. All the while generating more and more data for us to handle and process.
Project Managers, Practitioners, and Professionals
There are three archetypes: project managers, practitioners, and professionals. A good project team will be staffed with all three. There's the gal who keeps the project on track, on budget, and within scope; the geek (with a faint, detectable glow of a halo around his head) who can deliver a soliloquy on the history of bate numbers and can lecture at exhaustive length on recall and precision; and last (but not least) the partner who, between negotiating the big deals, instills ethical and professional behavior in the team. A bad team, mind you, can still be staffed with all three archetypes. The difference is that a good team knows that they need each and every one of these role players. They rely on a mixture of every one's talents. A bad team will have individuals who have an overinflated view of their own contributions. They downplay the relative worth of everyone else's role on the project and have a heroic view of themselves. Whenever you're embarking on a new project, get to know the role players. If you notice bickering, infighting, or grandstanding this is a huge red flag. This, even more so than the value of the technology that's being employed, will give you some indication of the project's ultimate chance for success.
Recall and Precision
There's a great law.com article by H. Christopher Boehning and Daniel J. Toal that discusses traditional keyword and Boolean search methods versus new alternative methods. Though the authors don't mention it specifically, their article discusses the theory of "recall" and "precision". The ability to search a corpus of documents and bring back all of the relevant material in a result set is called "recall". The ability to reduce the number of false positives in a result set is called "precision". Therefore, if you craft an overly broad search you may increase your recall, but lower your precision. This scenario usually results in a larger number of false positive documents to sort through in your review. If you have very few false positives in your result set, it allows you to identify relevant documents one-after-another with fairly high frequency, but the snapshot of material may be a very thin slice of the overall relevant material (high precision, low recall). In other words, there may be a lot more juicy stuff out there to review. The trick is--and this is the holy grail of search--how do you corral all of the good stuff without having any bad stuff mixed in?
It really depends on your review goals. The fallacy with most search efforts is a desire to only get low doc counts with the most relevant material possible. In this case, the emphasis for your review is on precision (maybe because cost is your primary driving constraint). If relevant material is rampant within the corpus, however, you will want to increase your recall in order to get at the full scope of your issue. You may tolerate a good number of false positives in order to be as thorough as possible (maybe completeness is your primary driving constraint). You'll want to decide quickly whether recall or precision is the ultimate goal of your review. Of course you'll want both, but after the review has started you'll want to shift your focus on one or the other depending on the incremental results of your review. You'll know quickly (after a day or two) if your review assignments are yielding the desired level of precision. In order to test your level of recall, you'll want to sample a population of the documents that were excluded from review (make sure it's statistically significant). Once you perform a QC review on this sample set, you'll know whether your search terms were sufficient in capturing enough relevant material.
As you all know, the iterative nature of this work is commonplace in our business. Unless you have a real sense of the percentage of relevant material to begin with, there's absolutely no way of knowing whether your search results have achieved the highest level of recall and precision until you roll up your sleeves and just dig into it. If you're trusting the artificial intelligence of a system to do this "auto-magically" for you, either by concept grouping or "learning" or some other newfangled algorithm, then you are putting quite a bit of faith into the technology. Remember that most of this new technology is a carefully guarded trade secret belonging to the software vendor. In order to prove anything to the court, however, you have to be able to lift the hood and explain the goings-on underneath. The only defensible position that one can take these days, at least until there's a technology winner that is universally accepted by the court, is to present your search terms with hit counts and corresponding review calls. Keywords and Boolean searches are still the state-of-the-art today.
This Blog is dedicated to the men & women working directly in the trenches on EDD projects - junior attorneys, paralegals, project managers, document reviewers, data processors, and staff consultants alike, who put in countless stressful (and often thankless) hours doing what seems to be the impossible.
About Me
Name: Jerry Bui
Location: Los Angeles, California, United States
Jerry leads large scale electronic discovery projects and investigations for government agencies and the country's top law firms. His background is in multi-tiered software architecture, network security, data modeling/warehousing and document analytics. He has been involved in major front-page corporate cases, some of which involve hot-button matters such as Anti-money Laundering, Antitrust, and Options Back-dating.
Disclaimer: Opinions and claims contained herein are those of the author only and are not representative of Jerry's employer, its partners, or any of its member firms.