The Sinister Six…Challenges of Working with Large Data Sets

November 20, 2020



Nick Schreiner
Nick Schreiner

Collectively, we have sent an average of 306.4 billion emails each day in 2020. Add to that 23 billion text messages and other messaging apps, and you get roughly 41 million messages sent every minute[1]. Not surprisingly, there have been at least one or two articles written about expanding data volumes and the corresponding impact on discovery. I’ve also seen the occasional post discussing how the methods by which we communicate are changing and how “apps that weren’t built with discovery in mind” are now complicating our daily lives. I figured there is room for at least one more big data post. Here I’ll outline some of the specific challenges we’ll continue to face in our “new normal,” all while teasing what I’m sure will be a much more interesting post that gets into the solutions that will address these challenges.

Without further delay, here are six challenges we face when working with large data sets and some insights into how we can address these through data re-use, AI, and big data analytics:

  1. Sensitive PII / SHI - The combination of expanding data volumes, data sources, and increasing regulation covering the transmission and production of sensitive personally identifiable information (PII) and sensitive health information (SHI) presents several unique challenges. Organizations must be able to quickly respond to Data Subject Access Requests (DSARs), which require that they be able to efficiently locate and identify data sources that contain this information. When responding to regulatory activity or producing in the course of litigation, the redaction of this content is often required. For example, DOJ second requests require the redaction of non-responsive sensitive PII and/or SHI prior to production. For years, we have relied on solutions based on Regular Expressions (RegEx) to identify this content. While useful, these solutions provide somewhat limited accuracy. With improvements in AI and big data analytics come new approaches to identifying sensitive content, both at the source and further downstream during the discovery process. These improvements will establish a foundation for increased accuracy, as well as the potential for proactively identifying sensitive information as opposed to looking for it reactively.
  2. Proprietary Information - As our society becomes more technologically enabled, we’re experiencing a proliferation of solutions that impact every part of our life. It seems everything nowadays is collecting data in some fashion with the promise of improving some quality of life aspect. This, combined with the expanding ways in which we communicate means that proprietary information, like source code, may be transmitted in a multitude of ways. Further, proprietary formulas, client contacts, customer lists, and other categories of trade secrets must be closely safeguarded. Just as we have to be vigilant in protecting sensitive personal and health information from inadvertent discloser, organizations need to protect their proprietary information as well. Some of the same techniques we’re going to see leveraged to combat the inadvertent disclosure of sensitive personal and health information can be leveraged to identify source code within document populations and ensure that it is handled and secured appropriately.
  3. Privilege - Every discovery effort is first aimed at identifying information relevant to the matter at hand, and second to ensure that no privileged information is inadvertently produced. That is… not new information. As we’ve seen the rise in predictive analytics, and, for those that have adopted it, a substantial rise in efficiency and positive impact on discovery costs, the identification of privileged content has remained largely an effort centered on search terms and manual review. This has started to change in recent years as solutions become available that promise a similar output to TAR-based responsiveness workflows. The challenge with privilege is that the identification process relies more heavily on “who” is communicating than “what” is being communicated. The primary TAR solutions on the market are text-based classification engines that focus on the substantive portion of conversations (i.e. the “what” portion of the above statement). Improvments in big data analytics mean we can evaluate document properties beyond text to ensure the “who” component is weighted appropriately in the predictive engine. This, combined with the potential for data re-use supported through big data solutions, promises to substantially increase our ability to accurately identify privileged, and not privileged, content.
  4. Responsiveness - Predictive coding and continuous active learning are going to be major innovations in the electronic discovery industry…would have been a catchy lead-in five years ago. They’re here, they have been here, and adoption continues to increase, yet it’s still not at the point where it should be, in my opinion. TAR-based solutions are amazing for their capacity to streamline review and to materially impact the manual effort required to parse data sets. Traditionally, however, existing solutions leverage a single algorithm that evaluates only the text of documents. Additionally, for the most part, we re-create the wheel on every matter. We create a new classifier, review documents, train the algorithm, rinse, and repeat. Inherent in this process is the requirement that we evaluate a broad data set - so even items that have a slim to no chance of being relevant are included as part of the process. But there’s more we can be doing on that front. Increases in AI and big data capabilities mean that we have access to more tools than we did five years ago. These solutions are foundational for enabling a world in which we continue to leverage learning from previous matters on each new future matter. Because we now have the ability to evaluate a document comprehensively, we can predict with high accuracy populations that should be subject to TAR-based workflows and those that should simply be sampled and set aside.
  5. Key Docs - Variations of the following phrase have been uttered time and again by numerous people (most often those paying discovery bills or allocating resources to the cause), “I’m going to spend a huge amount of time and money to parse through millions of documents to find the 10-20 that I need to make my case.” They’re not wrong. The challenge here is that what is deemed “key” or “hot” in one matter for an organization may not be similar to that which falls into the same category on another. Current TAR-based solutions that focus exclusively on text lay the foundation for honing in on key documents across engagements involving similar subject matter. Big data solutions, on the other hand, offer the capacity to learn over time and to develop classifiers, based on more than just text, that can be repurposed at the organizational and, potentially, industry level.
  6. Risk - Whether related to sensitive, proprietary, or privileged information, every discovery effort utilizes risk-mitigation strategies in some capacity. This, quite obviously, extends to source data with increasing emphasis on comprehensive records management, data loss prevention, and threat management strategies. Improvements in our ability to accurately identify and classify these categories during discovery can have a positive impact on left-side EDRM functional areas as well. Organizations are not only challenged with identifying this content through the course of discovery, but also in understanding where it resides at the source and ensuring that they have appropriate mechanisms to identify, collect and secure it. Advances in AI and big data analytics will enable more comprehensive discovery programs that leverage the identification of these data types downstream to improve upstream processes.

As I alluded to above, these big data challenges can be addressed with the use of AI, analytics, data reuse, and more. Now that I have summarized some of the challenges many of you are already tasked with dealing with on a day-to-day basis, you can learn more about actual solutions to these challenges. Check out my colleague’s write up on how AI and analytics can help you gain a holistic view of your data.

To discuss this topic more or to ask questions, feel free to reach out to me at

[1] Metrics courtesy of Statista

About the Author

Nick Schreiner

Nick Schreiner has 14 years of experience in the legal technology industry spanning product management, operations, delivery, and sales support. He has designed and managed end-to-end managed services and eDiscovery service delivery models. He managed and supported end-to-end eDiscovery projects~ designed best practice solutions, consulted on technology workflows and implementations.