#!trpst#trp-gettext data-trpgettextoriginal=2425#!trpen#Aller au contenu#!trpst#/trp-gettext#!trpen#
Fact Crashing™, Fact Crashing, Action Data, What is Action Data?, Verbal Data, Data-based evidence, Discovery, metadata, physical evidence, demonstrative evidence, documentary evidence, and testimonial evidence, electronically stored information, structured data and unstructured data, Personably Identifiable Information, iDiscovery

This is a multi-part introduction to Fact Crashing™: The acceleration of dispute resolution through the prioritization of data-based evidence (ACTION data).

There are 9 Principles of Fact Crashing™. As of Part III in the series, I have discussed:

Principle 1: Data is Evidence and is Discoverable

Principle 2: Data Should be Addressed Early

Let us continue.

For this installment, I will answer the question: What is ACTION data?

Traditionally in discovery, parties deal with a wide variety of evidence, but arguably only a few categories that matter – physical evidence, demonstrative evidence, documentary evidence, and testimonial evidence.

Within this paradigm, electronically stored information (ESI) is considered documentary evidence. Within the world of ESI, there are two categories of data: structured data and unstructured data. Most of us are most familiar with unstructured data. This is free-form data that has highly malleable content such as emails, MS Word documents, MS Excel spreadsheets, MS PowerPoint files, audio files, photos, and movies. Anything where the content does not follow a particular layout, format, or formula. Hence the term “unstructured.” The opposite of unstructured is, fielded data, or data that is structured. Typically thought of as databases,these can also be spreadsheet pages, load files, or log files. Structured data can also refer to items with metadata, including emails, Microsoft Office documents, and operating system entries.

I have found that much of the structured data we deal with is tracking, tracing, and transactional. This is the data that records your Amazon™ purchase: what you bought when you bought it, how many units, payment information, and shipping information. All of this is structured (fielded) data. When you receive the product and post a review, you can type whatever you like. You are not typically limited by content, only by length. Your review is unstructured data.

It also turns out that data scientists consider all data structured. In fact, three axes of structure can be used to describe any data:

  • How structured is the content?
  • How structured is the storage?
  • How structured is the retrieval?

All of our data systems can be measured along these three axes. So, there is no unstructured data; there is only less structured data. This always reminds me of the movie “The Matrix” when the young boy tells Neo, “The truth is, there is no spoon.” By this analogy, we could consider email as “semi-structured data” because it combines fielded data and free-form data. Likewise, for text messages, instant messaging, pictures, audio files, etc. All data tend to have some degree of consistent, highly defined, highly structured data that controls various aspects of tracing the creation, storage, usage, retrieval, and even data disposal.

So, using terms like structured data and unstructured data when all data has various degrees of structure may be a misnomer.

Some other descriptors may be helpful:

Human vs. Instrumental

Subjective vs. Objective

Entered vs. Recorded

Manual vs. Automated

 Vague vs. Precise

Analog vs. Quantum

Variable vs. Fixed

Free-form vs. Defined

Content vs. Context

Words vs. Numbers

Ideas vs. Metrics

Abstract vs. Concrete

Linguistic vs. Mathematical

Other characteristics are also essential to recognize. Chief among these is the cost curve of structured data versus unstructured data. This is deserving of its own chapter, but suffice it to say that the curves are very different for now. Unstructured data has a linear cost curve. The more data you have, the more it costs. Even with advanced technology, we can change the slope of the line, but it remains a line.

Structured data, on the other hand, has a curve reflecting accelerated economies of scale, almost approaching fixed costs. This is an upfront investment with little to no marginal costs for additional records per data source. Moreover, there is a compound value for each additional data source, even while the incremental cost is relatively fixed. Each new data source represents linear costs but exponential value.

For all of these reasons, I suggest the paradigm of:


ACTION data is focused on the transaction, the activity, the metadata, and system data associated with that activity. This data is Ambient, Contextual, Transactional, Instrumental, Operational, and Navigable. It could also be considered: Attributable, Codified, Tallied, Integered, Objective, and Necessary. In other words, this is not what someone said they did (that would be the content of their message, voice mail, photo). Instead, it is what they actually did.

What is the opposite of ACTION data? That would be VERBAL data:

Unlike ACTION data, VERBAL data is Variable, Emotional, Reflective, BLOB, Artistic, and Linguistic. Optional descriptors are Varied, Emotive, Relatable, Basic, Actionable (Articulated and Ambiguous, as well), and Language. It’s helpful to think of the data world in two groups: ACTION data and VERBAL data, especially when we recognize that many data files contain elements of both. Fact Crashing™ is focused on ACTION data. And, when VERBAL data is available, we seek ways to turn it into ACTION data instead.

How do you turn VERBAL data into ACTION data?

By adding metadata. This is done by extracting metadata from VERBAL data or by creating metadata from the VERBAL content. We extract metadata when we pull fielded data from email headers and then use those fields to filter those emails. We also do it when drawing file properties from MS Office documents and then using those fields to filter those files. We create metadata when we analyze the content of free-form files and fields to extract information from them. Do they contain personally identifiable information (PII)? Do they reference issues A, B, or C? Do they relate to concepts 1, 2, or 3? Can they be categorized as an invoice, contract, calendar entry, SPAM? Is the content responsive? Is it privileged? When we identify, then record these characteristics, we are creating metadata.

In fact, whenever we analyze unstructured data, our computers actually rely on structured characteristics, even when those characteristics are transient or are behind the scenes. For example: When you like a particular song on Pandora®, that system can recommend similar songs to you. How does it do this? Pandora does not listen to the music but instead relies on pre-recorded metadata. In Pandora, every song is categorized by Ph.D. musicologists using 400 different fielded characteristics found in the Music Genome Project. These cover the qualities of melody, rhythm, harmony, form, composition, and lyrics.

Per Julia Layton at How Stuff Works:

“Pandora relies on a Music Genome that consists of 400 musical attributed covering the qualities of melody, harmony, rhythm, form, composition and lyrics. It’s a project that began in January 2000 and took 30 experts in music theory five years to complete. The Genome is based on an intricate analysis by actual humans (about 20 to 30 minutes per four-minute song) of the music of 10,000 artists from the past 100 years. The analysis of new music continues every day since Pandora’s online launch in August 2005. As of May 2006, the Genome’s music library contains 400,000 analyzed songs from 20,000 contemporary artists. You won’t find Latin or classical yet: Pandora is in the process of developing a specialized Latin music Genome and is still deep in thought about how to approach the world of classical composition.”

Pandora and the Music Genome Project have translated something inherently VERBAL into something that is now ACTION. ACTION data can be measured, compared, and ranked. VERBAL data, in its original format, can only be read by Humans. This is costly and time-consuming. In the early days of computer-assisted discovery, when we were mainly working with paper documents and scanning them to create repositories of TIFF images, we would send extensive collections of documents to lower-cost review teams. Their sole job was to “code” the documents. This bibliographical coding populated fields such as Author, Subject, Date, Category, etc. These fields, in turn, became the basis of retrieval. The VERBAL data (the scanned images) was augmented with ACTION data.

But can’t we just deal with the free-form text? Yes, but there are limitations.

In the seminal Information Retrieval work of Blair and Maron (1985), the attorneys were using the IBM STAIRS system. This was the first documented system that allowed Boolean searching (AND, OR, NOT, etc.) and proximity searching (adjacent to, same paragraph, etc.). Since then, we have steadily improved full text searching with additional advanced functions, natural language processing, language translators, and advanced information retrieval techniques for clustering, space vector machines, concept searching, etc.

Blair and Maron strongly criticized the full-text retrieval system as ineffective for identifying the correct documents. Since then, we’ve matured many technologies to help with the handling of free-form data. These usually are lumped under some form of artificial intelligence. They are much better but still limited.

Would it be helpful to know if your emails contain Personably Identifiable Information (PII)? Then you can use an algorithm to search, identify, and then record that aspect. in doing so, you would extract some ACTION data from your VERBAL data. Would it be helpful to analyze a large group of communications for sentiment? If so, you can use a sentiment analysis algorithm to analyze, assign, and record that aspect. You just extracted some ACTION data from your VERBAL data.

So yes, we can handle VERBAL data.  But it’s labor intensive, expensive, subjective, and can have widely varying degrees of precision and recall.

For all of these reasons, it’s appropriate to prioritize other sources ahead of VERBAL data. It is pertinent to prioritize ACTION data and use it to augment VERBAL data. Recognizing this is the essence of Fact Crashing™.

In the next few installments, we will look at identifying, qualifying, and prioritizing ACTION data. Once you adjust to this then, just as Neo saw the Matrix as bits and bytes, you may start seeing ACTION data everywhere.

Continue to Part V of our Fact Crashing™ series >>

iDiscovery Solutions is a strategic consulting, technology, and expert services firm – providing customized eDiscovery solutions from digital forensics to expert testimony for law firms and corporations across the United States and Europe.