Blog Post

The Value of Data Efficiency

January 22, 2014 Author: Omer Trajman

Taking a step back to look at how the Big Data space is evolving, we seem to be at the beginning of a multi-decade journey. Big Data, as a phenomenon, describes our increasing ability to gather information about everything. It has become clear over the past few years that where there is a will, there is someone digitizing, collecting, measuring and analyzing. Yet our new power to collect more about “everything that happens” is widening the data literacy gap. Just because we can collect more doesn’t mean we necessarily know more.

Businesses can now collect information about every single movement in the supply chain, or each record that is created by the machines in the data center. They can even apply sophisticated analysis and visualization to all this data and draw astounding new insights and take actions to improve the business. Yet the most difficult step in the process of turning data into insight or action is simply getting data from where it’s created to the people who can analyze it.

The ability to measure and improve upon the process of getting data to the right people is what we call data efficiency. As much as the buzzwords of the day are “data science” and “machine learning,” the hardest part about Big Data is identifying, ingesting, cleansing, transforming and accessing data. In order to use data, an individual must be able to identify it and gain access to it. This sounds simple in principle, yet digging into the details reveals several layers of complexity.

Each user in a different functional role has a different way of describing the data they need. A marketer may ask for all data pertaining to a customer website visit. They expect to have a graphical interface that is prepopulated with customer visits, categorized by referring site, time, geographic source and demographics. The source data is a collection of activity traces in the form of web server, application server and database server log files stored on widely distributed network of servers. Some of the log files are captured by a CDN provider and some by 3rd party service providers. Collating these files and translating the user’s request into a filter is now clearly a much more daunting task than it first appeared to be.

While Big Data technologies - whether Hadoop, MPP Databases, BI tools, analytics packages or ETL software - are very powerful, they are by their nature extremely data inefficient. Out of the box, they don’t help users get access to data. Each piece of software must be matched to form a stack that includes data collection, storage and access suited for a defined set of user requirements. This stack is then configured to meet the particular needs of the organization. The entire customization process may take months and in the end may still involve days of manual intervention to help users access data.

This problem was not created by Big Data; most data management systems are data inefficient, going back decades. Yet our increased ability to measure and collect has exacerbated the data efficiency problem. Historically, new companies have created software to address the critical data access needs of particular constituents. Omniture (and Offermatica) created services to help marketers access information about website usage. Arcsight built systems to help security teams access information about suspicious events. Splunk has built an impressive business helping IT teams access system log data for troubleshooting and triage.

The software that these companies built could be constructed by individual businesses using existing tools (DBMS, BI, ETL). Yet Arcsight, Omniture, Splunk and their direct competitors gained widespread market adoption largely because they offered out of the box, data efficient solutions. With the emergence of Big Data, organizations are again at a crossroads. As we’ve noted, the cost of software is rapidly dropping on a per byte basis. Organizations can attempt to build on top of Big Data technologies and solve the growing data efficiency problems or wait for vendors to construct them. Ultimately the ability of any organization to capitalize on Big Data will depend on how efficiently user requests are translated into data access and at what price points those solutions are made available.