Blog Post

Big Data (De) Duplication

April 27, 2014 Author: Omer Trajman

Big Data has huge potential to give us deeper insight into our business, to help us understand our customers, our supply chain and all our devices, and to radically change how we operate — but all these advances comes at a cost. In order to analyze data that spans our rapidly growing operational systems, we first need to collect it, which involves making a duplicate copy of each piece of data into a single unified system. While there are still many challenges in collecting and managing a wide variety of data at scale, Big Data in particular gives newcomers pause when it comes to creating additional copies of an ever growing set of data.

Consider machine event data as an example. Every server and application in a data center generates events about its activities, typically storing these events in log files local to each server. Web servers write every request they receive, application servers catalog their actions and database servers maintain transaction and execution logs. Though strewn across a data center, the events in each of these logs are useful for many operational applications such as troubleshooting system problems, understanding overall data center performance and capacity planning. Machine events are also routinely used to identify security threats and to look for systemic risk, abuse and fraud.

We are already making many copies of our machine event data in order to feed this valuable data into each of these operational applications -- for triage, clickstream analytics, capacity planning and numerous other uses. With so many copies of machine event data floating around, it’s natural to be wary of creating another copy in the hopes of some as of yet unknown Big Data discovery. Further, even if our Big Data experiment is successful, it is not clear that it will be valuable to justify keeping another copy around. Diving into the details we find that looking at just the number of copies of data belies the key advantages of a Big Data solution. By not taking advantage of Big Data, we’re potentially missing out on the opportunity to reduce duplicates of data while at the same time getting more value from our operational applications.

Most systems retain between 30 and 90 days of machine event data. For simplicity we’ll assume we keep 30 days on each source system (web, application, database server) and 90 days within each of six applications: system triage, security, fraud investigation, performance management, alerting and monitoring. This means the average organization is able to look at 90 days of data while consuming 30 + 90 x 6 = 570 days of space, not including backups. If we store a single copy of all of our machine event data and make it available to each of our applications, we could keep over a year and a half worth of data for the same price we’re paying today. Consider, if each of these applications had access to six times as much data, the possible improvements in planning, security and fraud investigations and triage of systemic problems

If we further factor in that Big Data solutions are on average three times cheaper than the storage used for conventional applications then the data potentially available increases to six years at the same nominal cost. Big Data not only delivers on the promise of deeper insight and operational improvements, it does so at the same or lower cost than existing solutions by reducing duplicated data and consolidating access and control over a wide variety of data sets required for critical business operations.