In the last few posts we’ve been reviewing the features and benefits of an emerging class of tools called IT Operations Analytics (ITOA). In the next few installments, we share our experience building such a system, starting with the data collection tier, and the challenge of aggregating and processing machine data at scale.
Let’s start with the requirements of the system, taking into account the fact that data collected by it will be used for mission critical purposes, such as auditing, compliance, and investigations. This means that certain guarantees are required:
- no single point of failure
- horizontal scalability of all components
- no artificial limitations on data retention
- guaranteed, non-duplicate delivery of event data
- continuous operation (all upgrades, etc. performed online)
Given the stringent nature of these requirements, we chose to build our system using a messaging architecture with the Apache Kafka message bus at its core, as described in the diagram below. We’ve implemented the quality of service guarantees in the Kafka producers that capture events from different sources and pump them into the bus, as well as in the consumers that pick up the events from the bus and process them.
The event producers transform every event into a standardized Apache Avro schema with a basic set attributes (timestamp, host, location, etc.) and key-value pairs that can be populated with different attributes for each event type, alongside the complete unaltered event body. Certain event types have standard parsers and each user has the ability to define additional custom event types as needed. In either case the event type is populated by the consumer, which determines the downstream processing of the event. The user can also create their own custom producers for proprietary data sources.
On the other side of the bus, the consumers do most of the work associated with de-duplication, transformation, and processing of the events according to the requirements of downstream destinations, parsing the raw bytes of the event and populating additional attributes. Consumers are heavily parallelized and take care of the mechanics of reading from the Kafka bus including offset management, reading across topics, and distributing the downstream work. A set of consumers implement analytic algorithms and push an aggregated feedback to the bus for further processing. Other consumers read the raw and aggregated data and forward it to engines that index and store the data for different uses. Data is indexed in SolrCloud for free text and faceted search; HDFS for deep storage graphs and metrics; and Spark for on the wire processing. Users can plug in custom consumers for specialized transformation, aggregation, and processing or forwarding to other downstream systems such as proprietary SIEM applications or alerting tools and dashboards.
We’ll get into the visualization and analytics aspects of the system in future blogs. But in the mean time if you’d like to hear more about event processing at scale, our CTO, Eric Sammer will be presenting a session on topic at Strata+Hadoop World in San Jose on Thursday; and as always,please feel free to email us at firstname.lastname@example.org.
Eric Sammer (CTO, ScalingData)
11:30am–12:10pm Thursday, 02/19/2015
Hadoop in Action, Location: 210 A/E