Blog Post

Rocana Ops: Event-based monitoring of IT infrastructure

    
August 27, 2015 Author: Eric Sammer

Rocana Ops is designed to collect, store, and analyze two dimensions of machine-generated data at scale: event data and device metrics. Here we’ll dive into the innovative way our product handles the former. By event data, we mean not just log events, in all their various forms, but also higher order activities such as authentication events, continuous deployment events, or HTTP page view events.

The primary reason behind defining a singular way of describing all of these events was to allow users to understand the relationship between these systems, and build a more complete mental model of the impact they have on one another. This approach makes it fundamentally easier to incorporate new kinds or sources of events into a rich, already functional event data collection environment.

rocana-architecture

Anyone can look at an analytics report of HTTP page load events, for instance, and see that average load time jumps from 23 milliseconds to 87 milliseconds at a certain point – and that’s always interesting. But knowing why that change happened is far more useful. There are tens, if not hundreds, of reasons why such a change might occur. In fact, if you told a seasoned site reliability engineer about this kind of problem, one of the first questions they’d ask is, “What has changed?” If, in addition to the load-time metric, they learned that a new version of the web application’s code base was deployed to the affected application servers by a development team, they might be closer to understanding the problem.

The key is in being able to discover and draw these connections simply, quickly, and reliably. Being able to explore different combinations of event types together in a single view reveals a lot more than any one system on its own ever could. With production systems generating tens of terabytes every day, humans must rely on machines to surface these correlations.

Rocana Ops uses an extensible event data model to represent all data within the system. This schema provides a simple and flexible way of representing time-oriented discrete events that occur within the data center. Every event has the following fields:

  • id – A unique event ID.
  • ts – A Unix epoch-style timestamp indicating when the event was generated by its source system (accurate up to the nanosecond).
  • event_type_id – A numeric event type identifier, which also tells the system the format to expect for the body and attributes.
  • location – A name representing the physical location where the event originated.
  • host – The hostname or IP address of the source host.
  • service – The service or process which generated the event.
  • body – An opaque binary blob capable of transporting event type-specific data.
  • attributes – A set of event type-specific key/value attribute pairs.

While most of these fields are self-explanatory, the event type, body, and attributes deserve special attention. The event body is a binary blob typically used to contain the original raw input received by the system. For syslog messages, this is the unparsed RFC 3164 or 5424 syslog message. When the system is configured to tail log files on disk, this is a UTF-8 encoded string of a single record. No matter the source, an event body can hold any data related to an event.

While the body is interesting, most users don’t think of syslog data, for example, as a string. Instead, they think of the timestamp, facility, severity, application name, process ID, and message. Rocana Ops extracts this key information from the body, and stores the resultant data in the attributes of the event. Both the raw and the extracted data are usually retained. Once extracted, the attributes may be used to query or visualize the event data.

Determined at the event source, the event type is used by the system to differentiate how events should be processed and displayed. This also enables users to filter by event type for purposes of monitoring or while drilling down during an investigation.

Example: A syslog (RFC 3164) event.

{
 id: "JRHAIDMLCKLEAPMIQDHFLO3MXYXV7NVBEJNDKZGS2XVSEINGGBHA====",
 event_type_id: 100,
 ts: 1436576671000,
 location: "aws/us-west-2a",
 host: "example01.rocana.com",
 service: "dhclient",
 body: "<36>Jul 10 18:04:31 example01.rocana.com dhclient[865] DHCPACK from 10.10.1.1 (xid=0x5c64bdb0)",
 attributes: {
   syslog_timestamp: "1436576671000",
   syslog_process: "dhclient",
   syslog_pid: "865",
   syslog_facility: "3",
   syslog_severity: "6",
   syslog_hostname: "example01",
   syslog_message: "DHCPACK from 10.10.1.1 (xid=0x5c64bdb0)"
 }
}


The flexibility of this event schema empowers IT operators to monitor new types of events through simple configurations at the source host, without having to teach the Rocana Ops platform how to incorporate the data. Coupled with the Big Data capacity of the platform, this model eliminates the disincentive to collect events as soon as new sources are added. This was a key consideration in developing our product: we want operators to never again have to put off event data collection until they get around to programming the system to understand what is being collected. Rocana Ops collects trillions of events a day without sacrificing their distinctness or the robustness of their contents.

The event data model is just one of several groundbreaking aspects of our product’s design. For a more complete overview, download the white paper, “Rocana Ops Architecture”.

 

DOWNLOAD