Your data center is like a concert hall. Each server, each application, each container, each rack is a musician taking part in a giant symphony. A few servers might play the same song on different instruments. Some applications might play the same instrument and produce a similar melody. However, in today's multi-tenant multi-purpose multi-billion-dollar data center you end up with a permutation of musicians, songs, instruments, and volumes that is a deafening cacophony when heard all at once. How do you tune-in to an individual song? How do you quickly identify when one musician is playing the wrong notes? When a musician suddenly starts playing three times as fast or stops playing altogether?
If you record each musician individually to their own “track” you isolate each performance and can pause, rewind, zoom in, and really examine each moment and how others were performing in the same period. Our real time ingestion framework records everything from a single stream and breaks it down into thousands of individual tracks based on the characteristics of the data. Tracks can either be analyzed while streaming or recorded for later analysis. We need tools to show us these individual tracks in combinations which allow us not only to visualize the massive amount of data quickly but help us understand it and navigate it, soloing musicians and groups as needed to see how each performance was played through the day. Because our orchestra is so large we need to turn to aggregations, faceting, and filtering to reduce the “track list” down to a manageable size and start to see patterns we care about emerge.
Lets have a quick look at the volume of events recorded from our data center and see how we can use some visualizations to help understand the data.To begin we’ll look at a plot of overall event volume over a 3 day segment. We can see there was a massive increase in total event volume around 6PM on August 9th - marked on the screenshot below with a red arrow.
Being able to visualize event volume over time helps us see trends and areas that might be interesting to us, but the key to actually unlocking the data is making it easy to quickly navigate and interact with the landscape. One way we make interacting with this visualization easy is by providing a small fixed context visualization in the bottom gutter allowing us to highlight regions and zoom in to an area of interest - in the screenshot below we’ve zoomed in to where the increase in events started to be observed.
With a click we can decompose this single stream into a group of streams, one for each event producer. To use the audio recording metaphor this is equivalent to viewing individual recorded tracks side by side instead of combined into a single sound. Lets split out our events into individual “tracks” each consisting of event volume from a specific host in our data center. In the screenshot below we’ve grouped by host and when we mouse over the visualization we can see grouped values at any point in time broken out by the top 4 hosts. Looking at the visualization now we can see the event volume increase is related to a few hosts. In the screenshot below the red arrow shows event volumes from the top four hosts are each responsible for more events than all the other hosts combined.
Pivoting the data another direction we group event volumes by the service producing the events. In the screenshot below we see the large majority of volume increase is due to mysql. We’ve identified a probable cause for the increase through interacting with our data.
With another click we can switch our plot to a line visualization instead of stacked area because it might make more sense to see each event volume individually rather than stacked on one another. In the screenshot below we can very clearly see a huge increase in noise coming out my mysql while other volume lines remain pretty constant.
Another tool we have in our arsenal for deeper exploration is to plot individual search queries against one another, allowing us to apply different sets of filters against already filtered data to separate a single stream into multiple substreams, similar to how a graphic equalizer might show bass, middle range, and treble response on three different plots allowing you to view how the different components of a single sound source combine into a single track. In the screenshot below we’ve compared all of the events from mysql against events from mysql which contain the word ‘error’ and events which contain the word ‘query.’
Using just a few visualizations and tools we’ve faceted, split, and filtered the data out into smaller and smaller forms, all the while drilling deeper into a particular movement in our data center. From here we can jump to a deep view of historical metrics collected from msql for the same time period or drill down further towards the start of the increase.
Maybe we can move back a week in time & see if this kind of behavior is expected? That’s fodder for another blog post!
Or has it already been flagged as anomalous, answering that question for us? We cover how we detect anomalies in this blog post.
At Rocana we deal with massive amounts of data from a huge number of systems and we believe visualizing the data alone isn’t enough. A picture is worth a thousand words but a picture you can change and manipulate tells a thousand different stories and holds answers to questions you learn how to ask only while you’re exploring. We continue to push ourselves to uncover new ways to interact with the data so you can find the off-key, out of tune, and off-beat notes hidden within the wall of sound.