Blog Post

Old-school IT Ops at Next-Gen Scale

August 17, 2015 Author: Brian Dominick

Remember the first server cluster you ever administered? Mine was in 1999; it housed a couple of load-balanced web application servers, a web server, two mail servers, and a database server. Problems were easy to identify: either I’d get a notice from a monitoring tool or else my phone would ring because a user or ten had complained. Either way, I was the guy for that cluster. I typically knew which server to SSH into based on what the problem was. I knew where the logs were, and I knew which logs to review first, because I just knew the damn cluster.


Even if the problem was complicated — say, poor performance of an outgoing mass mailing — I could assume it was either the SMTP server used to spool the mailings, the web application used to perform the mailings, or the database from which we drew the content and recipient info for the mailings. Worst-case scenarios were digestible.

If you worked at even a large IT department ten years ago, you might have had a little domain like I did, but that’s pretty much unheard of today, even at mid-sized operations. The “tribal knowledge” that my fellow sysadmins and I acquired enabled us to effectively manage these small, compartmentalized clusters. But with today’s dynamic workloads bouncing around virtualized services and containers, constantly scaling up and down to meet demand, the old paradigm just doesn’t work. Instead, IT operators spend hours on group conference calls sorting through mounds of data, trying to find the source of issues. Often, the problem is never found; servers and services are just re-started with fingers crossed.

What if you could recreate the kind of intimacy that put your whole domain within your reach and return to the days when troubleshooting meant just a few steps to finding the root cause of a complex problem? What if, in fact, this could be achieved even more simply than the old conventional methods?

In the white paper “How to Investigate An Infrastructure Performance Problem”, Rocana’s Director of Engineering, Joey Echeverria, discusses how Rocana Ops puts the power of old-school IT root cause analysis within the modern IT operator’s reach — using advanced analytics and visualizations to guide us through what would otherwise be a morass of infrastructure and data. Joey walks us from discovery of the problem through nailing the root cause, all without leaving our web browser. The days of complex applications running on simple IT environments are behind us, but Rocana is putting control over those environments back at the IT operator’s fingertips.