Blog Post

How to Avoid Front Page Outages

    
August 8, 2016 Author: Omer Trajman

Airlines have had a rough year. At Southwest a router failure a few weeks ago grounded all flights. A single power outage impacted the entire operations systems at Delta today, from which they are still recovering. JetBlue has had to revert to manual check-ins due to a systems issue, and last year United flights globally were delayed due to a failed network router issue.

airline-outage.pngAs we add speed and wring out inefficiencies in our infrastructure, we rely ever more on complex computer systems to keep things running. When everything works as designed, these systems can greatly increase customer satisfaction and profits. When they fail in unexpected ways, it’s front page news, a social media firestorm, and millions of dollars are lost.

Yet as we’ve learned from other companies, outages are often very predictable and in particular cascading outages are almost always avoidable. We achieve resiliency in complex computer systems in part due to design but what most companies are missing is broad visibility into all of their systems. From conversations I’ve had with hundreds of IT professionals, most organizations are only monitoring 5-10% of their systems and they’re actively watching even fewer – just those believed to be mission critical.

In a world powered by complex IT infrastructure, there is rarely such a thing as “non-mission critical.” Increasingly all systems are interconnected so that we can give customers better omni-channel experiences, so we can cross sell and up-sell personalized services, and so we can run a more efficient business operation. When everything depends on everything else, there is no choice but to monitor all systems, achieving total visibility.

The way to make the front page is to continue to depend on silos of limited scale IT monitoring and then act surprised when some interconnected system causes a cascade of outages that end up with frustrated customers and scrambling to get everything back online. The way to stay off the front page is to always monitor everything, all the time. That seems easier said than done, but it’s not impossible.

Other departments outside of IT have been able to achieve total visibility. Those making investments in scalable data management infrastructure, in big data technologies, and in modern analytics have been able to gain a complete view. Marketing has a 360-degree view of their customers, finance knows where every penny comes in and is spent, logistics is able to track equipment location down to the millisecond. As we become ever more dependent on IT to operate day-to-day business, we need to help IT adopt a similar approach.

At Rocana, we’re proponents of IT building their own event data warehouse. This is step one to achieving total visibility and requires a scalable and open system that has the analytics to make sense of all IT data built in. This is the exact strategy that businesses used when they first started to digital paper processes. The evolution of digital transformation is that the digital processes are becoming automated. Customers get an alert, check in to their flight, buy upgrades and get their boarding pass using only IT systems. When any part of those systems go unmonitored, customers get frustrated and people lose money.

Learn More...