Blog Post

What’s Your Spookiest IT Ops Horror Story?

    
October 27, 2016 Author: Rocanauts

If you’ve been working in the industry long enough, or are an avid tech tinkerer and all-around computer geek, you’ve likely lived through an awful IT-related horror story. In the spooky spirit of Halloween, we asked some of our Rocanauts to share their scariest IT Ops horror stories from across their careers and personal lives.

horror-stories.jpgThink you can one-up these? Comment below or respond to our post on LinkedIn for Facebook with YOUR spookiest IT Ops horror story – the best one will win a free Context is Everything t-shirt!

Michael Guenther, Customer Success Engineer (@SourestKraut)
My wife had an undead Microsoft Exchange server that rose from the grave to wreak havoc. She had taken a physical Exchange server, virtualized it, and shut the old physical box down. They were running off the identical virtual version of everything and just left the old box in the rack because there was no hurry to get rid of it. A few months later, everyone’s Outlook clients kept saying they were offline, then back online, then offline again. They double-checked the virtual server and there was nothing wrong with it.  Networking looked okay as well. Eventually they figured out that for some crazy reason the old physical box had powered itself back on, and the virtual and the old physical server were wrestling over the IP address. Fortunately, it happened in the wee hours of the morning, so their Exchange data stores didn’t get too jumbled. They went in and cut the cord to the old physical box and its network connection just to be sure it wouldn't come back to life and haunt them again. The moral of the story: always unplug decommissioned servers!

Zachary Joyner, Tools and Automation Engineer
I once worked on team responsible for a real-time message processing system for medical records that fed several downstream products including a drug interaction and disease prevention application. One night we noticed our queues were backing up for about an hour and that our CPUs were spending lots of time waiting for disk even though our disk usage was constant. If the queues get full they can't take on new data, leading to data loss and data not getting to the downstream products. We contacted our sister operations team for investigation, who then contacted four additional operations teams. We spent two weeks proving our system was not the problem. After they finally believed that we had an infrastructure problem, we spent months trying different approaches from the different teams. After many weekly meetings of 20 people and multiple planned outages, the final 'fix' was to move to a completely different storage system and endure one more planned outage. All because everyone was looking at different siloed dashboards and no one had a complete picture of what was going on.

Alan Gardner, Platform Engineer (@alanctgardner)
My team once stood up a client's on-prem, 50-node Hadoop cluster one day. The next day, we logged in and all the CPUs were pegged at 100%. Connect to one of the boxes, someone was mining Bitcoins on every node on the cluster! Talk about Trick or Treat…!

Mark Tozzi, Platform Engineer (@not_napoleon)
About eight years ago I had an issue where FTP was broken.  I was working on an ETL pipeline, and an early step was to FTP a bunch of log files from our other data centers to the data warehouse servers (which is a horror story on its own). The transfer would periodically fail at the FTP step, but not always, and we were "sure" FTP wasn't the issue since it's pretty widely used. Every time it would come up we'd spend a few days digging in, unable to reproduce it and give up, only to repeat in a few weeks. Eventually someone tracked down that it only happened between a couple of rarely used servers, and eventually discovered that MTU (minimum transfer unit) automatic negotiation between a specific version of windows and a specific version of BSD (both dreadfully outdated) did not succeed. Upgrading to something relatively modern fixed it. If we'd have been able to see early on that it always happened between the same two servers, we probably would have found the issue sooner, but even so, it was really obscure.