Management / The American View: Who, me? Yes. Disastrously so.
The American View: Who, me? Yes. Disastrously so.
22 January 2019 |
Failure often teaches us as much as success does, especially when learning new technologies. Business Reporter's resident U.S. 'blogger argues that every IT shop needs some space to safely conduct research and development in order to help its people grow.
Technical news site The Register has a recurring segment called ‘Who, Me?’ where readers anonymously share their biggest IT support foul-ups and own-goals. I’d originally planned to share this story with the Who, Me? team, but realized after typing up a draft that it just wouldn’t do to share this story anonymously. I botched this project, and deserve the kicking that comes along with it.
For context: I spent the aughts working in American civil service. In the defence side of civil service without being too precise. Starting in 2002, I took over my organisation’s IT support department as its senior supervisor. It was a director-equivalent gig, since I had two senior managers reporting to me, and each senior manager had four shop managers reporting to them. We had around 50 people working in IT in our heyday, and we served a user community of around 1,300. So … equivalent to a medium-sized business.
Life in our department was hectic. Our annual support budget was inadequate to meet operational requirements, so we were forced to constantly beg for additional funds to maintain and upgrade our equipment. Standard procedure was that the newest kit always went into production immediately to maximize the amount of vendor technical support we could draw on during hard times. Any displaced kit that still worked was assigned to internal R&D projects while worn-out kit got demilitarised and scrapped. There’s nothing novel about this approach, save that we weren’t officially chartered to do any sort of internal R&D. Most outfits like ours simply waited patiently for Someone Above™ to bless them with new hardware … or they did without.
That wouldn’t do. The refresh cycle for server room gear was too darned long, and often got undermined by poor and slow procurement processes. As an example, we were once sent a giant tape backup unit to support all of our production servers that (a) arrived broken, (b) kept breaking, and (c) went out of warranty before it ever managed to successfully pass a backup-and-restore test. We needed some sort of local R&D; we were forever hunting for better ways to meet our customers’ needs on the cheap.
We had storage containers full of obsolete equipment. We’d make sure that we wrung every last drop of utility out of our gear before we finally de-mil’d it and sent it off to be scrapped.
Along those lines, we’d been early adopters of Apple’s “enterprise grade” Xserves  as an alternative to conventional (and more expensive) small servers for menial tasks. We couldn’t use them in production at first because our parent organization didn’t have a Certificate to Operate (CTO) for MacOS X, Apple’s operating system. This had been a chicken-and-egg nightmare for our parent organisation for years. No one could buy an OS X machine because they hadn’t been ‘approved’ for use, and no one would approve them for use because (as we were constantly lectured) ‘no one has them so we don’t need to request a CTO for them.’ Meanwhile, my photo and video staff hounded me non-stop that they had to have their Macs or else Everyone Would Die™ and they’d be unable to provide a quality product to executive management. It was … frustrating.
So, after several circular and non-productive arguments with our national IT staff, I tasked my R&D team to find a legal loophole that would let us get the gear that our multimedia team needed to conduct the tests required to submit Mac OS X for a legal and official CTO. It took the team several years, but they finally got it sorted – by the book, too. We managed to get CTOs for Max OS X 10.2, 10.3, and 10.4 (including both workstation and server versions of the OS). From that day forward, every IT shop in the organisation nationwide could use it. There was much rejoicing. Many celebratory beverages were purchased for my team at conventions by harried peers. Yea, us.
I say all this to set up this Christmastime scene. We’d just had a spectacular fiscal year closeout; we’d gotten ourselves enough money to buy our first-ever dedicated SAN. That freed up the experimental Xserve RAID  storage units that we’d previously employed as Direct-Attached Storage for our file servers. We were also able to convert our early-generation Xserve test-units into encoders for our live LAN video initiative.  Cheap test gear had solved two problems, now they’d solve others. Waste not, etc.
In accordance with our SOP, the oldest gear got de-racked and transferred to the R&D program to address customer complaints. I figured that the old, first-generation Xserve RAID storage behemoth might do for a problem we’d been having in our video editing lab. My techs were constantly complaining that they didn’t have enough storage to work on large projects. My budget manager was complaining that we couldn’t afford to keep the video team fed with increasingly high-capacity hard drives.
To say nothing of all the other expensive toys that they ‘had’ to have in order to produce content.
That in mind, I picked us up a copy of Apple’s new Xsan clustered file system software at the end of the year, and was keen to learn how it worked. I’d read about how the software allowed you to share one storage unit between multiple hosts, and figured it was much cheaper and more efficient than that we’d been doing. The problem was, I wasn’t a storage engineer. None of us in the IT team were. Back then, we had no idea how real SANs were supposed to function. Government operation, remember? Always a bit behind in required skills. Tat’s where I made my first mistake: I sent our top sysadmins to standing up the new production SAN while I tried to figure out Apple’s software-based SAN solution.
I read the Xsan manual and – if I’m honest — I didn’t understand the core concepts.  I was intrigued by the prospect of being able to share one massive pile of RAIDed disks between our two video workstations. So, over one quiet holiday weekend, I connected the video editing workstations into our oldest Xserve RAID box. I installed Xsan on one of the computers, tested writing content back and forth, realized that I couldn’t figure out how to properly configure the management settings, and pushed the project to the back-burner. Figured I’d come back to it the next weekend.
You can see where this is going, right? I didn’t connect the old storage unit to surplus video editing workstations in an isolated lab. I had wired the test storage box up to our two production video workstations. Admittedly, these PCs were rarely used during the month; Because of the holiday vacation schedule, I thought I’d have 3-4 weeks of that room being idle before I needed to brief anyone about the risk of storing data on the prototype. I hadn’t counted on my eager-beaver video team supervisor perceiving the operational possibilities of HUGE SHARED STORAGE. She’d leaped on it. I’d warned her that it was still an experiment in-progress ... but apparently I wasn’t clear enough.
A few workdays later, my office phone rang. My video supervisor was furious. She said that she couldn’t access any of the video file projects that she’d saved to the new SAN. Apprehensive, I dashed downstairs to the studio and tried to troubleshoot the problem. The drives were visible to the workstation and there was data on the mounted volume, but none of the files were readable. I called Apple tech support and explained our situation. The poor engineer gently explained that I’d configured everything dead-wrong. The two attached workstations had been fighting over the storage, each pretending to be the management unit, until the two feuding machines had completely fouled the data. Worse, the content copied to the Xserve RAID would be unrecoverable.
‘EVERYTHING IS FINE! JUST HANG ON A MOMENT. NOTHING TO WORRY ABOUT.'
Frustrated, I broke the bad news to my video supervisor and advised her to shut down the workstations, disconnect from the Xserve RAID, and go back to working on her project the old-fashioned way. She insisted that I had to fix the problem. I reminded her that it was an R&D project; we’d get it sorted eventually, but it wasn’t ready for prime-time yet.
That’s when the other shoe dropped. My video supervisor admitted that she’d been so frustrated at not having enough free space on her workstations that she’d copied all of her current content over to the proto-SAN and had then deleted her original files to make room on her workstations. She hadn’t thought to make backups; she’d just assumed that everything would work perfectly and had ‘burned the ships’ (so to speak).
I was horrified. My video supervisor was angry and frustrated. Things got worse when she explained that she’d used all of that yummy liberated hard drive space on her workstations to load new applications (thereby overwriting all of her deleted project files). We couldn’t get anything back from the workstations or from the SAN. Five terabytes of irreplaceable video content had disappeared forever, thanks to her well-intended enthusiasm and an unjustified assumption that the gear being tested would work perfectly the first time. Oi.
Starting that day, we locked all of our R&D kit in the main data centre. R&D kit wasn’t allowed to touch any data that we couldn’t afford to lose. Undoubtedly should have made that our policy from the beginning, but … None of us had thought of it. We also hadn’t thought of marking or tagging our R&D gear to make it more visible. Really, we were kit-bashing junk in our spare time for proof-of-concept attempts or to jury-rig temporary solutions while we waited (often for years) for funding to come through for real equipment. That was no excuse, though. In retrospect, we should have taken more care to separate our ‘mad scientist’ activities from the everyday fleet.
I wouldn’t go so far as to advice locking your R&D gear in a vault that’s monitored by armed sentries (unless you’re a defence contractor). For most of us, an isolated locked room with an isolated network ought to be sufficient.
It was an embarrassing screw-up then, and it’s still embarrassing now. The thing to remember, though, is that accidents are always going to happen, especially when you’re trying to learn a new skill or work with new equipment. Rather than covering up your mistakes, it’s best to ruefully come clean about what you did wrong so that others can learn from your errors. That, at least, was our policy. ‘fess up and teach others.
That’s my recommendation: don’t forego R&D for fear of what might go wrong. Just segregate your experiments and tests from production. The only way that people are going to grow in IT is to invest in new skills. That takes hands-on learning. You need labs and space and time for people to try things and fail. Just … keep it contained. Otherwise, you’ll wind up publishing your own Who, Me? tale on TheRegister someday.
 The ‘Xserve’ was Apple’s take on a 1U (2.75” tall) rack-mounted server. They were inexpensive and small machines that did a yeoman’s job providing basic network services (at least, when compared to some of the big Compaq servers that we were accustomed to buying).
 A lightning strike had fried our expensive Video Distribution Network and we could never get funding to fix it. Since powerful people had to have their live TV, we cobbled together a solution using old parts that allowed any customer on the LAN to pull up a browser window and watch TV on a ten second delay from our satellite feed.
 Xserve RAID units were mass storage peripherals for servers that packed a whopping 14 hard drives for about £10k – a pittance compared to the EMC storage units that other organisations were buying for ten times that.
 If you understand what ‘LUN masking’ means, congratulations. I didn’t at the time. The Xsan manual assumed that you’d already mastered this concept and didn’t need to have it explained.
POC is Keil Hubert, firstname.lastname@example.org
Follow him on Twitter at @keilhubert.
Keil Hubert is the head of Security Training and Awareness for OCC, the world’s largest equity derivatives clearing organization, headquartered in Chicago, Illinois. Prior to joining OCC, Keil has been a U.S. Army medical IT officer, a U.S.A.F. Cyberspace Operations officer, a small businessman, an author, and several different variations of commercial sector IT consultant.
Keil deconstructed a cybersecurity breach in his presentation at TEISS 2014, and has served as Business Reporter’s resident U.S. ‘blogger since 2012. His books on applied leadership, business culture, and talent management are available on Amazon.com. Keil is based out of Dallas, Texas.