Drilling Down

By Steve Goldman, Director of Crisis Courses, MIT

Disaster Recovery (DR) plans and responders must look at the hidden or next-level ramifications of IT disaster response. In addition to the technical problem at hand, what else can happen? Who else becomes impacted? Responders must Drill Down during DR exercises to find out.

Many years ago, I was an engineer working in the Corporate Communications Department at an electric utility with a fleet of nuclear power plants. I was the technical/media spokesperson for the company. One day I received a call from the Emergency Director during a drill at one of our nuclear plants. He said that the facility was undergoing a simulated and federally evaluated emergency; per the response procedure, he was notifying me. Being new to the position and before nuclear plant emergency planning was comprehensively mandated, I said, “OK, what do I do?” He replied, “I don’t know. My procedure says in an emergency we are to notify the Corporate Communications staff. That’s you. I have fulfilled my procedural responsibility at 9:08 AM this morning. Gotta go!” Click.

Clearly a nuclear plant emergency would have garnered national and even international news media coverage, even back then. And bluntly we in the Corporate Communications Department were not ready for that. But that was not the concern of the plant management or the federal evaluators at that time. The concern was “Did they follow their procedures?” Yes, they did, and they successfully passed the exercise. In a real event, the plant might have resolved the technical issues, but the communications response would have been the bigger disaster. The plant, the federal evaluators, and I should have drilled down to determine the ramifications of the next level of response.

I like IT people (well, most of you), but as a group in disaster response, sometimes they get focused on the technical issue. They often do not look at non-technical, hidden, or next-level ramifications of IT disaster response. In addition to the technical problem at hand, what else can happen? Who else becomes impacted? A ransomware shutout, the loss of Cloud services, a data center failure – all these impact IT and IT should have procedures to deal with these. But do not forget that the rest of the organization will also be impacted – often more severely than IT itself.

This means that the individual business/operating units must have “IT Network Down” procedures. But has IT looked at these responses to an IT problem? IT DR drill/exercise developers must follow the ramifications of their simulated disasters: they must “drill down” to determine the ramifications of the next level of response. Testing this next level should be part of Disaster Recovery drills. IT cannot focus only on its technical response. IT permeates all of the organization; the true (and sometimes hidden) loss of IT services must be determined and addressed. It may not be IT’s responsibility to fix the problem, but IT should identify it.

In addition to my work at MIT, I provide consulting services. I absolutely enjoy developing, conducting, and evaluating drills and exercises. When developing the scenario, I deep dive into the IT DR and cyber security procedures. I also look at the departments impacted by the simulated loss and how they will respond; with this I know what to expect or what needs to be addressed during the simulation. Some interesting findings:

  • In one company’s ransomware procedure, IT put itself in charge of all crisis communications including setting up a new contracted Customer Call Center. Imagine the surprise when Corporate Communications learned of this!
  • In another organization’s Legal Counsel’s procedure, it was stated that Lawyers need 48 hours to decide on issues!
  • In a ransomware procedure, it was the CIO’s discretion whether or not to inform the CEO or Senior Management about a ransomware attack.

Right.

I present two examples of drilling down while I was consulting for (1) hospitals and (2) a financial institution. These events actually happened.

Lessons Learned 1: Forms? What’s a form?

I conducted a series of exercises at individual hospitals and their overall management system. One scenario was an IT ransomware threat that eventually shut down the hospital IT network. IT/DR responders merrily plugged away at restoring the network and systems. And IT did good technical work, recovering within their Recovery Time Objective of 16 hours. But in previous drills, no one looked at the ramifications of a network loss on the rest of the hospital. I did; it’s what I do: drill down.

We then simulated the loss of the IT network where the real action is – on the hospital floors. The doctors, nurses, and staff worked conscientiously to care for patients without the IT systems and equipment upon which they had become totally dependent. Nowadays, patients are literally hooked up to the IT network. Staff normally use hospital-issued network-connected laptop computers, tablets, and cell phones to do their jobs. Paper, pen, and a clipboard are rare. So, when the network crashed, all these technology tools were rendered useless.

Many Millennials could not fill out the paper forms . . . all their data input experience was on a laptop or tablet.

Now, the hospitals did have emergency downtime procedures to handle most IT events. However, the people and procedures were never tested and certainly never were part of an IT drill. Recovery procedures expected staff to use backup forms to carry out their responsibilities. Staff were instructed to go to the “Forms Cabinet”, but many staff did not know where or even what that was! Many of the needed forms were not there. The Baby Boomers were familiar with forms and how to use them; the Millennials and Gen-Zs were not! The latter could not fill out the paper forms because they had not been trained to do this: all their data input experience was on a tablet or laptop! Paper and pen were considered oh so last century. Some were not even able to write cursive – all they could do was block-print slowly! It was a valuable lesson that has since been addressed through training and mini-drills.

Lessons Learned 2: Data? Who needs Data?

A financial organization hired me to develop and conduct an IT Disaster Recovery Plan drill, to be run parallel with the annual Business Continuity Plan Exercise. The IT department had a Recovery Time Objective (RTO) of returning all critical applications in 24 hours. In every prior DRP drill, they claimed they met the RTO. However, I designed the combined BCP/DRP exercise such that actual users (representatives from all departments) were to go to the backup data center, log in to their applications, and demonstrate that each department could perform its critical functions. IT never involved users before; again, I drilled down. On exercise day, we simulated the destruction of the data center on a Wednesday afternoon. The IT DRP responders said that they would work all night and recover all applications by 8:00 AM Thursday. And that they did. When the actual department users logged in on Thursday morning at the backup data center, all their applications were indeed ready and available for use. Most excellent!

“You want the data drive recovered? No one told us that users would want the data drive. What’s the problem?”

However – and this turned out to be a rather large however – the users could not access their data! When the users asked IT about how to retrieve their applications’ data, IT said (and I am not making this up) “You want the data drive recovered? No one told us that users would want the data drive. We were told to recover applications and we did. What’s the problem?” After a rather interesting user-IT “exchange of opinions,” IT determined that it would take 1 to 2 weeks to recover the users’ data! So much for the 24-hour RTO.

In this example, the IT DR responders clearly did not understand the purpose of the DRP. It took drilling down in a realistic test of the DR Plan to demonstrate this. I am pleased to report that within a month of this exercise, the IT staff revised their DRP to recover all data within the 24-hour RTO. They then conducted a drill with application users to prove they could do it, and they did. Bravo!

What’s the Point of Drills and Exercises Anyway?

There is no sense in conducting a DR/Cyber exercise and compiling a large “to do” list if nothing is improved. Imagine if your responders identify the same improvement items drill after drill. Not only will they lose confidence in you and your DR program, but they will also be unprepared when a real crisis hits. When exercise items are identified, it is your fiduciary responsibility to address them. Management may make the decision not to implement an exercise recommendation; but that decision needs to be made and then communicated to your stakeholders.

In other words, fix what needs to be fixed! Develop an Exercise Findings Action Plan, assign and track responsibilities, and improve, improve, improve. You must show your responders your management, your regulators, and your auditors that you are making progress and will improve your DR Plans. Otherwise, the lessons learned will be lost and you will have wasted everyone’s time.

Summary

Drilling Down applies to all business units, not just IT. However, IT services run throughout the entire organization. Companies can operate without parts or all of, for example, Legal, Marketing, Finance, Human Resources, and even areas of Operations. But most organizations cannot conduct their business/responsibilities without IT. Drills and exercises provide a golden (and often, only) opportunity for IT DR responders to Drill Down to determine the ramifications of the next level of response. Testing this next level should be part of Disaster Recovery drills. Make it so.

Shameless plug

Attend the “Crisis Management & Business Resiliency” Course at MIT. It’s sold-out for this July, but we offer it online this October. Otherwise plan to spend a glorious week at MIT next July 2024. Many IT/DR/Cyber managers attend. For details and registration, go to: http://professional.mit.edu/cm

Author

Dr. Steven B. Goldman is the Director of Crisis Courses at the Massachusetts Institute of Technology. He is an internationally recognized expert and consultant in Business Resiliency, Crisis Management, Risk/Crisis Communications, Crisis Leadership, and  Realistic Drills and Exercises. His background is comprehensive yet unique in that he has been a professional engineer, corporate spokesperson, manager of media relations, business continuity planner, crisis responder, consultant, a Fortune 500 Company’s Global Business Continuity Program Manager, and University Professor. Dr. Goldman has developed, conducted, and evaluated drills and exercises ranging from two-hour tabletops to massive three-day full-scale exercises involving hundreds of responders, multiple organizations, and all levels of government. You may contact him at goldmans@mit.edu.

Parts of this article were obtained from previous papers written by Dr. Goldman.

LEAVE A REPLY

Please enter your comment!
Please enter your name here