How the CrowdStrike, Microsoft outage turned IT techs into heroes

admin November 21, 2023

How the CrowdStrike, Microsoft outage turned IT techs into heroes

Science
July 25, 2024
No Comment
186

It was 3 a.m. Friday when Tyson Morris got a wake-up call that would send him into crisis mode for days. Atlanta’s trains and buses were expected to be running in two hours, but all systems were down, showing the dreaded “blue screen of death.”

“It’s the one phone call a chief information officer never wants to get,” said Morris, CIO for the Metropolitan Atlanta Rapid Transit Authority. “I jumped out of bed, and my wife was wondering what was going on. She thought someone had died.”

Morris sprang into action to mobilize his team of 130 for an all-hands-on-deck operation. Was it a hack? Had an employee gone rogue and brought down their operations? For hours, no one knew.

The outage, caused by a faulty update from security software firm CrowdStrike, was the kind of event IT staff train for but hope never happens. The incident brought down an estimated 8.5 million Windows devices around the globe, paralyzing operations at hospitals, airlines, 911 call centers and more. Insurers estimate the outage cost companies more than $1 billion in revenue, with Fortune 500 companies potentially losing more than $5 billion.

While the outage made it difficult to impossible for many to work, IT technicians were toiling overtime — some spending the night at the office, feverishly trying to get systems back up and running through the weekend. It also revealed vulnerabilities that companies can use as lessons for the next big outage.

“It was a heightened sense of stress that I haven’t experienced,” said Morris, who’s been in the industry for more than two decades. “Every second counts.”

The event shined a bright light on the importance of IT workers, said Eric Grenier, an analyst who covers endpoint security for market research firm Gartner. CrowdStrike sent out a fix to users, but it required people to manually fix each system. Later, CrowdStrike released an automated repair. The only other time Grenier recalls a massive outage that came close to this was the buggy McAfee update in 2010.

“The fact that we’re seeing reports of hundreds of thousands of devices that were remediated over the weekend, that’s huge,” Grenier said. IT workers were “the superheroes of this.”

On the ground, it was a mad dash. Kyle Haas, a systems engineer for IT consulting firm Mirazon in Louisville, spent Friday driving across the city to help clients get back online. During the car rides and in between clients, he shot off emails and took phone calls to help others. For nine hours straight, Haas was in overdrive.

“I skipped my coffee that morning,” he said, adding that he woke up to panicked emails and messages from clients who didn’t know what was happening. “It was touch as many things as you can. Fix it all.”

Haas said his team of about 40 people spent 12 hours ensuring all their clients were back up and running. Though the day was intense and stressful, he said he was grateful that the issue was purely due to a bad update, and the fix was relatively easy. That meant he wouldn’t have to fight off bad actors or try to recover lost data, which are common in ransomware attacks or system failures.

His big save of the day? Helping one of the water companies that was an hour away from having to go into manual override, which would have prevented it from testing water quality.

One TikTok user, who goes by plumsoju and said he was part of the IT team at his company, showed what his day was like by unmuting his computer. Inbound messages from colleagues were dinging continuously — something he said had been happening for hours. He compared the experience to the viral meme of a dog drinking coffee while the house is on fire saying, “this is fine.” The TikTok creator did not respond to a request for comment.

For Morris, the event was a big shock. He had been CIO of the transit agency for only three months. Fortunately, the IT department had a preexisting emergency plan, which included a phone tree and dedicated channels for communication. But that didn’t mean it was easy. Morris, who was on a family trip in Tennessee, drove down to Atlanta to help. Meanwhile, the team was working around-the-clock, with some members pulling 18-hour shifts and sleeping at the office.

By 9 a.m. Friday, buses and trains were rolling again, and by Monday morning every last laptop had been fixed.

“We were getting positive feedback. … A lot of thank-you’s came in,” Morris said. “That continued to help boost morale.”

On the West Coast, signs of the outage started to appear late the night before, giving IT workers a head start at identifying the problem. Jerry Leever, IT director at accounting, tax and advisory firm GHJ in Los Angeles, said he received an email from the company’s outsourced IT members at 10:30 p.m. Pacific time, which was quickly followed by server system detector alerts.

Leever was brushing his teeth and checking his email before bed when he saw the message. His stomach dropped.

“I had a moment of worry and then a moment of understanding that we are trained to handle this situation,” Leever said. “You don’t have a lot of time to stay in the panic because you have to get things online as soon as possible.”

By 3 a.m. Pacific, Leever and his teammates had the servers up and running. They had an automated email set to send at 5 a.m., informing their 200-plus colleagues about what happened and how to fix the issue. They also had a 6 a.m. call set up for colleagues who needed IT to guide them step-by-step. By about 10:30 a.m. Pacific, everyone was back online, a feat Leever credits to their communication plan and early warnings.

All the IT people who spoke with The Washington Post admitted there were lessons that came from the CrowdStrike outage. It helped magnify the importance of having an up-to-date business continuity plan that emphasizes communication procedures, which can get complicated if systems are down. And it left some leaders questioning whether they have enough contingencies in place so that operations can continue when something goes down.

It also left some to question whether they should diversify providers more so that the entire operation doesn’t suffer because of a problem with one. Some organizations are evaluating if they are staffed properly for emergencies or whether they need to have outsourced help on standby. And it also highlighted the importance of storing key data like recovery codes for encrypted systems in different places in case a server goes down.

For Leever, who characterized this outage as the worst incident he’s dealt with, the end of the day Friday couldn’t come soon enough. He headed straight to his favorite restaurant bar for a burger and an Aperol spritz.

“Just hug your IT folks,” he said. “It helps when folks are understanding and gracious in times of crisis.”

#CrowdStrike #Microsoft #outage #turned #techs #heroes