In the end, everything turned out all right, although those five hours were an interesting experience.
What happened?Delta Air Lines, one of the biggest airlines in the world, operates a fleet of more than 800 aircraft that make about 6,000 flights daily and transport more than 100 million people per year. The company has 13 hubs, and one of them is Atlanta, the biggest airport in the world. All in all, Delta has a huge hub system.
On August 8, at 2:30 a.m. Eastern time, the computer system servers of Delta Air Lines’ data center in Atlanta powered down following a power outage. Duplicate systems either failed to come online or were overwhelmed, and as a result, the entire computer system stopped working.
According to John Kraft, a spokesperson for Georgia Power (the company that provides electricity to the data center), a switchgear was responsible for the outage. A switchgear works just like a circuit breaker in a home — it helps control and switch power flows.
Industry analyst and former airline executive Robert Mann said to Reuters that “the company was probably running a routine test of its backup power supplies when the switchgear failed and locked Delta out of its reserve generators.”
As a result, about 1,000 flights were canceled on August 8, and more were canceled in the days following, as Delta scrambled to recover. The vast majority of other flights were delayed.
Aside from the impact on passengers, this incident affected dozens of airports and increased the workload of flight operations officers all over the world.
Of course, the carefully orchestrated system of connecting flights between Delta Air Lines and its partners also suffered. Many passengers had to stay overnight in cities where they had expected to have a brief layover.
Delta offered those passengers the option of abandoning their planned flight voluntarily and staying at the city of departure for one day at the company’s cost. Many agreed. Delta also promised $200 in travel vouchers to all customers whose flights were canceled or delayed by more than three hours.
How I spent that dayThe afternoon of August 8th was the worst part of the day. It was the peak of the failures, flight cancellations, and delays. I spent this time at Los Angeles International Airport, the fifth busiest airport in the world for passenger traffic (it serves about 70 million passengers per year). It’s not a calm place even when everything is OK.
Strange things happened here and there. My flight to Seattle disappeared from the arrivals and departures board. Delta’s mobile app provided the only way to find out when and where boarding started.
However, the assigned gate changed every five minutes. I figured the people in charge of gate assignments must have been shuffling hundreds of flights around dozens of gates —like a giant game of Tetris at the highest difficulty level.
Having run for a half-hour across the terminal, I finally arrived at the gate that was really assigned for my flight. That gate assignment existed only as a verbal arrangement: Airline employees stayed in constant touch using good old phones.
It was two more hours before I got to the plane, and then another two hours before the plane actually took off.
During the five-hour delay, I could not help but appreciate the professional behavior of the airline’s employees. They did their best to mitigate the consequences of the outage. With great patience, they explained to passengers what was going on and apologized for the inconvenience — always polite, friendly, and calm. No employees behaved rudely or ignored questions, although they still had a lot of other work to do.
ConclusionsWhen a computer system fails, we have to switch to old-fashioned methods that not only feel radically inefficient, but also are far from usable nowadays — they were not designed to cope with modern loads. All we can do in that case is hope that the electronic systems will come back online as soon as possible.
So what can be done? Should we abandon electronic systems if they are occasionally unreliable? Of course not. No company can be competitive without modern technologies. Cars and trucks are dangerous, but no one is suggesting transporting goods and people by horse instead.
What we really need is maximum attention on security and reliability. The electronics we rely on to manage the work of critical infrastructure should be properly protected from incidents and attacks. This one example is an important reminder of the enormous impact any serious failure in industrial computer system can have on the lives of hundreds of thousands of people — not to mention the financial losses.