On December 21, 2022, just when the holiday trips were underway, Southwest Airlines crossed a series of cascade failures in their planning, initially triggered by severe bad weather in the Denver region. But the problems have spread through their network, and over the next 10 days, the crisis has ended up getting rid of more than 2 million passengers and causing losses of $ 750 million for the airline.
How did a localized weather system ended up triggering such a widespread failure? MIT researchers have examined this widely reported failure as an example of cases where systems that work smoothly most of the time suddenly decompose and cause a domino failure effect. They have now developed a calculation system to use the combination of sparse data on a rare failure event, in combination with much more extensive data on normal operations, to work upside down and try to locate the deep causes of failure, and, hopefully, be able to find ways to adjust systems to prevent such failures in the future.
The results Were presented at the International Conference on Learning Representations (ICLR), which was held in Singapore from April 24 to 28 by the MIT Doctorate student Charles Dawson, professor of aeronautics and astronautics Fan Chuchu, and colleagues from Harvard University and the University of Michigan.
“The motivation behind this work is that it is really frustrating when we have to interact with these complicated systems, where it is really difficult to understand what is happening behind the scenes that creates these problems or failures that we observe,” explains Dawson.
The new work is based on previous research by Fan's Lab, where they have examined the problems involving hypothetical failure prediction problems, she says, as with groups of robots working together on a task, or complex systems such as the electricity network, in search of means of predicting how such systems can fail. “The objective of this project,” says Fan, “was really to transform this into a diagnostic tool that we could use on real world systems.”
The idea was to provide a means that someone could “give us data from an era when this real system had a problem or a failure”, says Dawson, “and we can try to diagnose the deep causes and provide a little look behind the curtain to this complexity.”
The intention is that the methods they have developed “work for a fairly general class of cyber-physical problems,” he said. These are problems in which “you have an automated decision-making component interacting with the disorder of the real world,” he explains. There are tools available to test software systems that work alone, but complexity occurs when this software has to interact with physical entities that arise in their activities in a real physical framework, whether plane planning, autonomous vehicle movements, interactions of a team of robots or control of entries and outputs on an electrical network. In such systems, what is happening often, he says is that “the software could make a decision that seems correct at the beginning, but it has all these domino and knock-on effects that make things more disorderly and much more uncertain.”
A key difference, however, is that in systems like robot teams, unlike planes planning, “we have access to a model in the world of robotics”, explains Fan, who is a main researcher of the MIT for information laboratory and decision -making systems (LIDS). “We have a good understanding of physics behind robotics, and we have means to create a model” which represents their activities with reasonable precision. But the planning of airlines involves processes and systems that are proprietary commercial information, and the researchers therefore had to find ways to deduce what was behind decisions, using only information accessible to the relatively sparse public, which consisted essentially in just the real arrival and departure hours of each plane.
“We have entered all this flight data, but there is all this system of the planning system behind, and we do not know how the system works,” explains Fan. And the amount of data relating to real failure is only worth several days, compared to years of data on normal flight operations.
The impact of meteorological events in Denver during the week of the Southwest programs crisis clearly appeared in flight data, just from long -term layouts than normal between landing and takeoff at Denver airport. But the way it has a waterfall impact, the system was less obvious and required more analysis. The key turned out to have to do with the concept of reserve aircraft.
Airlines generally keep aircraft in reserve at various airports, so that if problems are found with an airplane provided for a flight, another plane can be replaced quickly. Southwest uses only one type of plane, so they are all interchangeable, which facilitates these substitutions. But most airlines operate on a system of concentrator and shelves, with some designated hub airports where most of these reserve planes can be preserved, while the Southwest does not use hubs, so that their reserve plans are more dispersed throughout their network. And the way these planes were deployed proved to be playing a major role in the course crisis.
“The challenge is that there are no public data available in terms of plane parking place throughout the southwest network,” explains Dawson. “What we can find using our method is that by examining public data on arrivals, departures and delays, we can use our method to support the hidden parameters of these aircraft reserves, to explain the observations we saw.”
What they found is that the way the reserves were deployed was a “main indicator” of the problems that broke in a national crisis. Certain parts of the network which were affected directly by the weather were able to recover quickly and resume deadlines. “But when we looked at other areas of the network, we saw that these reservations were simply not available, and things were constantly worsening.”
For example, data has shown that Denver's reserves decreased quickly due to weather delays, but “it also allowed us to trace this Denver failure in Las Vegas,” he said. Although there is no violent time there, “our method always showed us a constant drop in the number of planes that were able to serve thefts outside Las Vegas.”
He says that “what we found was that there was these aircraft circulation in the South West network, where an airplane could start the day in California, then fly to Denver, then end the day in Las Vegas.” What happened in the case of this storm is that the cycle was interrupted. As a result, “this Denver storm breaks the cycle, and suddenly the reserves in Las Vegas, which are not affected by the weather, are starting to deteriorate.”
In the end, the southwest was forced to take a drastic measure to solve the problem: they had to make a “hard reset” from their entire system, canceling all thefts and piloting empty aircraft across the country to rebalance their reserves.
By working with experts in air transport systems, researchers have developed a model of how the planning system is supposed to work. Then, “what our method does is essentially to execute the model upside down.” By looking at the results observed, the model allows them to return to see what types of initial conditions could have produced these results.
Although data on real failures are rare, in -depth data on typical operations have helped teach the calculation model “what is possible, what is possible, what is the field of physical possibility here”, says Dawson. “This gives us knowledge of the field to say then, in this extreme event, given the space of what is possible, what is the most likely explanation” for failure.
This could lead to a real-time monitoring system, he says, where the data on normal operations is constantly compared to current data and determining what the trend looks like. “Are we looking for normal, or do we tend to extreme events?” Seeing signs of imminent problems could allow preventive measures, as reserve aircraft in advance in the fields of the expected problems.
The work on the development of these systems is underway in its laboratory, says Fan. In the meantime, they have produced an open source tool to analyze the failure systems, called CALNF, available for anyone can use. Meanwhile, Dawson, who obtained his doctorate last year, works as Postdoc to apply the methods developed in this work to understand failures in power networks.
The research team also included Max Li from the University of Michigan and Van Tran of Harvard University. The work was supported by NASA, the Air Force scientific research office and the Mit-Dest program.
