Fault-Tolerance Research by Kale, NCSA’s Mendes, and Illinois CS Alum Chakravorty Honored as One of Most Influential HPC Papers
5/16/2019 10:29:45 AM
Thirteen years ago, Professor Laxmikant “Sanjay” Kale and two members of his research group proposed a way to build proactive fault tolerance into high-performance computing that relied on the ability to predict when processors are about to fail.
Now the paper in which that solution was proposed has been chosen as one of the seven most impactful papers of the past quarter century by the IEEE International Conference on High Performance Computing, Data, and Analytics. A committee chose the paper as part of the 25th anniversary of the conference.
The 2006 paper authored by Kale, his then-student Sayantan Chakravorty, and research group member Celso Mendes, was “Proactive Fault Tolerance in MPI Applications via Task Migration.” Chakravorty was the primary author and the paper was part of his PhD thesis. It proposed a way of anticipating the imminent failure of supercomputer nodes and redistributing the work so that computing does not have to stop.
“The idea is, typically this would happen on a 100-node cluster – it could happen on a 10,000-node supercomputer, too – but one of the nodes is showing signs of stress” such as overheating, Kale said. “But then the question becomes, what do you do if you know this node is going to go down?”
Kale is now the Paul and Cynthia Saylor Professor at Illinois CS, while Chakravorty (MS ’05, PhD ’08) now is a software development engineer for Amazon Web Services, and Mendes (PhD ’97) is a senior research programmer at NCSA.
Charm++ allows the computation to be expressed as a large number of objects interacting with each other, Kale said. That fine-grained approach provided a detailed look at the activity happening within any given node, and an early warning that a node was about to fail. Charm++ then quickly spread and rebalanced the computational load.
“The signature strength of the Charm system is that it can automatically rebalance load by migrating objects around,” Kale said. “So we take those objects, spread them to the (other) processors, rebalance the load, and now we are running pretty much at the same strength except for the loss of that one computer – if you are running 100, and you lost one, 99 percent speed is what you can run.”
This then-new capability of using runtime adaptation to support fault tolerance, was the reason for the paper being accepted at the original conference in the first place, and now for its inclusion in the select group of papers with strongest long-term impact.
Ultimately, Kale said, the paper’s influence was shown in later research.
“A lot fault tolerance research references this,” he said, noting that the paper has been cited 126 times.