Illinois and VMware's Collaborative Research Develops Practical Reliability Testing Techniques for Cluster-Management Controllers
8/22/2022 8:59:39 AM
What Illinois Computer Science professor Tianyin Xu appreciates most about conducting computer systems research is that the results can make a direct impact.
That motivator, more than anything else, is how he tries to connect with his students, Illinois CS colleagues, and industry collaborators. There is an energy to the thought that by working together, the team can build practical systems that can make a difference beyond research prototypes.
Over the time he has worked at Illinois CS, Xu said that not many projects represent a better example of this than his recent effort with three students, two faculty colleagues, and VMware Research.
The resulting research paper entitled “Automatic Reliability Testing for Cluster Management Controllers,” was recently presented by CS PhD student Xudong Sun at the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI), a premier computer systems research conference.
Beyond the research, this project has attracted a lot of attention by industry practitioners and endorsement inside VMware. VMware CTO, Kit Colbert, said in a recent tweet that the Sieve project is one example of why he loves the company, where great academic and industry collaboration leads to even greater developments.
The work also earned Xu’s research group a $250,000 research gift from VMware to continue the research. His recent $229,000 research grant from the Illinois-IBM Discovery Accelerator Institute is also related.
This recognition for the entire research team, including Xu, Sun, MS student Wenqing Luo, PhD student Tyler Gu, incoming Illinois CS faculty Aishwarya Ganesan and Ramnatthan Alagappan (both were at VMware), and Lalith Suresh and Michael Gasch from VMware Research – is something that Xu credits to the work ethic of all involved.
“We have a dream team. What I enjoy most is that everyone in the team really wants to do something useful and impactful beyond a publication only,” Xu said. “Since we are all convinced that controller reliability is an important and emergent problem, the team started building a usable system in a very early stage of the project. For example, the team set up continuous testing and integration and did code review.”
The genesis of the collaboration goes back to 2019, when Xu attended the ACM Symposium of Operating Systems (SOSP) – where he was introduced to Suresh from VMware Research by CS alumnus Faria Kalim (MS ’17, PhD ’20) who interned with Lalith. Later, Sun also interned with Lalith in the summer of both 2020 and 2022.
Xu and Suresh were immediately connected over a shared interest in reliably managing large-scale datacenter infrastructures that operate thousands of machines as well as critical cloud systems and applications running on top of them. Their natural collaboration takes advantage of Xu’s expertise in system configuration and operation and Suresh’s expertise in cluster management systems.
“I remember I talked to Lalith for only half an hour, and we already had great chemistry in terms of research interests and tastes. We quickly decided to work together. Later, Lalith invited Aishwarya and Ram to join the project (they both were at VMware Research at the time). I was very excited, because I knew Aishwarya and Ram for years – since we were graduate students,” Xu said.
With the goal of improving the correctness and reliability of large-scale system operations, the team produced Sieve – the first automatic and generalizable reliability-testing technique for cluster-management controllers.
“Industry is honest, and seeing the reception of this work is gratifying for all of us involved,” Xu said.
In his undergraduate studies, Sun focused on building high-performance and resource-efficient systems.
“The most frustrating moments came when my system ran into failures, due to bugs or configuration issues,” Sun said. “This is how I started to seek understanding for the fundamentals of correct and reliable systems, ultimately deciding to join Illinois CS to work with professor Tianyin Xu on system reliability.”
Sun’s research focuses on preventing production failures in large-scale cloud systems.
“Building Kubernetes controllers to automate system operations has become a common practice in modern cloud infrastructures,” Sun said. “There are many severe failures caused by buggy controllers, which motivated us to build tools to systematically test them and prevent failure-inducing bugs.”
Working alongside Xu helped Sun’s advances on Sieve lead to research papers at prestigious academic conferences and workshops, including OSDI and HotOS. He also presented at flagship industry conferences such as KubeCon + CloudNativeCon. Moreover, companies send him customized swag to thank him for the serious defects detected by Sieve in their software products.
The goals and inspiration he came to Illinois CS with, in terms of advancing system reliability, have begun to come true.
“I am most proud of how Sieve finds serious bugs in many popular and mature Kubernetes controllers,” Sun said. “These controllers manage critical production systems. It is exciting to see that Sieve can make a direct impact by improving their reliability.”
Besides the Sieve project, Sun is also working on detecting configuration issues in cloud systems using software testing techniques and designing new programming support to develop provably-correct controllers with formal verification techniques.