Robotics, machine learning systems, and other modern computer and software systems are increasingly being built by utilizing reusable infrastructure components. While developers have access to powerful tools, they also face complex challenges, such as configuring system components and infrastructure to perform optimally. Consequently, software and hardware for specific systems and tasks must be carefully selected and configured.
But system configuration is a challenging task and if an incident occurs due to misconfiguration, identifying the root cause is notoriously difficult and may produce misleading performance faults. These incidents have severe monetary, time and environmental repercussions. For example, Facebook blamed a faulty configuration change that caused a nearly six-hour outage last October that affected 3.5 billion users.
"What if a malfunction as a result of misconfiguration costs billions of dollars, for example, an important space mission similar to what happened at a global scale to internet companies such as Facebook and Google."
-Pooyan Jamshidi, Computer Science and Engineering
Software performance is critical for most modern software systems to achieve optimal capacity and functionalities, while limiting operating costs and energy consumption. Computer Science and Engineering Assistant Professor Pooyan Jamshidi is currently working on research to establish an alternative method for improving the current testing and debugging for complex, highly configurable machine learning systems.
Jamshidi was awarded $1.2 million over four years from the National Science Foundation for his research, “Causal Performance Debugging for Highly Configurable Systems,” in collaboration with Christian Kästner of Carnegie Mellon University and Baishakhi Ray from Columbia University. Kästner was Jamshidi’s postdoctoral advisor at Carnegie Mellon, while Jamshidi and Ray have been collaborating on causal inference for configurable systems research since early 2020.
“Our goal is to positively impact a variety of industrial sectors dependent on these highly configurable systems. The research is also intended to provide significant energy savings and reduced carbon emissions, especially for big data and machine learning systems operating at a massive scale, such as Google and Facebook” Jamshidi says.
Jamshidi and his collaborators intend to develop foundations and tools for a causal approach to performance modeling and performance debugging. Instead of only analyzing correlations, a new concept of causal performance models will be used to intervene over configuration options and observe system performance with multiple objectives. He also believes that the causal model could reduce the configuration space so it can search more intelligently to find near optimal configurations.
“The causal models enable inference and reasoning for numerous tasks, including debugging performance faults and misconfigurations,” Jamshidi says. “For example, some events might be correlated, but it doesn't mean that one causes another. There might be a case where one cause affects both and as a result these two events are correlated.”
Jamshidi, collaborators and Ph.D. students in his lab, AISys, will also develop three innovations. The first is to design and refine a causal modeling approach for system performance composed of multiple configurable components. Secondly, the team plans to develop and evaluate user-facing tool support, based on causal models, to help users select well-performing configurations for their specific tasks and hardware to resolve misconfiguration faults. Finally, a developer-facing tool will be created to foster code-level debugging and documentation.
“I like the idea of trying to identify internal proxies that can explain performance and get us away from expensively building a new performance model for each use case to one that hopefully generalizes across many workloads,” Kästner says.
Jamshidi states that his research is relevant for optimizing performance and energy since every company today uses some sort of machine learning systems. With these systems having numerous uses, his research covers a significant number of companies that either use machine learning systems or develop them to provide services, including robotic systems. A developer can write publicly accessible software, such as those for a machine learning system to control robots to do autonomous tasks.
According to Jamshidi, optimizing performance for computer systems is a complex problem where multiple objectives are involved, and many internal and external parameters might affect the performance of the system.
“We have done many empirical studies about configurable systems. The studies showed that finding the optimal configuration may not only improve execution time, but massively improve energy consumption of the system,” Jamshidi says. “A machine learning system has many computations for producing outcomes, which consume a good amount of energy on the hardware. Some of the hardware might be small and connected to a battery, so we don't want the robot’s battery die before finishing a mission or charge the battery every hour. This research will have impacts into the amount of energy that a system would consume.”
One of the graduate students at AISys, in collaboration with researchers from IBM, Columbia University and Purdue University, developed a method known as Unicorn, in which it learns a causal performance model by performing reasoning in the domain, captured interactions, and traced system-level performance events in the system. It used highly configurable systems to explain how interactions impacted the variation in performance objectives. Existing methods focused on using statistical methods to determine root causes, which could be misleading.
“We showed that with causal analysis we were able to find these root causes more clearly, precisely and accurately. This would help the developer and user find the root causes of the issue and fix it in a short amount of time and with less trial and error, which is considered causal debugging,” Jamshidi says.
The Unicorn method can also automatically determine correct settings for configuration options without any human intervention. Data is extracted from the system, allowing the causal model to help reason, determine root causes, and automatically fix the issue. Jamshidi plans for his current research to continue his previous work and address other challenges, such as optimizing system performance to efficiently complete tasks.
Since he is still in the early stages of his research, Jamshidi hopes to find an industry partner to complete at least two scenarios to evaluate his approaches and methods.
“We don’t want to limit ourselves to a small system in the lab,” Jamshidi says. “We want to see whether our methods that we develop really work in a real-world industry setting."