PARALLEL DATA LAB

Problem Analysis: Extended Overview

We are exploring methodologies and algorithms for automating the analysis of failures and performance degradations in large-scale systems. Problem analysis includes such crucial tasks as identifying which component(s) misbehaved and the likely root causes, diagnosing performance problems, and providing supporting evidence for any conclusions. Combining statistical tools with appropriate instrumentation, we hope to dramatically reduce the difficulty of analyzing performance and reliability problems in deployed storage systems. Such tools, integrated with automated reaction logic, also provide an essential building block for the longer-term goal of self-healing.

Automating problem analysis is crucial to achieving cost-effective systems at the scales needed for tomorrow’s high-end computing. The number of hardware and software components in such systems will make problems common rather than anomalous, so it must be possible to quickly move from problem to fix with little to no system downtime for analysis.

Further, the complexity of such distributed software systems makes by-hand analysis increasingly untenable. More nuanced, but perhaps of most concern, implementors of scalable applications (e.g., parallel storage) are increasingly unable to test in representative high-end computing environments—they simply cannot afford to replicate the necessary system scale. As a result, scale-related problems must be analyzed in the field to allow improvements to be made, introducing delays and reducing productivity for customers/users. Issues of clearance for systems deployed to support highly sensitive activities must also be taken into consideration. Current designs and tools fall far short of what is needed.

Currently, we are developing techniques for understanding the trade-offs associated with instrumentation and algorithms for hands-off problem analysis, including:

  • Continuous performance and anomaly tracing: Analysis requires information, making detailed system instrumentation a fundamental building block. Information is needed regarding the behavior, timing, and resource usage of each software and hardware component in a system, as well as the intercommunication among them. The primary trade-off is between collection overheads (storage and performance) and obtaining enough information for an effective analysis.

  • Blame assignment: Identifying which hardware and/or software components caused a particular problem is perhaps the most important operational question—i.e., which is the component that needs to be fixed, removed, or replaced. Applying machine learning technologies to appropriate instrumentation data promises some automated assistance. Primary questions center on the accuracy of such approaches for different levels of instrumentation detail, and on problem difficulty (e.g., single localized component failure vs. cascading performance problem).

  • Performance diagnosis: Identifying the I/O request sequences that cause a system to exhibit disappointing performance levels, and the internal control and data flow paths involved are also extremely important. Combining machine learning (for pattern recognition) and queueing theory (for bottleneck analysis) will help to zero in on such details. Again, questions of accuracy and instrumentation requirements are central.

Two interrelated research challenges are evident. First, statistical tools will play a crucial role in accurate problem diagnosis and analysis schemes. The difficulty will be to understand which ones work most effectively in various situations. Second, the impact of instrumentation detail on the effectiveness of those tools must be well-understood to justify the associated instrumentation costs. Both efforts will require extensive experimentation and deep understanding of real case studies.