PARALLEL DATA LAB 

PDL Abstract

So, You Want To Trace Your Distributed System? Key Design Insights from Years of Practical Experience

Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-14-102. April 2014.

Raja R. Sambasivan†, Rodrigo Fonseca^, Ilari Shafer*, Gregory R. Ganger

†Carnegie Mellon University,
^Brown University
*Microsoft

http://www.pdl.cmu.edu/

End-to-end tracing captures the workflowof causally-related activity (e.g., work done to process a request) within and among the components of a distributed system. As distributed systems grow in scale and complexity, such tracing is becoming a critical tool for management tasks like diagnosis and resource accounting. Drawing upon our experiences building and using end-to-end tracing infrastructures, this paper distills the key design axes that dictate trace utility for important use cases. Developing tracing infrastructures without explicitly understanding these axes and choices for them will likely result in infrastructures that are not useful for their intended purposes. In addition to identifying the design axes, this paper identifies good design choices for various tracing use cases, contrasts them to choices made by previous tracing implementations, and shows where prior implementations fall short. It also identifies remaining challenges on the path to making tracing an integral part of distributed system design.

KEYWORDS: Cloud computing, Distributed systems, Design, End-to-end tracing

FULL TR: pdf