PDL Abstract

Principled Workflow-centric Tracing of Distributed Systems

ACM Symposium on Cloud Computing 2016 (SoCC ’16) October 5-7, 2016, Santa Clara, CA, USA.

Raja R. Sambasivan*, Ilari Shafer^, Jonathan Mace‡, Benjamin H. Sigelman†, Rodrigo Fonseca‡, Gregory R. Ganger*

* Carnegie Mellon University,
^ Microsoft,
‡ Brown University,
† LightStep


Workžow-centric tracing captures the workflow of causally-related events (e.g., work done to process a request) within and among the components of a distributed system. As distributed systems grow in scale and complexity, such tracing is becoming a critical tool for understanding distributed system behavior. Yet, there is a fundamental lack of clarity about how such infrastructures should be designed to provide maximum benefit for important management tasks, such as resource accounting and diagnosis.Without research into this important issue, there is a danger that workflow-centric tracing will not reach its full potential. To help, this paper distills the design space of workflow-centric tracing and describes key design choices that can help or hinder a tracing infrastructure’s utility for important tasks. Our design space and the design choices we suggest are based on our experiences developing several previous workflow-centric tracing infrastructures.