ATLAS: Analysis of Traces from Los Alamos Supercomputers

Contact: George Amvrosiadis, Greg Ganger, Chuck Cranor

At the Los Alamos National Laboratory (LANL), computing clusters are monitored closely to ensure their availability. Terabytes of data are collected every day on each cluster’s operation from several sources: job scheduler logs, sensor data, and file system logs, among others. Project Atlas aims to analyze and model the operation of LANL clusters, and use these models to develop techniques that improve clusters’ operational efficiency.

Trace Analysis

One of the current targets of Atlas is the characterization of diversity in cluster workloads and its impact on research results. Did you know that the most popular cluster workload trace was released by Google in 2012 and has been used in more than 450 publications already? Due to the scarcity of other data sources, however, researchers have started to overfit their work to the Google trace’s characteristics. We have been able to demonstrate this overfitting by introducing four new traces from two private clusters of a hedge fund firm (HedgeFund) and two LANL clusters. Our analysis shows that the HedgeFund workloads, consisting of data analytics jobs expected to be more closely related to the Google workload, display more similarity to LANL’s HPC cluster workloads (see Figure 1 for examples). Overall, our analysis results show that the new traces differ from the Google trace in substantive ways, suggesting that additional traces should be considered when evaluating the generality of new research. Through project Atlas, we plan to aid the community in moving forward by publicly releasing these and additional future traces. Furthermore, we plan to continue challenging workload assumptions used widely in the literature through our analysis of new datasets from LANL and other organizations.

Figure 1: CDFs of job size and duration across the Google, LANL, and HedgeFund traces.
Note how the Google workload stands out.

Workload Modeling

LANL jobs terminate with three possible statuses; a job can be successful, cancelled, or timed out. Job cancellations are triggered by a user or as the result of a failure in software or hardware. Job timeouts occur when the user-provided time limit is reached, which results in the job getting killed, and they occur often because users are motivated to issue jobs with small time limits, as those are generally prioritized by the scheduler. Our analysis of LANL job logs from the Mustang and Trinity clusters shows that a large number of CPU hours are allocated to jobs that are eventually cancelled or time out. The majority of this CPU time is very unlikely to be wasted. Our goal is to use cluster traces from a variety of data sources to build models that accurately predict job outcomes ahead of time. Such information could be leveraged by the scheduler, or even the users when deciding the frequency with which job state should be persisted to disk, in order to recover from a job termination and continue computing without repeating lost work.



George Amvrosiadis
Greg Ganger
Chuck Cranor
Garth Gibson


Jun Woo Park
Michael Kuchnik
Bryan Hooi


Los Alamos National Laboratory
: Elisabeth Baseman, Nathan DeBardeleben

Software Engineering Institute: Scott McMillan



We thank the members and companies of the PDL Consortium: Alibaba Group, Broadcom, Ltd., Dell EMC, Facebook, Google, Hewlett-Packard Labs, Hitachi Ltd., Intel Corporation, IBM, Micron, Microsoft Research, MongoDB, NetApp, Inc., Oracle Corporation, Salesforce, Samsung Information Systems America, Seagate Technology, Toshiba, Two Sigma, Veritas and Western Digital for their interest, insights, feedback, and support.




© 2018. Last updated 14 February, 2018