PARALLEL DATA LAB 

PDL Abstract

Scheduling Threads for Constructive Cache Sharing on CMPs

SPAA'07, June 9-11, 2007, San Diego, California, USA.

Shimin Chen** Phillip B. Gibbons** Michael Kozuch** Vasileios Liaskovitis* Anastassia Ailamaki* Guy E. Blelloch* Babak Falsafi* Limor Fix** Nikos Hardavellas* Todd C. Mowry*;** Chris Wilkerson†

*Carnegie Mellon University
**Intel Research Pittsburgh
†Intel Microprocessor Research Lab

http://www.pdl.cmu.edu/

In chip multiprocessors (CMPs), limiting the number of off- chip cache misses is crucial for good performance. Many multithreaded programs provide opportunities for construc tive cache sharing, in which concurrently scheduled threads share a largely overlapping working set. In this paper, we compare the performance of two state-of-the-art schedulers proposed for fine-grained multithreaded programs: Parallel Depth First (PDF), which is specifically designed for constructive cache sharing, and Work Stealing (WS), which is a more traditional design. Our experimental results indicate that PDF scheduling yields a 1.3-1.6X performance improvement relative to WS for several fine-grain parallel benchmarks on projected future CMP configurations; we also report several issues that may limit the advantage of PDF in certain applications. These results also indicate that PDF more effectively utilizes off-chip bandwidth, making it possible to trade-off on-chip cache for a larger number of cores. Moreover, we find that task granularity plays a key role in cache performance. Therefore, we present an automatic approach for selecting effective grain sizes, based on a new working set profiling algorithm that is an order of magnitude faster than previous approaches. This is the first paper demonstrating the effectiveness of PDF on real benchmarks, providing a direct comparison between PDF and WS, revealing the limiting factors for PDF in practice, and presenting an approach for overcoming these factors.

KEYWORDS: Chip Multiprocessors, Scheduling Algorithms, Constructive Cache Sharing, Parallel Depth First, Work Stealing, Thread Granularity, Working Set Profiling

FULL PAPER: pdf