ABSTRACTBrief Announcement: Parallel Depth First vs. Work Stealing Schedulers on CMP Architectures Vasileios Liaskovitis*, Shimin Chen**, Phillip B. Gibbons**, Anastassia Ailamaki*, Guy E. Blelloch*, Babak Falsafi*, Limor Fix**, Nikos Hardavellas*, Michael Kozuch**, Todd C. Mowry*;**, Chris Wilkerson† *Carnegie Mellon University In chip multiprocessors (CMPs), limiting the number of
off-chip cache misses is crucial for good performance. Many Overview of schedulers. In PDF, processing cores are
allocated ready-to-execute program tasks such that higher CMP configurations studied. We evaluated the performance of PDF and WS across a range of simulated CMP Summary of findings. We studied a variety of benchmark programs to show the following findings. For several application classes, PDF enables significant constructive sharing between threads, leading to better utilization of the on-chip caches and reducing off-chip traffic compared to WS. In particular, bandwidth-limited irregular programs and parallel divide-and-conquer programs present a relative speedup of 1.3-1.6X over WS, observing a 13-41% reduction in off-chip traffic. For each schedule, the number of L2 misses (i.e., the off-chip traffic) is shown on the left and the speed-up over running on one core is shown on the right, for 1 to 32 cores. Note that reducing the off-chip traffic has the additional benefit of reducing the power consumption. Moreover, PDF's smaller working sets provide opportunities to power down segments of the cache without increasing the running time. Furthermore, when multiple programs are active concurrently, the PDF version is also less of a cache hog and its smaller working set is more likely to remain in the cache across context switches. For several other applications classes, PDF and WS have roughly the same execution times, either because there is only limited data reuse that can be exploited or because the programs are not limited by off-chip bandwidth. In the latter case, the constructive sharing PDF enables does provide the power and multiprogramming benefits discussed above. Finally, most parallel benchmarks to date, written for SMPs, use such a coarse-grained threading that they cannot exploit the constructive cache behavior inherent in PDF.We find that mechanisms to finely grain multithreaded applications are crucial to achieving good performance on CMPs. FULL PAPER: pdf
|