PARALLEL DATA LAB 

PDL Abstract

Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks

ASPLOS 2018. The 23rd ACM International Conference on Architectural Support for Programming Languages and Operating Systems, March 24th – March 28th, Williamsburg, VA.

Amirali Boroumand1 Saugata Ghose1, Youngsok Kim2, Rachata Ausavarungnirun1, Eric Shiu3, Rahul Thakur3,
Daehyun Kim4,3, Aki Kuusela3, Allan Knies3, Parthasarathy Ranganathan3, Onur Mutlu5,1

1 Carnegie Mellon University
2 Dept. of ECE, Seoul National University
3 Google
4 Samsung Research
5 ETH Zürich

http://www.pdl.cmu.edu

We are experiencing an explosive growth in the number of consumer devices, including smartphones, tablets, web-based computers such as Chromebooks, and wearable devices. For this class of devices, energy efficiency is a first-class concern due to the limited battery capacity and thermal power budget. We find that data movement is a major contributor to the total system energy and execution time in consumer devices. The energy and performance costs of moving data between the memory system and the compute units are significantly higher than the costs of computation. As a result, addressing data movement is crucial for consumer devices.

In this work, we comprehensively analyze the energy and performance impact of data movement for several widely-used Google consumer workloads: (1) the Chrome web browser; (2) TensorFlow Mobile, Google’s machine learning framework; (3) video playback, and (4) video capture, both of which are used in many video services such as YouTube and Google Hangouts. We find that processing in-memory (PIM) can significantly reduce data movement for all of these workloads, by performing part of the computation close to memory. Each workload contains simple primitives and functions that contribute to a significant amount of the overall data movement. We investigate whether these primitives and functions are feasible to implement using PIM, given the limited area and power constraints of consumer devices. Our analysis shows that offloading these primitives to PIM logic, consisting of either simple cores or specialized accelerators, eliminates a large amount of data movement, and significantly reduces total system energy (by an average of 55.4% across the workloads) and execution time (by an average of 54.2%).

FULL PAPER: pdf