Publications by Year: 2020

2020
Srivatsan Krishnan, Maximilian Lam, Sharad Chitlangia, Zishen Wan, Gabriel Barth-Maron, Aleksandra Faust, and Vijay Janapa Reddi. 11/28/2020. “QuaRL: Quantization for Sustainable Reinforcement Learning”. Publisher's VersionAbstract
Deep reinforcement learning continues to show tremendous potential in achieving tasklevel autonomy, however, its computational and energy demands remain prohibitively high. In this paper, we tackle this problem by applying quantization to reinforcement learning. To that end, we introduce a novel Reinforcement Learning (RL) training paradigm, ActorQ, to speed up actor-learner distributed RL training. ActorQ leverages 8-bit quantized actors to speed up data collection without affecting learning convergence. Our quantized distributed RL training system, ActorQ, demonstrates end-to-end speedups of > 1.5 ×- 2.5 ×, and faster convergence over full precision training on a range of tasks (Deepmind Control Suite) and different RL algorithms (D4PG, DQN). Furthermore, we compare the carbon emissions (Kgs of CO2) of ActorQ versus standard reinforcement learning on various tasks. Across various settings, we show that ActorQ enables more environmentally friendly reinforcement learning by achieving 2.8× less carbon emission and energy compared to training RLagents in full-precision. Finally, we demonstrate empirically that aggressively quantized RL-policies (up to 4/5 bits) enable significant speedups on quantization-friendly (supports native quantization) resource-constrained edge devices, without degrading accuracy. We believe that this is the first of many future works on enabling computationally /energy efficient and sustainable reinforcement learning. The source code for QuaRL is available here: https://github.com/harvard-edge/QuaRL.
QuaRL: Quantization for Sustainable Reinforcement Learning
Samuel Hsia, Udit Gupta, Wilkening Mark, Carole Wu, Gu-Yeon Wei, and David Brooks. 10/27/2020. “Cross-Stack Workload Characterization of Deep Recommendation Systems.” In 2020 IEEE International Symposium on Workload Characterization (IISWC). Publisher's VersionAbstract

Deep learning based recommendation systems form the backbone of most personalized cloud services. Though the computer architecture community has recently started to take notice of deep recommendation inference, the resulting solutions have taken wildly different approaches - ranging from near memory processing to at-scale optimizations. To better design future hardware systems for deep recommendation inference, we must first systematically examine and characterize the underlying systems-level impact of design decisions across the different levels of the execution stack. In this paper, we characterize eight industry-representative deep recommendation models at three different levels of the execution stack: algorithms and software, systems platforms, and hardware microarchitectures. Through this cross-stack characterization, we first show that system deployment choices (i.e., CPUs or GPUs, batch size granularity) can give us up to 15x speedup. To better understand the bottlenecks for further optimization, we look at both software operator usage breakdown and CPU frontend and backend microarchitectural inefficiencies. Finally, we model the correlation between key algorithmic model architecture features and hardware bottlenecks, revealing the absence of a single dominant algorithmic component behind each hardware bottleneck.

Cross-Stack Workload Characterization of Deep Recommendation Systems
Brandon Reagen, Wooseok Choi, Yeongil Ko, Vincent Lee, Gu Wei, Lee S, and David Brooks. 10/8/2020. “Cheetah: Optimizations and Methods for PrivacyPreserving Inference via Homomorphic Encryption”. Publisher's VersionAbstract
As the application of deep learning continues to grow, so does the amount of data used to make predictions. While traditionally, big-data deep learning was constrained by computing performance and off-chip memory bandwidth, a new constraint has emerged: privacy. One solution is homomorphic encryption (HE). Applying HE to the client-cloud model allows cloud services to perform inference directly on the client's encrypted data. While HE can meet privacy constraints, it introduces enormous computational challenges and remains impractically slow in current systems.
This paper introduces Cheetah, a set of algorithmic and hardware optimizations for HE DNN inference to achieve plaintext DNN inference speeds. Cheetah proposes HE-parameter tuning optimization and operator scheduling optimizations, which together deliver 79x speedup over the state-of-the-art. However, this still falls short of plaintext inference speeds by almost four orders of magnitude. To bridge the remaining performance gap, Cheetah further proposes an accelerator architecture that, when combined with the algorithmic optimizations, approaches plaintext DNN inference speeds. We evaluate several common neural network models (e.g., ResNet50, VGG16, and AlexNet) and show that plaintext-level HE inference for each is feasible with a custom accelerator consuming 30W and 545mm^2.
Cheetah: Optimizations and Methods for PrivacyPreserving Inference via Homomorphic Encryption
Srivatsan Krishnan, Zishen Wan, Kshitij Bhardwaj, Paul Whatmough, Aleksandra Faust, Gu Wei, David Brooks, and Vijay Reddi. 9/16/2020. “The Sky Is Not the Limit: A Visual Performance Model for Cyber-Physical Co-Design in Autonomous Machines.” IEEE Computer Architecture Letters, 19, 1, Pp. 38–42. Publisher's VersionAbstract
We introduce the “Formula-1” (F-1) roofline model to understand the role of computing in aerial autonomous machines. The model provides insights by exploiting the fundamental relationships between various components in an aerial robot, such as sensor framerate, compute performance, and body dynamics (physics). F-1 serves as a tool that can aid computer and cyber-physical system architects to understand the optimal design (or selection) of various components in the development of autonomous machines.
The Sky Is Not the Limit: A Visual Performance Model for Cyber-Physical Co-Design in Autonomous Machines
Glenn Ko, Yuji Chai, Marco Donato, Paul Whatmough, Tambe Thierry, Rob Rutenbar, Gu Wei, and Gu Wei. 8/18/2020. “A Scalable Bayesian Inference Accelerator for Unsupervised Learning.” In IEEE Hot Chips 31 Symposium. Palo Alto, CA, USA. Publisher's VersionAbstract
This article consists only of a collection of slides from the author's conference presentation.
A Scalable Bayesian Inference Accelerator for Unsupervised Learning
Kshitij Bhardwaj, Marton Havasi, Yuan Yao, David Brooks, José Lobato, and Gu Wei. 8/10/2020. “A comprehensive methodology to determine optimal coherence interfaces for many-accelerator SoCs.” In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design, Pp. 145–150. Publisher's VersionAbstract

Modern systems-on-chip (SoCs) include not only general-purpose CPUs but also specialized hardware accelerators. Typically, there are three coherence model choices to integrate an accelerator with the memory hierarchy: no coherence, coherent with the last-level cache (LLC), and private cache based full coherence. However, there has been very limited research on finding which coherence models are optimal for the accelerators of a complex many-accelerator SoC. This paper focuses on determining a cost-aware coherence interface for an SoC and its target application: find the best coherence models for the accelerators that optimize their power and performance, considering both workload characteristics and system-level contention. A novel comprehensive methodology is proposed that uses Bayesian optimization to efficiently find the cost-aware coherence interfaces for SoCs that are modeled using the gem5-Aladdin architectural simulator. For a complete analysis, gem5-Aladdin is extended to support LLC coherence in addition to already-supported no coherence and full coherence. For a heterogeneous SoC targeting applications with varying amount of accelerator-level parallelism, the proposed framework rapidly finds cost-aware coherence interfaces that show significant performance and power benefits over the other commonly-used coherence interfaces.

A comprehensive methodology to determine optimal coherence interfaces for many-accelerator SoCs
Paul Whatmough, Marco Donato, Glenn Ko, David Brooks, and Gu-Wei. 8/1/2020. “CHIPKIT: An agile, reusable open-source framework for rapid test chip development.” IEEE Micro, 40, 4, Pp. 32 - 40. Publisher's VersionAbstract
The current trend for domain-specific architectures (DSAs) has led to renewed interest in research test chips to demonstrate new specialized hardware. Tape-outs also offer huge pedagogical value garnered from real hands-on exposure to the whole system stack. However, successful tape-outs demand hard-earned experience, and the design process is time consuming and fraught with challenges. Therefore, custom chips have remained the preserve of a small number of research groups, typically focused on circuit design research. This paper describes the CHIPKIT framework. We describe a reusable SoC subsystem which provides basic IO, an on-chip programmable host, memory and peripherals. This subsystem can be readily extended with new IP blocks to generate custom test chips. We also present an agile RTL development flow, including a code generation tool calledVGEN. Finally, we outline best practices for full-chip validation across the entire design cycle.
CHIPKIT: An agile, reusable open-source framework for rapid test chip development
Thierry Tambe, En-Yang, Zishen Wan, Yuntian Deng, Vijay Reddi, Alexander Rush, David Brooks, and Gu-Yeon Wei. 7/20/2020. “Algorithm-Hardware Co-Design of Adaptive Floating-Point Encodings for Resilient Deep Learning Inference.” In . San Francisco, CA, USA: Design Automation Conference (DAC 2020). Publisher's VersionAbstract
Conventional hardware-friendly quantization methods, such asfixed-point or integer, tend to perform poorly at very low preci-sion as their shrunken dynamic ranges cannot adequately capturethe wide data distributions commonly seen in sequence transduc-tion models. We present an algorithm-hardware co-design centeredaround a novel floating-point inspired number format,AdaptivFloat,that dynamically maximizes and optimally clips its available dy-namic range, at a layer granularity, in order to create faithful encod-ings of neural network parameters. AdaptivFloat consistently pro-duces higher inference accuracies compared to block floating-point,uniform, IEEE-like float or posit encodings at low bit precision (≤8-bit) across a diverse set of state-of-the-art neural networks, ex-hibiting narrow to wide weight distribution. Notably, at 4-bit weightprecision, only a 2.1 degradation in BLEU score is observed on theAdaptivFloat-quantized Transformer network compared to totalaccuracy loss when encoded in the above-mentioned prominentdatatypes. Furthermore, experimental results on a deep neural net-work (DNN) processing element (PE), exploiting AdaptivFloat logicin its computational datapath, demonstrate per-operation energyand area that is 0.9×and 1.14×, respectively, that of an equivalentbit width NVDLA-like integer-based PE.
Algorithm-Hardware Co-Design of Adaptive Floating-Point Encodings for Resilient Deep Learning Inference
Antonino Tumeo, Marco Minutoli, Giovanni Castellana, Joseph Manzano, Vinay Amatya, David Brooks, and Gu Wei. 7/20/2020. “Software Defined Accelerators From Learning Tools Environment.” In 2020 57th ACM/IEEE Design Automation Conference (DAC), Pp. 1–6. IEEE. Publisher's VersionAbstract
Next generation systems, such as edge devices, will need to provide efficient processing of machine learning (ML) algorithms along several metrics, including energy, performance, area, and latency. However, the quickly evolving field of ML makes it extremely difficult to generate accelerators able to support a wide variety of algorithms. At the same time, designing accelerators in hardware description languages (HDLs) by hand is hard and time consuming, and does not allow quick exploration of the design space. In this paper we present the Software Defined Accelerators From Learning Tools Environment (SODALITE), an automated open source high-level ML framework-to-verilog compiler targeting ML Application-Specific Integrated Circuits (ASICs) chiplets. The SODALITE approach will implement optimal designs by seamlessly combining custom components generated through high-level synthesis (HLS) with templated and fully tunable Intellectual Properties (IPs) and macros, integrated in an extendable resource library. Through a closed loop design space exploration engine, developers will be able to quickly explore their hardware designs along different dimensions.
Software Defined Accelerators From Learning Tools Environment
Glenn Ko, Yuji Chai, Marco Donato, Paul Whatmough, Thierry Tambe, Rob Rutenbar, David Brooks, and Gu-Yeon Wei. 6/16/2020. “A 3mm2 Programmable Bayesian Inference Accelerator for Unsupervised Machine Perception using Parallel Gibbs Sampling in 16nm.” In IEEE Symposium on VLSI Circuits (VLSI). Publisher's VersionAbstract
This paper describes a 16nm programmable accelerator for unsupervised probabilistic machine perception tasks that performs Bayesian inference on probabilistic models mapped onto a 2D Markov Random Field, using MCMC. Exploiting two degrees of parallelism, it performs Gibbs sampling inference at up to 1380× faster with 1965× less energy than an Arm Cortex-A53 on the same SoC, and 1.5× faster with 6.3× less energy than an embedded FPGA in the same technology. At 0.8V, it runs at 450MHz, producing 44.6 MSamples/s at 0.88 nJ/sample.
A 3mm2 Programmable Bayesian Inference Accelerator for Unsupervised Machine Perception using Parallel Gibbs Sampling in 16nm
Liu Ke, Udit Gupta, Carole-Jean Wu, Benjamin Cho, Mark Hempstead, Brandon Reagen, Xuan Zhang, David Brooks, Vikas Chandra, Utku Diril, Amin Firoozshahian, Kim Hazelwood, Bill Jia, Hsien-Hsin Lee, Meng Li, Bert Maher, Dheevatsa Mudigere, Maxim Naumov, Martin Schatz, Mikhail Smelyanskiy, and Xiaodong Wang. 5/30/2020. “RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing.” In . The 47th IEEE/ACM International Symposium on Computer Architecture (ISCA 2020). Publisher's VersionAbstract
Personalized recommendation systems leverage deep learning models and account for the majority of data center AI cycles. Their performance is dominated by memory-bound sparse embedding operations with unique irregular memory access patterns that pose a fundamental challenge to accelerate. This paper proposes a lightweight, commodity DRAM compliant, near-memory processing solution to accelerate personalized recommendation inference. The in-depth characterization of production-grade recommendation models shows that embedding operations with high model-, operator- and data-level parallelism lead to memory bandwidth saturation, limiting recommendation inference performance. We propose RecNMP which provides a scalable solution to improve system throughput, supporting a broad range of sparse embedding models. RecNMP is specifically tailored to production environments with heavy co-location of operators on a single server. Several hardware/software co-optimization techniques such as memory-side caching, table-aware packet scheduling, and hot entry profiling are studied, resulting in up to 9.8x memory latency speedup over a highly-optimized baseline. Overall, RecNMP offers 4.2x throughput improvement and 45.8% memory energy savings.
RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing
Mattson Peter, Vijay Reddi, Christine Cheng, Cody Coleman, Greg Diamos, David Kanter, Paulius Micikevicius, David Patterson, Guenther Schmuelling, Hanlin Tang, Gu-Yeon Wei, and Carole-Jean Wu. 3/1/2020. “MLPerf: An industry standard benchmark suite for machine learning performance.” IEEE Micro, 40, 2, Pp. 8–16. Publisher's VersionAbstract
In this article, we describe the design choices behind MLPerf, a machine learning performance benchmark that has become an industry standard. The first two rounds of the MLPerf Training benchmark helped drive improvements to software-stack performance and scalability, showing a 1.3× speedup in the top 16-chip results despite higher quality targets and a 5.5× increase in system scale. The first round of MLPerf Inference received over 500 benchmark results from 14 different organizations, showing growing adoption.
MLPerf: An industry standard benchmark suite for machine learning performance
Maximilian Lam, Zachary Yedidia, Colby Banbury, and Vijay Janapa Reddi. 2/26/2020. “Precision Batching: Bitserial Decomposition for Efficient Neural Network Inference on GPUs”. Publisher's VersionAbstract
We present PrecisionBatching, a quantized inference algorithm for speeding up neural network execution on traditional hardware platforms at low bitwidths without the need for retraining or recalibration. PrecisionBatching decomposes a neural network into individual bitlayers and accumulates them using fast 1-bit operations while maintaining activations in full precision. PrecisionBatching not only facilitates quantized inference at low bitwidths (< 8 bits) without the need for retraining/recalibration, but also 1) enables traditional hardware platforms the ability to realize inference speedups at a finer granularity of quantization (e.g: 1-16 bit execution) and 2) allows accuracy and speedup tradeoffs at runtime by exposing the number of bitlayers to accumulate as a tunable parameter. Across a variety of applications (MNIST, language modeling, natural language inference) and neural network architectures (fully connected, RNN, LSTM), PrecisionBatching yields end-to-end speedups of over 8x on a GPU within a < 1% error margin of the full precision baseline, outperforming traditional 8-bit quantized inference by over 1.5x-2x at the same error tolerance.
Precision Batching: Bitserial Decomposition for Efficient Neural Network Inference on GPUs
Udit Gupta, Carole Wu, Xiaodong Wang, Maxim Naumov, Brandon Reagen, David Brooks, Bradford Cottel, Kim Hazelwood, Bill Jia, Hsien-Hsin Lee, Andrey Malevich, Dheevatsa Mudigere, Mikhail Smelyanskiy, Liang Xiong, and Xuan Zhang. 2020. “The Architectural Implications of Facebook's DNN-based Personalized Recommendation.” In . The 26th IEEE International Symposium on High-Performance Computer Architecture. Publisher's VersionAbstract
The widespread application of deep learning has changed the landscape of computation in the data center. In particular, personalized recommendation for content ranking is now largely accomplished leveraging deep neural networks. However, despite the importance of these models and the amount of compute cycles they consume, relatively little research attention has been devoted to systems for recommendation. To facilitate research and to advance the understanding of these workloads, this paper presents a set of real-world, production-scale DNNs for personalized recommendation coupled with relevant performance metrics for evaluation. In addition to releasing a set of open-source workloads, we conduct in-depth analysis that underpins future system design and optimization for at-scale recommendation: Inference latency varies by 60% across three Intel server generations, batching and co-location of inferences can drastically improve latency-bounded throughput, and the diverse composition of recommendation models leads to different optimization strategies.
The Architectural Implications of Facebook's DNN-based Personalized Recommendation
Samuel Hsia, Udit Gupta, Mark Wilkening, Carole Wu, Gu Wei, and David Brooks. 2020. “Cross-Stack Workload Characterization of Deep Recommendation Systems.” 2020 IEEE International Symposium on Workload Characterization (IISWC). Publisher's VersionAbstract
Deep learning based recommendation systems form the backbone of most personalized cloud services. Though the computer architecture community has recently started to take notice of deep recommendation inference, the resulting solutions have taken wildly different approaches - ranging from near memory processing to at-scale optimizations. To better design future hardware systems for deep recommendation inference, we must first systematically examine and characterize the underlying systems-level impact of design decisions across the different levels of the execution stack. In this paper, we characterize eight industry-representative deep recommendation models at three different levels of the execution stack: algorithms and software, systems platforms, and hardware microarchitectures. Through this cross-stack characterization, we first show that system deployment choices (i.e., CPUs or GPUs, batch size granularity) can give us up to 15x speedup. To better understand the bottlenecks for further optimization, we look at both software operator usage breakdown and CPU frontend and backend microarchitectural inefficiencies. Finally, we model the correlation between key algorithmic model architecture features and hardware bottlenecks, revealing the absence of a single dominant algorithmic component behind each hardware bottleneck.
Cross-Stack Workload Characterization of Deep Recommendation Systems
Udit Gupta, Samuel Hsia, Vikram Saraph, Xiaodong Wang, Brandon Reagen, Gu-Yeon Wei, Hsien-Hsin S. Lee, Carole-Jean Wu, and David Brooks. 2020. “DeepRecSys: A System for Optimizing End-To-End At-scale Neural Recommendation Inference.” In . The 47th IEEE/ACM International Symposium on Computer Architecture (ISCA 2020). Publisher's VersionAbstract
Neural personalized recommendation is the corner-stone of a wide collection of cloud services and products, constituting significant compute demand of the cloud infrastructure. Thus, improving the execution efficiency of neural recommendation directly translates into infrastructure capacity saving. In this paper, we devise a novel end-to-end modeling infrastructure, DeepRecInfra, that adopts an algorithm and system co-design methodology to custom-design systems for recommendation use cases. Leveraging the insights from the recommendation characterization, a new dynamic scheduler, DeepRecSched, is proposed to maximize latency-bounded throughput by taking into account characteristics of inference query size and arrival patterns, recommendation model architectures, and underlying hardware systems. By doing so, system throughput is doubled across the eight industry-representative recommendation models. Finally, design, deployment, and evaluation in at-scale production datacenter shows over 30% latency reduction across a wide variety of recommendation models running on hundreds of machines.
DeepRecSys: A System for Optimizing End-To-End At-scale Neural Recommendation Inference
Wang Yu, Gu Wei, and David Brooks. 2020. “A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms.” In . Third Conference on Machine Learning and Systems (MLSys). Publisher's VersionAbstract
Training deep learning models is compute-intensive and there is an industry-wide trend towards hardware and software specialization to improve performance. To systematically compare deep learning systems, we introduce a methodology comprised of a set of analysis techniques and parameterized end-to-end models for fully connected, convolutional, and recurrent neural networks. This methodology can be applied to analyze various hardware and software systems, and is intended to complement traditional methods. We demonstrate its utility by comparing two generations of specialized platforms (Google's Cloud TPU v2/v3), three heterogeneous platforms (Google TPU, Nvidia GPU, and Intel CPU), and specialized software stacks (TensorFlow and CUDA).
A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms