Publications by Type: Conference Paper

2023
Matthew Adiletta, Jesmin Jahan Tithi, Emmanouil-Ioannis Farsarakis, Gerasimos Gerogiannis, Robert Adolf, Robert Benke, Sidharth Kashyap, Samuel Hsia, Kartik Lakhotia, Fabrizio Petrini, Gu-Yeon Wei, and David Brooks. 4/24/2023. “Characterizing the Scalability of Graph Convolutional Networks on Intel® PIUMA.” In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). Raleigh, North Carolina.Abstract

Large-scale Graph Convolutional Network (GCN) inference on traditional CPU/GPU systems is challenging due to a large memory footprint, sparse computational patterns, and irregular memory accesses with poor locality. Intel's Programmable Integrated Unified Memory Architecture (PIUMA) is designed to address these challenges for graph analytics. In this paper, a detailed characterization of GCNs is presented using the Open-Graph Benchmark (OGB) datasets to determine the viability of PIUMA as a potential solution to GCN scalability.

First, the extent of sparse matrix dense matrix multiplication~(SpMM) as a performance driver for GCN on CPU and GPU is explored, offering a methodology for predicting GCN behavior as a function of dataset characteristics. Second, an SpMM kernel optimized for PIUMA is described and investigated for sensitivity to system parameters including memory bandwidth, latency, and thread count. SpMM scalability on PIUMA is demonstrated, while the scalability limitations of a Xeon-optimized SpMM implementation are discussed. Finally, GCN performance is compared on PIUMA versus a Xeon CPU system and Ampere GPU system, showing impressive results on PIUMA for large-scale datasets.

ispass_gnn_characterization_on_piuma.pdf
Samuel Hsia, Udit Gupta, Bilge Acun, Newsha Ardalani, Pan Zhong, Gu-Yeon Wei, David Brooks, and Carole-Jean Wu. 3/25/2023. “MP-Rec: Hardware-Software Co-Design to Enable Multi-Path Recommendation.” In ASPLOS. Vancouver, Canada. Publisher's VersionAbstract

Deep learning recommendation systems serve personalized content under diverse tail-latency targets and input-query loads. In order to do so, state-of-the-art recommendation models rely on terabyte-scale embedding tables to learn user preferences over large bodies of contents. The reliance on a fixed embedding representation of embedding tables not only imposes significant memory capacity and bandwidth requirements but also limits the scope of compatible system solutions. This paper challenges the assumption of fixed embedding representations by showing how synergies between embedding representations and hardware platforms can lead to improvements in both algorithmic- and system performance. Based on our characterization of various embedding representations, we propose a hybrid embedding representation that achieves higher quality embeddings at the cost of increased memory and compute requirements. To address the system performance challenges of the hybrid representation, we propose MP-Rec -- a co-design technique that exploits heterogeneity and dynamic selection of embedding representations and underlying hardware platforms.

On real system hardware, we demonstrate how matching custom accelerators, i.e., GPUs, TPUs, and IPUs, with compatible embedding representations can lead to 16.65x performance speedup. Additionally, in query-serving scenarios, MP-Rec achieves 2.49x and 3.76x higher correct prediction throughput and 0.19% and 0.22% better model quality on a CPU-GPU system for the Kaggle and Terabyte datasets, respectively.

2022
Yu-Shun Hsiao, Siva Kumar Sastry Hari, Michał Filipiuk, Timothy Tsai, Michael B. Sullivan, Vijay Janapa Reddi, Vasu Singh, and Stephen W. Keckler. 7/10/2022. “Zhuyi: Perception Processing Rate Estimation for Safety in Autonomous Vehicles.” In ACM/IEEE Design Automation Conference (DAC). San Francisco, CA, USA.
L. Pentecost, A. Hankin, M. Donato, M. Hempstead, G.-Y. Wei, and D. Brooks. 4/2/2022. “NVMExplorer: A Framework for Cross-Stack Comparisons of Embedded Non-Volatile Memories.” In 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). Seoul, South Korea. Publisher's VersionAbstract
Repeated off-chip memory accesses to DRAM drive up operating power for data-intensive applications, and SRAM technology scaling and leakage power limits the efficiency of embedded memories. Future on-chip storage will need higher density and energy efficiency, and the actively expanding field of emerging, embeddable non-volatile memory (eNVM) technologies is providing many potential candidates to satisfy this need. Each technology proposal presents distinct trade-offs in terms of density, read, write, and reliability characteristics, and we present a comprehensive framework for navigating and quantifying these design trade-offs alongside realistic system constraints and application-level impacts. This work evaluates eNVM-based storage for a range of application and system contexts including machine learning on the edge, graph analytics, and general purpose cache hierarchy, in addition to describing a freely available (this http URL) set of tools for application experts, system designers, and device experts to better understand, compare, and quantify the next generation of embedded memory solutions.
2109.01188.pdf
Tianyu Jia, En-Yu Yang, Yu-Shun Hsiao, Jonathan Cruz, David Brooks, Gu-Yeon Wei, and Vijay Janapa Reddi. 3/14/2022. “OMU: A Probabilistic 3D Occupancy Mapping Accelerator for Real-time OctoMap at the Edge.” In DATE: Design, Automation, and Test in Europe (DATE).
2021
Zishen Wan, Aqeel Anwar, Yu-Shun Hsiao, Tianyu Jia, Vijay Janapa Reddi, and Arijit Raychowdhury. 11/9/2021. “Analyzing and Improving Fault Tolerance of Learning-Based Navigation Systems.” In 58th ACM/IEEE Design Automation Conference (DAC). Publisher's VersionAbstract
Learning-based navigation systems are widely used in autonomous applications, such as robotics, unmanned vehicles and drones. Specialized hardware accelerators have been proposed for high-performance and energy-efficiency for such navigational tasks. However, transient and permanent faults are increasing in hardware systems and can catastrophically violate tasks safety. Meanwhile, traditional redundancy-based protection methods are challenging to deploy on resource-constrained edge applications. In this paper, we experimentally evaluate the resilience of navigation systems with respect to algorithms, fault models and data types from both RL training and inference. We further propose two efficient fault mitigation techniques that achieve 2x success rate and 39% quality-of-flight improvement in learning-based navigation systems.
Analyzing and Improving Fault Tolerance of Learning-Based Navigation Systems
2020
Samuel Hsia, Udit Gupta, Wilkening Mark, Carole Wu, Gu-Yeon Wei, and David Brooks. 10/27/2020. “Cross-Stack Workload Characterization of Deep Recommendation Systems.” In 2020 IEEE International Symposium on Workload Characterization (IISWC). Publisher's VersionAbstract

Deep learning based recommendation systems form the backbone of most personalized cloud services. Though the computer architecture community has recently started to take notice of deep recommendation inference, the resulting solutions have taken wildly different approaches - ranging from near memory processing to at-scale optimizations. To better design future hardware systems for deep recommendation inference, we must first systematically examine and characterize the underlying systems-level impact of design decisions across the different levels of the execution stack. In this paper, we characterize eight industry-representative deep recommendation models at three different levels of the execution stack: algorithms and software, systems platforms, and hardware microarchitectures. Through this cross-stack characterization, we first show that system deployment choices (i.e., CPUs or GPUs, batch size granularity) can give us up to 15x speedup. To better understand the bottlenecks for further optimization, we look at both software operator usage breakdown and CPU frontend and backend microarchitectural inefficiencies. Finally, we model the correlation between key algorithmic model architecture features and hardware bottlenecks, revealing the absence of a single dominant algorithmic component behind each hardware bottleneck.

Cross-Stack Workload Characterization of Deep Recommendation Systems
Glenn Ko, Yuji Chai, Marco Donato, Paul Whatmough, Tambe Thierry, Rob Rutenbar, Gu Wei, and Gu Wei. 8/18/2020. “A Scalable Bayesian Inference Accelerator for Unsupervised Learning.” In IEEE Hot Chips 31 Symposium. Palo Alto, CA, USA. Publisher's VersionAbstract
This article consists only of a collection of slides from the author's conference presentation.
A Scalable Bayesian Inference Accelerator for Unsupervised Learning
Kshitij Bhardwaj, Marton Havasi, Yuan Yao, David Brooks, José Lobato, and Gu Wei. 8/10/2020. “A comprehensive methodology to determine optimal coherence interfaces for many-accelerator SoCs.” In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design, Pp. 145–150. Publisher's VersionAbstract

Modern systems-on-chip (SoCs) include not only general-purpose CPUs but also specialized hardware accelerators. Typically, there are three coherence model choices to integrate an accelerator with the memory hierarchy: no coherence, coherent with the last-level cache (LLC), and private cache based full coherence. However, there has been very limited research on finding which coherence models are optimal for the accelerators of a complex many-accelerator SoC. This paper focuses on determining a cost-aware coherence interface for an SoC and its target application: find the best coherence models for the accelerators that optimize their power and performance, considering both workload characteristics and system-level contention. A novel comprehensive methodology is proposed that uses Bayesian optimization to efficiently find the cost-aware coherence interfaces for SoCs that are modeled using the gem5-Aladdin architectural simulator. For a complete analysis, gem5-Aladdin is extended to support LLC coherence in addition to already-supported no coherence and full coherence. For a heterogeneous SoC targeting applications with varying amount of accelerator-level parallelism, the proposed framework rapidly finds cost-aware coherence interfaces that show significant performance and power benefits over the other commonly-used coherence interfaces.

A comprehensive methodology to determine optimal coherence interfaces for many-accelerator SoCs
Thierry Tambe, En-Yang, Zishen Wan, Yuntian Deng, Vijay Reddi, Alexander Rush, David Brooks, and Gu-Yeon Wei. 7/20/2020. “Algorithm-Hardware Co-Design of Adaptive Floating-Point Encodings for Resilient Deep Learning Inference.” In . San Francisco, CA, USA: Design Automation Conference (DAC 2020). Publisher's VersionAbstract
Conventional hardware-friendly quantization methods, such asfixed-point or integer, tend to perform poorly at very low preci-sion as their shrunken dynamic ranges cannot adequately capturethe wide data distributions commonly seen in sequence transduc-tion models. We present an algorithm-hardware co-design centeredaround a novel floating-point inspired number format,AdaptivFloat,that dynamically maximizes and optimally clips its available dy-namic range, at a layer granularity, in order to create faithful encod-ings of neural network parameters. AdaptivFloat consistently pro-duces higher inference accuracies compared to block floating-point,uniform, IEEE-like float or posit encodings at low bit precision (≤8-bit) across a diverse set of state-of-the-art neural networks, ex-hibiting narrow to wide weight distribution. Notably, at 4-bit weightprecision, only a 2.1 degradation in BLEU score is observed on theAdaptivFloat-quantized Transformer network compared to totalaccuracy loss when encoded in the above-mentioned prominentdatatypes. Furthermore, experimental results on a deep neural net-work (DNN) processing element (PE), exploiting AdaptivFloat logicin its computational datapath, demonstrate per-operation energyand area that is 0.9×and 1.14×, respectively, that of an equivalentbit width NVDLA-like integer-based PE.
Algorithm-Hardware Co-Design of Adaptive Floating-Point Encodings for Resilient Deep Learning Inference
Antonino Tumeo, Marco Minutoli, Giovanni Castellana, Joseph Manzano, Vinay Amatya, David Brooks, and Gu Wei. 7/20/2020. “Software Defined Accelerators From Learning Tools Environment.” In 2020 57th ACM/IEEE Design Automation Conference (DAC), Pp. 1–6. IEEE. Publisher's VersionAbstract
Next generation systems, such as edge devices, will need to provide efficient processing of machine learning (ML) algorithms along several metrics, including energy, performance, area, and latency. However, the quickly evolving field of ML makes it extremely difficult to generate accelerators able to support a wide variety of algorithms. At the same time, designing accelerators in hardware description languages (HDLs) by hand is hard and time consuming, and does not allow quick exploration of the design space. In this paper we present the Software Defined Accelerators From Learning Tools Environment (SODALITE), an automated open source high-level ML framework-to-verilog compiler targeting ML Application-Specific Integrated Circuits (ASICs) chiplets. The SODALITE approach will implement optimal designs by seamlessly combining custom components generated through high-level synthesis (HLS) with templated and fully tunable Intellectual Properties (IPs) and macros, integrated in an extendable resource library. Through a closed loop design space exploration engine, developers will be able to quickly explore their hardware designs along different dimensions.
Software Defined Accelerators From Learning Tools Environment
Glenn Ko, Yuji Chai, Marco Donato, Paul Whatmough, Thierry Tambe, Rob Rutenbar, David Brooks, and Gu-Yeon Wei. 6/16/2020. “A 3mm2 Programmable Bayesian Inference Accelerator for Unsupervised Machine Perception using Parallel Gibbs Sampling in 16nm.” In IEEE Symposium on VLSI Circuits (VLSI). Publisher's VersionAbstract
This paper describes a 16nm programmable accelerator for unsupervised probabilistic machine perception tasks that performs Bayesian inference on probabilistic models mapped onto a 2D Markov Random Field, using MCMC. Exploiting two degrees of parallelism, it performs Gibbs sampling inference at up to 1380× faster with 1965× less energy than an Arm Cortex-A53 on the same SoC, and 1.5× faster with 6.3× less energy than an embedded FPGA in the same technology. At 0.8V, it runs at 450MHz, producing 44.6 MSamples/s at 0.88 nJ/sample.
A 3mm2 Programmable Bayesian Inference Accelerator for Unsupervised Machine Perception using Parallel Gibbs Sampling in 16nm
Liu Ke, Udit Gupta, Carole-Jean Wu, Benjamin Cho, Mark Hempstead, Brandon Reagen, Xuan Zhang, David Brooks, Vikas Chandra, Utku Diril, Amin Firoozshahian, Kim Hazelwood, Bill Jia, Hsien-Hsin Lee, Meng Li, Bert Maher, Dheevatsa Mudigere, Maxim Naumov, Martin Schatz, Mikhail Smelyanskiy, and Xiaodong Wang. 5/30/2020. “RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing.” In . The 47th IEEE/ACM International Symposium on Computer Architecture (ISCA 2020). Publisher's VersionAbstract
Personalized recommendation systems leverage deep learning models and account for the majority of data center AI cycles. Their performance is dominated by memory-bound sparse embedding operations with unique irregular memory access patterns that pose a fundamental challenge to accelerate. This paper proposes a lightweight, commodity DRAM compliant, near-memory processing solution to accelerate personalized recommendation inference. The in-depth characterization of production-grade recommendation models shows that embedding operations with high model-, operator- and data-level parallelism lead to memory bandwidth saturation, limiting recommendation inference performance. We propose RecNMP which provides a scalable solution to improve system throughput, supporting a broad range of sparse embedding models. RecNMP is specifically tailored to production environments with heavy co-location of operators on a single server. Several hardware/software co-optimization techniques such as memory-side caching, table-aware packet scheduling, and hot entry profiling are studied, resulting in up to 9.8x memory latency speedup over a highly-optimized baseline. Overall, RecNMP offers 4.2x throughput improvement and 45.8% memory energy savings.
RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing
Udit Gupta, Carole Wu, Xiaodong Wang, Maxim Naumov, Brandon Reagen, David Brooks, Bradford Cottel, Kim Hazelwood, Bill Jia, Hsien-Hsin Lee, Andrey Malevich, Dheevatsa Mudigere, Mikhail Smelyanskiy, Liang Xiong, and Xuan Zhang. 2020. “The Architectural Implications of Facebook's DNN-based Personalized Recommendation.” In . The 26th IEEE International Symposium on High-Performance Computer Architecture. Publisher's VersionAbstract
The widespread application of deep learning has changed the landscape of computation in the data center. In particular, personalized recommendation for content ranking is now largely accomplished leveraging deep neural networks. However, despite the importance of these models and the amount of compute cycles they consume, relatively little research attention has been devoted to systems for recommendation. To facilitate research and to advance the understanding of these workloads, this paper presents a set of real-world, production-scale DNNs for personalized recommendation coupled with relevant performance metrics for evaluation. In addition to releasing a set of open-source workloads, we conduct in-depth analysis that underpins future system design and optimization for at-scale recommendation: Inference latency varies by 60% across three Intel server generations, batching and co-location of inferences can drastically improve latency-bounded throughput, and the diverse composition of recommendation models leads to different optimization strategies.
The Architectural Implications of Facebook's DNN-based Personalized Recommendation
Udit Gupta, Samuel Hsia, Vikram Saraph, Xiaodong Wang, Brandon Reagen, Gu-Yeon Wei, Hsien-Hsin S. Lee, Carole-Jean Wu, and David Brooks. 2020. “DeepRecSys: A System for Optimizing End-To-End At-scale Neural Recommendation Inference.” In . The 47th IEEE/ACM International Symposium on Computer Architecture (ISCA 2020). Publisher's VersionAbstract
Neural personalized recommendation is the corner-stone of a wide collection of cloud services and products, constituting significant compute demand of the cloud infrastructure. Thus, improving the execution efficiency of neural recommendation directly translates into infrastructure capacity saving. In this paper, we devise a novel end-to-end modeling infrastructure, DeepRecInfra, that adopts an algorithm and system co-design methodology to custom-design systems for recommendation use cases. Leveraging the insights from the recommendation characterization, a new dynamic scheduler, DeepRecSched, is proposed to maximize latency-bounded throughput by taking into account characteristics of inference query size and arrival patterns, recommendation model architectures, and underlying hardware systems. By doing so, system throughput is doubled across the eight industry-representative recommendation models. Finally, design, deployment, and evaluation in at-scale production datacenter shows over 30% latency reduction across a wide variety of recommendation models running on hundreds of machines.
DeepRecSys: A System for Optimizing End-To-End At-scale Neural Recommendation Inference
Wang Yu, Gu Wei, and David Brooks. 2020. “A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms.” In . Third Conference on Machine Learning and Systems (MLSys). Publisher's VersionAbstract
Training deep learning models is compute-intensive and there is an industry-wide trend towards hardware and software specialization to improve performance. To systematically compare deep learning systems, we introduce a methodology comprised of a set of analysis techniques and parameterized end-to-end models for fully connected, convolutional, and recurrent neural networks. This methodology can be applied to analyze various hardware and software systems, and is intended to complement traditional methods. We demonstrate its utility by comparing two generations of specialized platforms (Google's Cloud TPU v2/v3), three heterogeneous platforms (Google TPU, Nvidia GPU, and Intel CPU), and specialized software stacks (TensorFlow and CUDA).
A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms
2019
Lillian Pentecost, Marco Donato, Brandon Reagen, Udit Gupta, Siming Ma, Gu Wei, and David Brooks. 10/1/2019. “MaxNVM: Maximizing DNN Storage Density and Inference Efficiency with Sparse Encoding and Error Mitigation.” In MICRO '52: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, Pp. 769–781. Publisher's VersionAbstract
Deeply embedded applications require low-power, low-cost hardware that fits within stringent area constraints. Deep learning has many potential uses in these domains, but introduces significant inefficiencies stemming from off-chip DRAM accesses of model weights. Ideally, models would fit entirely on-chip. However, even with compression, memory requirements for state-of-the-art mod- els make on-chip inference impractical. Due to increased density, emerging eNVMs are one promising solution. We present MaxNVM, a principled co-design of sparse encodings, protective logic, and fault-prone MLC eNVM technologies (i.e.,RRAM and CTT) to enable highly-efficient DNN inference. We find bit reduction techniques (e.g., clustering and sparse compression) increase weight vulnerability to faults. This limits the capabilities of MLC eNVM. To circumvent this limitation, we improve storage den- sity (i.e., bits-per-cell) with minimal overhead using protective logic. Tradeoffs between density and reliability result in a rich design space. We show that by balancing these techniques, the weights of large networks are able to reasonably fit on-chip. Compared to a naive, single-level-cell eNVM solution, our highly-optimized MLC memory systems reduce weight area by up to 29×. We compare our technique against NVDLA, a state-of-the-art industry-grade CNN accelerator, and demonstrate up to 3.2× reduced power and up to 3.5× reduced energy per ResNet50 inference.
MaxNVM: Maximizing DNN Storage Density and Inference Efficiency with Sparse Encoding and Error Mitigation
Brian Plancher, Camelia Brumar, Iulian Brumar, Lillian Pentecost, Saketh Rama, and David Brooks. 9/24/2019. “Application of Approximate Matrix Multiplication to Neural Networks and Distributed SLAM.” In IEEE High Performance Extreme Computing Conference (HPEC). Waltham, MA, USA. Publisher's VersionAbstract
Computational efficiency is a critical constraint for a variety of cutting-edge real-time applications. In this work, we identify an opportunity to speed up the end-to-end runtime of two such compute bound applications by incorporating approximate linear algebra techniques. Particularly, we apply approximate matrix multiplication to artificial Neural Networks (NNs) for image classification and to the robotics problem of Distributed Simultaneous Localization and Mapping (DSLAM). Expanding upon recent sampling-based Monte Carlo approximation strategies for matrix multiplication, we develop updated theoretical bounds, and an adaptive error prediction strategy. We then apply these techniques in the context of NNs and DSLAM increasing the speed of both applications by 15-20% while maintaining a 97% classification accuracy for NNs running on the MNIST dataset and keeping the average robot position error under 1 meter (vs 0.32 meters for the exact solution). However, both applications experience variance in their results. This suggests that Monte Carlo matrix multiplication may be an effective technique to reduce the memory and computational burden of certain algorithms when used carefully, but more research is needed before these techniques can be widely used in practice.
Application of Approximate Matrix Multiplication to Neural Networks and Distributed SLAM
Udit Gupta, Brandon Reagen, Lillian Pentecost, Marco Donato, Thierry Tambe, Alexander Rush, Gu Wei, and David Brooks. 8/23/2019. “MASR: A Modular Accelerator for Sparse RNNs.” In International Conference on Parallel Architectures and Compilation Techniques. Publisher's VersionAbstract
Recurrent neural networks (RNNs) are becoming the de facto solution for speech recognition. RNNs exploit long-term temporal relationships in data by applying repeated, learned transformations. Unlike fully-connected (FC) layers with single vector matrix operations, RNN layers consist of hundreds of such operations chained over time. This poses challenges unique to RNNs that are not found in convolutional neural networks (CNNs) or FC models, namely large dynamic activation. In this paper we present MASR, a principled and modular architecture that accelerates bidirectional RNNs for on-chip ASR. MASR is designed to exploit sparsity in both dynamic activations and static weights. The architecture is enhanced by a series of dynamic activation optimizations that enable compact storage, ensure no energy is wasted computing null operations, and maintain high MAC utilization for highly parallel accelerator designs. In comparison to current state-of-the-art sparse neural network accelerators (e.g., EIE), MASR provides 2x area 3x energy, and 1.6x performance benefits. The modular nature of MASR enables designs that efficiently scale from resource-constrained low-power IoT applications to large-scale, highly parallel datacenter deployments.
MASR: A Modular Accelerator for Sparse RNNs
G Ko, Yuji Chai, A Rutenbar, David Brooks, and Gu Wei. 4/28/2019. “Flexgibbs: Reconfigurable parallel gibbs sampling accelerator for structured graphs.” In 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), Pp. 334–334. Publisher's VersionAbstract
Many consider one of the key components to the success of deep learning as its compatibility with existing accelerators, mainly GPU. While GPUs are great at handling linear algebra kernels commonly found in deep learning, they are not the optimal architecture for handling unsupervised learning methods such as Bayesian models and inference. As a step towards, achieving better understanding of architectures for probabilistic models, Gibbs sampling, one of the most commonly used algorithms for Bayesian inference, is studied with a focus on parallelism that converges to the target distribution and parameterized components. We propose FlexGibbs, a reconfigurable parallel Gibbs sampling inference accelerator for structured graphs. We designed an architecture optimal for solving Markov Random Field tasks using an array of parallel Gibbs samplers, enabled by chromatic scheduling. We show that for sound source separation application, FlexGibbs configured on the FPGA fabric of Xilinx Zync CPU-FPGA SoC achieved Gibbs sampling inference speedup of 1048x and 99.85% reduction in energy over running it on ARM Cortex-A53.
Flexgibbs: Reconfigurable parallel gibbs sampling accelerator for structured graphs

Pages