Publications by Year: 2022

2022
Matthew Adiletta, David Brooks, and Gu-Yeon Wei. 11/28/2022. Architectural Implications of Embedding Dimension During GCN on CPU and GPU. Cambridge: Harvard University.Abstract
Graph Neural Networks (GNNs) are a class of neural networks designed to extract information from the graphical structure of data. Graph Convolutional Networks (GCNs) are a widely used type of GNN for  transductive graph learning problems which apply convolution to learn information from graphs. GCN is a challenging algorithm from an architecture perspective due to inherent sparsity, low data reuse, and massive memory capacity requirements. Traditional neural algorithms exploit the high compute capacity of GPUs to achieve high performance for both inference and training. The architectural decision to use a GPU for GCN inference is a question explored in this work. GCN on both CPU and GPU was characterized in order to better understand the implications of graph size, embedding dimension, and sampling on performance. 
Architectural Implications of Embedding Dimension During GCN on CPU and GPU
Yu-Shun Hsiao, Siva Kumar Sastry Hari, Michał Filipiuk, Timothy Tsai, Michael B. Sullivan, Vijay Janapa Reddi, Vasu Singh, and Stephen W. Keckler. 7/10/2022. “Zhuyi: Perception Processing Rate Estimation for Safety in Autonomous Vehicles.” In ACM/IEEE Design Automation Conference (DAC). San Francisco, CA, USA.
Abdulrahman Mahmoud, Thierry Tambe, Tarek Aloui, David Brooks, and Gu-Yeon Wei. 6/27/2022. “GoldenEye: A Platform for Evaluating Emerging Numerical Data Formats in DNN Accelerators.” 2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). Baltimore, Maryland, USA. Publisher's VersionAbstract
This paper presents GoldenEye, a functional simulator with fault injection capabilities for common and emerging numerical formats, implemented for the PyTorch deep learning framework. Gold- enEye provides a unified framework for numerical format evaluation of DNNs, including traditional number systems such as fixed and floating point, as well as recent DNN-inspired formats such as block floating point and AdaptivFloat. Additionally, GoldenEye enables single- and multi- bit flips at various logical and functional points during a value’s lifetime for resiliency analysis, including for the first time attention to numerical values’ hardware metadata. This paper describes Golden- Eye’s technical design and implementation which make it an easy-to- use, extensible, versatile, and fast tool for dependability research and future DNN accelerator design. We showcase its utility with three case studies: a unifying platform for number system comparison and eval- uation, a design-space exploration heuristic for data type selection, and fast DNN reliability analysis for different error models. GoldenEye is open-sourced and available at: https://github.com/ma3mool/goldeneye.
L. Pentecost, A. Hankin, M. Donato, M. Hempstead, G.-Y. Wei, and D. Brooks. 4/2/2022. “NVMExplorer: A Framework for Cross-Stack Comparisons of Embedded Non-Volatile Memories.” In 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). Seoul, South Korea. Publisher's VersionAbstract
Repeated off-chip memory accesses to DRAM drive up operating power for data-intensive applications, and SRAM technology scaling and leakage power limits the efficiency of embedded memories. Future on-chip storage will need higher density and energy efficiency, and the actively expanding field of emerging, embeddable non-volatile memory (eNVM) technologies is providing many potential candidates to satisfy this need. Each technology proposal presents distinct trade-offs in terms of density, read, write, and reliability characteristics, and we present a comprehensive framework for navigating and quantifying these design trade-offs alongside realistic system constraints and application-level impacts. This work evaluates eNVM-based storage for a range of application and system contexts including machine learning on the edge, graph analytics, and general purpose cache hierarchy, in addition to describing a freely available (this http URL) set of tools for application experts, system designers, and device experts to better understand, compare, and quantify the next generation of embedded memory solutions.
2109.01188.pdf
Zishen Wan, Aqeel Anwar, Abdulrahman Mahmoud, Tianyu Jia, Yu-Shun Hsiao, Vijay Janapa Reddi, and Arijit Raychowdhury. 3/14/2022. “Frl-fi: Transient fault analysis for federated reinforcement learning-based navigation systems.” 2022 Design Automation and Test in Europe Conference (DATE). Publisher's VersionAbstract
Swarm intelligence is being increasingly deployed in autonomous systems, such as drones and unmanned vehicles. Federated reinforcement learning (FRL), a key swarm intelligence paradigm where agents interact with their own environments and cooperatively learn a consensus policy while preserving privacy, has recently shown potential advantages and gained popularity. However, transient faults are increasing in the hardware system with continuous technology node scaling and can pose threats to FRL systems. Meanwhile, conventional redundancy-based protection methods are challenging to deploy on resource-constrained edge applications. In this paper, we experimentally evaluate the fault tolerance of FRL navigation systems at various scales with respect to fault models, fault locations, learning algorithms, layer types, communication intervals, and data types at both training and inference stages. We further propose two cost-effective fault detection and recovery techniques that can achieve up to 3.3x improvement in resilience with <2.7% overhead in FRL systems.
Frl-fi: Transient fault analysis for federated reinforcement learning-based navigation systems
Tianyu Jia, En-Yu Yang, Yu-Shun Hsiao, Jonathan Cruz, David Brooks, Gu-Yeon Wei, and Vijay Janapa Reddi. 3/14/2022. “OMU: A Probabilistic 3D Occupancy Mapping Accelerator for Real-time OctoMap at the Edge.” In DATE: Design, Automation, and Test in Europe (DATE).
Maximilian Lam, Michael Mitzenmacher, Vijay Janapa Reddi, Gu-Yeon Wei, and David Brooks. 3/5/2022. “Tabula: Efficiently Computing Nonlinear Activation Functions for Secure Neural Network Inference”. Publisher's VersionAbstract
Multiparty computation approaches to private neural network inference require significant communication between server and client, incur tremendous runtime penalties, and cost massive storage overheads. The primary source of these expenses is garbled circuits operations for nonlinear activation functions (typically ReLU), which require on the order of kilobytes of data transfer for each individual operation and tens of kilobytes of preprocessing storage per operation per inference. We propose a replacement for garbled circuits: Tabula, an algorithm to securely and efficiently perform single operand nonlinear functions for private neural network inference. Tabula performs a one time client initialization procedure with the help of a trusted third party (or via using fully homomorphic encryption), operates over smaller finite fields whose elements are representable with less than 16 bits, and employs a lookup table which stores the encrypted results of nonlinear operations over secretly shared values. We show Tabula is secure under a semi-honest threat model, allowing it to be used as a replacement for garbled circuits operations. Our results show that for private neural network inference, Tabula eliminates communication by a factor of more than 50×, enables speedups over 10×, and reduces storage costs from O(n) to O(1).
Tabula: Efficiently Computing Nonlinear Activation Functions for Secure Neural Network Inference
Thierry Tambe, David Brooks, and Gu-Yeon Wei. 3/1/2022. “Learnings from a HLS-based High-Productivity Digital VLSI Flow.” In Workshop on Languages, Tools, and Techniques for Accelerator Design (LATTE'22). Publisher's VersionAbstract
Thetwilight of Dennardscalinghasactivatedaglobaltrendtowards application-based hardware specialization. This trend is currently accelerating due to the surging democratization and deployment of machine learning on mobile and IoT compute platforms. At the same time, the growing complexity of specialized system-on-chips (SoCs) is levying a more laborious tax on ASIC companies’ design and verification efforts. High-level synthesis (HLS) is emerging as a foremost agile VLSI development methodology gaining increasing adoption in the hardware design community. However, concerns over Quality of Results (QoR) remain a key factor inhibiting more mainstream adoption of HLS. Obtaining optimal PPA outcomes can sometimes be an elusive or challenging task and strongly correlates with the syntactic approach of the high-level source code. In this paper, we aim to share the proven HLS practices we employed to raise the level of confidence in the post-silicon functional and performance expectations from our accelerator designs. In doing so, we recount some of the main challenges we encountered in our HLS-based hardware-software co-design journey and offer a few recommendations cultivated from our learnings. Finally, we posit on wheretheresearch opportunities to further improve design QoR and HLS user experience lie.
Learnings from a HLS-based High-Productivity Digital VLSI Flow
Bo-Yuan Huang, Steven Lyubomirsky, Yi Li, Mike He, Thierry Tambe, Gus Henry Smith, Akash Gaonkar, Vishal Canumalla, Gu-Yeon Wei, Aarti Gupta, Zachary Tatlock, and Sharad Malik. 3/1/2022. “Specialized Accelerators and Compiler Flows: Replacing Accelerator APIs with a Formal Software/Hardware Interface”. Publisher's VersionAbstract
Specialized accelerators are increasingly used to meet the power-performance goals of emerging applications such as machine learning, image processing, and graph analysis. Existing accelerator programming methodologies using APIs have several limitations: (1) The application code lacks portability to other platforms and compiler frameworks; (2) the lack of integration of accelerator code in the compiler limits useful optimizations such as instruction selection and operator fusion; and (3) the opacity of the accelerator function semantics limits the ability to check the final code for correctness. The root of these limitations is the lack of a formal software/hardware interface specification for accelerators.
In this paper, we use the recently developed Instruction-Level Abstraction (ILA) for accelerators to serve this purpose, similar to how the Instruction Set Architecture (ISA) has been used as the software/hardware interface for processors. We propose a compiler flow termed D2A using the ILA and present a prototype that demonstrates this flow for deep learning (DL) applications. This prototype compiles programs from high-level domain-specific languages, e.g., PyTorch and MxNet, to multiple target accelerators with no target-specific extensions to the application or compiler - thus demonstrating application portability. It includes compiler optimizations through instruction selection using equality saturation-based flexible matching. Finally, we show checking the correctness of the resulting code through formal verification of individual matched operations and fully automated simulation-based validation of complete applications. The evaluation of the prototype compiler is based on six different DL applications and three different accelerators. Overall, this methodology lays the foundation for integrating accelerators in compiler flows using a formal software/hardware interface.
Specialized Accelerators and Compiler Flows: Replacing Accelerator APIs with a Formal Software/Hardware Interface