Publications by Author: David Brooks

2023
Matthew Adiletta, Jesmin Jahan Tithi, Emmanouil-Ioannis Farsarakis, Gerasimos Gerogiannis, Robert Adolf, Robert Benke, Sidharth Kashyap, Samuel Hsia, Kartik Lakhotia, Fabrizio Petrini, Gu-Yeon Wei, and David Brooks. 4/24/2023. “Characterizing the Scalability of Graph Convolutional Networks on Intel® PIUMA.” In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). Raleigh, North Carolina.Abstract

Large-scale Graph Convolutional Network (GCN) inference on traditional CPU/GPU systems is challenging due to a large memory footprint, sparse computational patterns, and irregular memory accesses with poor locality. Intel's Programmable Integrated Unified Memory Architecture (PIUMA) is designed to address these challenges for graph analytics. In this paper, a detailed characterization of GCNs is presented using the Open-Graph Benchmark (OGB) datasets to determine the viability of PIUMA as a potential solution to GCN scalability.

First, the extent of sparse matrix dense matrix multiplication~(SpMM) as a performance driver for GCN on CPU and GPU is explored, offering a methodology for predicting GCN behavior as a function of dataset characteristics. Second, an SpMM kernel optimized for PIUMA is described and investigated for sensitivity to system parameters including memory bandwidth, latency, and thread count. SpMM scalability on PIUMA is demonstrated, while the scalability limitations of a Xeon-optimized SpMM implementation are discussed. Finally, GCN performance is compared on PIUMA versus a Xeon CPU system and Ampere GPU system, showing impressive results on PIUMA for large-scale datasets.

ispass_gnn_characterization_on_piuma.pdf
Samuel Hsia, Udit Gupta, Bilge Acun, Newsha Ardalani, Pan Zhong, Gu-Yeon Wei, David Brooks, and Carole-Jean Wu. 3/25/2023. “MP-Rec: Hardware-Software Co-Design to Enable Multi-Path Recommendation.” In ASPLOS. Vancouver, Canada. Publisher's VersionAbstract

Deep learning recommendation systems serve personalized content under diverse tail-latency targets and input-query loads. In order to do so, state-of-the-art recommendation models rely on terabyte-scale embedding tables to learn user preferences over large bodies of contents. The reliance on a fixed embedding representation of embedding tables not only imposes significant memory capacity and bandwidth requirements but also limits the scope of compatible system solutions. This paper challenges the assumption of fixed embedding representations by showing how synergies between embedding representations and hardware platforms can lead to improvements in both algorithmic- and system performance. Based on our characterization of various embedding representations, we propose a hybrid embedding representation that achieves higher quality embeddings at the cost of increased memory and compute requirements. To address the system performance challenges of the hybrid representation, we propose MP-Rec -- a co-design technique that exploits heterogeneity and dynamic selection of embedding representations and underlying hardware platforms.

On real system hardware, we demonstrate how matching custom accelerators, i.e., GPUs, TPUs, and IPUs, with compatible embedding representations can lead to 16.65x performance speedup. Additionally, in query-serving scenarios, MP-Rec achieves 2.49x and 3.76x higher correct prediction throughput and 0.19% and 0.22% better model quality on a CPU-GPU system for the Kaggle and Terabyte datasets, respectively.

2022
Matthew Adiletta, David Brooks, and Gu-Yeon Wei. 11/28/2022. Architectural Implications of Embedding Dimension During GCN on CPU and GPU. Cambridge: Harvard University.Abstract
Graph Neural Networks (GNNs) are a class of neural networks designed to extract information from the graphical structure of data. Graph Convolutional Networks (GCNs) are a widely used type of GNN for  transductive graph learning problems which apply convolution to learn information from graphs. GCN is a challenging algorithm from an architecture perspective due to inherent sparsity, low data reuse, and massive memory capacity requirements. Traditional neural algorithms exploit the high compute capacity of GPUs to achieve high performance for both inference and training. The architectural decision to use a GPU for GCN inference is a question explored in this work. GCN on both CPU and GPU was characterized in order to better understand the implications of graph size, embedding dimension, and sampling on performance. 
Architectural Implications of Embedding Dimension During GCN on CPU and GPU
Abdulrahman Mahmoud, Thierry Tambe, Tarek Aloui, David Brooks, and Gu-Yeon Wei. 6/27/2022. “GoldenEye: A Platform for Evaluating Emerging Numerical Data Formats in DNN Accelerators.” 2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). Baltimore, Maryland, USA. Publisher's VersionAbstract
This paper presents GoldenEye, a functional simulator with fault injection capabilities for common and emerging numerical formats, implemented for the PyTorch deep learning framework. Gold- enEye provides a unified framework for numerical format evaluation of DNNs, including traditional number systems such as fixed and floating point, as well as recent DNN-inspired formats such as block floating point and AdaptivFloat. Additionally, GoldenEye enables single- and multi- bit flips at various logical and functional points during a value’s lifetime for resiliency analysis, including for the first time attention to numerical values’ hardware metadata. This paper describes Golden- Eye’s technical design and implementation which make it an easy-to- use, extensible, versatile, and fast tool for dependability research and future DNN accelerator design. We showcase its utility with three case studies: a unifying platform for number system comparison and eval- uation, a design-space exploration heuristic for data type selection, and fast DNN reliability analysis for different error models. GoldenEye is open-sourced and available at: https://github.com/ma3mool/goldeneye.
Tianyu Jia, En-Yu Yang, Yu-Shun Hsiao, Jonathan Cruz, David Brooks, Gu-Yeon Wei, and Vijay Janapa Reddi. 3/14/2022. “OMU: A Probabilistic 3D Occupancy Mapping Accelerator for Real-time OctoMap at the Edge.” In DATE: Design, Automation, and Test in Europe (DATE).
Maximilian Lam, Michael Mitzenmacher, Vijay Janapa Reddi, Gu-Yeon Wei, and David Brooks. 3/5/2022. “Tabula: Efficiently Computing Nonlinear Activation Functions for Secure Neural Network Inference”. Publisher's VersionAbstract
Multiparty computation approaches to private neural network inference require significant communication between server and client, incur tremendous runtime penalties, and cost massive storage overheads. The primary source of these expenses is garbled circuits operations for nonlinear activation functions (typically ReLU), which require on the order of kilobytes of data transfer for each individual operation and tens of kilobytes of preprocessing storage per operation per inference. We propose a replacement for garbled circuits: Tabula, an algorithm to securely and efficiently perform single operand nonlinear functions for private neural network inference. Tabula performs a one time client initialization procedure with the help of a trusted third party (or via using fully homomorphic encryption), operates over smaller finite fields whose elements are representable with less than 16 bits, and employs a lookup table which stores the encrypted results of nonlinear operations over secretly shared values. We show Tabula is secure under a semi-honest threat model, allowing it to be used as a replacement for garbled circuits operations. Our results show that for private neural network inference, Tabula eliminates communication by a factor of more than 50×, enables speedups over 10×, and reduces storage costs from O(n) to O(1).
Tabula: Efficiently Computing Nonlinear Activation Functions for Secure Neural Network Inference
Thierry Tambe, David Brooks, and Gu-Yeon Wei. 3/1/2022. “Learnings from a HLS-based High-Productivity Digital VLSI Flow.” In Workshop on Languages, Tools, and Techniques for Accelerator Design (LATTE'22). Publisher's VersionAbstract
Thetwilight of Dennardscalinghasactivatedaglobaltrendtowards application-based hardware specialization. This trend is currently accelerating due to the surging democratization and deployment of machine learning on mobile and IoT compute platforms. At the same time, the growing complexity of specialized system-on-chips (SoCs) is levying a more laborious tax on ASIC companies’ design and verification efforts. High-level synthesis (HLS) is emerging as a foremost agile VLSI development methodology gaining increasing adoption in the hardware design community. However, concerns over Quality of Results (QoR) remain a key factor inhibiting more mainstream adoption of HLS. Obtaining optimal PPA outcomes can sometimes be an elusive or challenging task and strongly correlates with the syntactic approach of the high-level source code. In this paper, we aim to share the proven HLS practices we employed to raise the level of confidence in the post-silicon functional and performance expectations from our accelerator designs. In doing so, we recount some of the main challenges we encountered in our HLS-based hardware-software co-design journey and offer a few recommendations cultivated from our learnings. Finally, we posit on wheretheresearch opportunities to further improve design QoR and HLS user experience lie.
Learnings from a HLS-based High-Productivity Digital VLSI Flow
2021
Thierry Tambe, Coleman Hooper, Lillian Pentecost, Tianyu Jia, En-Yu Yang, Marco Donato, Victor Sanh, Paul Whatmough, Alexander M. Rush, David Brooks, and Gu-Yeon Wei. 10/17/2021. “EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference.” IEEE/ACM International Symposium on Microarchitecture (MICRO 2021). Publisher's VersionAbstract
Transformer-based language models such as BERT provide significant accuracy improvement for a multitude of natural language processing (NLP) tasks. However, their hefty computational and memory demands make them challenging to deploy to resource-constrained edge platforms with strict latency requirements. We present EdgeBERT, an in-depth algorithm-hardware co-design for latency-aware energy optimization for multi-task NLP. EdgeBERT employs entropy-based early exit predication in order to perform dynamic voltage-frequency scaling (DVFS), at a sentence granularity, for minimal energy consumption while adhering to a prescribed target latency. Computation and memory footprint overheads are further alleviated by employing a calibrated combination of adaptive attention span, selective network pruning, and floating-point quantization. Furthermore, in order to maximize the synergistic benefits of these algorithms in always-on and intermediate edge computing settings, we specialize a 12nm scalable hardware accelerator system, integrating a fast-switching low-dropout voltage regulator (LDO), an all-digital phase-locked loop (ADPLL), as well as, high-density embedded non-volatile memories (eNVMs) wherein the sparse floating-point bit encodings of the shared multi-task parameters are carefully stored. Altogether, latency-aware multi-task NLP inference acceleration on the EdgeBERT hardware system generates up to 7x, 2.5x, and 53x lower energy compared to the conventional inference without early stopping, the latency-unbounded early exit approach, and CUDA adaptations on an Nvidia Jetson Tegra X2 mobile GPU, respectively.
Maximilian Lam, Gu-Yeon Wei, David Brooks, Vijay Janapa Reddi, and Michael Mitzenmacher. 6/2021. “Gradient Disaggregation: Breaking Privacy in Federated Learning by Reconstructing the User Participant Matrix”. Publisher's VersionAbstract
We show that aggregated model updates in federated learning may be insecure. An untrusted central server may disaggregate user updates from sums of updates across participants given repeated observations, enabling the server to recover privileged information about individual users' private training data via traditional gradient inference attacks. Our method revolves around reconstructing participant information (e.g: which rounds of training users participated in) from aggregated model updates by leveraging summary information from device analytics commonly used to monitor, debug, and manage federated learning systems. Our attack is parallelizable and we successfully disaggregate user updates on settings with up to thousands of participants. We quantitatively and qualitatively demonstrate significant improvements in the capability of various inference attacks on the disaggregated updates. Our attack enables the attribution of learned properties to individual users, violating anonymity, and shows that a determined central server may undermine the secure aggregation protocol to break individual users' data privacy in federated learning.
 
Gradient Disaggregation: Breaking Privacy in Federated Learning by Reconstructing the User Participant Matrix
Yu-Shun Hsiao, Zishen Wan, Tianyu Jia, Radhika Ghosal, Arijit Raychowdhury, David Brooks, Gu-Yeon Wei, and Vijay Janapa Reddi. 5/27/2021. “Mavfi: An end-to-end fault analysis framework with anomaly detection and recovery for micro aerial vehicles”. Publisher's VersionAbstract
Reliability and safety are critical in autonomous machine services, such as autonomous vehicles and aerial drones. In this paper, we first present an open-source Micro Aerial Vehicles (MAVs) reliability analysis framework, MAVFI, to characterize transient fault's impacts on the end-to-end flight metrics, e.g., flight time, success rate. Based on our framework, it is observed that the end-to-end fault tolerance analysis is essential for characterizing system reliability. We demonstrate the planning and control stages are more vulnerable to transient faults than the visual perception stage in the common "Perception-Planning-Control (PPC)" compute pipeline. Furthermore, to improve the reliability of the MAV system, we propose two low overhead anomaly-based transient fault detection and recovery schemes based on Gaussian statistical models and autoencoder neural networks. We validate our anomaly fault protection schemes with a variety of simulated photo-realistic environments on both Intel i9 CPU and ARM Cortex-A57 on Nvidia TX2 platform. It is demonstrated that the autoencoder-based scheme can improve the system reliability by 100% recovering failure cases with less than 0.0062% computational overhead in best-case scenarios. In addition, MAVFI framework can be used for other ROS-based cyber-physical applications and is open-sourced at this https URL.
Mavfi: An end-to-end fault analysis framework with anomaly detection and recovery for micro aerial vehicles
Udit Gupta, Samuel Hsia, Jeff Zhang, Mark Wilkening, Javin Pombra, Hsien-Hsin S. Lee, Gu-Yeon Wei, Carole-Jean Wu, and David Brooks. 5/22/2021. “RecPipe: Co-designing Models and Hardware to Jointly Optimize Recommendation Quality and Performance.” MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, Pp. 870–884. Publisher's VersionAbstract
Deep learning recommendation systems must provide high quality, personalized content under strict tail-latency targets and high system loads. This paper presents RecPipe, a system to jointly optimize recommendation quality and inference performance. Central to RecPipe is decomposing recommendation models into multi-stage pipelines to maintain quality while reducing compute complexity and exposing distinct parallelism opportunities. RecPipe implements an inference scheduler to map multi-stage recommendation engines onto commodity, heterogeneous platforms (e.g., CPUs, GPUs).While the hardware-aware scheduling improves ranking efficiency, the commodity platforms suffer from many limitations requiring specialized hardware. Thus, we design RecPipeAccel (RPAccel), a custom accelerator that jointly optimizes quality, tail-latency, and system throughput. RPAc-cel is designed specifically to exploit the distinct design space opened via RecPipe. In particular, RPAccel processes queries in sub-batches to pipeline recommendation stages, implements dual static and dynamic embedding caches, a set of top-k filtering units, and a reconfigurable systolic array. Com-pared to prior-art and at iso-quality, we demonstrate that RPAccel improves latency and throughput by 3x and 6x.
RecPipe: Co-designing Models and Hardware to Jointly Optimize Recommendation Quality and Performance
Thierry Tambe, En-Yu Yang, Glenn G. Ko, Yuji Chai, Coleman Hooper, Marco Donato, Paul N. Whatmough, Alexander M. Rush, David Brooks, and Gu-Yeon Wei. 2/13/2021. “A 25mm2 SoC for IoT Devices with 18ms Noise Robust Speech-to-Text Latency via Bayesian Speech Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recognition in 16nm FinFET.” International Solid-State Circuits Conference (ISSCC'21). Publisher's VersionAbstract
Automatic speech recognition (ASR) using deep learning is essential for user interfaces on IoT devices. However, previously published ASR chips [4-7] do not consider realistic operating conditions, which are typically noisy and may include more than one speaker. Furthermore, several of these works have implemented only small-vocabulary tasks, such as keyword-spotting (KWS), where context-blind deep neural network (DNN) algorithms are adequate. However, for large-vocabulary tasks (e.g., >100k words), the more complex bidirectional RNNs with an attention mechanism [1] provide context learning in long sequences, which improve ASR accuracy by up to 62% on the 200kwords LibriSpeech dataset, compared to a simpler unidirectional RNN (Fig. 9.8.1). Attention-based networks emphasize the most relevant parts of the source sequence during each decoding time step. In doing so, the encoder sequence is treated as a soft-addressable memory whose positions are weighted based on the state of the decoder RNN. Bidirectional RNNs learn past and future temporal information by concatenating forward and backward time steps.
A 25mm2 SoC for IoT Devices with 18ms Noise Robust Speech-to-Text Latency via Bayesian Speech Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recognition in 16nm FinFET
Mark Wilkening, Udit Gupta, Samuel Hsia, Caroline Trippel, Carole-Jean Wu, David Brooks, and Gu-Yeon Wei. 1/29/2021. “RecSSD: Near Data Processing for Solid State Drive Based Recommendation Inference.” ASPLOS 2021: Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Pp. 717–729. Publisher's VersionAbstract
Neural personalized recommendationmodelsareusedacrossawide Samuel Hsia Harvard University Cambridge, Massachusetts, USA shsia@g.harvard.edu David Brooks Harvard University Cambridge, Massachusetts, USA dbrooks@eecs.harvard.edu USA. ACM, New York, NY, USA, 13 pages. https://doi.org/10.1145/3445814. 3446763 variety of datacenter applications including search, social media, and entertainment. State-of-the-art models comprise large embedding tables that have billions of parameters requiring large memory capacities. Unfortunately, large and fast DRAM-based memories levy high infrastructure costs. Conventional SSD-based storage solutions offer an order of magnitude larger capacity, but have worse read latency and bandwidth, degrading inference performance. RecSSD is a near data processing based SSD memory system customized for neural recommendation inference that reduces end-to-end model inference latency by 2× compared to using COTS SSDs across eight industry-representative models.
RecSSD: Near Data Processing for Solid State Drive Based Recommendation Inference
2020
Samuel Hsia, Udit Gupta, Wilkening Mark, Carole Wu, Gu-Yeon Wei, and David Brooks. 10/27/2020. “Cross-Stack Workload Characterization of Deep Recommendation Systems.” In 2020 IEEE International Symposium on Workload Characterization (IISWC). Publisher's VersionAbstract

Deep learning based recommendation systems form the backbone of most personalized cloud services. Though the computer architecture community has recently started to take notice of deep recommendation inference, the resulting solutions have taken wildly different approaches - ranging from near memory processing to at-scale optimizations. To better design future hardware systems for deep recommendation inference, we must first systematically examine and characterize the underlying systems-level impact of design decisions across the different levels of the execution stack. In this paper, we characterize eight industry-representative deep recommendation models at three different levels of the execution stack: algorithms and software, systems platforms, and hardware microarchitectures. Through this cross-stack characterization, we first show that system deployment choices (i.e., CPUs or GPUs, batch size granularity) can give us up to 15x speedup. To better understand the bottlenecks for further optimization, we look at both software operator usage breakdown and CPU frontend and backend microarchitectural inefficiencies. Finally, we model the correlation between key algorithmic model architecture features and hardware bottlenecks, revealing the absence of a single dominant algorithmic component behind each hardware bottleneck.

Cross-Stack Workload Characterization of Deep Recommendation Systems
Brandon Reagen, Wooseok Choi, Yeongil Ko, Vincent Lee, Gu Wei, Lee S, and David Brooks. 10/8/2020. “Cheetah: Optimizations and Methods for PrivacyPreserving Inference via Homomorphic Encryption”. Publisher's VersionAbstract
As the application of deep learning continues to grow, so does the amount of data used to make predictions. While traditionally, big-data deep learning was constrained by computing performance and off-chip memory bandwidth, a new constraint has emerged: privacy. One solution is homomorphic encryption (HE). Applying HE to the client-cloud model allows cloud services to perform inference directly on the client's encrypted data. While HE can meet privacy constraints, it introduces enormous computational challenges and remains impractically slow in current systems.
This paper introduces Cheetah, a set of algorithmic and hardware optimizations for HE DNN inference to achieve plaintext DNN inference speeds. Cheetah proposes HE-parameter tuning optimization and operator scheduling optimizations, which together deliver 79x speedup over the state-of-the-art. However, this still falls short of plaintext inference speeds by almost four orders of magnitude. To bridge the remaining performance gap, Cheetah further proposes an accelerator architecture that, when combined with the algorithmic optimizations, approaches plaintext DNN inference speeds. We evaluate several common neural network models (e.g., ResNet50, VGG16, and AlexNet) and show that plaintext-level HE inference for each is feasible with a custom accelerator consuming 30W and 545mm^2.
Cheetah: Optimizations and Methods for PrivacyPreserving Inference via Homomorphic Encryption
Srivatsan Krishnan, Zishen Wan, Kshitij Bhardwaj, Paul Whatmough, Aleksandra Faust, Gu Wei, David Brooks, and Vijay Reddi. 9/16/2020. “The Sky Is Not the Limit: A Visual Performance Model for Cyber-Physical Co-Design in Autonomous Machines.” IEEE Computer Architecture Letters, 19, 1, Pp. 38–42. Publisher's VersionAbstract
We introduce the “Formula-1” (F-1) roofline model to understand the role of computing in aerial autonomous machines. The model provides insights by exploiting the fundamental relationships between various components in an aerial robot, such as sensor framerate, compute performance, and body dynamics (physics). F-1 serves as a tool that can aid computer and cyber-physical system architects to understand the optimal design (or selection) of various components in the development of autonomous machines.
The Sky Is Not the Limit: A Visual Performance Model for Cyber-Physical Co-Design in Autonomous Machines
Kshitij Bhardwaj, Marton Havasi, Yuan Yao, David Brooks, José Lobato, and Gu Wei. 8/10/2020. “A comprehensive methodology to determine optimal coherence interfaces for many-accelerator SoCs.” In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design, Pp. 145–150. Publisher's VersionAbstract

Modern systems-on-chip (SoCs) include not only general-purpose CPUs but also specialized hardware accelerators. Typically, there are three coherence model choices to integrate an accelerator with the memory hierarchy: no coherence, coherent with the last-level cache (LLC), and private cache based full coherence. However, there has been very limited research on finding which coherence models are optimal for the accelerators of a complex many-accelerator SoC. This paper focuses on determining a cost-aware coherence interface for an SoC and its target application: find the best coherence models for the accelerators that optimize their power and performance, considering both workload characteristics and system-level contention. A novel comprehensive methodology is proposed that uses Bayesian optimization to efficiently find the cost-aware coherence interfaces for SoCs that are modeled using the gem5-Aladdin architectural simulator. For a complete analysis, gem5-Aladdin is extended to support LLC coherence in addition to already-supported no coherence and full coherence. For a heterogeneous SoC targeting applications with varying amount of accelerator-level parallelism, the proposed framework rapidly finds cost-aware coherence interfaces that show significant performance and power benefits over the other commonly-used coherence interfaces.

A comprehensive methodology to determine optimal coherence interfaces for many-accelerator SoCs
Paul Whatmough, Marco Donato, Glenn Ko, David Brooks, and Gu-Wei. 8/1/2020. “CHIPKIT: An agile, reusable open-source framework for rapid test chip development.” IEEE Micro, 40, 4, Pp. 32 - 40. Publisher's VersionAbstract
The current trend for domain-specific architectures (DSAs) has led to renewed interest in research test chips to demonstrate new specialized hardware. Tape-outs also offer huge pedagogical value garnered from real hands-on exposure to the whole system stack. However, successful tape-outs demand hard-earned experience, and the design process is time consuming and fraught with challenges. Therefore, custom chips have remained the preserve of a small number of research groups, typically focused on circuit design research. This paper describes the CHIPKIT framework. We describe a reusable SoC subsystem which provides basic IO, an on-chip programmable host, memory and peripherals. This subsystem can be readily extended with new IP blocks to generate custom test chips. We also present an agile RTL development flow, including a code generation tool calledVGEN. Finally, we outline best practices for full-chip validation across the entire design cycle.
CHIPKIT: An agile, reusable open-source framework for rapid test chip development
Thierry Tambe, En-Yang, Zishen Wan, Yuntian Deng, Vijay Reddi, Alexander Rush, David Brooks, and Gu-Yeon Wei. 7/20/2020. “Algorithm-Hardware Co-Design of Adaptive Floating-Point Encodings for Resilient Deep Learning Inference.” In . San Francisco, CA, USA: Design Automation Conference (DAC 2020). Publisher's VersionAbstract
Conventional hardware-friendly quantization methods, such asfixed-point or integer, tend to perform poorly at very low preci-sion as their shrunken dynamic ranges cannot adequately capturethe wide data distributions commonly seen in sequence transduc-tion models. We present an algorithm-hardware co-design centeredaround a novel floating-point inspired number format,AdaptivFloat,that dynamically maximizes and optimally clips its available dy-namic range, at a layer granularity, in order to create faithful encod-ings of neural network parameters. AdaptivFloat consistently pro-duces higher inference accuracies compared to block floating-point,uniform, IEEE-like float or posit encodings at low bit precision (≤8-bit) across a diverse set of state-of-the-art neural networks, ex-hibiting narrow to wide weight distribution. Notably, at 4-bit weightprecision, only a 2.1 degradation in BLEU score is observed on theAdaptivFloat-quantized Transformer network compared to totalaccuracy loss when encoded in the above-mentioned prominentdatatypes. Furthermore, experimental results on a deep neural net-work (DNN) processing element (PE), exploiting AdaptivFloat logicin its computational datapath, demonstrate per-operation energyand area that is 0.9×and 1.14×, respectively, that of an equivalentbit width NVDLA-like integer-based PE.
Algorithm-Hardware Co-Design of Adaptive Floating-Point Encodings for Resilient Deep Learning Inference
Antonino Tumeo, Marco Minutoli, Giovanni Castellana, Joseph Manzano, Vinay Amatya, David Brooks, and Gu Wei. 7/20/2020. “Software Defined Accelerators From Learning Tools Environment.” In 2020 57th ACM/IEEE Design Automation Conference (DAC), Pp. 1–6. IEEE. Publisher's VersionAbstract
Next generation systems, such as edge devices, will need to provide efficient processing of machine learning (ML) algorithms along several metrics, including energy, performance, area, and latency. However, the quickly evolving field of ML makes it extremely difficult to generate accelerators able to support a wide variety of algorithms. At the same time, designing accelerators in hardware description languages (HDLs) by hand is hard and time consuming, and does not allow quick exploration of the design space. In this paper we present the Software Defined Accelerators From Learning Tools Environment (SODALITE), an automated open source high-level ML framework-to-verilog compiler targeting ML Application-Specific Integrated Circuits (ASICs) chiplets. The SODALITE approach will implement optimal designs by seamlessly combining custom components generated through high-level synthesis (HLS) with templated and fully tunable Intellectual Properties (IPs) and macros, integrated in an extendable resource library. Through a closed loop design space exploration engine, developers will be able to quickly explore their hardware designs along different dimensions.
Software Defined Accelerators From Learning Tools Environment

Pages