Publications by Author: Marco Donato

2021
Thierry Tambe, Coleman Hooper, Lillian Pentecost, Tianyu Jia, En-Yu Yang, Marco Donato, Victor Sanh, Paul Whatmough, Alexander M. Rush, David Brooks, and Gu-Yeon Wei. 10/17/2021. “EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference.” IEEE/ACM International Symposium on Microarchitecture (MICRO 2021). Publisher's VersionAbstract
Transformer-based language models such as BERT provide significant accuracy improvement for a multitude of natural language processing (NLP) tasks. However, their hefty computational and memory demands make them challenging to deploy to resource-constrained edge platforms with strict latency requirements. We present EdgeBERT, an in-depth algorithm-hardware co-design for latency-aware energy optimization for multi-task NLP. EdgeBERT employs entropy-based early exit predication in order to perform dynamic voltage-frequency scaling (DVFS), at a sentence granularity, for minimal energy consumption while adhering to a prescribed target latency. Computation and memory footprint overheads are further alleviated by employing a calibrated combination of adaptive attention span, selective network pruning, and floating-point quantization. Furthermore, in order to maximize the synergistic benefits of these algorithms in always-on and intermediate edge computing settings, we specialize a 12nm scalable hardware accelerator system, integrating a fast-switching low-dropout voltage regulator (LDO), an all-digital phase-locked loop (ADPLL), as well as, high-density embedded non-volatile memories (eNVMs) wherein the sparse floating-point bit encodings of the shared multi-task parameters are carefully stored. Altogether, latency-aware multi-task NLP inference acceleration on the EdgeBERT hardware system generates up to 7x, 2.5x, and 53x lower energy compared to the conventional inference without early stopping, the latency-unbounded early exit approach, and CUDA adaptations on an Nvidia Jetson Tegra X2 mobile GPU, respectively.
Thierry Tambe, En-Yu Yang, Glenn G. Ko, Yuji Chai, Coleman Hooper, Marco Donato, Paul N. Whatmough, Alexander M. Rush, David Brooks, and Gu-Yeon Wei. 2/13/2021. “A 25mm2 SoC for IoT Devices with 18ms Noise Robust Speech-to-Text Latency via Bayesian Speech Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recognition in 16nm FinFET.” International Solid-State Circuits Conference (ISSCC'21). Publisher's VersionAbstract
Automatic speech recognition (ASR) using deep learning is essential for user interfaces on IoT devices. However, previously published ASR chips [4-7] do not consider realistic operating conditions, which are typically noisy and may include more than one speaker. Furthermore, several of these works have implemented only small-vocabulary tasks, such as keyword-spotting (KWS), where context-blind deep neural network (DNN) algorithms are adequate. However, for large-vocabulary tasks (e.g., >100k words), the more complex bidirectional RNNs with an attention mechanism [1] provide context learning in long sequences, which improve ASR accuracy by up to 62% on the 200kwords LibriSpeech dataset, compared to a simpler unidirectional RNN (Fig. 9.8.1). Attention-based networks emphasize the most relevant parts of the source sequence during each decoding time step. In doing so, the encoder sequence is treated as a soft-addressable memory whose positions are weighted based on the state of the decoder RNN. Bidirectional RNNs learn past and future temporal information by concatenating forward and backward time steps.
A 25mm2 SoC for IoT Devices with 18ms Noise Robust Speech-to-Text Latency via Bayesian Speech Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recognition in 16nm FinFET
2020
Glenn Ko, Yuji Chai, Marco Donato, Paul Whatmough, Tambe Thierry, Rob Rutenbar, Gu Wei, and Gu Wei. 8/18/2020. “A Scalable Bayesian Inference Accelerator for Unsupervised Learning.” In IEEE Hot Chips 31 Symposium. Palo Alto, CA, USA. Publisher's VersionAbstract
This article consists only of a collection of slides from the author's conference presentation.
A Scalable Bayesian Inference Accelerator for Unsupervised Learning
Paul Whatmough, Marco Donato, Glenn Ko, David Brooks, and Gu-Wei. 8/1/2020. “CHIPKIT: An agile, reusable open-source framework for rapid test chip development.” IEEE Micro, 40, 4, Pp. 32 - 40. Publisher's VersionAbstract
The current trend for domain-specific architectures (DSAs) has led to renewed interest in research test chips to demonstrate new specialized hardware. Tape-outs also offer huge pedagogical value garnered from real hands-on exposure to the whole system stack. However, successful tape-outs demand hard-earned experience, and the design process is time consuming and fraught with challenges. Therefore, custom chips have remained the preserve of a small number of research groups, typically focused on circuit design research. This paper describes the CHIPKIT framework. We describe a reusable SoC subsystem which provides basic IO, an on-chip programmable host, memory and peripherals. This subsystem can be readily extended with new IP blocks to generate custom test chips. We also present an agile RTL development flow, including a code generation tool calledVGEN. Finally, we outline best practices for full-chip validation across the entire design cycle.
CHIPKIT: An agile, reusable open-source framework for rapid test chip development
Glenn Ko, Yuji Chai, Marco Donato, Paul Whatmough, Thierry Tambe, Rob Rutenbar, David Brooks, and Gu-Yeon Wei. 6/16/2020. “A 3mm2 Programmable Bayesian Inference Accelerator for Unsupervised Machine Perception using Parallel Gibbs Sampling in 16nm.” In IEEE Symposium on VLSI Circuits (VLSI). Publisher's VersionAbstract
This paper describes a 16nm programmable accelerator for unsupervised probabilistic machine perception tasks that performs Bayesian inference on probabilistic models mapped onto a 2D Markov Random Field, using MCMC. Exploiting two degrees of parallelism, it performs Gibbs sampling inference at up to 1380× faster with 1965× less energy than an Arm Cortex-A53 on the same SoC, and 1.5× faster with 6.3× less energy than an embedded FPGA in the same technology. At 0.8V, it runs at 450MHz, producing 44.6 MSamples/s at 0.88 nJ/sample.
A 3mm2 Programmable Bayesian Inference Accelerator for Unsupervised Machine Perception using Parallel Gibbs Sampling in 16nm
2019
Marco Donato, Lillian Pentecost, David Brooks, and Gu Wei. 10/4/2019. “MEMTI: Optimizing On-Chip Nonvolatile Storage for Visual Multitask Inference at the Edge.” IEEE MICRO, 39, 6. Publisher's VersionAbstract
The combination of specialized hardware and embedded nonvolatile memories (eNVM) holds promise for energy-efficient deep neural network (DNN) inference at the edge. However, integrating DNN hardware accelerators with eNVMs still presents several challenges. Multilevel programming is desirable for achieving maximal storage density on chip, but the stochastic nature of eNVM writes makes them prone to errors and further increases the write energy and latency. In this article, we present MEMTI, a memory architecture that leverages a multitask learning technique for maximal reuse of DNN parameters across multiple visual tasks. We show that by retraining and updating only 10% of all DNN parameters, we can achieve efficient model adaptation across a variety of visual inference tasks. The system performance is evaluated by integrating the memory with the open-source NVIDIA deep learning architecture.
MEMTI: Optimizing On-Chip Nonvolatile Storage for Visual Multitask Inference at the Edge
Lillian Pentecost, Marco Donato, Brandon Reagen, Udit Gupta, Siming Ma, Gu Wei, and David Brooks. 10/1/2019. “MaxNVM: Maximizing DNN Storage Density and Inference Efficiency with Sparse Encoding and Error Mitigation.” In MICRO '52: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, Pp. 769–781. Publisher's VersionAbstract
Deeply embedded applications require low-power, low-cost hardware that fits within stringent area constraints. Deep learning has many potential uses in these domains, but introduces significant inefficiencies stemming from off-chip DRAM accesses of model weights. Ideally, models would fit entirely on-chip. However, even with compression, memory requirements for state-of-the-art mod- els make on-chip inference impractical. Due to increased density, emerging eNVMs are one promising solution. We present MaxNVM, a principled co-design of sparse encodings, protective logic, and fault-prone MLC eNVM technologies (i.e.,RRAM and CTT) to enable highly-efficient DNN inference. We find bit reduction techniques (e.g., clustering and sparse compression) increase weight vulnerability to faults. This limits the capabilities of MLC eNVM. To circumvent this limitation, we improve storage den- sity (i.e., bits-per-cell) with minimal overhead using protective logic. Tradeoffs between density and reliability result in a rich design space. We show that by balancing these techniques, the weights of large networks are able to reasonably fit on-chip. Compared to a naive, single-level-cell eNVM solution, our highly-optimized MLC memory systems reduce weight area by up to 29×. We compare our technique against NVDLA, a state-of-the-art industry-grade CNN accelerator, and demonstrate up to 3.2× reduced power and up to 3.5× reduced energy per ResNet50 inference.
MaxNVM: Maximizing DNN Storage Density and Inference Efficiency with Sparse Encoding and Error Mitigation
Udit Gupta, Brandon Reagen, Lillian Pentecost, Marco Donato, Thierry Tambe, Alexander Rush, Gu Wei, and David Brooks. 8/23/2019. “MASR: A Modular Accelerator for Sparse RNNs.” In International Conference on Parallel Architectures and Compilation Techniques. Publisher's VersionAbstract
Recurrent neural networks (RNNs) are becoming the de facto solution for speech recognition. RNNs exploit long-term temporal relationships in data by applying repeated, learned transformations. Unlike fully-connected (FC) layers with single vector matrix operations, RNN layers consist of hundreds of such operations chained over time. This poses challenges unique to RNNs that are not found in convolutional neural networks (CNNs) or FC models, namely large dynamic activation. In this paper we present MASR, a principled and modular architecture that accelerates bidirectional RNNs for on-chip ASR. MASR is designed to exploit sparsity in both dynamic activations and static weights. The architecture is enhanced by a series of dynamic activation optimizations that enable compact storage, ensure no energy is wasted computing null operations, and maintain high MAC utilization for highly parallel accelerator designs. In comparison to current state-of-the-art sparse neural network accelerators (e.g., EIE), MASR provides 2x area 3x energy, and 1.6x performance benefits. The modular nature of MASR enables designs that efficiently scale from resource-constrained low-power IoT applications to large-scale, highly parallel datacenter deployments.
MASR: A Modular Accelerator for Sparse RNNs
Siming Ma, Marco Donato, Sae Lee, David Brooks, and Gu Wei. 7/22/2019. “Fully-CMOS Multi-Level Embedded Non-Volatile Memory Devices With Reliable Long-Term Retention for Efficient Storage of Neural Network Weights.” IEEE Electron Device Letters, 40, 9, Pp. 1403–1406. Publisher's VersionAbstract
We present a fully CMOS-compatible multilevel non-volatile memory technology, without any special process cost. It is especially suitable for storing the weights of artificial neural networks on chip with low cost, high density, and high power-efficiency. We use hot carrier injection to program the single-transistor cells, and we conduct charge pumping experiments which identify interfacial traps, rather than bulk oxide traps, as the dominant factor in producing stable I-V shifts. We also derive a new physics-based experimentally verified logarithmic model to explain the rate of interfacial trap generation in large I-V shift regimes where the conventional power-law no longer applies. We fabricate two chips, one using TSMC's 16 nm FinFET and the other in 28 nm planar, and show the FinFET cells are more favorable for non-volatile memory due to their better channel control. We store multiple levels in each FinFET cell using a “program and check” strategy which sets memory cells' currents with standard deviations less than 2 μA across a shifting range of over 100 μA. We demonstrate 8 level FinFET cells with extrapolated 10-year charge loss within 10% at 125 °C.
Fully-CMOS Multi-Level Embedded Non-Volatile Memory Devices With Reliable Long-Term Retention for Efficient Storage of Neural Network Weights
Paul Whatmough, Sae Lee, Marco Donato, Hsea Hsueh, Sam Xi, Udit Gupta, Lillian Pentecost, Glenn Ko, David Brooks, and Gu Wei. 6/2019. “A 16nm 25mm2 SoC with a 54.5x Flexibility-Efficiency Range from Dual-Core Arm Cortex-A53 to eFPGA and Cache-Coherent Accelerators.” Symposium on VLSI Circuits. Publisher's VersionAbstract
This paper presents a 25mm^2 SoC in 16nm FinFET technology targeting flexible acceleration of compute intensive kernels in DNN, DSP and security algorithms. The SoC includes an always-on sub-system, a dual-core Arm A53 CPU cluster, an embedded FPGA array, and a quad-core cache-coherent accelerator cluster. Measurement results demonstrate the following observations: 1) moving DSP/cryptography kernels from A53 to eFPGA increases energy efficiency between 5.5× - 28.9×, 2) the use of cache coherency for datapath accelerators increases throughput by 2.94×, and 3) accelerator flexibility-efficiency (GOPS/W) range spans from 3.1× (A53+S1MD), to 16.5× (eFPGA), to 54.5× (CCA) compared to the dual-core CPU baseline on comparable tasks. The energy per inference on MobileNet-128 CNN shows a peak improvement of 47.6×.
A 16nm 25mm2 SoC with a 54.5x Flexibility-Efficiency Range from Dual-Core Arm Cortex-A53 to eFPGA and Cache-Coherent Accelerators
2018
Paul Whatmough, Sae Lee, Sam Xi, Udit Gupta, Lillian Pentecost, Marco Donato, Hsea Hseuh, David Brooks, and Gu Wei. 10/2018. “SMIV: A 16nm SoC with Efficient and Flexible DNN Acceleration for Intelligent IoT Devices.” Hot Chips 30: A Symposium on High Performance Chips, 99, Pp. 1-1. Publisher's VersionAbstract
Emerging Internet of Things (IoT) devices necessitate system-on-chips (SoCs) that can scale from ultralow power always-on (AON) operation, all the way up to less frequent high-performance tasks at high energy efficiency. Specialized accelerators are essential to help meet these needs at both ends of the scale, but maintaining workload flexibility remains an important goal. This article presents a 25-mm² SoC in 16-nm FinFET technology which demonstrates targeted, flexible acceleration of key compute-intensive kernels spanning machine learning (ML), DSP, and cryptography. The SMIV SoC includes a dedicated AON sub-system, a dual-core Arm Cortex-A53 CPU cluster, an SoC-attached embedded field-programmable gate array (eFPGA) array, and a quad-core cache-coherent accelerator (CCA) cluster. Measurement results demonstrate: 1) 1236x power envelope, from 1.1 mW (only AON cluster), up to 1.36 W (whole SoC at maximum throughput); 2) 5.5-28.9x energy efficiency gain from offloading compute kernels from A53 to eFPGA; 3) 2.94x latency improvement using coherent memory access (CCA cluster); and 4) 55x MobileNetV1 energy per inference improvement on CCA compared to the CPU baseline. The overall flexibility-efficiency range on SMIV spans measured energy efficiencies of 1x (dual-core A53), 3.1x (A53 with SIMD), 16.5x (eFPGA), 54.9x (CCA), and 256x (AON) at a peak efficiency of 4.8 TOPS/W.
SMIV: A 16nm SoC with Efficient and Flexible DNN Acceleration for Intelligent IoT Devices
Marco Donato, Brandon Reagen, Lillian Pentecost, Udit Gupta, David Brooks, and Gu Wei. 6/28/2018. “On-chip deep neural network storage with multi-level eNVM.” In DAC '18: Proceedings of the 55th Annual Design Automation Conference, Pp. 1–6. San Francisco, CA, USA. Publisher's VersionAbstract

One of the biggest performance bottlenecks of today's neural network (NN) accelerators is off-chip memory accesses [11]. In this paper, we propose a method to use multi-level, embedded nonvolatile memory (eNVM) to eliminate all off-chip weight accesses. The use of multi-level memory cells increases the probability of faults. Therefore, we co-design the weights and memories such that their properties complement each other and the faults result in no noticeable NN accuracy loss. In the extreme case, the weights in fully connected layers can be stored using a single transistor. With weight pruning and clustering, we show our technique reduces the memory area by over an order of magnitude compared to an SRAM baseline. In the case of VGG16 (130M weights), we are able to store all the weights in 4.9 mm2, well within the area allocated to SRAM in modern NN accelerators [6]

On-chip deep neural network storage with multi-level eNVM
Marco Donato, Brandon Reagen, Lillian Pentecost, Udit Gupta, David Brooks, and Gu Wei. 6/24/2018. “On-Chip Deep Neural Network Storage with Multi-Level eNVM.” In Design Automation Conference (DAC). Publisher's VersionAbstract
One of the biggest performance bottlenecks of today’s neural network (NN) accelerators is off-chip memory accesses. In this paper, we propose a method to use multi-level, embedded non-volatile memory (eNVM) to eliminate all off-chip weight accesses. The use of multi-level memory cells increases the probability of faults. Therefore, we co-design the weights and memories such that their properties complement each other and the faults result in no noticeable NN accuracy loss. In the extreme case, the weights in fully connected layers can be stored using a single transistor. With weight pruning and clustering, we show our technique reduces the memory area by over an order of magnitude compared to an SRAM baseline. In the case of VGG16 (130M weights), we are able to store all the weights in 4.9 mm2, well within the area allocated to SRAM in modern NN accelerators.
On-Chip Deep Neural Network Storage with Multi-Level eNVM