Publications by Author: David Brooks

2019
Paul Whatmough, Sae Lee, Marco Donato, Hsea Hsueh, Sam Xi, Udit Gupta, Lillian Pentecost, Glenn Ko, David Brooks, and Gu Wei. 6/2019. “A 16nm 25mm2 SoC with a 54.5x Flexibility-Efficiency Range from Dual-Core Arm Cortex-A53 to eFPGA and Cache-Coherent Accelerators.” Symposium on VLSI Circuits. Publisher's VersionAbstract
This paper presents a 25mm^2 SoC in 16nm FinFET technology targeting flexible acceleration of compute intensive kernels in DNN, DSP and security algorithms. The SoC includes an always-on sub-system, a dual-core Arm A53 CPU cluster, an embedded FPGA array, and a quad-core cache-coherent accelerator cluster. Measurement results demonstrate the following observations: 1) moving DSP/cryptography kernels from A53 to eFPGA increases energy efficiency between 5.5× - 28.9×, 2) the use of cache coherency for datapath accelerators increases throughput by 2.94×, and 3) accelerator flexibility-efficiency (GOPS/W) range spans from 3.1× (A53+S1MD), to 16.5× (eFPGA), to 54.5× (CCA) compared to the dual-core CPU baseline on comparable tasks. The energy per inference on MobileNet-128 CNN shows a peak improvement of 47.6×.
A 16nm 25mm2 SoC with a 54.5x Flexibility-Efficiency Range from Dual-Core Arm Cortex-A53 to eFPGA and Cache-Coherent Accelerators
G Ko, Yuji Chai, A Rutenbar, David Brooks, and Gu Wei. 4/28/2019. “Flexgibbs: Reconfigurable parallel gibbs sampling accelerator for structured graphs.” In 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), Pp. 334–334. Publisher's VersionAbstract
Many consider one of the key components to the success of deep learning as its compatibility with existing accelerators, mainly GPU. While GPUs are great at handling linear algebra kernels commonly found in deep learning, they are not the optimal architecture for handling unsupervised learning methods such as Bayesian models and inference. As a step towards, achieving better understanding of architectures for probabilistic models, Gibbs sampling, one of the most commonly used algorithms for Bayesian inference, is studied with a focus on parallelism that converges to the target distribution and parameterized components. We propose FlexGibbs, a reconfigurable parallel Gibbs sampling inference accelerator for structured graphs. We designed an architecture optimal for solving Markov Random Field tasks using an array of parallel Gibbs samplers, enabled by chromatic scheduling. We show that for sound source separation application, FlexGibbs configured on the FPGA fabric of Xilinx Zync CPU-FPGA SoC achieved Gibbs sampling inference speedup of 1048x and 99.85% reduction in energy over running it on ARM Cortex-A53.
Flexgibbs: Reconfigurable parallel gibbs sampling accelerator for structured graphs
Yu Emma Wang, Yuhao Zhu, Glenn Ko, Brandon Reagen, Gu-Wei, and David Brooks. 3/24/2019. “Demystifying Bayesian Inference Workloads.” International Symposium on Performance Analysis of Systems and Software (ISPASS). Publisher's VersionAbstract
The recent surge of machine learning has motivated computer architects to focus intently on accelerating related workloads, especially in deep learning. Deep learning has been the pillar algorithm that has led the advancement of learning patterns from a vast amount of labeled data, or supervised learning. However, for unsupervised learning, Bayesian methods often work better than deep learning. Bayesian modeling and inference works well with unlabeled or limited data, can leverage informative priors, and has interpretable models. Despite being an important branch of machine learning, Bayesian inference generally has been overlooked by the architecture and systems communities. In this paper, we facilitate the study of Bayesian inference with the development of BayesSuite, a collection of seminal Bayesian inference workloads. We characterize the power and performance profiles of BayesSuite across a variety of current-generation processors and find significant diversity. Manually tuning and deploying Bayesian inference workloads requires deep understanding of the workload characteristics and hardware specifications. To address these challenges and provide high-performance, energy-efficient support for Bayesian infer- ence, we introduce a scheduling and optimization mechanism that can be plugged into a system scheduler. We also propose a computation elision technique that further improves the performance and energy efficiency of the workloads by skipping computations that do not improve the quality of the inference. Our proposed techniques are able to increase Bayesian inference performance by 5.8× on average over the naive assignment and execution of the workloads.
Demystifying Bayesian Inference Workloads
Yu Wang, Victor Lee, Gu Wei, and David Brooks. 1/1/2019. “Predicting New Workload or CPU Performance by Analyzing Public Datasets.” ACM Transactions on Architecture and Code Optimization (TACO), 15, 4, Pp. 53:1–53:21. Publisher's VersionAbstract
The marketplace for general-purpose microprocessors offers hundreds of functionally similar models, differing by traits like frequency, core count, cache size, memory bandwidth, and power consumption. Their performance depends not only on microarchitecture, but also on the nature of the workloads being executed. Given a set of intended workloads, the consumer needs both performance and price information to make rational buying decisions. Many benchmark suites have been developed to measure processor performance, and their results for large collections of CPUs are often publicly available. However, repositories of benchmark results are not always helpful when consumers need performance data for new processors or new workloads. Moreover, the aggregate scores for benchmark suites designed to cover a broad spectrum of workload types can be misleading. To address these problems, we have developed a deep neural network (DNN) model, and we have used it to learn the relationship between the specifications of Intel CPUs and their performance on the SPEC CPU2006 and Geekbench 3 benchmark suites. We show that we can generate useful predictions for new processors and new workloads. We also cross-predict the two benchmark suites and compare their performance scores. The results quantify the self-similarity of these suites for the first time in the literature. This work should discourage consumers from basing purchasing decisions exclusively on Geekbench 3, and it should encourage academics to evaluate research using more diverse workloads than the SPEC CPU suites alone.
Predicting New Workload or CPU Performance by Analyzing Public Datasets
Gu Wei, Tao Tong, David Brooks, and Sae Lee. 2019. “Capacitor reconfiguration of a single-input, multi-output, switched-capacitor converter”.
Gu Wei, Tao Tong, David Brooks, and Sae Lee. 2019. “Device and method for hybrid feedback control of a switch-capacitor multi-unit voltage regulator.” United States of America. Publisher's VersionAbstract

A device and method for hybrid feedback control of a switch-capacitor multi-unit voltage regulator are presented. A multi-unit switched-capacitor (SC) core includes a plurality of SC converter units, each unit with a capacitor and a plurality of switches controllable by a plurality of switching signals. Power switch drivers provide a switching signal to each SC converter unit. A secondary proactive loop circuit includes a feedback control circuit configured to control one or more of the plurality of switches. A comparator is configured to compare the regulator output voltage with a reference voltage and provide a comparator trigger signal. Ripple reduction logic is configured to receive the comparator trigger signal and provide an SC unit allocation signal. A multiplexer is configured to receive a first clock signal, a second clock signal, and the SC unit allocation signal and provide a signal to the power switch drivers.

Device and method for hybrid feedback control of a switch-capacitor multi-unit voltage regulator
2018
Brandon Reagen, Udit Gupta, Robert Adolf, Michael Mitzenmacher, Alexander Rush, Gu Wei, and David Brooks. 11/13/2018. “Weightless: Lossy Weight Encoding For Deep Neural Network Compression.” In International Conference on Machine Learning, Pp. 4324–4333. Publisher's VersionAbstract
The large memory requirements of deep neural networks limit their deployment and adoption on many devices. Model compression methods effectively reduce the memory requirements of these models, usually through applying transformations such as weight pruning or quantization. In this paper, we present a novel scheme for lossy weight encoding which complements conventional compression techniques. The encoding is based on the Bloomier filter, a probabilistic data structure that can save space at the cost of introducing random errors. Leveraging the ability of neural networks to tolerate these imperfections and by re-training around the errors, the proposed technique, Weightless, can compress DNN weights by up to 496x with the same model accuracy. This results in up to a 1.51x improvement over the state-of-the-art.
Weightless: Lossy Weight Encoding For Deep Neural Network Compression
Sae Lee, Paul Whatmough, Niamh Mulholland, Patrick Hansen, David Brooks, and Gu Wei. 10/18/2018. “A wide dynamic range sparse FC-DNN processor with multi-cycle banked SRAM read and adaptive clocking in 16nm FinFET.” ESSCIRC 2018-IEEE 44th European Solid State Circuits Conference. Publisher's VersionAbstract
Always-on classifiers for sensor data require a very wide operating range to support a variety of real-time workloads and must operate robustly at low supply voltages. We present a 16nm always-on wake-up controller with a fully-connected (FC) Deep Neural Network (DNN) accelerator that operates from 0.4-1 V. Calibration-free automatic voltage/frequency tuning is provided by tracking small non-zero Razor timing-error rates, and a novel timing-error driven sync-free fast adaptive clocking scheme provides resilience to on-chip supply voltage noise. The model access burden of neural networks is relaxed using a multicycle SRAM read, which allows memory voltage to be reduced at iso-throughput. The wide operating range allows for high performance at 1.36GHz, low-power consumption down to 750μW and state-of-the-art raw efficiency at 16-bit precision of 750 GOPS/W dense, or 1.81 TOPS/W sparse.
A wide dynamic range sparse FC-DNN processor with multi-cycle banked SRAM read and adaptive clocking in 16nm FinFET
Rafael Garibotti, Brandon Reagen, Yakun Shao, Gu Wei, and David Brooks. 10/2018. “Assisting high-level synthesis improve spmv benchmark through dynamic dependence analysis.” IEEE Transactions on Circuits and Systems II: Express Briefs, 65, 10, Pp. 1440–1444. Publisher's VersionAbstract
Recent advances in high-level synthesis (HLS) have enabled an automatic means of generating register-transfer level from high-level specifications without compromising performance. HLS provides substantial improvements to productivity and is a promising solution to designing future heterogeneous chips consisting of dozens of unique IP blocks (i.e., hardware accelerators). Despite their impressive capabilities, HLS tools today are commonly used to target a small subset of workloads, i.e., ones with inordinately regular control flow and memory access patterns. The challenges of achieving high-quality hardware for irregular workloads stems from HLS relying on static analysis. Static analysis is overly conservative when dealing with non-uniform memory access and imbalanced workloads, and identifying the most appropriate parallelizing strategy. In this brief, we propose the use of dynamic analysis to generate higher quality designs using commercial HLS tools. Our evaluations show that with dynamic dependence analysis, HLS designs achieve 3.3× performance improvement for the sparse matrix-vector multiply benchmark.
Assisting High-Level Synthesis Improve SpMV Benchmark Through Dynamic Dependence Analysis
Paul Whatmough, Sae Lee, Sam Xi, Udit Gupta, Lillian Pentecost, Marco Donato, Hsea Hseuh, David Brooks, and Gu Wei. 10/2018. “SMIV: A 16nm SoC with Efficient and Flexible DNN Acceleration for Intelligent IoT Devices.” Hot Chips 30: A Symposium on High Performance Chips, 99, Pp. 1-1. Publisher's VersionAbstract
Emerging Internet of Things (IoT) devices necessitate system-on-chips (SoCs) that can scale from ultralow power always-on (AON) operation, all the way up to less frequent high-performance tasks at high energy efficiency. Specialized accelerators are essential to help meet these needs at both ends of the scale, but maintaining workload flexibility remains an important goal. This article presents a 25-mm² SoC in 16-nm FinFET technology which demonstrates targeted, flexible acceleration of key compute-intensive kernels spanning machine learning (ML), DSP, and cryptography. The SMIV SoC includes a dedicated AON sub-system, a dual-core Arm Cortex-A53 CPU cluster, an SoC-attached embedded field-programmable gate array (eFPGA) array, and a quad-core cache-coherent accelerator (CCA) cluster. Measurement results demonstrate: 1) 1236x power envelope, from 1.1 mW (only AON cluster), up to 1.36 W (whole SoC at maximum throughput); 2) 5.5-28.9x energy efficiency gain from offloading compute kernels from A53 to eFPGA; 3) 2.94x latency improvement using coherent memory access (CCA cluster); and 4) 55x MobileNetV1 energy per inference improvement on CCA compared to the CPU baseline. The overall flexibility-efficiency range on SMIV spans measured energy efficiencies of 1x (dual-core A53), 3.1x (A53 with SIMD), 16.5x (eFPGA), 54.9x (CCA), and 256x (AON) at a peak efficiency of 4.8 TOPS/W.
SMIV: A 16nm SoC with Efficient and Flexible DNN Acceleration for Intelligent IoT Devices
Paul Whatmough, Sae Lee, David Brooks, and Gu Wei. 9/2018. “DNN ENGINE: A 28-nm Timing-Error Tolerant Sparse Deep Neural Network Processor for IoT Applications.” IEEE Journal of Solid-State Circuits (JSSC), 53, 9. Publisher's VersionAbstract
This paper presents a 28-nm system-on-chip (SoC) for Internet of things (IoT) applications with a programmable accelerator design that implements a powerful fully connected deep neural network (DNN) classifier. To reach the required low energy consumption, we exploit the key properties of neural network algorithms: parallelism, data reuse, small/sparse data, and noise tolerance. We map the algorithm to a very large scale integration (VLSI) architecture based around an singleinstruction, multiple-data data path with hardware support to exploit data sparsity by completely eliding unnecessary computation and data movement. This approach exploits sparsity, without compromising the parallel computation. We also exploit the inherent algorithmic noise-tolerance of neural networks, by introducing circuit-level timing violation detection to allow worst case voltage guard-bands to be minimized. The resulting intermittent timing violations may result in logic errors, which conventionally need to be corrected. However, in lieu of explicit error correction, we cope with this by accentuating the noise tolerance of neural networks. The measured test chip achieves high classification accuracy (98.36% for the MNIST test set), while tolerating aggregate timing violation rates>10 -1 . The accelerator achieves a minimum energy of 0.36 μJ/inference at 667 MHz; maximum throughput at 1.2 GHz and 0.57 μJ/inference; or a 10% margined operating point at 1 GHz and 0.58 μJ/inference.
DNN ENGINE: A 28-nm Timing-Error Tolerant Sparse Deep Neural Network Processor for IoT Applications
Marco Donato, Brandon Reagen, Lillian Pentecost, Udit Gupta, David Brooks, and Gu Wei. 6/28/2018. “On-chip deep neural network storage with multi-level eNVM.” In DAC '18: Proceedings of the 55th Annual Design Automation Conference, Pp. 1–6. San Francisco, CA, USA. Publisher's VersionAbstract

One of the biggest performance bottlenecks of today's neural network (NN) accelerators is off-chip memory accesses [11]. In this paper, we propose a method to use multi-level, embedded nonvolatile memory (eNVM) to eliminate all off-chip weight accesses. The use of multi-level memory cells increases the probability of faults. Therefore, we co-design the weights and memories such that their properties complement each other and the faults result in no noticeable NN accuracy loss. In the extreme case, the weights in fully connected layers can be stored using a single transistor. With weight pruning and clustering, we show our technique reduces the memory area by over an order of magnitude compared to an SRAM baseline. In the case of VGG16 (130M weights), we are able to store all the weights in 4.9 mm2, well within the area allocated to SRAM in modern NN accelerators [6]

On-chip deep neural network storage with multi-level eNVM
Brandon Reagen, Udit Gupta, Lillian Pentecost, Paul Whatmough, Sae Lee, Niamh Mulholland, David Brooks, and Gu Wei. 6/24/2018. “Ares: a framework for quantifying the resilience of deep neural networks.” In Design Automation Conference, 17: Pp. 1-6. Publisher's VersionAbstract
As the use of deep neural networks continues to grow, so does the fraction of compute cycles devoted to their execution. This has led the CAD and architecture communities to devote considerable attention to building DNN hardware. Despite these efforts, the fault tolerance of DNNs has generally been overlooked. This paper is the first to conduct a large-scale, empirical study of DNN resilience. Motivated by the inherent algorithmic resilience of DNNs, we are interested in understanding the relationship between fault rate and model accuracy. To do so, we present Ares: a light-weight, DNN-specific fault injection framework validated within 12% of real hardware. We find that DNN fault tolerance varies by orders of magnitude with respect to model, layer type, and structure.
Ares: a framework for quantifying the resilience of deep neural networks
Marco Donato, Brandon Reagen, Lillian Pentecost, Udit Gupta, David Brooks, and Gu Wei. 6/24/2018. “On-Chip Deep Neural Network Storage with Multi-Level eNVM.” In Design Automation Conference (DAC). Publisher's VersionAbstract
One of the biggest performance bottlenecks of today’s neural network (NN) accelerators is off-chip memory accesses. In this paper, we propose a method to use multi-level, embedded non-volatile memory (eNVM) to eliminate all off-chip weight accesses. The use of multi-level memory cells increases the probability of faults. Therefore, we co-design the weights and memories such that their properties complement each other and the faults result in no noticeable NN accuracy loss. In the extreme case, the weights in fully connected layers can be stored using a single transistor. With weight pruning and clustering, we show our technique reduces the memory area by over an order of magnitude compared to an SRAM baseline. In the case of VGG16 (130M weights), we are able to store all the weights in 4.9 mm2, well within the area allocated to SRAM in modern NN accelerators.
On-Chip Deep Neural Network Storage with Multi-Level eNVM
Mario Lok, Elizabeth Farrell Helbling, Xuan Zhang, Robert Wood, David Brooks, and Gu Wei. 4/2018. “A Low Mass Power Electronics Unit to Drive Piezoelectric Actuators for Flying Microrobots.” IEEE Transactions on Power Electronics, 33, 4, Pp. 3180 - 3191. Publisher's VersionAbstract
This paper presents a power electronics design for the piezoelectric actuators of an insect-scale flapping-wing robot, the RoboBee. The proposed design outputs four high-voltage drive signals tailored for the two bimorph actuators of the RoboBee in an alternating drive configuration. It utilizes fully integrated drive stage circuits with a novel highside gate driver to save chip area and meet the strict mass constraint of the RoboBee. Compared with previous integrated designs, it also boosts efficiency in delivering energy to the actuators and recovering unused energy by applying three power saving techniques, dynamic common mode adjustment, envelope tracking, and charge sharing. Using this design to energize four 15 nF capacitor loads with a 200 V and 100 Hz drive signal and tracking the control commands recorded from an actual flight experiment for the robot, we measure an average power consumption of 290 mW.
A Low Mass Power Electronics Unit to Drive Piezoelectric Actuators for Flying Microrobots
Yuhao Zhu, Gu Wei, and David Brooks. 2/15/2018. “Cloud no longer a silver bullet, edge to the rescue.” arXiv preprint arXiv:1802.05943. Publisher's VersionAbstract
This paper takes the position that, while cognitive computing today relies heavily on the cloud, we will soon see a paradigm shift where cognitive computing primarily happens on network edges. The shift toward edge devices is fundamentally propelled both by technological constraints in data centers and wireless network infrastructures, as well as practical considerations such as privacy and safety. The remainder of this paper lays out our view of how these constraints will impact future cognitive computing. Bringing cognitive computing to edge devices opens up several new opportunities and challenges, some of which demand new solutions and some of which require us to revisit entrenched techniques in light of new technologies. We close the paper with a call to action for future research.
Cloud no longer a silver bullet, edge to the rescue
2017
Simone Campanoni, Kevin Brownell, Svilen Kanev, Timothy Jones, Gu Wei, and David Brooks. 12/2017. “Automatically accelerating non-numerical programs by architecture-compiler co-design.” Communications of the ACM, 60, 12, Pp. 88–97. Publisher's VersionAbstract
Because of the high cost of communication between processors, compilers that parallelize loops automatically have been forced to skip a large class of loops that are both critical to performance and rich in latent parallelism. HELIX-RC is a compiler/microprocessor co-design that opens those loops to parallelization by decoupling communication from thread execution in conventional multicore architecures. Simulations of HELIX-RC, applied to a processor with 16 Intel Atom-like cores, show an average of 6.85× performance speedup for six SPEC CINT2000 benchmarks.
Automatically accelerating non-numerical programs by architecture-compiler co-design
Sreela Kodali, Patrick Hansen, Niamh Mulholland, Paul Whatmough, David Brooks, and Gu Wei. 11/5/2017. “Applications of Deep Neural Networks for Ultra Low Power IoT.” In International Conference on Computer Design. Publisher's VersionAbstract
IoT devices are increasing in prevalence and popularity, becoming an indispensable part of daily life. Despite the stringent energy and computational constraints of IoT systems, specialized hardware can enable energy-efficient sensor-data classification in an increasingly diverse range of IoT applications. This paper demonstrates seven different IoT applications using a fully-connected deep neural network (FC-NN) accelerator on 28nm CMOS. The applications include audio keyword spotting, face recognition, and human activity recognition. For each application, a FC-NN model was trained from a preprocessed dataset and mapped to the accelerator. Experimental results indicate the models retained their state-of-the-art accuracy on the accelerator across a broad range of frequencies and voltages. Real-time energy results for the applications were found to be on the order of 100nJ per inference or lower.
Applications of Deep Neural Networks for Ultra Low Power IoT
Ramon Bertran, Pradip Bose, David Brooks, Jeff Burns, Alper Buyuktosunoglu, Nandhini Chandramoorthy, Eric Cheng, Martin Cochet, Schuyler Eldridge, Daniel Friedman, and others. 11/5/2017. “Very low voltage (VLV) design.” In 2017 IEEE International Conference on Computer Design (ICCD), Pp. 601–604. Boston, MA, USA: IEEE. Publisher's VersionAbstract
This paper is a tutorial-style introduction to a special session on: Effective Voltage Scaling in the Late CMOS Era. It covers the fundamental challenges and associated solution strategies in pursuing very low voltage (VLV) designs. We discuss the performance and system reliability constraints that are key impediments to VLV. The associated trade-offs across power, performance and reliability are helpful in inferring the optimal operational voltage-frequency point. This work was performed under the auspices of an ongoing DARPA program (named PERFECT) that is focused on maximizing system-level energy efficiency.
Very low voltage (VLV) design
Paul Whatmough, Sae Lee, Gu Wei, and David Brooks. 10/29/2017. “Sub-uJ Deep Neural Networks for Embedded Applications.” In IEEE 51st Asilomar Conference on Signals, Systems, and Computers. Pacific Grove, CA, USA. Publisher's VersionAbstract
To intelligently process sensor data on internet of things (IoT) devices, we require powerful classifiers that can operate at sub-uJ energy levels. Previous work has focused on spiking neural network (SNN) algorithms, which are well suited to VLSI implementation due to the single-bit connections between neurons in the network. In contrast, deep neural networks (DNNs) are not as well suited to hardware implementation, because the compute and storage demands are high. In this paper, we demonstrate that there are a variety of optimizations that can be applied to DNNs to reduce the energy consumption such that they outperform SNNs in terms of energy and accuracy. Six optimizations are surveyed and applied to a SIMD accelerator architecture. The accelerator is implemented in a 28nm SoC test chip. Measurement results demonstrate ~10X aggregate improvement in energy efficiency, with a minimum energy of 0.36uJ/inference at 667MHz clock frequency. Compared to previously published spiking neural network accelerators, we demonstrate an improvement in energy efficiency of more than an order of magnitude, across a wide energy-accuracy trade-off range.
Sub-uJ Deep Neural Networks for Embedded Applications

Pages