Publications by Type: Conference Paper

2018
Brandon Reagen, Udit Gupta, Robert Adolf, Michael Mitzenmacher, Alexander Rush, Gu Wei, and David Brooks. 11/13/2018. “Weightless: Lossy Weight Encoding For Deep Neural Network Compression.” In International Conference on Machine Learning, Pp. 4324–4333. Publisher's VersionAbstract
The large memory requirements of deep neural networks limit their deployment and adoption on many devices. Model compression methods effectively reduce the memory requirements of these models, usually through applying transformations such as weight pruning or quantization. In this paper, we present a novel scheme for lossy weight encoding which complements conventional compression techniques. The encoding is based on the Bloomier filter, a probabilistic data structure that can save space at the cost of introducing random errors. Leveraging the ability of neural networks to tolerate these imperfections and by re-training around the errors, the proposed technique, Weightless, can compress DNN weights by up to 496x with the same model accuracy. This results in up to a 1.51x improvement over the state-of-the-art.
Weightless: Lossy Weight Encoding For Deep Neural Network Compression
Marco Donato, Brandon Reagen, Lillian Pentecost, Udit Gupta, David Brooks, and Gu Wei. 6/28/2018. “On-chip deep neural network storage with multi-level eNVM.” In DAC '18: Proceedings of the 55th Annual Design Automation Conference, Pp. 1–6. San Francisco, CA, USA. Publisher's VersionAbstract

One of the biggest performance bottlenecks of today's neural network (NN) accelerators is off-chip memory accesses [11]. In this paper, we propose a method to use multi-level, embedded nonvolatile memory (eNVM) to eliminate all off-chip weight accesses. The use of multi-level memory cells increases the probability of faults. Therefore, we co-design the weights and memories such that their properties complement each other and the faults result in no noticeable NN accuracy loss. In the extreme case, the weights in fully connected layers can be stored using a single transistor. With weight pruning and clustering, we show our technique reduces the memory area by over an order of magnitude compared to an SRAM baseline. In the case of VGG16 (130M weights), we are able to store all the weights in 4.9 mm2, well within the area allocated to SRAM in modern NN accelerators [6]

On-chip deep neural network storage with multi-level eNVM
Brandon Reagen, Udit Gupta, Lillian Pentecost, Paul Whatmough, Sae Lee, Niamh Mulholland, David Brooks, and Gu Wei. 6/24/2018. “Ares: a framework for quantifying the resilience of deep neural networks.” In Design Automation Conference, 17: Pp. 1-6. Publisher's VersionAbstract
As the use of deep neural networks continues to grow, so does the fraction of compute cycles devoted to their execution. This has led the CAD and architecture communities to devote considerable attention to building DNN hardware. Despite these efforts, the fault tolerance of DNNs has generally been overlooked. This paper is the first to conduct a large-scale, empirical study of DNN resilience. Motivated by the inherent algorithmic resilience of DNNs, we are interested in understanding the relationship between fault rate and model accuracy. To do so, we present Ares: a light-weight, DNN-specific fault injection framework validated within 12% of real hardware. We find that DNN fault tolerance varies by orders of magnitude with respect to model, layer type, and structure.
Ares: a framework for quantifying the resilience of deep neural networks
Marco Donato, Brandon Reagen, Lillian Pentecost, Udit Gupta, David Brooks, and Gu Wei. 6/24/2018. “On-Chip Deep Neural Network Storage with Multi-Level eNVM.” In Design Automation Conference (DAC). Publisher's VersionAbstract
One of the biggest performance bottlenecks of today’s neural network (NN) accelerators is off-chip memory accesses. In this paper, we propose a method to use multi-level, embedded non-volatile memory (eNVM) to eliminate all off-chip weight accesses. The use of multi-level memory cells increases the probability of faults. Therefore, we co-design the weights and memories such that their properties complement each other and the faults result in no noticeable NN accuracy loss. In the extreme case, the weights in fully connected layers can be stored using a single transistor. With weight pruning and clustering, we show our technique reduces the memory area by over an order of magnitude compared to an SRAM baseline. In the case of VGG16 (130M weights), we are able to store all the weights in 4.9 mm2, well within the area allocated to SRAM in modern NN accelerators.
On-Chip Deep Neural Network Storage with Multi-Level eNVM
2017
Sreela Kodali, Patrick Hansen, Niamh Mulholland, Paul Whatmough, David Brooks, and Gu Wei. 11/5/2017. “Applications of Deep Neural Networks for Ultra Low Power IoT.” In International Conference on Computer Design. Publisher's VersionAbstract
IoT devices are increasing in prevalence and popularity, becoming an indispensable part of daily life. Despite the stringent energy and computational constraints of IoT systems, specialized hardware can enable energy-efficient sensor-data classification in an increasingly diverse range of IoT applications. This paper demonstrates seven different IoT applications using a fully-connected deep neural network (FC-NN) accelerator on 28nm CMOS. The applications include audio keyword spotting, face recognition, and human activity recognition. For each application, a FC-NN model was trained from a preprocessed dataset and mapped to the accelerator. Experimental results indicate the models retained their state-of-the-art accuracy on the accelerator across a broad range of frequencies and voltages. Real-time energy results for the applications were found to be on the order of 100nJ per inference or lower.
Applications of Deep Neural Networks for Ultra Low Power IoT
Ramon Bertran, Pradip Bose, David Brooks, Jeff Burns, Alper Buyuktosunoglu, Nandhini Chandramoorthy, Eric Cheng, Martin Cochet, Schuyler Eldridge, Daniel Friedman, and others. 11/5/2017. “Very low voltage (VLV) design.” In 2017 IEEE International Conference on Computer Design (ICCD), Pp. 601–604. Boston, MA, USA: IEEE. Publisher's VersionAbstract
This paper is a tutorial-style introduction to a special session on: Effective Voltage Scaling in the Late CMOS Era. It covers the fundamental challenges and associated solution strategies in pursuing very low voltage (VLV) designs. We discuss the performance and system reliability constraints that are key impediments to VLV. The associated trade-offs across power, performance and reliability are helpful in inferring the optimal operational voltage-frequency point. This work was performed under the auspices of an ongoing DARPA program (named PERFECT) that is focused on maximizing system-level energy efficiency.
Very low voltage (VLV) design
Paul Whatmough, Sae Lee, Gu Wei, and David Brooks. 10/29/2017. “Sub-uJ Deep Neural Networks for Embedded Applications.” In IEEE 51st Asilomar Conference on Signals, Systems, and Computers. Pacific Grove, CA, USA. Publisher's VersionAbstract
To intelligently process sensor data on internet of things (IoT) devices, we require powerful classifiers that can operate at sub-uJ energy levels. Previous work has focused on spiking neural network (SNN) algorithms, which are well suited to VLSI implementation due to the single-bit connections between neurons in the network. In contrast, deep neural networks (DNNs) are not as well suited to hardware implementation, because the compute and storage demands are high. In this paper, we demonstrate that there are a variety of optimizations that can be applied to DNNs to reduce the energy consumption such that they outperform SNNs in terms of energy and accuracy. Six optimizations are surveyed and applied to a SIMD accelerator architecture. The accelerator is implemented in a 28nm SoC test chip. Measurement results demonstrate ~10X aggregate improvement in energy efficiency, with a minimum energy of 0.36uJ/inference at 667MHz clock frequency. Compared to previously published spiking neural network accelerators, we demonstrate an improvement in energy efficiency of more than an order of magnitude, across a wide energy-accuracy trade-off range.
Sub-uJ Deep Neural Networks for Embedded Applications
Brandon Reagen, Yakun Shao, Sam Xi, Gu Wei, and David Brooks. 8/6/2017. “Methods and infrastructure in the era of accelerator-centric architectures.” In 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS), Pp. 902–905. Boston, MA, USA: IEEE. Publisher's VersionAbstract
Computer architecture today is anything but business as usual, and what is bad for business is often great for science. As Moore's Law continues to unwaveringly march forward, despite the ceasing of Dennard scaling, continued performance gains with each processor generation has become a significant challenge, and requires creative solutions. Namely, the way to continue to scale performance in light of power issues is through hardware specialization. Hardware accelerators promise not only orders of magnitude in performance improvements over general purpose processors, but sport similar energy efficiency gains. However, accelerators are equal parts problem solver as they are creator. The major problem is designing and integrating accelerators into a complex environment within the stringent SoC design cycles. Given that each accelerator has a rich design space and convoluted implications and interactions with the memory system, better mechanisms for studying this new-breed of SoC are needed. To usher in the new era of computer architecture, we have built Aladdin: a high-level accelerator simulator enabling rapid accelerator design. Aladdin was recently extended to operate in conjunction with gem5 to study memory system interactions. In this paper we will recount the operation and utilities of Aladdin and gem5-Aladdin, concluding with a case study of how Aladdin can be used to optimize DNN accelerators.
Methods and infrastructure in the era of accelerator-centric architectures
Paul Whatmough, Sae Lee, Niamh Mulholland, Patrick Hansen, Sreela Kodali, and David Brooks. 8/2017. “DNN ENGINE: A 16nm Sub-uJ Deep Neural Network Inference Accelerator for the Embedded Masses.” In Hot Chips 29: A Symposium on High Performance Chips. Publisher's Version DNN ENGINE: A 16nm Sub-uJ Deep Neural Network Inference Accelerator for the Embedded Masses
Brandon Reagen, Jose Hernandez-Lobato, Robert Adolf, Michael Gelbart, Paul Whatmough, Gu Wei, and David Brooks. 7/24/2017. “A Case for Efficient Accelerator Design Space Exploration via Bayesian Optimization.” In International Symposium on Low Power Electronics and Design. Taipei, Taiwan. Publisher's VersionAbstract
In this paper we propose using machine learning to improve the design of deep neural network hardware accelerators. We show how to adapt multi-objective Bayesian optimization to overcome a challenging design problem: optimizing deep neural network hardware accelerators for both accuracy and energy efficiency. DNN accelerators exhibit all aspects of a challenging optimization space: the landscape is rough, evaluating designs is expensive, the objectives compete with each other, and both design spaces (algorithmic and microarchitectural) are unwieldy. With multi-objective Bayesian optimization, the design space exploration is made tractable and the design points found vastly outperform traditional methods across all metrics of interest.
A Case for Efficient Accelerator Design Space Exploration via Bayesian Optimization
An Zou, Jingwen Leng, Yazhou Zu, Tao Tong, Vijay Reddi, David Brooks, Gu Wei, and Xuan Zhang. 6/18/2017. “Ivory: Early-stage design space exploration tool for integrated voltage regulators.” In Proceedings of the 54th Annual Design Automation Conference 2017, Pp. 1–6. Austin, TX. Publisher's VersionAbstract
Despite being employed in burgeoning efforts to improve power delivery efficiency, integrated voltage regulators (IVRs) have yet to be evaluated in a rigorous, systematic, or quantitative manner. To fulfill this need, we present Ivory, a high-level design space exploration tool capable of providing accurate conversion efficiency, static performance characteristics, and dynamic transient responses of an IVR-enabled power delivery subsystem (PDS), enabling rapid trade-off exploration at early design stage, approximately 1000× faster than SPICE simulation. We demonstrate and validate Ivory with a wide spectrum of IVR topologies. In addition, we present a case study using Ivory to reveal the optimal PDS configurations, with underlying power break-downs and area overheads for the GPU manycore architecture, which has yet to embrace IVRs.
Ivory: Early-stage design space exploration tool for integrated voltage regulators
Rafael Garibotti, Brandon Reagen, Yakun Shao, Gu Wei, and David Brooks. 5/28/2017. “Using dynamic dependence analysis to improve the quality of high-level synthesis designs.” In 2017 IEEE International Symposium on Circuits and Systems (ISCAS), Pp. 1–4. Baltimore, MD, USA: IEEE. Publisher's VersionAbstract
High-Level Synthesis (HLS) tools that compile algorithms written in high-level languages into register-transfer level implementations can significantly improve design productivity and lower engineering cost. However, HLS-generated designs still lag handwritten implementations in a number of areas, particularly in the efficient allocation of hardware resources. In this work, we propose the use of dynamic dependence analysis to generate higher quality designs using existing HLS tools. We focus on resource sharing for compute-intensive workloads, a major limitation of relying only on static analysis. We demonstrate that with dynamic dependence analysis, the synthesized designs can achieve an order of magnitude resource reduction without performance loss over the state-of-the-art HLS solutions.
Using dynamic dependence analysis to improve the quality of high-level synthesis designs
Svilen Kanev, Sam Xi, Gu Wei, and David Brooks. 4/2017. “Mallacc: Accelerating Memory Allocation.” In International Symposium on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2nd ed., 5: Pp. 33-45. Publisher's VersionAbstract
Recent work shows that dynamic memory allocation consumes nearly 7% of all cycles in Google datacenters. With the trend towards increased specialization of hardware, we propose Mallacc, an in-core hardware accelerator designed for broad use across a number of high-performance, modern memory allocators. The design of Mallacc is quite different from traditional throughput-oriented hardware accelerators. Because memory allocation requests tend to be very frequent, fast, and interspersed inside other application code, accelerators must be optimized for latency rather than throughput and area overheads must be kept to a bare minimum. Mallacc accelerates the three primary operations of a typical memory allocation request: size class computation, retrieval of a free memory block, and sampling of memory usage. Our results show that malloc latency can be reduced by up to 50% with a hardware cost of less than 1500 μm 2 of silicon area, less than 0.006% of a typical high-performance processor core.
Mallacc: Accelerating Memory Allocation
Paul Whatmough, Sae Lee, Hyunkwang Lee, Saketh Rama, David Brooks, and Gu Wei. 2/9/2017. “A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine with >0.1 Timing Error Rate Tolerance for IoT Applications.” In International Solid-State Circuits Conference. San Francisco, CA, USA. Publisher's VersionAbstract
This paper presents a 28nm SoC with a programmable FC-DNN accelerator design that demonstrates: (1) HW support to exploit data sparsity by eliding unnecessary computations (4× energy reduction); (2) improved algorithmic error tolerance using sign-magnitude number format for weights and datapath computation; (3) improved circuit-level timing violation tolerance in datapath logic via timeborrowing; (4) combined circuit and algorithmic resilience with Razor timing violation detection to reduce energy via VDD scaling or increase throughput via FCLK scaling; and (5) high classification accuracy (98.36% for MNIST test set) while tolerating aggregate timing violation rates >10-1. The accelerator achieves a minimum energy of 0.36μJ/pred at 667MHz, maximum throughput at 1.2GHz and 0.57μJ/pred, or a 10%-margined operating point at 1GHz and 0.58μJ/pred.
 
A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine with >0.1 Timing Error Rate Tolerance for IoT Applications
Whatmough N, Sae Lee, Hyunkwang Lee, Saketh Rama, David Brooks, and Gu Wei. 2/5/2017. “14.3 A 28nm SoC with a 1.2 GHz 568nJ/prediction sparse deep-neural-network engine with> 0.1 timing error rate tolerance for IoT applications.” In 2017 IEEE International Solid-State Circuits Conference (ISSCC), Pp. 242–243. IEEE. Publisher's VersionAbstract
This paper presents a 28nm SoC with a programmable FC-DNN accelerator design that demonstrates: (1) HW support to exploit data sparsity by eliding unnecessary computations (4× energy reduction); (2) improved algorithmic error tolerance using sign-magnitude number format for weights and datapath computation; (3) improved circuit-level timing violation tolerance in datapath logic via timeborrowing; (4) combined circuit and algorithmic resilience with Razor timing violation detection to reduce energy via VDD scaling or increase throughput via FCLK scaling; and (5) high classification accuracy (98.36% for MNIST test set) while tolerating aggregate timing violation rates >10-1. The accelerator achieves a minimum energy of 0.36μJ/pred at 667MHz, maximum throughput at 1.2GHz and 0.57μJ/pred, or a 10%-margined operating point at 1GHz and 0.58μJ/pred.
 
14.3 A 28nm SoC with a 1.2 GHz 568nJ/prediction sparse deep-neural-network engine with> 0.1 timing error rate tolerance for IoT applications
Simon Chaput, David Brooks, and Gu Wei. 2/2/2017. “21.5 A 3-to-5V input 100V pp output 57.7 mW 0.42% THD+ N highly integrated piezoelectric actuator driver.” In 2017 IEEE International Solid-State Circuits Conference (ISSCC), Pp. 360–361. San Francisco, CA, USA: IEEE. Publisher's VersionAbstract
Piezoelectric actuators are used in a growing range of applications, e.g., haptic feedback systems, cooling fans, and microrobots. However, to fully realize their potential, these actuators require drivers able to efficiently generate high-voltage (>100V pp ) low frequency (<;300Hz) analog waveforms from a low-voltage source (3-to-5V) with small form factor. Certain applications, such as piezoelectric (PZT) cooling fans, further demand low distortion waveforms (THD+N <; 1%) to minimize sound emission from the actuator. Existing solutions for small PZT drivers typically rely on designs comprising a power converter to step up a low voltage followed by a high-voltage amplifier [1,2,3]. Although envelope tracking can help reduce amplifier power [3], none of these designs can recover the energy stored on the actuator to maximize efficiency. And while a differential bidirectional flyback converter [4] can recover energy, it requires four inductors, thereby incurring large size penalty. This paper introduces a single-inductor, highly integrated, bidirectional, high-voltage actuator driver that achieves 12.6× lower power and 2.1× lower THD+N at a similar size to the currently available state-of-the art solution [1]. Measured results from an IC prototype demonstrate 200Hz sinusoidal waveforms up to 100V pp with 0.42% THD+N from a 3.6V source while dissipating 57.7mW to drive a 150nF capacitor. Beyond PZT actuators, the IC can also drive any type of capacitive load, e.g., electrostatic and electroactive polymer actuators.
21.5 A 3-to-5V input 100V pp output 57.7 mW 0.42% THD+ N highly integrated piezoelectric actuator driver
2016
Yakun Shao, Sam Xi, Vijayalakshmi Srinivasan, Gu Wei, and David Brooks. 10/15/2016. “Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin.” In International Symposium on Microarchitecture (MICRO). Taipei, Taiwan. Publisher's VersionAbstract
Increasing demand for power-efficient, high- performance computing has spurred a growing number and diversity of hardware accelerators in mobile and server Systems on Chip (SoCs). This paper makes the case that the co-design of the accelerator microarchitecture with the system in which it belongs is critical to balanced, efficient accelerator microarchitectures. We find that data movement and coherence management for accelerators are significant yet often unaccounted components of total accelerator runtime, resulting in misleading performance predictions and inefficient accelerator designs. To explore the design space of accelerator-system co-design, we develop gem5-Aladdin, an SoC simulator that captures dynamic interactions between accelerators and the SoC platform, and validate it to within 6% against real hardware. Our co-design studies show that the optimal energy-delay-product (EDP) of an accelerator microarchitecture can improve by up to 7.4x when system-level effects are considered compared to optimizing accelerators in isolation.
Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin
Robert Adolf, Saketh Rama, Brandon Reagen, Gu Wei, and David Brooks. 8/23/2016. “Fathom: Reference Workloads for Modern Deep Learning Methods.” In IEEE International Symposium on Workload Characterization. Publisher's VersionAbstract
Deep learning has been popularized by its recent successes on challenging artificial intelligence problems. One of the reasons for its dominance is also an ongoing challenge: the need for immense amounts of computational power. Hardware architects have responded by proposing a wide array of promising ideas, but to date, the majority of the work has focused on specific algorithms in somewhat narrow application domains. While their specificity does not diminish these approaches, there is a clear need for more flexible solutions. We believe the first step is to examine the characteristics of cutting edge models from across the deep learning community. Consequently, we have assembled Fathom: a collection of eight archetypal deep learning workloads for study. Each of these models comes from a seminal work in the deep learning community, ranging from the familiar deep convolutional neural network of Krizhevsky et al., to the more exotic memory networks from Facebook’s AI research group. Fathom has been released online, and this paper focuses on understanding the fundamental performance characteristics of each model. We use a set of application-level modeling tools built around the TensorFlow deep learning framework in order to analyze the behavior of the Fathom workloads. We present a breakdown of where time is spent, the similarities between the performance profiles of our models, an analysis of behavior in inference and training, and the effects of parallelism on scaling.
Fathom: Reference Workloads for Modern Deep Learning Methods
Brandon Reagen, Paul Whatmough, Robert Adolf, Saketh Rama, Hyunkwang Lee, Sae Kyu Lee, José Hernández-Lobato, Gu Wei, and David Brooks. 6/18/2016. “Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators.” In International Symposium on Computer Architecture (ISCA). Seoul, Korea (South). Publisher's VersionAbstract
The continued success of Deep Neural Networks (DNNs) in classification tasks has sparked a trend of accelerating their execution with specialized hardware. While published designs easily give an order of magnitude improvement over general-purpose hardware, few look beyond an initial implementation. This paper presents Minerva, a highly automated co-design approach across the algorithm, architecture, and circuit levels to optimize DNN hardware accelerators. Compared to an established fixed-point accelerator baseline, we show that fine-grained, heterogeneous datatype optimization reduces power by 1.5×; aggressive, inline predication and pruning of small activity values further reduces power by 2.0×; and active hardware fault detection coupled with domain-aware error mitigation eliminates an additional 2.7× through lowering SRAM voltages. Across five datasets, these optimizations provide a collective average of 8.1× power reduction over an accelerator baseline without compromising DNN model accuracy. Minerva enables highly accurate, ultra-low power DNN accelerators (in the range of tens of milliwatts), making it feasible to deploy DNNs in power-constrained IoT and mobile devices.
Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators
José Lobato, Michael A Gelbart, Brandon Reagen, Robert Adolf, Daniel Hernández-Lobato, Paul N Whatmough, David Brooks, Gu-Yeon Wei, and Ryan P Adams. 2016. “Designing neural network hardware accelerators with decoupled objective evaluations.” In NIPS workshop on Bayesian Optimization, Pp. 10. Publisher's VersionAbstract
Software-based implementations of deep neural network predictions consume large amounts of energy, limiting their deployment in power-constrained environments. Hardware acceleration is a promising alternative. However, it is challenging to efficiently design accelerators that have both low prediction error and low energy consumption. Bayesian optimization can be used to accelerate the design problem. However, most of the existing techniques collect data in a coupled way by always evaluating the two objectives (energy and error) jointly at the same input, which is inefficient. Instead, in this work we consider a decoupled approach in which, at each iteration, we choose which objective to evaluate next and at which input. We show that considering decoupled evaluations produces better solutions when computational resources are limited. Our results also indicate that evaluating the prediction error is more important than evaluating the energy consumption.
Designing neural network hardware accelerators with decoupled objective evaluations

Pages