Brandon Reagen, Udit Gupta, Lillian Pentecost, Paul Whatmough, Sae Lee, Niamh Mulholland, David Brooks, and Gu Wei. 6/24/2018. “Ares: a framework for quantifying the resilience of deep neural networks.” In Design Automation Conference, 17: Pp. 1-6. Publisher's VersionAbstract
As the use of deep neural networks continues to grow, so does the fraction of compute cycles devoted to their execution. This has led the CAD and architecture communities to devote considerable attention to building DNN hardware. Despite these efforts, the fault tolerance of DNNs has generally been overlooked. This paper is the first to conduct a large-scale, empirical study of DNN resilience. Motivated by the inherent algorithmic resilience of DNNs, we are interested in understanding the relationship between fault rate and model accuracy. To do so, we present Ares: a light-weight, DNN-specific fault injection framework validated within 12% of real hardware. We find that DNN fault tolerance varies by orders of magnitude with respect to model, layer type, and structure.
Ares: a framework for quantifying the resilience of deep neural networks
Marco Donato, Brandon Reagen, Lillian Pentecost, Udit Gupta, David Brooks, and Gu Wei. 6/24/2018. “On-Chip Deep Neural Network Storage with Multi-Level eNVM.” In Design Automation Conference (DAC). Publisher's VersionAbstract
One of the biggest performance bottlenecks of today’s neural network (NN) accelerators is off-chip memory accesses. In this paper, we propose a method to use multi-level, embedded non-volatile memory (eNVM) to eliminate all off-chip weight accesses. The use of multi-level memory cells increases the probability of faults. Therefore, we co-design the weights and memories such that their properties complement each other and the faults result in no noticeable NN accuracy loss. In the extreme case, the weights in fully connected layers can be stored using a single transistor. With weight pruning and clustering, we show our technique reduces the memory area by over an order of magnitude compared to an SRAM baseline. In the case of VGG16 (130M weights), we are able to store all the weights in 4.9 mm2, well within the area allocated to SRAM in modern NN accelerators.
On-Chip Deep Neural Network Storage with Multi-Level eNVM
Mario Lok, Elizabeth Farrell Helbling, Xuan Zhang, Robert Wood, David Brooks, and Gu Wei. 4/2018. “A Low Mass Power Electronics Unit to Drive Piezoelectric Actuators for Flying Microrobots.” IEEE Transactions on Power Electronics, 33, 4, Pp. 3180 - 3191. Publisher's VersionAbstract
This paper presents a power electronics design for the piezoelectric actuators of an insect-scale flapping-wing robot, the RoboBee. The proposed design outputs four high-voltage drive signals tailored for the two bimorph actuators of the RoboBee in an alternating drive configuration. It utilizes fully integrated drive stage circuits with a novel highside gate driver to save chip area and meet the strict mass constraint of the RoboBee. Compared with previous integrated designs, it also boosts efficiency in delivering energy to the actuators and recovering unused energy by applying three power saving techniques, dynamic common mode adjustment, envelope tracking, and charge sharing. Using this design to energize four 15 nF capacitor loads with a 200 V and 100 Hz drive signal and tracking the control commands recorded from an actual flight experiment for the robot, we measure an average power consumption of 290 mW.
A Low Mass Power Electronics Unit to Drive Piezoelectric Actuators for Flying Microrobots
Yuhao Zhu, Gu Wei, and David Brooks. 2/15/2018. “Cloud no longer a silver bullet, edge to the rescue.” arXiv preprint arXiv:1802.05943. Publisher's VersionAbstract
This paper takes the position that, while cognitive computing today relies heavily on the cloud, we will soon see a paradigm shift where cognitive computing primarily happens on network edges. The shift toward edge devices is fundamentally propelled both by technological constraints in data centers and wireless network infrastructures, as well as practical considerations such as privacy and safety. The remainder of this paper lays out our view of how these constraints will impact future cognitive computing. Bringing cognitive computing to edge devices opens up several new opportunities and challenges, some of which demand new solutions and some of which require us to revisit entrenched techniques in light of new technologies. We close the paper with a call to action for future research.
Cloud no longer a silver bullet, edge to the rescue
Simone Campanoni, Kevin Brownell, Svilen Kanev, Timothy Jones, Gu Wei, and David Brooks. 12/2017. “Automatically accelerating non-numerical programs by architecture-compiler co-design.” Communications of the ACM, 60, 12, Pp. 88–97. Publisher's VersionAbstract
Because of the high cost of communication between processors, compilers that parallelize loops automatically have been forced to skip a large class of loops that are both critical to performance and rich in latent parallelism. HELIX-RC is a compiler/microprocessor co-design that opens those loops to parallelization by decoupling communication from thread execution in conventional multicore architecures. Simulations of HELIX-RC, applied to a processor with 16 Intel Atom-like cores, show an average of 6.85× performance speedup for six SPEC CINT2000 benchmarks.
Automatically accelerating non-numerical programs by architecture-compiler co-design
Sreela Kodali, Patrick Hansen, Niamh Mulholland, Paul Whatmough, David Brooks, and Gu Wei. 11/5/2017. “Applications of Deep Neural Networks for Ultra Low Power IoT.” In International Conference on Computer Design. Publisher's VersionAbstract
IoT devices are increasing in prevalence and popularity, becoming an indispensable part of daily life. Despite the stringent energy and computational constraints of IoT systems, specialized hardware can enable energy-efficient sensor-data classification in an increasingly diverse range of IoT applications. This paper demonstrates seven different IoT applications using a fully-connected deep neural network (FC-NN) accelerator on 28nm CMOS. The applications include audio keyword spotting, face recognition, and human activity recognition. For each application, a FC-NN model was trained from a preprocessed dataset and mapped to the accelerator. Experimental results indicate the models retained their state-of-the-art accuracy on the accelerator across a broad range of frequencies and voltages. Real-time energy results for the applications were found to be on the order of 100nJ per inference or lower.
Applications of Deep Neural Networks for Ultra Low Power IoT
Ramon Bertran, Pradip Bose, David Brooks, Jeff Burns, Alper Buyuktosunoglu, Nandhini Chandramoorthy, Eric Cheng, Martin Cochet, Schuyler Eldridge, Daniel Friedman, and others. 11/5/2017. “Very low voltage (VLV) design.” In 2017 IEEE International Conference on Computer Design (ICCD), Pp. 601–604. Boston, MA, USA: IEEE. Publisher's VersionAbstract
This paper is a tutorial-style introduction to a special session on: Effective Voltage Scaling in the Late CMOS Era. It covers the fundamental challenges and associated solution strategies in pursuing very low voltage (VLV) designs. We discuss the performance and system reliability constraints that are key impediments to VLV. The associated trade-offs across power, performance and reliability are helpful in inferring the optimal operational voltage-frequency point. This work was performed under the auspices of an ongoing DARPA program (named PERFECT) that is focused on maximizing system-level energy efficiency.
Very low voltage (VLV) design
Paul Whatmough, Sae Lee, Gu Wei, and David Brooks. 10/29/2017. “Sub-uJ Deep Neural Networks for Embedded Applications.” In IEEE 51st Asilomar Conference on Signals, Systems, and Computers. Pacific Grove, CA, USA. Publisher's VersionAbstract
To intelligently process sensor data on internet of things (IoT) devices, we require powerful classifiers that can operate at sub-uJ energy levels. Previous work has focused on spiking neural network (SNN) algorithms, which are well suited to VLSI implementation due to the single-bit connections between neurons in the network. In contrast, deep neural networks (DNNs) are not as well suited to hardware implementation, because the compute and storage demands are high. In this paper, we demonstrate that there are a variety of optimizations that can be applied to DNNs to reduce the energy consumption such that they outperform SNNs in terms of energy and accuracy. Six optimizations are surveyed and applied to a SIMD accelerator architecture. The accelerator is implemented in a 28nm SoC test chip. Measurement results demonstrate ~10X aggregate improvement in energy efficiency, with a minimum energy of 0.36uJ/inference at 667MHz clock frequency. Compared to previously published spiking neural network accelerators, we demonstrate an improvement in energy efficiency of more than an order of magnitude, across a wide energy-accuracy trade-off range.
Sub-uJ Deep Neural Networks for Embedded Applications
Simon Chaput, David Brooks, and Gu Wei. 10/5/2017. “An area-efficient 8-bit single-ended ADC with extended input voltage range.” IEEE Transactions on Circuits and Systems II: Express Briefs, 65, 11, Pp. 1549–1553.Abstract
This brief presents an 8-bit successive approximation register analog-to-digital converter (ADC) implemented within a system-on-chip (SoC) for autonomous flapping-wing microrobots. The ADC implements hybrid split-capacitor sub-digital-to-analog converter (DAC) techniques to achieve 35.72% improvement in a capacitor bank energy-area product. The device also implements an extended single-ended input voltage range allowing a direct connection to sensors while maintaining low-power operation. This technique allows 51.7% DAC switching energy reduction compared to the state of the art. The SoC, fabricated in 40-nm CMOS, includes four parallel 0.001 mm 2 1 MS/s ADC cores multiplexed across 13 input ports. It enables 0 to 1.8-V input range while operating off of a 0.9 V supply. At 1 MS/s, the ADC achieves a signal-to-noise and distortion ratio of 45.6 dB for a 1.6-V pp input signal and consumes 10.4 μW.
An area-efficient 8-bit single-ended ADC with extended input voltage range
Brandon Reagen, Yakun Shao, Sam Xi, Gu Wei, and David Brooks. 8/6/2017. “Methods and infrastructure in the era of accelerator-centric architectures.” In 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS), Pp. 902–905. Boston, MA, USA: IEEE. Publisher's VersionAbstract
Computer architecture today is anything but business as usual, and what is bad for business is often great for science. As Moore's Law continues to unwaveringly march forward, despite the ceasing of Dennard scaling, continued performance gains with each processor generation has become a significant challenge, and requires creative solutions. Namely, the way to continue to scale performance in light of power issues is through hardware specialization. Hardware accelerators promise not only orders of magnitude in performance improvements over general purpose processors, but sport similar energy efficiency gains. However, accelerators are equal parts problem solver as they are creator. The major problem is designing and integrating accelerators into a complex environment within the stringent SoC design cycles. Given that each accelerator has a rich design space and convoluted implications and interactions with the memory system, better mechanisms for studying this new-breed of SoC are needed. To usher in the new era of computer architecture, we have built Aladdin: a high-level accelerator simulator enabling rapid accelerator design. Aladdin was recently extended to operate in conjunction with gem5 to study memory system interactions. In this paper we will recount the operation and utilities of Aladdin and gem5-Aladdin, concluding with a case study of how Aladdin can be used to optimize DNN accelerators.
Methods and infrastructure in the era of accelerator-centric architectures
Brandon Reagen, Robert Adolf, Paul Whatmough, Gu-Wei, and David Brooks. 8/2017. Deep Learning for Computer Architects. Morgan & Claypool Publishers. Publisher's VersionAbstract
Machine learning, and specifically deep learning, has been hugely disruptive in many fields of computer science. The success of deep learning techniques in solving notoriously difficult classification and regression problems has resulted in their rapid adoption in solving real-world problems. The emergence of deep learning is widely attributed to a virtuous cycle whereby fundamental advancements in training deeper models were enabled by the availability of massive datasets and high-performance computer hardware. This text serves as a primer for computer architects in a new and rapidly evolving field. We review how machine learning has evolved since its inception in the 1960s and track the key developments leading up to the emergence of the powerful deep learning techniques that emerged in the last decade. Next we review representative workloads, including the most commonly used datasets and seminal networks across a variety of domains. In addition to discussing the workloads themselves, we also detail the most popular deep learning tools and show how aspiring practitioners can use the tools with the workloads to characterize and optimize DNNs. The remainder of the book is dedicated to the design and optimization of hardware and architectures for machine learning. As high-performance hardware was so instrumental in the success of machine learning becoming a practical solution, this chapter recounts a variety of optimizations proposed recently to further improve future designs. Finally, we present a review of recent research published in the area as well as a taxonomy to help readers understand how various contributions fall in context.
Paul Whatmough, Sae Lee, Niamh Mulholland, Patrick Hansen, Sreela Kodali, and David Brooks. 8/2017. “DNN ENGINE: A 16nm Sub-uJ Deep Neural Network Inference Accelerator for the Embedded Masses.” In Hot Chips 29: A Symposium on High Performance Chips. Publisher's Version DNN ENGINE: A 16nm Sub-uJ Deep Neural Network Inference Accelerator for the Embedded Masses
Brandon Reagen, Jose Hernandez-Lobato, Robert Adolf, Michael Gelbart, Paul Whatmough, Gu Wei, and David Brooks. 7/24/2017. “A Case for Efficient Accelerator Design Space Exploration via Bayesian Optimization.” In International Symposium on Low Power Electronics and Design. Taipei, Taiwan. Publisher's VersionAbstract
In this paper we propose using machine learning to improve the design of deep neural network hardware accelerators. We show how to adapt multi-objective Bayesian optimization to overcome a challenging design problem: optimizing deep neural network hardware accelerators for both accuracy and energy efficiency. DNN accelerators exhibit all aspects of a challenging optimization space: the landscape is rough, evaluating designs is expensive, the objectives compete with each other, and both design spaces (algorithmic and microarchitectural) are unwieldy. With multi-objective Bayesian optimization, the design space exploration is made tractable and the design points found vastly outperform traditional methods across all metrics of interest.
A Case for Efficient Accelerator Design Space Exploration via Bayesian Optimization
An Zou, Jingwen Leng, Yazhou Zu, Tao Tong, Vijay Reddi, David Brooks, Gu Wei, and Xuan Zhang. 6/18/2017. “Ivory: Early-stage design space exploration tool for integrated voltage regulators.” In Proceedings of the 54th Annual Design Automation Conference 2017, Pp. 1–6. Austin, TX. Publisher's VersionAbstract
Despite being employed in burgeoning efforts to improve power delivery efficiency, integrated voltage regulators (IVRs) have yet to be evaluated in a rigorous, systematic, or quantitative manner. To fulfill this need, we present Ivory, a high-level design space exploration tool capable of providing accurate conversion efficiency, static performance characteristics, and dynamic transient responses of an IVR-enabled power delivery subsystem (PDS), enabling rapid trade-off exploration at early design stage, approximately 1000× faster than SPICE simulation. We demonstrate and validate Ivory with a wide spectrum of IVR topologies. In addition, we present a case study using Ivory to reveal the optimal PDS configurations, with underlying power break-downs and area overheads for the GPU manycore architecture, which has yet to embrace IVRs.
Ivory: Early-stage design space exploration tool for integrated voltage regulators
Xuan Zhang, Mario Lok, Tao Tong, Sae Lee, Brandon Reagen, Pierre. Duhamel, Robert Wood, David Brooks, and Gu Wei. 6/12/2017. “A Fully Integrated Battery-Powered System-on-Chip in 40-nm CMOS for Closed-Loop Control of Insect-Scale Pico-Aerial Vehicle.” IEEE Journal of Solid-State Circuits, 52, 9, Pp. 2374 - 2387. Publisher's VersionAbstract
We demonstrate a fully integrated system-on-chip (SoC) optimized for insect-scale flapping-wing pico-aerial vehicles. The SoC is able to meet the stringent weight, power, and real-time performance demands of autonomous flight for a bee-sized robot. The entire integrated system with embedded voltage regulation, data conversion, clock generation, as well as both general-purpose and accelerated computing units, weighs less than 3 mg after die thinning. It is self-contained and can be powered directly off of a lithium battery. Measured results show open-loop wing flapping controlled by the SoC and improved energy efficiency through the use of hardware acceleration and supply resilience through the use of adaptive clocking.
A Fully Integrated Battery-Powered System-on-Chip in 40-nm CMOS for Closed-Loop Control of Insect-Scale Pico-Aerial Vehicle
Rafael Garibotti, Brandon Reagen, Yakun Shao, Gu Wei, and David Brooks. 5/28/2017. “Using dynamic dependence analysis to improve the quality of high-level synthesis designs.” In 2017 IEEE International Symposium on Circuits and Systems (ISCAS), Pp. 1–4. Baltimore, MD, USA: IEEE. Publisher's VersionAbstract
High-Level Synthesis (HLS) tools that compile algorithms written in high-level languages into register-transfer level implementations can significantly improve design productivity and lower engineering cost. However, HLS-generated designs still lag handwritten implementations in a number of areas, particularly in the efficient allocation of hardware resources. In this work, we propose the use of dynamic dependence analysis to generate higher quality designs using existing HLS tools. We focus on resource sharing for compute-intensive workloads, a major limitation of relying only on static analysis. We demonstrate that with dynamic dependence analysis, the synthesized designs can achieve an order of magnitude resource reduction without performance loss over the state-of-the-art HLS solutions.
Using dynamic dependence analysis to improve the quality of high-level synthesis designs
Sae Kyu Lee, Tao Tong, Xuan Zhang, David Brooks, and Gu Wei. 4/1/2017. “A 16-Core Voltage-Stacked System With Adaptive Clocking and an Integrated Switched-Capacitor DC–DC Converter.” IEEE Transactions on VLSI, 25, 4, Pp. 1271-1284. Publisher's VersionAbstract
This paper presents a 16-core voltage-stacked system with adaptive frequency clocking (AFClk) and a fully integrated voltage regulator that demonstrates efficient on-chip power delivery for multicore systems. Voltage stacking alleviates power delivery inefficiencies due to off-chip parasitics but adds complexity to combat internal voltage noise. To address the corresponding issue of internal voltage noise, the system utilizes an AFClk scheme with an efficient switched-capacitor dc-dc converter to mitigate noise on the stack layers and to improve system performance and efficiency. Experimental results demonstrate robust voltage noise mitigation as well as the potential of voltage stacking as a highly efficient power delivery scheme. This paper also illustrates that augmenting the hardware techniques with intelligent workload allocation that exploits the inherent properties of voltage stacking can preemptively reduce the interlayer activity mismatch and improve system efficiency.
A 16-Core Voltage-Stacked System With Adaptive Clocking and an Integrated Switched-Capacitor DC–DC Converter
Svilen Kanev, Sam Xi, Gu Wei, and David Brooks. 4/2017. “Mallacc: Accelerating Memory Allocation.” In International Symposium on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2nd ed., 5: Pp. 33-45. Publisher's VersionAbstract
Recent work shows that dynamic memory allocation consumes nearly 7% of all cycles in Google datacenters. With the trend towards increased specialization of hardware, we propose Mallacc, an in-core hardware accelerator designed for broad use across a number of high-performance, modern memory allocators. The design of Mallacc is quite different from traditional throughput-oriented hardware accelerators. Because memory allocation requests tend to be very frequent, fast, and interspersed inside other application code, accelerators must be optimized for latency rather than throughput and area overheads must be kept to a bare minimum. Mallacc accelerates the three primary operations of a typical memory allocation request: size class computation, retrieval of a free memory block, and sampling of memory usage. Our results show that malloc latency can be reduced by up to 50% with a hardware cost of less than 1500 μm 2 of silicon area, less than 0.006% of a typical high-performance processor core.
Mallacc: Accelerating Memory Allocation
Yuhao Zhu and Vijay Reddi. 3/2/2017. “Cognitive computing safety: The new horizon for reliability.” IEEE Micro, 37, Pp. 15–21. Publisher's VersionAbstract
This column includes two invited position papers about the challenges and opportunities in cognitive architectures.
Cognitive computing safety: The new horizon for reliability
Paul Whatmough, Sae Lee, Hyunkwang Lee, Saketh Rama, David Brooks, and Gu Wei. 2/9/2017. “A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine with >0.1 Timing Error Rate Tolerance for IoT Applications.” In International Solid-State Circuits Conference. San Francisco, CA, USA. Publisher's VersionAbstract
This paper presents a 28nm SoC with a programmable FC-DNN accelerator design that demonstrates: (1) HW support to exploit data sparsity by eliding unnecessary computations (4× energy reduction); (2) improved algorithmic error tolerance using sign-magnitude number format for weights and datapath computation; (3) improved circuit-level timing violation tolerance in datapath logic via timeborrowing; (4) combined circuit and algorithmic resilience with Razor timing violation detection to reduce energy via VDD scaling or increase throughput via FCLK scaling; and (5) high classification accuracy (98.36% for MNIST test set) while tolerating aggregate timing violation rates >10-1. The accelerator achieves a minimum energy of 0.36μJ/pred at 667MHz, maximum throughput at 1.2GHz and 0.57μJ/pred, or a 10%-margined operating point at 1GHz and 0.58μJ/pred.
A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine with >0.1 Timing Error Rate Tolerance for IoT Applications