Publications

2017
Paul Whatmough, Sae Lee, Hyunkwang Lee, Saketh Rama, David Brooks, and Gu Wei. 2/9/2017. “A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine with >0.1 Timing Error Rate Tolerance for IoT Applications.” In International Solid-State Circuits Conference. San Francisco, CA, USA. Publisher's VersionAbstract
This paper presents a 28nm SoC with a programmable FC-DNN accelerator design that demonstrates: (1) HW support to exploit data sparsity by eliding unnecessary computations (4× energy reduction); (2) improved algorithmic error tolerance using sign-magnitude number format for weights and datapath computation; (3) improved circuit-level timing violation tolerance in datapath logic via timeborrowing; (4) combined circuit and algorithmic resilience with Razor timing violation detection to reduce energy via VDD scaling or increase throughput via FCLK scaling; and (5) high classification accuracy (98.36% for MNIST test set) while tolerating aggregate timing violation rates >10-1. The accelerator achieves a minimum energy of 0.36μJ/pred at 667MHz, maximum throughput at 1.2GHz and 0.57μJ/pred, or a 10%-margined operating point at 1GHz and 0.58μJ/pred.
 
A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine with >0.1 Timing Error Rate Tolerance for IoT Applications
Whatmough N, Sae Lee, Hyunkwang Lee, Saketh Rama, David Brooks, and Gu Wei. 2/5/2017. “14.3 A 28nm SoC with a 1.2 GHz 568nJ/prediction sparse deep-neural-network engine with> 0.1 timing error rate tolerance for IoT applications.” In 2017 IEEE International Solid-State Circuits Conference (ISSCC), Pp. 242–243. IEEE. Publisher's VersionAbstract
This paper presents a 28nm SoC with a programmable FC-DNN accelerator design that demonstrates: (1) HW support to exploit data sparsity by eliding unnecessary computations (4× energy reduction); (2) improved algorithmic error tolerance using sign-magnitude number format for weights and datapath computation; (3) improved circuit-level timing violation tolerance in datapath logic via timeborrowing; (4) combined circuit and algorithmic resilience with Razor timing violation detection to reduce energy via VDD scaling or increase throughput via FCLK scaling; and (5) high classification accuracy (98.36% for MNIST test set) while tolerating aggregate timing violation rates >10-1. The accelerator achieves a minimum energy of 0.36μJ/pred at 667MHz, maximum throughput at 1.2GHz and 0.57μJ/pred, or a 10%-margined operating point at 1GHz and 0.58μJ/pred.
 
14.3 A 28nm SoC with a 1.2 GHz 568nJ/prediction sparse deep-neural-network engine with> 0.1 timing error rate tolerance for IoT applications
Simon Chaput, David Brooks, and Gu Wei. 2/2/2017. “21.5 A 3-to-5V input 100V pp output 57.7 mW 0.42% THD+ N highly integrated piezoelectric actuator driver.” In 2017 IEEE International Solid-State Circuits Conference (ISSCC), Pp. 360–361. San Francisco, CA, USA: IEEE. Publisher's VersionAbstract
Piezoelectric actuators are used in a growing range of applications, e.g., haptic feedback systems, cooling fans, and microrobots. However, to fully realize their potential, these actuators require drivers able to efficiently generate high-voltage (>100V pp ) low frequency (<;300Hz) analog waveforms from a low-voltage source (3-to-5V) with small form factor. Certain applications, such as piezoelectric (PZT) cooling fans, further demand low distortion waveforms (THD+N <; 1%) to minimize sound emission from the actuator. Existing solutions for small PZT drivers typically rely on designs comprising a power converter to step up a low voltage followed by a high-voltage amplifier [1,2,3]. Although envelope tracking can help reduce amplifier power [3], none of these designs can recover the energy stored on the actuator to maximize efficiency. And while a differential bidirectional flyback converter [4] can recover energy, it requires four inductors, thereby incurring large size penalty. This paper introduces a single-inductor, highly integrated, bidirectional, high-voltage actuator driver that achieves 12.6× lower power and 2.1× lower THD+N at a similar size to the currently available state-of-the art solution [1]. Measured results from an IC prototype demonstrate 200Hz sinusoidal waveforms up to 100V pp with 0.42% THD+N from a 3.6V source while dissipating 57.7mW to drive a 150nF capacitor. Beyond PZT actuators, the IC can also drive any type of capacitive load, e.g., electrostatic and electroactive polymer actuators.
21.5 A 3-to-5V input 100V pp output 57.7 mW 0.42% THD+ N highly integrated piezoelectric actuator driver
Robert Adolf, Saketh Rama, Brandon Reagen, Gu Wei, and David Brooks. 2017. “The Design and Evolution of Deep Learning Workloads.” IEEE MICRO, 37, 1, Pp. 18–21. The Design and Evolution of Deep Learning Workloads
Gu Wei and David Brooks. 2017. “An SoC Platform Architecture for Mini Autonomous Drones.” National Science Foundation Award Abstract # 1551044. Publisher's Version An SoC Platform Architecture for Mini Autonomous Drones
2016
Yakun Shao, Sam Xi, Vijayalakshmi Srinivasan, Gu Wei, and David Brooks. 10/15/2016. “Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin.” In International Symposium on Microarchitecture (MICRO). Taipei, Taiwan. Publisher's VersionAbstract
Increasing demand for power-efficient, high- performance computing has spurred a growing number and diversity of hardware accelerators in mobile and server Systems on Chip (SoCs). This paper makes the case that the co-design of the accelerator microarchitecture with the system in which it belongs is critical to balanced, efficient accelerator microarchitectures. We find that data movement and coherence management for accelerators are significant yet often unaccounted components of total accelerator runtime, resulting in misleading performance predictions and inefficient accelerator designs. To explore the design space of accelerator-system co-design, we develop gem5-Aladdin, an SoC simulator that captures dynamic interactions between accelerators and the SoC platform, and validate it to within 6% against real hardware. Our co-design studies show that the optimal energy-delay-product (EDP) of an accelerator microarchitecture can improve by up to 7.4x when system-level effects are considered compared to optimizing accelerators in isolation.
Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin
Robert Adolf, Saketh Rama, Brandon Reagen, Gu Wei, and David Brooks. 8/23/2016. “Fathom: Reference Workloads for Modern Deep Learning Methods.” In IEEE International Symposium on Workload Characterization. Publisher's VersionAbstract
Deep learning has been popularized by its recent successes on challenging artificial intelligence problems. One of the reasons for its dominance is also an ongoing challenge: the need for immense amounts of computational power. Hardware architects have responded by proposing a wide array of promising ideas, but to date, the majority of the work has focused on specific algorithms in somewhat narrow application domains. While their specificity does not diminish these approaches, there is a clear need for more flexible solutions. We believe the first step is to examine the characteristics of cutting edge models from across the deep learning community. Consequently, we have assembled Fathom: a collection of eight archetypal deep learning workloads for study. Each of these models comes from a seminal work in the deep learning community, ranging from the familiar deep convolutional neural network of Krizhevsky et al., to the more exotic memory networks from Facebook’s AI research group. Fathom has been released online, and this paper focuses on understanding the fundamental performance characteristics of each model. We use a set of application-level modeling tools built around the TensorFlow deep learning framework in order to analyze the behavior of the Fathom workloads. We present a breakdown of where time is spent, the similarities between the performance profiles of our models, an analysis of behavior in inference and training, and the effects of parallelism on scaling.
Fathom: Reference Workloads for Modern Deep Learning Methods
Tao Tong, Sae Lee, Xuan Zhang, David Brooks, and Gu Wei. 7/19/2016. “A Fully Integrated Reconfigurable Switched-Capacitor DC-DC Converter With Four Stacked Output Channels for Voltage Stacking Applications.” IEEE Journal of Solid-State Circuits, 51, 9, Pp. 2142–2152. Publisher's VersionAbstract
This work presents a fully integrated 4-to-1 DC-DC symmetric ladder switched-capacitor converter (SLSCC) for voltage stacking applications. The SLSCC absorbs inter-layer load power mismatch to provide minimum voltage guarantees for the internal rails of a multicore system that implements four-way voltage stacking. A new hybrid feedback control scheme reduces the voltage ripple across stacked voltage layers for high levels of current mismatch, a condition that exacerbates voltage noise in conventional SC converters. Furthermore, the proposed SLSCC dynamically allocates valuable flying capacitor resources according to different load conditions, which improves conversion efficiency and supports more power mismatch between the layers. Implemented in TSMC’s 40G process, the SLSCC converts a 3.6 V input voltage down to four stacked output voltage layers, each nominally at 900 mV.
A Fully Integrated Reconfigurable Switched-Capacitor DC-DC Converter With Four Stacked Output Channels for Voltage Stacking Applications
Brandon Reagen, Paul Whatmough, Robert Adolf, Saketh Rama, Hyunkwang Lee, Sae Kyu Lee, José Hernández-Lobato, Gu Wei, and David Brooks. 6/18/2016. “Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators.” In International Symposium on Computer Architecture (ISCA). Seoul, Korea (South). Publisher's VersionAbstract
The continued success of Deep Neural Networks (DNNs) in classification tasks has sparked a trend of accelerating their execution with specialized hardware. While published designs easily give an order of magnitude improvement over general-purpose hardware, few look beyond an initial implementation. This paper presents Minerva, a highly automated co-design approach across the algorithm, architecture, and circuit levels to optimize DNN hardware accelerators. Compared to an established fixed-point accelerator baseline, we show that fine-grained, heterogeneous datatype optimization reduces power by 1.5×; aggressive, inline predication and pruning of small activity values further reduces power by 2.0×; and active hardware fault detection coupled with domain-aware error mitigation eliminates an additional 2.7× through lowering SRAM voltages. Across five datasets, these optimizations provide a collective average of 8.1× power reduction over an accelerator baseline without compromising DNN model accuracy. Minerva enables highly accurate, ultra-low power DNN accelerators (in the range of tens of milliwatts), making it feasible to deploy DNNs in power-constrained IoT and mobile devices.
Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators
Jafferis T, Mario Lok, Winey Nastasia, Gu Wei, and Wood J. 4/13/2016. “Multilayer laminated piezoelectric bending actuators: design and manufacturing for optimum power density and efficiency.” Smart Materials and Structures, 25, 5, Pp. 055033. Publisher's VersionAbstract

In previous work we presented design and manufacturing rules for optimizing the energy density of piezoelectric bimorph actuators through the use of laser-induced melting, insulating edge coating, and features for rigid ground attachments to maximize force output, as well as a pre-stacked technique to enable mass customization. Here we adapt these techniques to bending actuators with four active layers, which utilize thinner material layers. This allows the use of lower operating voltages, which is important for overall power usage optimization, as typical small-scale power supplies are low-voltage and the efficiency of boost-converter and drive circuitry increases with decreasing output voltage. We show that this optimization results in a 24%–47% reduction in the weight of the required power supply (depending on the type of drive circuit used). We also present scaling arguments to determine when multi-layer actuator are preferable to thinner actuators, and show that our techniques are capable of scaling down to sub-mg weight actuators.

Multilayer laminated piezoelectric bending actuators: design and manufacturing for optimum power density and efficiency
Michael Karpelson, Wood J, and Gu Wei. 2/9/2016. “System and method for efficient drive of capacitive actuators with voltage amplification”.Abstract
A circuit for driving a plurality of capacitive actuators, the circuit having a low-voltage side, a high voltage side and a flyback transformer between the two. The low-voltage side comprises first and second pairs of low-side switches connected in series across an input voltage. The flyback transformer has a primary winding connected to the two pairs of switches. The high-voltage side has a pair of switches connected between the secondary winding of the flyback transformer and a ground and a plurality of capacitive loads and bidirectional switches to connect the loads to the secondary winding of the flyback transformer and a ground.
System and method for efficient drive of capacitive actuators with voltage amplification
José Lobato, Michael A Gelbart, Brandon Reagen, Robert Adolf, Daniel Hernández-Lobato, Paul N Whatmough, David Brooks, Gu-Yeon Wei, and Ryan P Adams. 2016. “Designing neural network hardware accelerators with decoupled objective evaluations.” In NIPS workshop on Bayesian Optimization, Pp. 10. Publisher's VersionAbstract
Software-based implementations of deep neural network predictions consume large amounts of energy, limiting their deployment in power-constrained environments. Hardware acceleration is a promising alternative. However, it is challenging to efficiently design accelerators that have both low prediction error and low energy consumption. Bayesian optimization can be used to accelerate the design problem. However, most of the existing techniques collect data in a coupled way by always evaluating the two objectives (energy and error) jointly at the same input, which is inefficient. Instead, in this work we consider a decoupled approach in which, at each iteration, we choose which objective to evaluate next and at which input. We show that considering decoupled evaluations produces better solutions when computational resources are limited. Our results also indicate that evaluating the prediction error is more important than evaluating the energy consumption.
Designing neural network hardware accelerators with decoupled objective evaluations
Gu Wei, David Brooks, Simone Campanoni, Kevin Brownell, and Svilen Kanev. 2016. “Methods and apparatus for parallel processing”.
2015
Brandon Reagen, Robert Adolf, Gu Wei, and David Brooks. 10/26/2015. “The MachSuite Benchmark.” In Boston Area Architecture Workshop (BARC). Raleigh, NC, USA. Publisher's VersionAbstract
Recent high-level synthesis and accelerator-related architecture papers show a great disparity in workload selection. To improve standardization within the accelerator research community, we present MachSuite, a collection of 19 benchmarks for evaluating high-level synthesis tools and accelerator-centric architectures. MachSuite spans a broad application space, captures a variety of different program behaviors, and provides implementations tailored towards the needs of accelerator designers and researchers, including support for high-level synthesis. We illustrate these aspects by characterizing each benchmark along five different dimensions, highlighting trends and salient features.
MachSuite: Benchmarks for accelerator design and customized architectures
Mario Lok, Xuan Zhang, Elizabeth Helblinh, Robert Wood, David Brooks, and Gu Wei. 9/28/2015. “A Power Electronics Unit to Drive Piezoelectric Actuators for Flying Microrobots.” In IEEE Custom Integrated Circuits Conference (CICC). Publisher's VersionAbstract
This paper describes a power electronics unit (PEU) for an insect-scale flapping-wing robot. Three power saving techniques used in the actuator driver of the PEU — envelope tracking, dynamic common mode, and charge sharing — reduce power consumption while retaining weight benefits of an inductor-less linear driver. A pair of actuator driver ICs energize four 15nF capacitor loads, which represent the piezoelectric actuators of a flapping-wing robot. The PEU consumes 290mW, which translates to 37% lower power compared to a design without these power saving techniques.
A Power Electronics Unit to Drive Piezoelectric Actuators for Flying Microrobots
Hayun Chung, Toprak Deniz, Alexander Rylyakov, John Bulzacchelli, Daniel Friedman, and Gu Wei. 8/2015. “A 7.5 GS/s flash ADC and a 10.24 GS/s time-interleaved ADC for backplane receivers in 65 nm CMOS.” Analog Integrated Circuits and Signal Processing, 85, 2, Pp. 299–310. Publisher's VersionAbstract
This paper presents a 7.5 GS/s, 4.5 bit flash analog-to-digital converter (ADC) for high-speed backplane communication. A two-stage track-and-hold (T/H) structure enables high input bandwidth and low power consumption at the same time. A sampling clock duty cycle control technique, which allocates more tracking time to the bandwidth-limited second T/H stage, facilitates high sampling rates. A digital offset correction scheme compensates both random and systematic offsets due to process variation and T/H amplifier gain nonlinearity, simultaneously. Two test-chip prototypes were fabricated in a 65 nm CMOS process. Experimental results of a standalone ADC chip demonstrate 3.8 effective number of bits (ENOB) at 7.5 GS/s. The figure-of-merit (FOM) of the standalone ADC is 0.49 pJ/conversion-step. The second test chip combines two ADCs together in order to demonstrate a time-interleaved ADC (TI-ADC) for use in high-speed backplane receivers. The TI-ADC operates at 10.24 GS/s while achieving 3.5 ENOB and 0.65 pJ/conversion-step FOM.
A 7.5 GS/s flash ADC and a 10.24 GS/s time-interleaved ADC for backplane receivers in 65 nm CMOS
Paul Whatmough, George Smart, Shidhartha Das, Yiannis Andreopoulos, and David Bull. 6/17/2015. “A 0.6V All-Digital Body-Coupled Wakeup Transceiver for IoT Applications.” In IEEE Symposium on VLSI Circuits (VLSIC). Kyoto, Japan. Publisher's VersionAbstract
A body-coupled symmetric wakeup transceiver is proposed for always-on device discovery in IoT applications requiring security and low-power consumption. The wakeup transceiver (WTRx) is implemented in 65nm CMOS, using digital logic cells and operates at 0.6V. A directly-modulated open-loop DCO generates an OOK-modulated 10MHz carrier, with a frequency-locked loop for intermittent calibration. A passive receiver incorporates a digital IO cell as hysteretic comparator, with a two-phase correlator bank. A novel MAC scheme allows for duty-cycling in both transmitter and receiver. Measured power consumption is 3.54μW, with sensitivity of 88mV and maximum wakeup latency of 150ms.
A 0.6V All-Digital Body-Coupled Wakeup Transceiver for IoT Applications
Xuan Zhang, Mario Lok, Tao Tong, Simon Chaput, Sae Lee, Brandon Reagen, Hyunkwang Lee, David Brooks, and Gu Wei. 6/17/2015. “A Multi-Chip System Optimized for Insect-Scale Flapping-Wing Robots.” In IEEE Symposium on VLSI Circuits (VLSIC). Publisher's VersionAbstract
We demonstrate a battery-powered multi-chip system optimized for insect-scale flapping wing robots that meets the tight weight limit and real-time performance demands of autonomous flight. Measured results show open-loop wing flapping driven by a power electronics unit and energy efficiency improvements via hardware acceleration.
A Multi-Chip System Optimized for Insect-Scale Flapping-Wing Robots
Brandon Reagen, Xaun Zhang, David Brooks, and Gu Wei. 6/14/2015. “From PDF to GDS: Designing the RoboBee SoC.” WARP 2015 6th Workshop on Architectural Research Prototyping (Co-Located with the 42nd International Symposium on Computer Architecture). Publisher's VersionAbstract
Developing the Robobee was a multi-discipline, 5-year project funded by a National Science Foundation Expeditions in Computing award with the goal of achieving autonomous flight with a bee-sized micro-robot [2]. The intent of the research was to help re-stabilize the declining bee population which researchers have shown could have devastating effects on the earths ecosystem. Bees are remarkably efficient; their skeleton weighs almost nothing: requiring minimal lift to takeoff and sustain flight; their brains are small: pre-programmed with a minimal set of instincts necessary for the colonies survival. Their capabilities under such stringent weight and compute limitations makes them a prime target for pushing what modern robotics and computer systems can do. The weight and power limits require a custom System-onChip (SoC) be built. Conventional off-chip voltage regulators are heavy and bulky, and thus cannot fit under the weight and form factor of the robotic bee. Commercial Off-The-Shelf parts (COTS) micro-controllers consume too much power to perform the required computation for autonomous flight. The solution is to pack as much IP onto a single die. SoCs have been the trend of all semi-conductor companies over the past decade from mobile and embedded to server grade solutions. In this paper we recount our experiences designing such a chip. We highlight the major challenges faced when designing for such a unique form factor, how designs and specifications were set by each collaborating lab, the difficulties of integrating a plethora of IP consisting of in-house digital and analog blocks, and the design flows we used. We also discuss how invaluable HLS was in reducing the engineering burden, focusing design efforts at higher levels of abstraction, and an overall successful tape-out.
From PDF to GDS: Designing the RoboBee SoC
Svilen Kanev, Juan Darago, Kim Hazelwood, Tipp Moseley, Gu Wei, and David Brooks. 6/13/2015. “Profiling a Warehouse-Scale Computer.” In International Symposium on Computer Architecture (ISCA). Publisher's VersionAbstract
With the increasing prevalence of warehouse-scale (WSC) and cloud computing, understanding the interactions of server applications with the underlying microarchitecture becomes ever more important in order to extract maximum performance out of server hardware. To aid such understanding, this paper presents a detailed microarchitectural analysis of live datacenter jobs, measured on more than 20,000 Google machines over a three year period, and comprising thousands of different applications. We first find that WSC workloads are extremely diverse, breeding the need for architectures that can tolerate application variability without performance loss. However, some patterns emerge, offering opportunities for co-optimization of hardware and software. For example, we identify common building blocks in the lower levels of the software stack. This \"datacenter tax\" can comprise nearly 30% of cycles across jobs running in the fleet, which makes its constituents prime candidates for hardware specialization in future server systems-on-chips. We also uncover opportunities for classic microarchitectural optimizations for server processors, especially in the cache hierarchy. Typical workloads place significant stress on instruction caches and prefer memory latency over bandwidth. They also stall cores often, but compute heavily in bursts. These observations motivate several interesting directions for future warehouse-scale computers.
Profiling a Warehouse-Scale Computer

Pages