Publications

2017
Sae Kyu Lee, Tao Tong, Xuan Zhang, David Brooks, and Gu Wei. 4/1/2017. “A 16-Core Voltage-Stacked System With Adaptive Clocking and an Integrated Switched-Capacitor DC–DC Converter.” IEEE Transactions on VLSI, 25, 4, Pp. 1271-1284. Publisher's VersionAbstract
This paper presents a 16-core voltage-stacked system with adaptive frequency clocking (AFClk) and a fully integrated voltage regulator that demonstrates efficient on-chip power delivery for multicore systems. Voltage stacking alleviates power delivery inefficiencies due to off-chip parasitics but adds complexity to combat internal voltage noise. To address the corresponding issue of internal voltage noise, the system utilizes an AFClk scheme with an efficient switched-capacitor dc-dc converter to mitigate noise on the stack layers and to improve system performance and efficiency. Experimental results demonstrate robust voltage noise mitigation as well as the potential of voltage stacking as a highly efficient power delivery scheme. This paper also illustrates that augmenting the hardware techniques with intelligent workload allocation that exploits the inherent properties of voltage stacking can preemptively reduce the interlayer activity mismatch and improve system efficiency.
A 16-Core Voltage-Stacked System With Adaptive Clocking and an Integrated Switched-Capacitor DC–DC Converter
Svilen Kanev, Sam Xi, Gu Wei, and David Brooks. 4/2017. “Mallacc: Accelerating Memory Allocation.” In International Symposium on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2nd ed., 5: Pp. 33-45. Publisher's VersionAbstract
Recent work shows that dynamic memory allocation consumes nearly 7% of all cycles in Google datacenters. With the trend towards increased specialization of hardware, we propose Mallacc, an in-core hardware accelerator designed for broad use across a number of high-performance, modern memory allocators. The design of Mallacc is quite different from traditional throughput-oriented hardware accelerators. Because memory allocation requests tend to be very frequent, fast, and interspersed inside other application code, accelerators must be optimized for latency rather than throughput and area overheads must be kept to a bare minimum. Mallacc accelerates the three primary operations of a typical memory allocation request: size class computation, retrieval of a free memory block, and sampling of memory usage. Our results show that malloc latency can be reduced by up to 50% with a hardware cost of less than 1500 μm 2 of silicon area, less than 0.006% of a typical high-performance processor core.
Mallacc: Accelerating Memory Allocation
Yuhao Zhu and Vijay Reddi. 3/2/2017. “Cognitive computing safety: The new horizon for reliability.” IEEE Micro, 37, Pp. 15–21. Publisher's VersionAbstract
This column includes two invited position papers about the challenges and opportunities in cognitive architectures.
Cognitive computing safety: The new horizon for reliability
Paul Whatmough, Sae Lee, Hyunkwang Lee, Saketh Rama, David Brooks, and Gu Wei. 2/9/2017. “A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine with >0.1 Timing Error Rate Tolerance for IoT Applications.” In International Solid-State Circuits Conference. San Francisco, CA, USA. Publisher's VersionAbstract
This paper presents a 28nm SoC with a programmable FC-DNN accelerator design that demonstrates: (1) HW support to exploit data sparsity by eliding unnecessary computations (4× energy reduction); (2) improved algorithmic error tolerance using sign-magnitude number format for weights and datapath computation; (3) improved circuit-level timing violation tolerance in datapath logic via timeborrowing; (4) combined circuit and algorithmic resilience with Razor timing violation detection to reduce energy via VDD scaling or increase throughput via FCLK scaling; and (5) high classification accuracy (98.36% for MNIST test set) while tolerating aggregate timing violation rates >10-1. The accelerator achieves a minimum energy of 0.36μJ/pred at 667MHz, maximum throughput at 1.2GHz and 0.57μJ/pred, or a 10%-margined operating point at 1GHz and 0.58μJ/pred.
 
A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine with >0.1 Timing Error Rate Tolerance for IoT Applications
Whatmough N, Sae Lee, Hyunkwang Lee, Saketh Rama, David Brooks, and Gu Wei. 2/5/2017. “14.3 A 28nm SoC with a 1.2 GHz 568nJ/prediction sparse deep-neural-network engine with> 0.1 timing error rate tolerance for IoT applications.” In 2017 IEEE International Solid-State Circuits Conference (ISSCC), Pp. 242–243. IEEE. Publisher's VersionAbstract
This paper presents a 28nm SoC with a programmable FC-DNN accelerator design that demonstrates: (1) HW support to exploit data sparsity by eliding unnecessary computations (4× energy reduction); (2) improved algorithmic error tolerance using sign-magnitude number format for weights and datapath computation; (3) improved circuit-level timing violation tolerance in datapath logic via timeborrowing; (4) combined circuit and algorithmic resilience with Razor timing violation detection to reduce energy via VDD scaling or increase throughput via FCLK scaling; and (5) high classification accuracy (98.36% for MNIST test set) while tolerating aggregate timing violation rates >10-1. The accelerator achieves a minimum energy of 0.36μJ/pred at 667MHz, maximum throughput at 1.2GHz and 0.57μJ/pred, or a 10%-margined operating point at 1GHz and 0.58μJ/pred.
 
14.3 A 28nm SoC with a 1.2 GHz 568nJ/prediction sparse deep-neural-network engine with> 0.1 timing error rate tolerance for IoT applications
Simon Chaput, David Brooks, and Gu Wei. 2/2/2017. “21.5 A 3-to-5V input 100V pp output 57.7 mW 0.42% THD+ N highly integrated piezoelectric actuator driver.” In 2017 IEEE International Solid-State Circuits Conference (ISSCC), Pp. 360–361. San Francisco, CA, USA: IEEE. Publisher's VersionAbstract
Piezoelectric actuators are used in a growing range of applications, e.g., haptic feedback systems, cooling fans, and microrobots. However, to fully realize their potential, these actuators require drivers able to efficiently generate high-voltage (>100V pp ) low frequency (<;300Hz) analog waveforms from a low-voltage source (3-to-5V) with small form factor. Certain applications, such as piezoelectric (PZT) cooling fans, further demand low distortion waveforms (THD+N <; 1%) to minimize sound emission from the actuator. Existing solutions for small PZT drivers typically rely on designs comprising a power converter to step up a low voltage followed by a high-voltage amplifier [1,2,3]. Although envelope tracking can help reduce amplifier power [3], none of these designs can recover the energy stored on the actuator to maximize efficiency. And while a differential bidirectional flyback converter [4] can recover energy, it requires four inductors, thereby incurring large size penalty. This paper introduces a single-inductor, highly integrated, bidirectional, high-voltage actuator driver that achieves 12.6× lower power and 2.1× lower THD+N at a similar size to the currently available state-of-the art solution [1]. Measured results from an IC prototype demonstrate 200Hz sinusoidal waveforms up to 100V pp with 0.42% THD+N from a 3.6V source while dissipating 57.7mW to drive a 150nF capacitor. Beyond PZT actuators, the IC can also drive any type of capacitive load, e.g., electrostatic and electroactive polymer actuators.
21.5 A 3-to-5V input 100V pp output 57.7 mW 0.42% THD+ N highly integrated piezoelectric actuator driver
Robert Adolf, Saketh Rama, Brandon Reagen, Gu Wei, and David Brooks. 2017. “The Design and Evolution of Deep Learning Workloads.” IEEE MICRO, 37, 1, Pp. 18–21. The Design and Evolution of Deep Learning Workloads
Gu Wei and David Brooks. 2017. “An SoC Platform Architecture for Mini Autonomous Drones.” National Science Foundation Award Abstract # 1551044. Publisher's Version An SoC Platform Architecture for Mini Autonomous Drones
2016
Yakun Shao, Sam Xi, Vijayalakshmi Srinivasan, Gu Wei, and David Brooks. 10/15/2016. “Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin.” In International Symposium on Microarchitecture (MICRO). Taipei, Taiwan. Publisher's VersionAbstract
Increasing demand for power-efficient, high- performance computing has spurred a growing number and diversity of hardware accelerators in mobile and server Systems on Chip (SoCs). This paper makes the case that the co-design of the accelerator microarchitecture with the system in which it belongs is critical to balanced, efficient accelerator microarchitectures. We find that data movement and coherence management for accelerators are significant yet often unaccounted components of total accelerator runtime, resulting in misleading performance predictions and inefficient accelerator designs. To explore the design space of accelerator-system co-design, we develop gem5-Aladdin, an SoC simulator that captures dynamic interactions between accelerators and the SoC platform, and validate it to within 6% against real hardware. Our co-design studies show that the optimal energy-delay-product (EDP) of an accelerator microarchitecture can improve by up to 7.4x when system-level effects are considered compared to optimizing accelerators in isolation.
Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin
Robert Adolf, Saketh Rama, Brandon Reagen, Gu Wei, and David Brooks. 8/23/2016. “Fathom: Reference Workloads for Modern Deep Learning Methods.” In IEEE International Symposium on Workload Characterization. Publisher's VersionAbstract
Deep learning has been popularized by its recent successes on challenging artificial intelligence problems. One of the reasons for its dominance is also an ongoing challenge: the need for immense amounts of computational power. Hardware architects have responded by proposing a wide array of promising ideas, but to date, the majority of the work has focused on specific algorithms in somewhat narrow application domains. While their specificity does not diminish these approaches, there is a clear need for more flexible solutions. We believe the first step is to examine the characteristics of cutting edge models from across the deep learning community. Consequently, we have assembled Fathom: a collection of eight archetypal deep learning workloads for study. Each of these models comes from a seminal work in the deep learning community, ranging from the familiar deep convolutional neural network of Krizhevsky et al., to the more exotic memory networks from Facebook’s AI research group. Fathom has been released online, and this paper focuses on understanding the fundamental performance characteristics of each model. We use a set of application-level modeling tools built around the TensorFlow deep learning framework in order to analyze the behavior of the Fathom workloads. We present a breakdown of where time is spent, the similarities between the performance profiles of our models, an analysis of behavior in inference and training, and the effects of parallelism on scaling.
Fathom: Reference Workloads for Modern Deep Learning Methods
Tao Tong, Sae Lee, Xuan Zhang, David Brooks, and Gu Wei. 7/19/2016. “A Fully Integrated Reconfigurable Switched-Capacitor DC-DC Converter With Four Stacked Output Channels for Voltage Stacking Applications.” IEEE Journal of Solid-State Circuits, 51, 9, Pp. 2142–2152. Publisher's VersionAbstract
This work presents a fully integrated 4-to-1 DC-DC symmetric ladder switched-capacitor converter (SLSCC) for voltage stacking applications. The SLSCC absorbs inter-layer load power mismatch to provide minimum voltage guarantees for the internal rails of a multicore system that implements four-way voltage stacking. A new hybrid feedback control scheme reduces the voltage ripple across stacked voltage layers for high levels of current mismatch, a condition that exacerbates voltage noise in conventional SC converters. Furthermore, the proposed SLSCC dynamically allocates valuable flying capacitor resources according to different load conditions, which improves conversion efficiency and supports more power mismatch between the layers. Implemented in TSMC’s 40G process, the SLSCC converts a 3.6 V input voltage down to four stacked output voltage layers, each nominally at 900 mV.
A Fully Integrated Reconfigurable Switched-Capacitor DC-DC Converter With Four Stacked Output Channels for Voltage Stacking Applications
Brandon Reagen, Paul Whatmough, Robert Adolf, Saketh Rama, Hyunkwang Lee, Sae Kyu Lee, José Hernández-Lobato, Gu Wei, and David Brooks. 6/18/2016. “Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators.” In International Symposium on Computer Architecture (ISCA). Seoul, Korea (South). Publisher's VersionAbstract
The continued success of Deep Neural Networks (DNNs) in classification tasks has sparked a trend of accelerating their execution with specialized hardware. While published designs easily give an order of magnitude improvement over general-purpose hardware, few look beyond an initial implementation. This paper presents Minerva, a highly automated co-design approach across the algorithm, architecture, and circuit levels to optimize DNN hardware accelerators. Compared to an established fixed-point accelerator baseline, we show that fine-grained, heterogeneous datatype optimization reduces power by 1.5×; aggressive, inline predication and pruning of small activity values further reduces power by 2.0×; and active hardware fault detection coupled with domain-aware error mitigation eliminates an additional 2.7× through lowering SRAM voltages. Across five datasets, these optimizations provide a collective average of 8.1× power reduction over an accelerator baseline without compromising DNN model accuracy. Minerva enables highly accurate, ultra-low power DNN accelerators (in the range of tens of milliwatts), making it feasible to deploy DNNs in power-constrained IoT and mobile devices.
Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators
Jafferis T, Mario Lok, Winey Nastasia, Gu Wei, and Wood J. 4/13/2016. “Multilayer laminated piezoelectric bending actuators: design and manufacturing for optimum power density and efficiency.” Smart Materials and Structures, 25, 5, Pp. 055033. Publisher's VersionAbstract

In previous work we presented design and manufacturing rules for optimizing the energy density of piezoelectric bimorph actuators through the use of laser-induced melting, insulating edge coating, and features for rigid ground attachments to maximize force output, as well as a pre-stacked technique to enable mass customization. Here we adapt these techniques to bending actuators with four active layers, which utilize thinner material layers. This allows the use of lower operating voltages, which is important for overall power usage optimization, as typical small-scale power supplies are low-voltage and the efficiency of boost-converter and drive circuitry increases with decreasing output voltage. We show that this optimization results in a 24%–47% reduction in the weight of the required power supply (depending on the type of drive circuit used). We also present scaling arguments to determine when multi-layer actuator are preferable to thinner actuators, and show that our techniques are capable of scaling down to sub-mg weight actuators.

Multilayer laminated piezoelectric bending actuators: design and manufacturing for optimum power density and efficiency
Michael Karpelson, Wood J, and Gu Wei. 2/9/2016. “System and method for efficient drive of capacitive actuators with voltage amplification”.Abstract
A circuit for driving a plurality of capacitive actuators, the circuit having a low-voltage side, a high voltage side and a flyback transformer between the two. The low-voltage side comprises first and second pairs of low-side switches connected in series across an input voltage. The flyback transformer has a primary winding connected to the two pairs of switches. The high-voltage side has a pair of switches connected between the secondary winding of the flyback transformer and a ground and a plurality of capacitive loads and bidirectional switches to connect the loads to the secondary winding of the flyback transformer and a ground.
System and method for efficient drive of capacitive actuators with voltage amplification
José Lobato, Michael A Gelbart, Brandon Reagen, Robert Adolf, Daniel Hernández-Lobato, Paul N Whatmough, David Brooks, Gu-Yeon Wei, and Ryan P Adams. 2016. “Designing neural network hardware accelerators with decoupled objective evaluations.” In NIPS workshop on Bayesian Optimization, Pp. 10. Publisher's VersionAbstract
Software-based implementations of deep neural network predictions consume large amounts of energy, limiting their deployment in power-constrained environments. Hardware acceleration is a promising alternative. However, it is challenging to efficiently design accelerators that have both low prediction error and low energy consumption. Bayesian optimization can be used to accelerate the design problem. However, most of the existing techniques collect data in a coupled way by always evaluating the two objectives (energy and error) jointly at the same input, which is inefficient. Instead, in this work we consider a decoupled approach in which, at each iteration, we choose which objective to evaluate next and at which input. We show that considering decoupled evaluations produces better solutions when computational resources are limited. Our results also indicate that evaluating the prediction error is more important than evaluating the energy consumption.
Designing neural network hardware accelerators with decoupled objective evaluations
Gu Wei, David Brooks, Simone Campanoni, Kevin Brownell, and Svilen Kanev. 2016. “Methods and apparatus for parallel processing”.
2015
Brandon Reagen, Robert Adolf, Gu Wei, and David Brooks. 10/26/2015. “The MachSuite Benchmark.” In Boston Area Architecture Workshop (BARC). Raleigh, NC, USA. Publisher's VersionAbstract
Recent high-level synthesis and accelerator-related architecture papers show a great disparity in workload selection. To improve standardization within the accelerator research community, we present MachSuite, a collection of 19 benchmarks for evaluating high-level synthesis tools and accelerator-centric architectures. MachSuite spans a broad application space, captures a variety of different program behaviors, and provides implementations tailored towards the needs of accelerator designers and researchers, including support for high-level synthesis. We illustrate these aspects by characterizing each benchmark along five different dimensions, highlighting trends and salient features.
MachSuite: Benchmarks for accelerator design and customized architectures
Mario Lok, Xuan Zhang, Elizabeth Helblinh, Robert Wood, David Brooks, and Gu Wei. 9/28/2015. “A Power Electronics Unit to Drive Piezoelectric Actuators for Flying Microrobots.” In IEEE Custom Integrated Circuits Conference (CICC). Publisher's VersionAbstract
This paper describes a power electronics unit (PEU) for an insect-scale flapping-wing robot. Three power saving techniques used in the actuator driver of the PEU — envelope tracking, dynamic common mode, and charge sharing — reduce power consumption while retaining weight benefits of an inductor-less linear driver. A pair of actuator driver ICs energize four 15nF capacitor loads, which represent the piezoelectric actuators of a flapping-wing robot. The PEU consumes 290mW, which translates to 37% lower power compared to a design without these power saving techniques.
A Power Electronics Unit to Drive Piezoelectric Actuators for Flying Microrobots
Hayun Chung, Toprak Deniz, Alexander Rylyakov, John Bulzacchelli, Daniel Friedman, and Gu Wei. 8/2015. “A 7.5 GS/s flash ADC and a 10.24 GS/s time-interleaved ADC for backplane receivers in 65 nm CMOS.” Analog Integrated Circuits and Signal Processing, 85, 2, Pp. 299–310. Publisher's VersionAbstract
This paper presents a 7.5 GS/s, 4.5 bit flash analog-to-digital converter (ADC) for high-speed backplane communication. A two-stage track-and-hold (T/H) structure enables high input bandwidth and low power consumption at the same time. A sampling clock duty cycle control technique, which allocates more tracking time to the bandwidth-limited second T/H stage, facilitates high sampling rates. A digital offset correction scheme compensates both random and systematic offsets due to process variation and T/H amplifier gain nonlinearity, simultaneously. Two test-chip prototypes were fabricated in a 65 nm CMOS process. Experimental results of a standalone ADC chip demonstrate 3.8 effective number of bits (ENOB) at 7.5 GS/s. The figure-of-merit (FOM) of the standalone ADC is 0.49 pJ/conversion-step. The second test chip combines two ADCs together in order to demonstrate a time-interleaved ADC (TI-ADC) for use in high-speed backplane receivers. The TI-ADC operates at 10.24 GS/s while achieving 3.5 ENOB and 0.65 pJ/conversion-step FOM.
A 7.5 GS/s flash ADC and a 10.24 GS/s time-interleaved ADC for backplane receivers in 65 nm CMOS
Paul Whatmough, George Smart, Shidhartha Das, Yiannis Andreopoulos, and David Bull. 6/17/2015. “A 0.6V All-Digital Body-Coupled Wakeup Transceiver for IoT Applications.” In IEEE Symposium on VLSI Circuits (VLSIC). Kyoto, Japan. Publisher's VersionAbstract
A body-coupled symmetric wakeup transceiver is proposed for always-on device discovery in IoT applications requiring security and low-power consumption. The wakeup transceiver (WTRx) is implemented in 65nm CMOS, using digital logic cells and operates at 0.6V. A directly-modulated open-loop DCO generates an OOK-modulated 10MHz carrier, with a frequency-locked loop for intermittent calibration. A passive receiver incorporates a digital IO cell as hysteretic comparator, with a two-phase correlator bank. A novel MAC scheme allows for duty-cycling in both transmitter and receiver. Measured power consumption is 3.54μW, with sensitivity of 88mV and maximum wakeup latency of 150ms.
A 0.6V All-Digital Body-Coupled Wakeup Transceiver for IoT Applications

Pages