Publications by Type: Journal Article

2017
Yuhao Zhu and Vijay Reddi. 3/2/2017. “Cognitive computing safety: The new horizon for reliability.” IEEE Micro, 37, Pp. 15–21. Publisher's VersionAbstract
This column includes two invited position papers about the challenges and opportunities in cognitive architectures.
Cognitive computing safety: The new horizon for reliability
Robert Adolf, Saketh Rama, Brandon Reagen, Gu Wei, and David Brooks. 2017. “The Design and Evolution of Deep Learning Workloads.” IEEE MICRO, 37, 1, Pp. 18–21. The Design and Evolution of Deep Learning Workloads
2016
Tao Tong, Sae Lee, Xuan Zhang, David Brooks, and Gu Wei. 7/19/2016. “A Fully Integrated Reconfigurable Switched-Capacitor DC-DC Converter With Four Stacked Output Channels for Voltage Stacking Applications.” IEEE Journal of Solid-State Circuits, 51, 9, Pp. 2142–2152. Publisher's VersionAbstract
This work presents a fully integrated 4-to-1 DC-DC symmetric ladder switched-capacitor converter (SLSCC) for voltage stacking applications. The SLSCC absorbs inter-layer load power mismatch to provide minimum voltage guarantees for the internal rails of a multicore system that implements four-way voltage stacking. A new hybrid feedback control scheme reduces the voltage ripple across stacked voltage layers for high levels of current mismatch, a condition that exacerbates voltage noise in conventional SC converters. Furthermore, the proposed SLSCC dynamically allocates valuable flying capacitor resources according to different load conditions, which improves conversion efficiency and supports more power mismatch between the layers. Implemented in TSMC’s 40G process, the SLSCC converts a 3.6 V input voltage down to four stacked output voltage layers, each nominally at 900 mV.
A Fully Integrated Reconfigurable Switched-Capacitor DC-DC Converter With Four Stacked Output Channels for Voltage Stacking Applications
Jafferis T, Mario Lok, Winey Nastasia, Gu Wei, and Wood J. 4/13/2016. “Multilayer laminated piezoelectric bending actuators: design and manufacturing for optimum power density and efficiency.” Smart Materials and Structures, 25, 5, Pp. 055033. Publisher's VersionAbstract

In previous work we presented design and manufacturing rules for optimizing the energy density of piezoelectric bimorph actuators through the use of laser-induced melting, insulating edge coating, and features for rigid ground attachments to maximize force output, as well as a pre-stacked technique to enable mass customization. Here we adapt these techniques to bending actuators with four active layers, which utilize thinner material layers. This allows the use of lower operating voltages, which is important for overall power usage optimization, as typical small-scale power supplies are low-voltage and the efficiency of boost-converter and drive circuitry increases with decreasing output voltage. We show that this optimization results in a 24%–47% reduction in the weight of the required power supply (depending on the type of drive circuit used). We also present scaling arguments to determine when multi-layer actuator are preferable to thinner actuators, and show that our techniques are capable of scaling down to sub-mg weight actuators.

Multilayer laminated piezoelectric bending actuators: design and manufacturing for optimum power density and efficiency
2015
Hayun Chung, Toprak Deniz, Alexander Rylyakov, John Bulzacchelli, Daniel Friedman, and Gu Wei. 8/2015. “A 7.5 GS/s flash ADC and a 10.24 GS/s time-interleaved ADC for backplane receivers in 65 nm CMOS.” Analog Integrated Circuits and Signal Processing, 85, 2, Pp. 299–310. Publisher's VersionAbstract
This paper presents a 7.5 GS/s, 4.5 bit flash analog-to-digital converter (ADC) for high-speed backplane communication. A two-stage track-and-hold (T/H) structure enables high input bandwidth and low power consumption at the same time. A sampling clock duty cycle control technique, which allocates more tracking time to the bandwidth-limited second T/H stage, facilitates high sampling rates. A digital offset correction scheme compensates both random and systematic offsets due to process variation and T/H amplifier gain nonlinearity, simultaneously. Two test-chip prototypes were fabricated in a 65 nm CMOS process. Experimental results of a standalone ADC chip demonstrate 3.8 effective number of bits (ENOB) at 7.5 GS/s. The figure-of-merit (FOM) of the standalone ADC is 0.49 pJ/conversion-step. The second test chip combines two ADCs together in order to demonstrate a time-interleaved ADC (TI-ADC) for use in high-speed backplane receivers. The TI-ADC operates at 10.24 GS/s while achieving 3.5 ENOB and 0.65 pJ/conversion-step FOM.
A 7.5 GS/s flash ADC and a 10.24 GS/s time-interleaved ADC for backplane receivers in 65 nm CMOS
Yakun Shao, Brandon Reagen, Gu Wei, and David Brooks. 5/13/2015. “The aladdin approach to accelerator design and modeling.” IEEE Micro, 35, 3, Pp. 58–70. Publisher's VersionAbstract
Hardware specialization, in the form of datapath and control circuitry customized to particular algorithms or applications, promises impressive performance and energy advantages compared to traditional architectures. Current research in accelerators relies on RTL-based synthesis flows to produce accurate timing, power, and area estimates. Such techniques not only require significant effort and expertise but also are slow and tedious to use, making large design space exploration infeasible. To overcome this problem, the authors developed Aladdin, a pre-RTL, power-performance accelerator modeling framework and demonstrated its application to system-on-chip (SoC) simulation. Aladdin estimates performance, power, and area of accelerators within 0.9, 4.9, and 6.6 percent with respect to RTL implementations. Integrated with architecture-level general-purpose core and memory hierarchy simulators, Aladdin provides researchers with a fast but accurate way to model the power and performance of accelerators in an SoC environment.
The aladdin approach to accelerator design and modeling
2013
Hayun Chung and Gu Wei. 8/16/2013. “ADC-based backplane receiver design-space exploration.” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 22, 7, Pp. 1539–1547. Publisher's VersionAbstract
Demand for higher throughput backplane communications, coupled with a desire for design portability and flexibility, has led to high-speed backplane receivers that use front-end analog-to-digital converters (ADCs) and digital equalization. Unfortunately, power and complexity of such receivers can be high and require careful design. This paper presents a parameterized ADC-based backplane receiver model that facilitates design-space exploration to optimize the tradeoffs between power and performance-an accurate behavioral model of front-end ADCs is presented for performance estimation and detailed power models for the digital equalizer (EQ) blocks are developed for power estimation. Model-based simulations suggest that comparator offset correction resolution is the most critical ADC design parameter when an overall receiver performance is concerned. Further receiver design-space exploration reveals that a Pareto optimal frontier exists, which can be used as a guideline to set the initial receiver configurations depending on a given power and performance constraints.
ADC-based backplane receiver design-space exploration
Robert Wood, Nagpal Radhika, and Gu Wei. 3/2013. “Flight of the robobees.” Scientific American, 308, 3, Pp. 60–65. Publisher's VersionAbstract
Not too long ago a mysterious affliction called colony collapse disorder (CCD) began to wipe out honeybee hives. These bees are responsible for most commercial pollination in the U.S., and their loss provoked fears that agriculture might begin to suffer as well. In 2009 the three of us, along with colleagues at Harvard University and Northeastern University, began to seriously consider what it would take to create a robotic bee colony. We wondered if mechanical bees could replicate not just an individual’s behavior but the unique behavior that emerges out of interactions among thousands of bees. We have now created the first RoboBees—flying bee-size robots—and are working on methods to make thousands of them cooperate like a real hive.
Flight of the robobees
Svilen Kanev, Timothy Jones, Gu Wei, David Brooks, and Vijay Reddi. 2013. “Measuring code optimization impact on voltage noise”.Abstract
In this paper, we characterize the impact of compiler optimizations on voltage noise. While intuition may suggest that the better processor utilization ensured by optimizing compilers results in a small amount of voltage variation, our measurements on a Intel Core2 Duo processor show the opposite – the majority of SPEC 2006 benchmarks exhibit more voltage droops when aggressively optimized. We show that this increase in noise could be sufficient for a net performance decrease in a typical-case, resilient design.
Measuring code optimization impact on voltage noise
Svilen Kanev, Timothy Jones, Gu Wei, David Brooks, and Vijay Reddi. 2013. “Measuring code optimization impact on voltage noise.” Workshop in Silicon Errors – System Effects (SELSE). Publisher's VersionAbstract
In this paper, we characterize the impact of compiler optimizations on voltage noise. While intuition may suggest that the better processor utilization ensured by optimizing compilers results in a small amount of voltage variation, our measurements on a Intel Core2 Duo processor show the opposite – the majority of SPEC 2006 benchmarks exhibit more voltage droops when aggressively optimized. We show that this increase in noise could be sufficient for a net performance decrease in a typical-case, resilient design.
Measuring code optimization impact on voltage noise
2012
Simone Campanoni, Timothy Jones, Glenn Holloway, Gu Wei, and David Brooks. 6/26/2012. “HELIX: Making the extraction of thread-level parallelism mainstream.” IEEE Micro, 32, 4, Pp. 8–18. Publisher's VersionAbstract
Improving system performance increasingly depends on exploiting microprocessor parallelism, yet mainstream compilers still don't parallelize code automatically. Helix automatically parallelizes general-purpose programs without requiring any special hardware; avoids slowing down compiled programs, making it a suitable candidate for mainstream compilers; and outperforms the most similar historical technique that has been implemented in production compilers.
HELIX: Making the extraction of thread-level parallelism mainstream
Simone Campanoni, Timothy Jones, Glenn Holloway, Gu Wei, and David Brooks. 6/26/2012. “Making the Extraction of Thread-Level Parallelism Mainstream.” IEEE Micro. Publisher's VersionAbstract
Improving system performance increasingly depends on exploiting microprocessor parallelism, yet mainstream compilers still do not parallelize code automatically. Promising parallelization approaches have either required manual programmer assistance, depended on special hardware features, or risked slowing down programs they should have speeded up. HELIX is one such approach that automatically parallelizes general-purpose programs without requiring any special hardware. In this paper we show that in practice HELIX always avoids slowing down compiled programs, making it a suitable candidate for mainstream compilers. We also show experimentally that HELIX outperforms the most similar historical technique that has been implemented in production compilers.
HELIX: Making the extraction of thread-level parallelism mainstream
Michael Karpelson, Gu Wei, and J Wood. 4/2012. “Driving high voltage piezoelectric actuators in microrobotic applications.” Sensors and actuators A: Physical, 176, Pp. 78–89. Publisher's VersionAbstract

Piezoelectric actuators have been used successfully to enable locomotion in aerial and ambulatory microrobotic platforms. However, the use of piezoelectric actuators presents two major challenges for power electronic design: generating high-voltage drive signals in systems typically powered by low-voltage energy sources, and recovering unused energy from the actuators. Due to these challenges, conventional drive circuits become too bulky or inefficient in low mass applications. This work describes electrical characteristics and drive requirements of low mass piezoelectric actuators, the design and optimization of suitable drive circuit topologies, aspects of the physical instantiation of these topologies, including the fabrication of extremely lightweight magnetic components, and a custom, ultra low power integrated circuit that implements control functionality for the drive circuits. The principles and building blocks presented here enable efficient high-voltage drive circuits that can satisfy the stringent weight and power requirements of microrobotic applications.

Driving high voltage piezoelectric actuators in microrobotic applications
H Chung and Gu Wei. 1/5/2012. “Simulated-annealing-based adaptive equaliser for on-die variation compensation.” Electronics letters, 48, 1, Pp. 18–19. Publisher's VersionAbstract

Fully exploiting the flexibility of lookup-table-based equalisers, it is proposed to compensate for on-die variation effects within a transmit-side equaliser. To efficiently deal with the nonlinear nature of circuit non-idealities, the proposed equaliser utilises simulated annealing for adaptation.

Simulated-annealing-based adaptive equaliser for on-die variation compensation
Michael Lyons, Mark Hempstead, Gu Wei, and David Brooks. 1/2012. “The accelerator store: A shared memory framework for accelerator-based systems.” ACM Transactions on Architecture. and Code Optimization, 8, 4, Pp. 1-22. Publisher's VersionAbstract

This paper presents the many-accelerator architecture, a design approach combining the scalability of homogeneous multi-core architectures and system-on-chip's high performance and power-efficient hardware accelerators. In preparation for systems containing tens or hundreds of accelerators, we characterize a diverse pool of accelerators and find each contains significant amounts of SRAM memory (up to 90% of their area). We take advantage of this discovery and introduce the accelerator store, a scalable architectural component to minimize accelerator area by sharing its memories between accelerators. We evaluate the accelerator store for two applications and find significant system area reductions (30%) in exchange for small overheads (2% performance, 0%--8% energy). The paper also identifies new research directions enabled by the accelerator store and the many-accelerator architecture.

The accelerator store: A shared memory framework for accelerator-based systems
Michael Lyons, Mark Hempstead, Gu Wei, and David Brooks. 1/2012. “The Accelerator Store: a shared memory framework for accelerator-based systems.” Transactions on Architecture and Code Optimization (TACO). Publisher's VersionAbstract
This paper presents the many-accelerator architecture, a design approach combining the scalability of homogeneous multi-core architectures and system-on-chip’s high performance and power-efficient hardware accelerators. In preparation for systems containing tens or hundreds of accelerators, we characterize a diverse pool of accelerators and find each contains significant amounts of SRAM memory (up to 90% of their area). We take advantage of this discovery and introduce the accelerator store, a scalable architectural component to minimize accelerator area by sharing its memories between accelerators. We evaluate the accelerator store for two applications and find significant system area reductions (30%) in exchange for small overheads (2% performance, 0%–8% energy). The paper also identifies new research directions enabled by the accelerator store and the many-accelerator architecture.
The accelerator store: A shared memory framework for accelerator-based systems
Wonyoung Kim, David Brooks, and Gu Wei. 1/2012. “A fully-integrated 3-level DC-DC converter for nanosecond-scale DVFS.” Solid-State Circuits, IEEE Journal of, 47, 1, Pp. 206–219. Publisher's VersionAbstract
On-chip DC-DC converters have the potential to offer fine-grain power management in modern chip-multiprocessors. This paper presents a fully integrated 3-level DC-DC converter, a hybrid of buck and switched-capacitor converters, implemented in 130 nm CMOS technology. The 3-level converter enables smaller inductors (1 nH) than a buck, while generating a wide range of output voltages compared to a 1/2 mode switched-capacitor converter. The test-chip prototype delivers up to 0.85 A load current while generating output voltages from 0.4 to 1.4 V from a 2.4 V input supply. It achieves 77% peak efficiency at power density of 0.1 W/mm 2 and 63% efficiency at maximum power density of 0.3 W/mm 2 . The converter scales output voltage from 0.4 V to 1.4 V (or vice-versa) within 20 ns at a constant 450 mA load current. A shunt regulator reduces peak-to-peak voltage noise from 0.27 V to 0.19 V under pseudo-randomly fluctuating load currents. Using simulations across a wide range of design parameters, the paper compares conversion efficiencies of the 3-level, buck and switched-capacitor converters.
A fully-integrated 3-level DC-DC converter for nanosecond-scale DVFS
Michael Lyons, Gu Wei, and David Brooks. 1/2012. “Shrink-Fit: A Framework for Flexible Accelerator Sizing.” IEEE Computer Architecture Letters, 12, 1, Pp. 17 - 20. Publisher's VersionAbstract
RTL design complexity discouraged adoption of reconfigurable logic in general purpose systems, impeding opportunities for performance and energy improvements. Recent improvements to HLS compilers simplify RTL design and are easing this barrier. A new challenge will emerge: managing reconfigurable resources between multiple applications with custom hardware designs. In this paper, we propose a method to “shrink-fit” accelerators within widely varying fabric budgets. Shrink-fit automatically shrinks existing accelerator designs within small fabric budgets and grows designs to increase performance when larger budgets are available. Our method takes advantage of current accelerator design techniques and introduces a novel architectural approach based on fine-grained virtualization. We evaluate shrink-fit using a synthesized implementation of an IDCT for decoding JPEGs and show the IDCT accelerator can shrink by a factor of 16x with minimal performance and area overheads. Using shrink-fit, application designers can achieve the benefits of hardware acceleration with single RTL designs on FPGAs large and small.
Shrink-Fit: A Framework for Flexible Accelerator Sizing
2011
Vijay Reddi and David Brooks. 10/2011. “Resilient Architectures via Collaborative Design: Maximizing Commodity Processor Performance in the Presence of Variations.” Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 30, 10, Pp. 1429–1445. Publisher's VersionAbstract
Unintended variations in circuit lithography and undesirable fluctuations in circuit operating parameters such as supply voltage and temperature are threatening the continuation of technology scaling that microprocessor evolution relies on. Although circuit-level solutions for some variation problems may be possible, they are prohibitively expensive and impractical for commodity processors, on which not only the consumer market but also an increasing segment of the business market now depends. Solutions at the microarchitecture level and even the software level, on the other hand, overcome some of these circuit-level challenges without significantly raising costs or lowering performance. Using examples drawn from our Alarms Project and related work, we illustrate how collaborative design that encompasses circuits, architecture, and chip-resident software leads to a cost-effective solution for inductive voltage noise, sometimes called the dI / dt problem. The strategy that we use for assuring correctness while preserving performance can be extended to other variation problems.
Resilient Architectures via Collaborative Design: Maximizing Commodity Processor Performance in the Presence of Variations
David Brooks. 9/2011. “CPUs, GPUs, and Hybrid Computing.” IEEE Micro, Pp. 4–6. Publisher's VersionAbstract
This introduction to the special issue discusses advances and challenges in the field of hybrid CPU/GPU computing.
CPUs, GPUs, and Hybrid Computing

Pages