Publications

2013
Tao Tong, Xuan Zhang, Wonyoung Kim, David Brooks, and Gu Wei. 9/22/2013. “A Fully Integrated Battery-Connected Switched-Capacitor 4:1 Voltage Regulator with 70% Peak Efficiency Using Bottom-Plate Charge Recycling.” In IEEE Custom Integrated Circuits Conference (CICC). Publisher's VersionAbstract
This work presents a switched-capacitor (SC) DC-DC voltage regulator that converts a 3.7V battery voltage down to ~0.8V in order to power the `brain' SoC of a flapping-wing microrobotic bee. A cascade of two 2:1 SC converters offers high efficiency for a 4:1 conversion ratio. A charge recycling technique reduces the flying capacitor's bottom-plate parasitic loss by 50% and overall conversion efficiency reaches 70%. The output droop is less than 10% of the nominal output voltage for a worst-case 47mA load step.
A Fully Integrated Battery-Connected Switched-Capacitor 4:1 Voltage Regulator with 70% Peak Efficiency Using Bottom-Plate Charge Recycling
Xuan Zhang, Tao Tong, David Brooks, and Gu Wei. 9/22/2013. “Supply-Noise Resilient Adaptive Clocking for Battery-Powered Aerial Microrobotic System-on-Chip in 40nm CMOS.” In IEEE Custom Integrated Circuits Conference (CICC). Publisher's VersionAbstract
A battery-powered aerial microrobotic System-on-Chip (SoC) has stringent weight and power budgets, which requires fully-integrated solutions for both clock generation and voltage regulation. Supply-noise resilience is important yet challenging for such SoC systems due to a non-constant battery discharge profile and load current variability. This paper proposes an adaptive-frequency clocking scheme that can tolerate supply noise and improve performance when implemented with an integrated voltage regulator (IVR). Measurements from a `brain' SoC, implemented in 40nm CMOS, demonstrate 2× performance improvement with adaptive-frequency clocking over conventional fixed-frequency clocking. Combining adaptive-frequency clocking with open-loop IVR extends error-free operation to a wider battery voltage range (2.8 to 3.8V) with higher average performance.
Supply-Noise Resilient Adaptive Clocking for Battery-Powered Aerial Microrobotic System-on-Chip in 40nm CMOS
Mario Lok, David Brooks, Robert Wood, and Gu Wei. 9/15/2013. “Design and analysis of an integrated driver for piezoelectric actuators.” In IEEE Energy Conversion Congress and Exposition. Publisher's VersionAbstract
Small-scale, highly maneuverable, flapping-wing robotic insects have a wide range of applications, including exploration, environmental monitoring, search and rescue, and surveillance. For these small-scale robots, a piezoelectric cantilever actuator driven by a high voltage drive signal is a preferred actuation mechanism. The generation of this drive signal via light and efficient power electronics is critical given the limited weight budget for the flapping-wing robot. Previous work demonstrated actuator drive circuitry using discrete power transistors and passive elements. This paper presents a new design that integrates all the power FETs into a single monolithic IC, reducing the weight of the power electronics to fit within the weight budget. This design adds the capability of driving multiple outputs to accommodate recent electromechanical design advances for flying robots.
Design and analysis of an integrated driver for piezoelectric actuators
Xuan Zhang, Tao Tong, Svilen Kanev, Sae Lee, Gu Wei, and David Brooks. 9/4/2013. “Characterizing and Evaluating Voltage Noise in Multi-Core Near-Threshold Processors.” In International Symposium on Low Power Electronics and Design (ISLPED). Publisher's VersionAbstract
Lowering the supply voltage to improve energy efficiency leads to higher load current and elevated supply sensitivity. In this paper, we provide the first quantitative analysis of voltage noise in multi-core near-threshold processors in a future 10nm technology across SPEC CPU2006 benchmarks. Our results reveal larger guardband requirement and significant energy efficiency loss due to power delivery nonidealities at near threshold, and highlight the importance of accurate voltage noise characterization for design exploration of energy-centric computing systems using near-threshold cores.
Characterizing and Evaluating Voltage Noise in Multi-Core Near-Threshold Processors
Yakun Shao and David Brooks. 9/4/2013. “Energy Characterization and Instruction-Level Energy Model of Intel's Xeon Phi Processor.” In International Symposium on Low Power Electronics and Design (ISLPED). Publisher's VersionAbstract
Intel’s Xeon Phi is the first commercial many-core/multi-thread x86-based processor. Xeon Phi belongs to a new breed of high performance computing processors that seek high compute density as well as energy efficiency. However, no high- level energy model is available for Xeon Phi software developers to quickly evaluate and optimize energy efficiency. This work demonstrates an instruction-level energy model for the Xeon Phi processor to facilitate the development of energy-efficient software. In order to construct this model, we first characterize the energy consumption of the processor, identifying how energy per instruction scales with the number of cores, the number of active threads per core, and instruction types. Based on the energy characterization, we construct an instruction-level energy model and validate the accuracy of the model between 1% and 5% for real world benchmarks. We show that the energy model can be used to identify software inefficiencies for these benchmarks and find that Linpack code can be optimized to increase energy efficiency by as much as 10%.
Energy Characterization and Instruction-Level Energy Model of Intel's Xeon Phi Processor
Brandon Reagen, Yakun Shao, Gu Wei, and David Brooks. 9/4/2013. “Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware.” In International Symposium on Low Power Electronics and Design (ISLPED). Publisher's VersionAbstract
As the traditional performance gains of technology scaling diminish, one of the most promising directions is building special purpose fixed function hardware blocks, commonly referred to as accelerators. Accelerators have become prevalent in industrial SoC designs for their low power, high performance potential. In this work we explore thousands of implementations of classical software workloads in hardware. This thorough, detailed design space search of hardware accelerators gives architects a quantita- tive way to reason about the differences in implementations. The exploration presented in this work shows that the space is full of poor design choices. By thoroughly analyzing each benchmark, we show which provide the most performance when implemented in hardware given a fixed power budget and explain which design techniques work best for each workload.
Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware
Hayun Chung and Gu Wei. 8/16/2013. “ADC-based backplane receiver design-space exploration.” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 22, 7, Pp. 1539–1547. Publisher's VersionAbstract
Demand for higher throughput backplane communications, coupled with a desire for design portability and flexibility, has led to high-speed backplane receivers that use front-end analog-to-digital converters (ADCs) and digital equalization. Unfortunately, power and complexity of such receivers can be high and require careful design. This paper presents a parameterized ADC-based backplane receiver model that facilitates design-space exploration to optimize the tradeoffs between power and performance-an accurate behavioral model of front-end ADCs is presented for performance estimation and detailed power models for the digital equalizer (EQ) blocks are developed for power estimation. Model-based simulations suggest that comparator offset correction resolution is the most critical ADC design parameter when an overall receiver performance is concerned. Further receiver design-space exploration reveals that a Pareto optimal frontier exists, which can be used as a guideline to set the initial receiver configurations depending on a given power and performance constraints.
ADC-based backplane receiver design-space exploration
Yakun Shao and David Brooks. 4/21/2013. “ISA-Independent Workload Characterization and its Implications for Specialized Architectures.” In International Symposium on Performance Analysis of Systems and Software (ISPASS). Publisher's VersionAbstract
Specialized architectures will become increasingly important as the computing industry demands more energy- efficient designs. The application-centric design style for these architectures is heavily dependent on workload characterization of intrinsic program characteristics, but at the same time these architectures are likely to be decoupled from legacy ISAs. In this work, we perform ISA-independent workload characterization for a variety of important intrinsic program characteristics relating to computation, memory, and control flow. The analysis is performed using a JIT compiler that emits ISA-independent instructions. We compare this analysis with an x86 trace and find that several of the analyses are highly sensitive to the ISA. We conclude that designers of specialized architectures must adopt ISA-independent workload characterization approaches.
ISA-Independent Workload Characterization and its Implications for Specialized Architectures
Robert Wood, Nagpal Radhika, and Gu Wei. 3/2013. “Flight of the robobees.” Scientific American, 308, 3, Pp. 60–65. Publisher's VersionAbstract
Not too long ago a mysterious affliction called colony collapse disorder (CCD) began to wipe out honeybee hives. These bees are responsible for most commercial pollination in the U.S., and their loss provoked fears that agriculture might begin to suffer as well. In 2009 the three of us, along with colleagues at Harvard University and Northeastern University, began to seriously consider what it would take to create a robotic bee colony. We wondered if mechanical bees could replicate not just an individual’s behavior but the unique behavior that emerges out of interactions among thousands of bees. We have now created the first RoboBees—flying bee-size robots—and are working on methods to make thousands of them cooperate like a real hive.
Flight of the robobees
Svilen Kanev, Timothy Jones, Gu Wei, David Brooks, and Vijay Reddi. 2013. “Measuring code optimization impact on voltage noise”.Abstract
In this paper, we characterize the impact of compiler optimizations on voltage noise. While intuition may suggest that the better processor utilization ensured by optimizing compilers results in a small amount of voltage variation, our measurements on a Intel Core2 Duo processor show the opposite – the majority of SPEC 2006 benchmarks exhibit more voltage droops when aggressively optimized. We show that this increase in noise could be sufficient for a net performance decrease in a typical-case, resilient design.
Measuring code optimization impact on voltage noise
Svilen Kanev, Timothy Jones, Gu Wei, David Brooks, and Vijay Reddi. 2013. “Measuring code optimization impact on voltage noise.” Workshop in Silicon Errors – System Effects (SELSE). Publisher's VersionAbstract
In this paper, we characterize the impact of compiler optimizations on voltage noise. While intuition may suggest that the better processor utilization ensured by optimizing compilers results in a small amount of voltage variation, our measurements on a Intel Core2 Duo processor show the opposite – the majority of SPEC 2006 benchmarks exhibit more voltage droops when aggressively optimized. We show that this increase in noise could be sufficient for a net performance decrease in a typical-case, resilient design.
Measuring code optimization impact on voltage noise
2012
Sae Lee, David Brooks, and Gu Wei. 7/2012. “Evaluation of voltage stacking for near-threshold multicore computing.” In ISLPED '12: Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design, Pp. 373–378https. Publisher's VersionAbstract

This paper evaluates voltage stacking in the context of near-threshold multicore computing. Key attributes of voltage stacking are investigated using results from a test-chip prototype built in 150nm FDSOI CMOS. By "stacking" logic blocks on top of each other, voltage stacking reduces the chip current draw and simplifies off-chip power delivery but within-die voltage noise due to inter-layer current mismatch is an issue. Results show that unlike conventional power delivery schemes, supply rail impedance in voltage stacked systems depend on aggregate power consumption, leading to better noise immunity for high power (low impedance) operation for many-core processors.

Evaluation of voltage stacking for near-threshold multicore computing
Svilen Kanev, Gu Wei, and David Brooks. 7/2012. “XIOSim: power-performance modeling of mobile x86 cores.” In International symposium on Low power Electronics and Design (ISLPED). ACM. Publisher's VersionAbstract
Simulation is one of the main vehicles of computer architecture research. In this paper, we present XIOSim –- a highly detailed microarchitectural simulator targeted at mobile x86 microprocessors. The simulator execution model that we propose is a blend between traditional user-level simulation and full-system simulation. Our current implementation features detailed power and performance core models which allow microarchitectural exploration. Using a novel validation methodology, we show that XIOSim’s performance models manage to stay well within 10% of real hardware for the whole SPEC CPU2006 suite. Furthermore, we validate power models against measured data to show a deviation of less than 5% in terms of average power consumption.
XIOSim: power-performance modeling of mobile x86 cores
Simone Campanoni, Timothy Jones, Glenn Holloway, Gu Wei, and David Brooks. 6/26/2012. “HELIX: Making the extraction of thread-level parallelism mainstream.” IEEE Micro, 32, 4, Pp. 8–18. Publisher's VersionAbstract
Improving system performance increasingly depends on exploiting microprocessor parallelism, yet mainstream compilers still don't parallelize code automatically. Helix automatically parallelizes general-purpose programs without requiring any special hardware; avoids slowing down compiled programs, making it a suitable candidate for mainstream compilers; and outperforms the most similar historical technique that has been implemented in production compilers.
HELIX: Making the extraction of thread-level parallelism mainstream
Simone Campanoni, Timothy Jones, Glenn Holloway, Gu Wei, and David Brooks. 6/26/2012. “Making the Extraction of Thread-Level Parallelism Mainstream.” IEEE Micro. Publisher's VersionAbstract
Improving system performance increasingly depends on exploiting microprocessor parallelism, yet mainstream compilers still do not parallelize code automatically. Promising parallelization approaches have either required manual programmer assistance, depended on special hardware features, or risked slowing down programs they should have speeded up. HELIX is one such approach that automatically parallelizes general-purpose programs without requiring any special hardware. In this paper we show that in practice HELIX always avoids slowing down compiled programs, making it a suitable candidate for mainstream compilers. We also show experimentally that HELIX outperforms the most similar historical technique that has been implemented in production compilers.
HELIX: Making the extraction of thread-level parallelism mainstream
Simone Campanoni, Timothy Jones, Glenn Holloway, Gu Wei, and David Brooks. 6/3/2012. “The HELIX project: overview and directions.” In Design Automation Conference (DAC). San Francisco, CA, USA: ACM. Publisher's VersionAbstract
Parallelism has become the primary way to maximize processor performance and power efficiency. But because creating parallel programs by hand is difficult and prone to error, there is an urgent need for automatic ways of transforming conventional programs to exploit modern multicore systems. The HELIX compiler transformation is one such technique that has proven effective at parallelizing individual sequential programs automatically for a real six-core processor. We describe that transformation in the context of the broader HELIX research project, which aims to optimize the throughput of a multicore processor by coordinated changes in its architecture, its compiler, and its operating system. The goal is to make automatic parallelization mainstream in multiprogramming settings through adaptive algorithms for extracting and tuning thread-level parallelism.
The HELIX project: overview and directions
Michael Karpelson, Gu Wei, and J Wood. 4/2012. “Driving high voltage piezoelectric actuators in microrobotic applications.” Sensors and actuators A: Physical, 176, Pp. 78–89. Publisher's VersionAbstract

Piezoelectric actuators have been used successfully to enable locomotion in aerial and ambulatory microrobotic platforms. However, the use of piezoelectric actuators presents two major challenges for power electronic design: generating high-voltage drive signals in systems typically powered by low-voltage energy sources, and recovering unused energy from the actuators. Due to these challenges, conventional drive circuits become too bulky or inefficient in low mass applications. This work describes electrical characteristics and drive requirements of low mass piezoelectric actuators, the design and optimization of suitable drive circuit topologies, aspects of the physical instantiation of these topologies, including the fabrication of extremely lightweight magnetic components, and a custom, ultra low power integrated circuit that implements control functionality for the drive circuits. The principles and building blocks presented here enable efficient high-voltage drive circuits that can satisfy the stringent weight and power requirements of microrobotic applications.

Driving high voltage piezoelectric actuators in microrobotic applications
Simone Campanoni, Timothy Jones, Glenn Holloway, Vijay Reddi, Gu Wei, and David Brooks. 3/31/2012. “HELIX: Automatic parallelization of irregular programs for chip multiprocessing.” In International Symposium on Code Generation and Optimization (CGO). ACM. Publisher's VersionAbstract
We describe and evaluate HELIX, a new technique for automatic loop parallelization that assigns successive iterations of a loop to separate threads. We show that the inter-thread communication costs forced by loop-carried data dependences can be mitigated by code optimization, by using an effective heuristic for selecting loops to parallelize, and by using helper threads to prefetch synchronization signals. We have implemented HELIX as part of an optimizing compiler framework that automatically selects and parallelizes loops from general sequential programs. The framework uses an analytical model of loop speedups, combined with profile data, to choose loops to parallelize. On a six-core Intel Core i7-980X, HELIX achieves speedups averaging 2.25, with a maximum of 4.12, for thirteen C benchmarks from SPEC CPU2000.
HELIX: Automatic parallelization of irregular programs for chip multiprocessing
H Chung and Gu Wei. 1/5/2012. “Simulated-annealing-based adaptive equaliser for on-die variation compensation.” Electronics letters, 48, 1, Pp. 18–19. Publisher's VersionAbstract

Fully exploiting the flexibility of lookup-table-based equalisers, it is proposed to compensate for on-die variation effects within a transmit-side equaliser. To efficiently deal with the nonlinear nature of circuit non-idealities, the proposed equaliser utilises simulated annealing for adaptation.

Simulated-annealing-based adaptive equaliser for on-die variation compensation
Michael Lyons, Mark Hempstead, Gu Wei, and David Brooks. 1/2012. “The accelerator store: A shared memory framework for accelerator-based systems.” ACM Transactions on Architecture. and Code Optimization, 8, 4, Pp. 1-22. Publisher's VersionAbstract

This paper presents the many-accelerator architecture, a design approach combining the scalability of homogeneous multi-core architectures and system-on-chip's high performance and power-efficient hardware accelerators. In preparation for systems containing tens or hundreds of accelerators, we characterize a diverse pool of accelerators and find each contains significant amounts of SRAM memory (up to 90% of their area). We take advantage of this discovery and introduce the accelerator store, a scalable architectural component to minimize accelerator area by sharing its memories between accelerators. We evaluate the accelerator store for two applications and find significant system area reductions (30%) in exchange for small overheads (2% performance, 0%--8% energy). The paper also identifies new research directions enabled by the accelerator store and the many-accelerator architecture.

The accelerator store: A shared memory framework for accelerator-based systems

Pages