Publications by Type: Conference Paper

2011
Wonyoung Kim, David Brooks, and Gu Wei. 2/20/2011. “A fully-integrated 3-level DC/DC converter for nanosecond-scale DVS with fast shunt regulation.” In IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), Pp. 268–270. IEEE. Publisher's VersionAbstract
In recent years, chip multiprocessor architectures have emerged to scale performance while staying within tight power constraints. This trend motivates per core/block dynamic voltage and frequency scaling (DVFS) with fast voltage transition. Given the high cost and bulk of off-chip DC/DC converters to implement multiple on-chip power domains, there has been a surge of interest in on-chip converters. This paper presents the design and experimental results of a fully integrated 3-level DC/DC converter that merges characteristics of both inductor-based buck and switched-capacitor (SC) converters. While off-chip buck converters show high conversion efficiency, their on-chip counterparts suffer from loss due to low quality inductors. With the help of flying capacitors, the 3-level converter requires smaller inductors than the buck converter, reducing loss and on-die area overhead. Compared to SC converters that need more com plex structures to regulate higher than half the input voltage, 3-level converters can efficiently regulate the output voltage across a wide range of levels and load currents. Measured results from a 130nm CMOS test-chip prototype demon strate nanosecond-scale voltage transition times and peak conversion efficiency of 77%.
A fully-integrated 3-level DC/DC converter for nanosecond-scale DVS with fast shunt regulation
Krishna Rangan, Michael Powell, Gu Wei, and David Brooks. 2/12/2011. “Achieving uniform performance and maximizing throughput in the presence of heterogeneity.” In High Performance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium on, Pp. 3–14. IEEE. Publisher's VersionAbstract
Continued scaling of process technologies is critical to sustaining improvements in processor frequencies and performance. However, shrinking process technologies exacerbates process variations - the deviation of process parameters from their target specifications. In the context of multi-core CMPs, which are implemented to feature homogeneous cores, within-die process variations result in substantially different core frequencies. Exposing such process-variation induced heterogeneity interferes with the norm of marketing chips at a single frequency. Further, application performance is undesirably dictated by the frequency of the core it is running on. To work around these challenges, a single uniform frequency, dictated by the slowest core, is currently chosen as the chip frequency sacrificing the increased performance capabilities of cores that could operate at higher frequencies. In this paper, we propose choosing the mean frequency across all cores, in lieu of the minimum frequency, as the single-frequency to use as the chip sales frequency. We examine several scheduling algorithms implemented below the O/S in hardware/firmware that guarantee minimum application performance near that of the average frequency, by masking process-variation induced heterogeneity from the end-user. We show that our Throughput-Driven Fairness (TDF) scheduling policy improves throughput by an average of 12% compared to a naive fairness scheme (round-robin) for frequency-sensitive applications. At the same time, TDF allows 98% of chips to maintain minimum performance at or above 90% of that expected at the mean frequency, providing a single uniform performance level to present for the chip.
Achieving uniform performance and maximizing throughput in the presence of heterogeneity
David Brooks. 1/25/2011. “The alarms project: a hardware/software approach to addressing parameter variations.” In Design Automation Conference (ASP-DAC), 2011 16th Asia and South Pacific, Pp. 291–291. IEEE. Publisher's VersionAbstract
Parameter variations (process, voltage, and temperature) threaten continued performance scaling of power-constrained computer systems. As designers seek to contain the power consumption of microprocessors through reductions in supply voltage and power-saving techniques such as clock-gating, these systems suffer increasingly large power supply fluctuations due to the finite impedance of the power supply network. These supply fluctuations, referred to as voltage emergencies, must be managed to guarantee correctness. Traditional approaches to address this problem incur high-cost or compromise power/performance efficiency. Our research seeks ways to handle these alarm conditions through a combined hardware/software approach, motivated by root cause analysis of voltage emergencies revealing that many of these events are heavily linked to both program control flow and microarchitectural events (cache misses and pipeline flushes). This talk will discuss three aspects of the project: (1) a fail-safe mechanism that provides hardware guaranteed correctness; (2) a voltage emergency predictor that leverages control flow and microarchitectural event information to predict voltage emergencies up to 16 cycles in advance; and (3) a proof-of-concept dynamic compiler implementation that demonstrates that dynamic code transformations can be used to eliminate voltage emergencies from the instruction stream with minimal impact on performance [1–9].
The alarms project: a hardware/software approach to addressing parameter variations
2010
Vijay Reddi, Svilen Kanev, Wonyoung Kim, Simone Campanoni, Michael Smith, Gu Wei, and David Brooks. 12/4/2010. “Voltage smoothing: Characterizing and mitigating voltage noise in production processors via software-guided thread scheduling.” In 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture. IEEE. Publisher's VersionAbstract
Parameter variations have become a dominant challenge in microprocessor design. Voltage variation is especially daunting because it happens so rapidly. We measure and characterize voltage variation in a running Intel Core2 Duo processor. By sensing on-die voltage as the processor runs single-threaded, multi-threaded, and multi-program workloads, we determine the average supply voltage swing of the processor to be only 4%, far from the processor’s 14% worst-case operating voltage margin. While such large margins guarantee correctness, they penalize performance and power efficiency. We investigate and quantify the benefits of designing a processor for typical-case (rather than worst-case) voltage swings, assuming that a fail-safe mechanism protects it from infrequently occurring large voltage fluctuations. With today’s processors, such resilient designs could yield 15% to 20% performance improvements. But we also show that in future systems, these gains could be lost as increasing voltage swings intensify the frequency of fail-safe recoveries. After characterizing microarchitectural activity that leads to voltage swings within multi-core systems, we show that a voltage-noise-aware thread scheduler in software can co-schedule phases of different programs to mitigate error recovery overheads in future resilient processor designs.
Voltage smoothing: Characterizing and mitigating voltage noise in production processors via software-guided thread scheduling
Karpelson Michael, Whitney P, Gu Wei, and Wood J. 10/18/2010. “Energetics of flapping-wing robotic insects: towards autonomous hovering flight.” In 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, Pp. 1630–1637. IEEE. Publisher's VersionAbstract
Flapping-wing mechanisms inspired by biological insects have the potential to enable a new class of small, highly maneuverable aerial robots with hovering capabilities. In order for such devices to operate without an external power source, it is necessary to address a complex system design challenge: the integration of all of the required components on board the robot. This paper discusses the flight energetics of flapping-wing robotic insects with the goal of selecting design parameters that enable power autonomy and maximize flight time. The subsystems of the robot are analyzed both from a broad perspective and using a detailed set of models for a piezoelectrically driven two-wing design. The models are used to perform a system-level optimization for the maximum flight time permitted by current technology, compare the resulting robot configurations to biological insects across several key metrics, and discuss the effect of performance gains in various subsystems of the robot.
Energetics of flapping-wing robotic insects: towards autonomous hovering flight
2009
Meeta Gupta, Jude Rivers, Pradip Bose, Gu Wei, and David Brooks. 12/12/2009. “Tribeca: design for PVT variations with local recovery and fine-grained adaptation.” In 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Pp. 435–446. New York, NY, USA: IEEE. Publisher's VersionAbstract

With continued advances in CMOS technology, parameter variations are emerging as a major design challenge. Irregularities during the fabrication of a microprocessor and variations of voltage and temperature during its operation widen worst-case timing margins of the design - degrading performance significantly. Because runtime variations like supply voltage droops and temperature fluctuations depend on the activity signature of the processor's workload, there are several opportunities to improve performance by dynamically adapting margins. This paper explores the power-performance efficiency gains that result from designing for typical conditions while dynamically tuning frequency and voltage to accommodate the runtime behavior of workloads. Such a design depends on a fail-safe mechanism that allows it to protect against margin violations during adaptation; we evaluate several such mechanisms, and we propose a local recovery scheme that exploits spatial variation among the units of the processor. While a processor designed for worst-case conditions might only be capable of a frequency that is 75% of an ideal processor with no parameter variations, we show that a fine-grained global frequency tuning mechanism improves power-performance efficiency (BIPS 3 /W) by 40% while operating at 91% of an ideal processor's frequency. Moreover, a per-unit voltage tuning mechanism aims to reduce the effect of within-die spatial variations to provide a 55% increase in power-performance efficiency. The benefits reported are clearly substantial in light of the <1% area overhead relative to existing global recovery mechanisms.

Tribeca: design for PVT variations with local recovery and fine-grained adaptation
Kristen Lovin, Benjamin Lee, Xiaoyao Liang, David Brooks, and Gu Wei. 10/4/2009. “Empirical performance models for 3T1D memories.” In ICCD'09: Proceedings of the 2009 IEEE international conference on Computer design, Pp. 398–403. IEEE.Abstract
Process variation poses a threat to the performance and reliability of the 6T SRAM cell. Research has turned to new memory cell designs, such as the 3T1D DRAM cell, as potential replacement designs. If designers are to consider 3T1D memory architectures, performance models are needed to better understand memory cell behavior. We propose a decoupled approach for collecting Monte Carlo HSPICE data, reducing simulation times by simulating memory array components separately based on their contribution to the worst-case critical path. We use this Monte Carlo data to train regression models, which accurately predict retention and access times of a 3T1D memory array with a median error of 7.39%.
Empirical performance models for 3T1D memories
Chung Hayun and Gu Wei. 9/13/2009. “Design-space exploration of backplane receivers with high-speed ADCs and digital equalization.” In 2009 IEEE Custom Integrated Circuits Conference, Pp. 555–558. IEEE. Publisher's VersionAbstract
High-speed backplane receivers based on front-end ADCs with digital equalization facilitate design reuse, portability, and flexibility to reconfigure itself and accommodate different channel environments. However, power and complexity of such receivers can be high and require thorough high-level exploration to optimize design tradeoffs. This paper presents a backplane receiver model consisting of a simple, accurate, experimentally-verified, and parameterized high-speed flash ADC and a configurable digital equalizer for design-space exploration. Simulations demonstrate tradeoffs between ADC and equalizer bit resolution while maintaining constant receiver performance.
Design-space exploration of backplane receivers with high-speed ADCs and digital equalization
Michael Lyons and David Brooks. 8/2009. “The design of a bloom filter hardware accelerator for ultra low power systems.” In ISLPED '09: Proceedings of the 2009 ACM/IEEE international symposium on Low power electronics and design (ISLPED), Pp. 371–376https. ACM. Publisher's VersionAbstract
Battery-powered embedded systems require low energy usage to extend system lifetime. These systems must power many components for long periods of time and are particularly sensitive to energy use. Recent techniques for reducing energy consumption in wireless sensor networks, such as aggregation, require additional computation to reduce energy intensive radio transmissions. Larger demands on the processor will require more computational energy, but traditional energy reduction approaches, such as multi-core scaling with reduced frequency and voltage may prove heavy handed and ineffective for motes (sensor network nodes). Alternatively, application-specific hardware design (ASHD) architectures can reduce computational energy consumption by processing operations common to specific applications more efficiently than a general purpose processor. By the nature of their deeply embedded operation, motes support a limited set of applications, and thus the conventional general purpose computing paradigm may not be well-suited to mote operation. This paper examines the design considerations of a hardware accelerator for compressed Bloom filters, a data structure for efficiently storing set membership. We evaluate our ASHD design for three representative wireless sensor network applications and demonstrate that ASHD design reduces network latency by 59% and computational energy by 98%, showing the need for architecting processors for ASHD accelerators.
The design of a bloom filter hardware accelerator for ultra low power systems
Vijay Reddi, Meeta Gupta, Michael Smith, Gu Wei, David Brooks, and Simone Campanoni. 7/26/2009. “Software-assisted hardware reliability: abstracting circuit-level challenges to the software stack.” In 2009 46th ACM/IEEE Design Automation Conference, Pp. 788–793. San Francisco, CA: IEEE. Publisher's VersionAbstract
Power constrained designs are becoming increasingly sensitive to supply voltage noise. We propose a hardware-software collaborative approach to enable aggressive operating margins: a checkpoint-recovery mechanism corrects margin violations, while a run-time software layer reschedules the program's instruction stream to prevent recurring margin crossings at the same program location. The run-time layer removes 60% of these events with minimal overhead, thereby significantly improving overall performance.
Software-assisted hardware reliability: abstracting circuit-level challenges to the software stack
Vijay Reddi, Meeta Gupta, Michael Smith, Gu Wei, David Brooks, and Simone Campanoni. 7/26/2009. “Software-assisted hardware reliability: abstracting circuit-level challenges to the software stack.” In Proceedings of the 46th Annual Design Automation Conference, Pp. 788–793. San Francisco, CA. Publisher's VersionAbstract
Power constrained designs are becoming increasingly sensitive to supply voltage noise. We propose a hardware-software collaborative approach to enable aggressive operating margins: a checkpoint-recovery mechanism corrects margin violations, while a run-time software layer reschedules the program's instruction stream to prevent recurring margin crossings at the same program location. The run-time layer removes 60% of these events with minimal overhead, thereby significantly improving overall performance.
Software-assisted hardware reliability: abstracting circuit-level challenges to the software stack
Hayun Chung, Alexander Rylyakov, Toprak Deniz, John Bulzacchelli, Gu Wei, and Friedman Daniel. 6/16/2009. “A 7.5-GS/s 3.8-ENOB 52-mW flash ADC with clock duty cycle control in 65nm CMOS.” In 2009 Symposium on VLSI Circuits, Pp. 268–269. Kyoto, Japan: IEEE. Publisher's VersionAbstract

A 7.5-GS/s 4.5-bit analog-to-digital converter (ADC) in 65nm CMOS is presented. A two-stage track-and-hold (TAH) with clock duty cycle control reduces bandwidth requirements on the slow TAH output to enable high sampling rates with low power consumption. The 7.5-GS/s flash ADC consumes 52-mW and occupies 0.01-mm 2 . Clock duty cycle control improves ENOB from 3.5 to 3.8 with an input sinusoid at the Nyquist frequency.

A 7.5-GS/s 3.8-ENOB 52-mW flash ADC with clock duty cycle control in 65nm CMOS
Krishna Rangan, Gu Wei, and David Brooks. 6/2009. “Thread motion: fine-grained power management for multi-core systems.” In ACM SIGARCH Computer Architecture News, 3rd ed., 37: Pp. 302–313. ACM. Publisher's VersionAbstract

Dynamic voltage and frequency scaling (DVFS) is a commonly-used power-management scheme that dynamically adjusts power and performance to the time-varying needs of running programs. Unfortunately, conventional DVFS, relying on off-chip regulators, faces limitations in terms of temporal granularity and high costs when considered for future multi-core systems. To overcome these challenges, this paper presents thread motion (TM), a fine-grained power-management scheme for chip multiprocessors (CMPs). Instead of incurring the high cost of changing the voltage and frequency of different cores, TM enables rapid movement of threads to adapt the time-varying computing needs of running applications to a mixture of cores with fixed but different power/performance levels. Results show that for the same power budget, two voltage/frequency levels are sufficient to provide performance gains commensurate to idealized scenarios using per-core voltage control. Thread motion extends workload-based power management into the nanosecond realm and, for a given power budget, provides up to 20% better performance than coarse-grained DVFS.

Thread motion: fine-grained power management for multi-core systems
Michael Karpelson, Gu Wei, and Wood J. 5/12/2009. “Milligram-scale high-voltage power electronics for piezoelectric microrobots.” In 2009 IEEE international conference on robotics and automation, Pp. 2217–2224. IEEE. Publisher's VersionAbstract

Compact yet powerful actuators are vital in many robotic applications, particularly small-scale autonomous systems such as bio-inspired microrobots. In recent years, a number of actuation methods have been proposed or applied in a microrobotic context, including piezoelectric [1], electrostatic [2], and dielectric elastomer actuators [3]. These actuation methods have the potential to achieve high efficiencies and power densities in very small geometries. Piezoelectric actuators in particular have shown promise in applications with very stringent weight and power density requirements, such as the Harvard Microrobotic Fly (HMF)—a flapping-wing robotic insect capable of liftoff with external power [4].

In order to produce mechanical output, the actuation methods mentioned above rely on the presence of electric charge on various electrodes in order to either generate high electric fields, as in the case of piezoelectric actuators, or high electrostatic forces, as in the case of electrostatic and dielectric elastomer actuators. Moreover, the geometries of such actuators inherently produce significant electrical capacitance, and therefore high operating voltages are usually necessary to accumulate a sufficient amount of charge on the actuator electrodes, ranging from tens to thousands of volts. For example, the piezoelectric actuators used in the HMF require drive voltages in the range of 200–300V. There are two major challenges in the design of power electronics capable of driving capacitive actuators: generating high voltages from low-voltage sources and recovering unused energy from the actuator.

Most compact energy sources suitable for microrobotic applications, such as lithium batteries, supercapacitors [5], solar cells [6], and fuel cells [7], generate output voltages below 5V. Connecting many such cells in series to obtain high voltage is generally not practical because the packaging overhead causes a significant reduction in energy density. Consequently, the generation of high voltages for HMF actuators requires voltage conversion circuits with step-up ratios ranging from 50 to 100. While there are a number of circuit topologies with high step-up ratios, many of them cannot be easily miniaturized and/or suffer from poor efficiency at the low output power levels common in microrobotic applications. Careful selection and optimization of the conversion circuit is necessary to ensure that heavy, inefficient electronics do not compromise system performance.

In addition to the voltage step-up functionality, the power electronics circuitry must generate a time-varying signal on the input electrodes of the actuator. The second challenge stems from the fact that, depending on the properties of the actuator, the nature of the mechanical load, and the characteristics of the drive signal, only a small fraction of the electrical energy stored in the actuator is converted into useful mechanical output [8]. In order to maximize overall system efficiency, it is highly desirable to both generate an appropriate drive signal and recover as much of the unused energy as possible, which imposes additional requirements on the drive circuitry.

This paper describes promising power electronics circuits that can generate the high, time-varying voltages necessary for the operation of piezoelectric actuators, while meeting the stringent weight requirements of microrobotic systems and maximizing system efficiency. Although the analysis focuses on piezoelectric actuators, many of the concepts described here can easily be adapted to other high-voltage capacitive actuators, such as electrostatic comb drives or dielectric elastomer actuators. This work reviews the electrical properties and drive requirements of piezoelectric actuators (Section II), and presents power electronics circuits applicable to various types and configurations of piezoelectric actuators (Sections III and IV). Experimental realizations of the drive circuits are described (Section V), including applications to milligram-scale microrobots, such as flapping-wing robotic insects.

Milligram-scale high-voltage power electronics for piezoelectric microrobots
Meeta Gupta, Vijay Reddi, Glenn Holloway, Gu Wei, and David Brooks. 4/20/2009. “An event-guided approach to reducing voltage noise in processors.” In Design, Automation &amp; Test in Europe Conference &amp; Exhibition, 4/20/2009. DATE'09., Pp. 160–165. Nice, France: IEEE. Publisher's Version An event-guided approach to reducing voltage noise in processors
Kevin Brownell, Ali Khan, David Brooks, and Gu Wei. 3/16/2009. “Place and route considerations for voltage interpolated designs.” In Quality of Electronic Design, 3/16/2009. ISQED 3/16/2009. Quality Electronic Design, Pp. 594–600. IEEE. Publisher's VersionAbstract

Voltage interpolation is a promising post fabrication technique for combating the effects of process variations. The benefits of voltage interpolation are well understood. Its implementation in a VLSI-CAD flow has been considered through the synthesis stage. In this paper we study the implications of place and route on voltage interpolation. We evaluate multiple placement strategies, and conclude that a hybridization of forced placement and cluster boxing techniques results in minimum overhead.

Place and route considerations for voltage interpolated designs
Vijay Reddi, Meeta Gupta, Glenn Holloway, Gu Wei, Michael Smith, and David Brooks. 2/14/2009. “Voltage emergency prediction: Using signatures to reduce operating margins.” In 2009 IEEE 15th International Symposium on High Performance Computer Architecture, Pp. 18–29. Raleigh, NC, USA: IEEE. Publisher's VersionAbstract
Inductive noise forces microprocessor designers to sacrifice performance in order to ensure correct and reliable operation of their designs. The possibility of wide fluctuations in supply voltage means that timing margins throughout the processor must be set pessimistically to protect against worst-case droops and surges. While sensor-based reactive schemes have been proposed to deal with voltage noise, inherent sensor delays limit their effectiveness. Instead, this paper describes a voltage emergency predictor that learns the signatures of voltage emergencies (the combinations of control flow and microarchitectural events leading up to them) and uses these signatures to prevent recurrence of the corresponding emergencies. In simulations of a representative superscalar microprocessor in which fluctuations beyond 4% of nominal voltage are treated as emergencies (an aggressive configuration), these signatures can pinpoint the likelihood of an emergency some 16 cycles ahead of time with 90% accuracy. This lead time allows machines to operate with much tighter voltage margins (4% instead of 13%) and up to 13.5% higher performance, which closely approaches the 14.2% performance improvement possible with an ideal oracle-based predictor.
Voltage emergency prediction: Using signatures to reduce operating margins
Vijay Reddi, Meeta Gupta, Glenn Holloway, Michael Smith, Gu-Yeon Wei, and David Brooks. 2/14/2009. “Voltage emergency prediction: Using Signatures to Reduce Operating Margins.” In 2009 IEEE 15th International Symposium on High Performance Computer Architecture. Publisher's VersionAbstract

Inductive noise forces microprocessor designers to sacrifice performance in order to ensure correct and reliable operation of their designs. The possibility of wide fluctuations in supply voltage means that timing margins throughout the processor must be set pessimistically to protect against worst-case droops and surges. While sensor-based reactive schemes have been proposed to deal with voltage noise, inherent sensor delays limit their effectiveness. Instead, this paper describes a voltage emergency predictor that learns the signatures of voltage emergencies (the combinations of control flow and microarchitectural events leading up to them) and uses these signatures to prevent recurrence of the corresponding emergencies. In simulations of a representative superscalar microprocessor in which fluctuations beyond 4% of nominal voltage are treated as emergencies (an aggressive configuration), these signatures can pinpoint the likelihood of an emergency some 16 cycles ahead of time with 90% accuracy. This lead time allows machines to operate with much tighter voltage margins (4% instead of 13%) and up to 13.5% higher performance, which closely approaches the 14.2% performance improvement possible with an ideal oracle-based predictor.

Voltage emergency prediction: Using Signatures to Reduce Operating Margins
Xiaoyao Liang, Benjamin Lee, Gu Wei, and David Brooks. 2009. “Design and test strategies for microarchitectural post-fabrication tuning.” In Computer Design, 2009. ICCD 2009. IEEE International Conference on, Pp. 84–90. IEEE. Publisher's VersionAbstract
Process variations are a major hurdle for continued technology scaling. Both systematic and random variations will affect the critical delay of fabricated chips, causing a wide frequency and power distribution. Tuning techniques adapt the microarchitecture to mitigate the impact of variations at post-fabrication testing time. This paper proposes a new post-fabrication testing framework that accounts for testing costs. This framework uses on-chip canary circuits to capture systematic variation while using statistical analysis to estimate random variation. We derive regression models to predict chip performance and power. These techniques comprise an integrated framework that identifies the most energy efficient post-fabrication tuning configuration for each chip.
Design and test strategies for microarchitectural post-fabrication tuning
Lukasz Strozek and David Brooks. 2009. “Efficient architectures through application clustering and heterogeneity.” In CASES '06: Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems, Pp. 190–200. Citeseer. Publisher's VersionAbstract

Customizing architectures for particular applications is a promising approach to yield highly energy-efficient designs for embedded systems. This work explores the benefits of architectural customization for a class of embedded architectures typically used in energy-constrained application domains such as sensor node and multimedia processing. We implement a process flow that analyzes runtime profiles of applications and combines this information with a model for our architectural design space providing a robust customization engine built upon a fully automated method for determining an efficient architecture (together with appropriate application transformations). By profiling embedded benchmarks from a variety of sensor and multimedia applications, the paper shows the relative energy savings resulting from various architectural optimizations and identifies the number of architectures that achieves near-optimal savings for a group of applications. This paper proposes the use of heterogeneous chip-multiprocessors as a cost-effective approach to capitalize on the potential energy savings provided by application customization while executing a range of applications efficiently.

Efficient architectures through application clustering and heterogeneity

Pages