Publications

2012
Simone Campanoni, Timothy Jones, Glenn Holloway, Vijay Reddi, Gu Wei, and David Brooks. 3/31/2012. “HELIX: Automatic parallelization of irregular programs for chip multiprocessing.” In International Symposium on Code Generation and Optimization (CGO). ACM. Publisher's VersionAbstract
We describe and evaluate HELIX, a new technique for automatic loop parallelization that assigns successive iterations of a loop to separate threads. We show that the inter-thread communication costs forced by loop-carried data dependences can be mitigated by code optimization, by using an effective heuristic for selecting loops to parallelize, and by using helper threads to prefetch synchronization signals. We have implemented HELIX as part of an optimizing compiler framework that automatically selects and parallelizes loops from general sequential programs. The framework uses an analytical model of loop speedups, combined with profile data, to choose loops to parallelize. On a six-core Intel Core i7-980X, HELIX achieves speedups averaging 2.25, with a maximum of 4.12, for thirteen C benchmarks from SPEC CPU2000.
HELIX: Automatic parallelization of irregular programs for chip multiprocessing
H Chung and Gu Wei. 1/5/2012. “Simulated-annealing-based adaptive equaliser for on-die variation compensation.” Electronics letters, 48, 1, Pp. 18–19. Publisher's VersionAbstract

Fully exploiting the flexibility of lookup-table-based equalisers, it is proposed to compensate for on-die variation effects within a transmit-side equaliser. To efficiently deal with the nonlinear nature of circuit non-idealities, the proposed equaliser utilises simulated annealing for adaptation.

Simulated-annealing-based adaptive equaliser for on-die variation compensation
Michael Lyons, Mark Hempstead, Gu Wei, and David Brooks. 1/2012. “The Accelerator Store: a shared memory framework for accelerator-based systems.” Transactions on Architecture and Code Optimization (TACO). Publisher's VersionAbstract
This paper presents the many-accelerator architecture, a design approach combining the scalability of homogeneous multi-core architectures and system-on-chip’s high performance and power-efficient hardware accelerators. In preparation for systems containing tens or hundreds of accelerators, we characterize a diverse pool of accelerators and find each contains significant amounts of SRAM memory (up to 90% of their area). We take advantage of this discovery and introduce the accelerator store, a scalable architectural component to minimize accelerator area by sharing its memories between accelerators. We evaluate the accelerator store for two applications and find significant system area reductions (30%) in exchange for small overheads (2% performance, 0%–8% energy). The paper also identifies new research directions enabled by the accelerator store and the many-accelerator architecture.
The accelerator store: A shared memory framework for accelerator-based systems
Michael Lyons, Mark Hempstead, Gu Wei, and David Brooks. 1/2012. “The accelerator store: A shared memory framework for accelerator-based systems.” ACM Transactions on Architecture. and Code Optimization, 8, 4, Pp. 1-22. Publisher's VersionAbstract

This paper presents the many-accelerator architecture, a design approach combining the scalability of homogeneous multi-core architectures and system-on-chip's high performance and power-efficient hardware accelerators. In preparation for systems containing tens or hundreds of accelerators, we characterize a diverse pool of accelerators and find each contains significant amounts of SRAM memory (up to 90% of their area). We take advantage of this discovery and introduce the accelerator store, a scalable architectural component to minimize accelerator area by sharing its memories between accelerators. We evaluate the accelerator store for two applications and find significant system area reductions (30%) in exchange for small overheads (2% performance, 0%--8% energy). The paper also identifies new research directions enabled by the accelerator store and the many-accelerator architecture.

The accelerator store: A shared memory framework for accelerator-based systems
Amanda Tseng and David Brooks. 1/2012. “Analytical Latency-Throughput Model of Future Power Constrained Multicore Processors.” In Workshop on Energy Efficient Design, ISCA, 600: Pp. 800.Abstract
Despite increased core counts that provide significant throughput performance gains, single thread performance is still an important metric in today’s processor designs. Due to chip power constraints, architects must carefully allocate power budgets to additional cores or increased single thread performance. To study this tradeoff between different performance metrics, we construct an analytical model that computes single thread and throughput performance under a given power budget for both symmetric and asymmetric multicore architectures. We also consider multi-task workloads, where optimal designs might include more than one large core in the heterogeneous architecture. Our analytical model considers the optimal number and complexity of cores in a processor and quantifies the benefits of asymmetric designs when trading latency and throughput. We show that a diverse set of core designs can be optimal in different scenarios.
Analytical Latency-Throughput Model of Future Power Constrained Multicore Processors
Wonyoung Kim, David Brooks, and Gu Wei. 1/2012. “A fully-integrated 3-level DC-DC converter for nanosecond-scale DVFS.” Solid-State Circuits, IEEE Journal of, 47, 1, Pp. 206–219. Publisher's VersionAbstract
On-chip DC-DC converters have the potential to offer fine-grain power management in modern chip-multiprocessors. This paper presents a fully integrated 3-level DC-DC converter, a hybrid of buck and switched-capacitor converters, implemented in 130 nm CMOS technology. The 3-level converter enables smaller inductors (1 nH) than a buck, while generating a wide range of output voltages compared to a 1/2 mode switched-capacitor converter. The test-chip prototype delivers up to 0.85 A load current while generating output voltages from 0.4 to 1.4 V from a 2.4 V input supply. It achieves 77% peak efficiency at power density of 0.1 W/mm 2 and 63% efficiency at maximum power density of 0.3 W/mm 2 . The converter scales output voltage from 0.4 V to 1.4 V (or vice-versa) within 20 ns at a constant 450 mA load current. A shunt regulator reduces peak-to-peak voltage noise from 0.27 V to 0.19 V under pseudo-randomly fluctuating load currents. Using simulations across a wide range of design parameters, the paper compares conversion efficiencies of the 3-level, buck and switched-capacitor converters.
A fully-integrated 3-level DC-DC converter for nanosecond-scale DVFS
Michael Lyons, Gu Wei, and David Brooks. 1/2012. “Shrink-Fit: A Framework for Flexible Accelerator Sizing.” IEEE Computer Architecture Letters, 12, 1, Pp. 17 - 20. Publisher's VersionAbstract
RTL design complexity discouraged adoption of reconfigurable logic in general purpose systems, impeding opportunities for performance and energy improvements. Recent improvements to HLS compilers simplify RTL design and are easing this barrier. A new challenge will emerge: managing reconfigurable resources between multiple applications with custom hardware designs. In this paper, we propose a method to “shrink-fit” accelerators within widely varying fabric budgets. Shrink-fit automatically shrinks existing accelerator designs within small fabric budgets and grows designs to increase performance when larger budgets are available. Our method takes advantage of current accelerator design techniques and introduces a novel architectural approach based on fine-grained virtualization. We evaluate shrink-fit using a synthesized implementation of an IDCT for decoding JPEGs and show the IDCT accelerator can shrink by a factor of 16x with minimal performance and area overheads. Using shrink-fit, application designers can achieve the benefits of hardware acceleration with single RTL designs on FPGAs large and small.
Shrink-Fit: A Framework for Flexible Accelerator Sizing
2011
Javier Lira, Carlos Molina, David Brooks, and Antonio Gonzalez. 12/18/2011. “Implementing a hybrid SRAM/eDRAM NUCA architecture.” In High Performance Computing (HiPC), 2011 18th International Conference on, Pp. 1–10. Bengaluru, India: IEEE. Publisher's VersionAbstract
Advances in technology allowed for integrating DRAM-like structures into the chip, called embedded DRAM (eDRAM). This technology has already been successfully implemented in some GPUs and other graphic-intensive SoC, like game consoles. The most recent processor from IBM ® , POWER7, is the first general-purpose processor that integrates an eDRAM module on the chip. In this paper, we propose a hybrid cache architecture that exploits the main features of both memory technologies, speed of SRAM and high density of eDRAM. We demonstrate, that due to the high locality found in emerging applications, a high percentage of data that enters to the on-chip last-level cache are not accessed again before they are evicted. Based on that observation, we propose a placement scheme where re-accessed data blocks are stored in fast, but costly in terms of area and power, SRAM banks, while eDRAM banks store data blo cks that just arrive to the NUCA cache or were demoted from a SRAM bank. We show that a well-balanced SRAM / eDRAM NUCA cache can achieve similar performance results than using a NUCA cache composed of only SRAM banks, but reduces area by 15% and power consumed by 10%. Furthermore, we also explore several alternatives to exploit the area reduction we gain by using the hybrid architecture, resulting in an overall performance improvement of 4%.
Implementing a hybrid SRAM/eDRAM NUCA architecture
Vijay Reddi and David Brooks. 10/2011. “Resilient Architectures via Collaborative Design: Maximizing Commodity Processor Performance in the Presence of Variations.” Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 30, 10, Pp. 1429–1445. Publisher's VersionAbstract
Unintended variations in circuit lithography and undesirable fluctuations in circuit operating parameters such as supply voltage and temperature are threatening the continuation of technology scaling that microprocessor evolution relies on. Although circuit-level solutions for some variation problems may be possible, they are prohibitively expensive and impractical for commodity processors, on which not only the consumer market but also an increasing segment of the business market now depends. Solutions at the microarchitecture level and even the software level, on the other hand, overcome some of these circuit-level challenges without significantly raising costs or lowering performance. Using examples drawn from our Alarms Project and related work, we illustrate how collaborative design that encompasses circuits, architecture, and chip-resident software leads to a cost-effective solution for inductive voltage noise, sometimes called the dI / dt problem. The strategy that we use for assuring correctness while preserving performance can be extended to other variation problems.
Resilient Architectures via Collaborative Design: Maximizing Commodity Processor Performance in the Presence of Variations
Pierre Duhamel, Judson Porter, Benjamin Finio, Geoffrey Barrows, David Brooks, Gu Wei, and Robert Wood. 9/25/2011. “Hardware in the loop for optical flow sensing in a robotic bee.” In Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ International Conference on, Pp. 1099–1106. San Francisco, CA, USA: IEEE. Publisher's VersionAbstract
The design of autonomous robots involves the development of many complex, interdependent components, including the mechanical body and its associated actuators, sensors, and algorithms to handle sensor processing, control, and high-level task planning. For the design of a robotic bee (RoboBee) it is necessary to optimize across the design space for minimum weight and power consumption to increase flight time; however, the design space of a single component is large, the interconnectedness and tradeoffs across components must be considered, and interdisciplinary collaborations cause different component design timelines. In this work, we show how the development of a hardware in the loop (HWIL) system for a flapping wing microrobot can simplify and accelerate evaluation of a large number of design choices. Specifically, we explore the design space of the visual system including sensor hardware and associated optical flow processing. We demonstrate the utility of the HWIL system in exposing trends on system performance for optical flow algorithm, field of view, sensor resolution, and frame rate.
Hardware in the loop for optical flow sensing in a robotic bee
Ankur Agrawal, Kumar Hanumolu, and Gu Wei. 9/19/2011. “Area efficient phase calibration of a 1.6 GHz multiphase DLL.” In 2011 IEEE Custom Integrated Circuits Conference (CICC), Pp. 1–4. IEEE. Publisher's VersionAbstract
This paper describes a digital calibration scheme that corrects for phase spacing errors in a multiphase clock generating delay-locked loop (DLL). The calibration scheme employs sub-sampling using a frequency-offset clock with respect to the DLL reference clock, to measure phase-offsets. The phase-correction circuit uses one digital-to-analog converter across eight variable-delay buffers to reduce the area consumption by 62%. The test-chip, designed in a 130 nm CMOS process, demonstrates a 8-phase 1.6 GHz DLL with a worst-case phase error of 450 fs.
Area efficient phase calibration of a 1.6 GHz multiphase DLL
David Brooks. 9/2011. “CPUs, GPUs, and Hybrid Computing.” IEEE Micro, Pp. 4–6. Publisher's VersionAbstract
This introduction to the special issue discusses advances and challenges in the field of hybrid CPU/GPU computing.
CPUs, GPUs, and Hybrid Computing
Hayun Chung and Gu Wei. 8/7/2011. “Design considerations for ADC-based backplane receivers.” In 2011 IEEE 54th International Midwest Symposium on Circuits and Systems (MWSCAS), Pp. 1–4. IEEE. Publisher's VersionAbstract
High-speed ADC-based backplane receivers often suffer from high power consumption and complexity and require careful designs. This paper discusses circuit- and system-level design considerations for such receivers. A low-power, high-speed front-end ADC circuit and a high-level design-space exploration of ADC-based receivers are presented.
Design considerations for ADC-based backplane receivers
Mark Hempstead, David Brooks, and Gu Wei. 7/2011. “An Accelerator-Based Wireless Sensor Network Processor in 130 nm CMOS.” Emerging and Selected Topics in Circuits and Systems, IEEE Journal on, 1, 2, Pp. 193–202. Publisher's VersionAbstract
Networks of ultra-low-power nodes capable of sensing, computation, and wireless communication have applications in medicine, science, industrial automation, and security. Reducing power consumption requires the development of system-on-chip implementations that must provide both energy efficiency and adequate performance to meet the demands of the long deployment lifetimes and bursts of computation that characterize wireless sensor network (WSN) applications. Therefore, this work argues that designers should evaluate the design in terms of average power for an entire workload, including active and idle periods, not just the metric of energy-per-instruction.
An Accelerator-Based Wireless Sensor Network Processor in 130 nm CMOS
Michael Karpelson, Robert J Wood, and Gu Wei. 6/15/2011. “Low power control IC for efficient high-voltage piezoelectric driving in a flying robotic insect.” In 2011 Symposium on VLSI Circuits-Digest of Technical Papers, Pp. 178–179. IEEE.Abstract
A dual-channel, low power control IC for driving high voltage piezoelectric actuators in a flapping-wing robotic insect is presented. The IC controls milligram-scale power electronics that meet the stringent weight and power requirements of aerial microrobots. Designed in a 0.13µm CMOS process, the IC implements an efficient control algorithm to drive piezoelectric actuators with high temporal resolution while consuming <100µW during normal operation at 1.0V. Keywords: low power, SOC, high voltage, piezoelectric actuator, and microrobotics.
Low power control IC for efficient high-voltage piezoelectric driving in a flying robotic insect
Peter Bailis, Vijay Reddi, Sanjay Gandhi, David Brooks, and Margo Seltzer. 6/9/2011. “Dimetrodon: processor-level preventive thermal management via idle cycle injection.” In Design Automation Conference (DAC), 2011 48th ACM/EDAC/IEEE, Pp. 89–94. San Diego, CA, USA: IEEE.Abstract
Processor-level dynamic thermal management techniques have long targeted worst-case thermal margins. We examine the thermal-performance trade-offs in average-case, preventive thermal management by actively degrading application performance to achieve long-term thermal control. We propose Dimetrodon, the use of idle cycle injection, a flexible, per-thread technique, as a preventive thermal management mechanism and demonstrate its efficiency compared to hardware techniques in a commodity operating system on real hardware under throughput and latency-sensitive real-world workloads. Compared to inflexible hardware techniques, Dimetrodon achieves favorable trade-offs for temperature reductions up to 30% due to rapid heat dissipation during short idle intervals.
Dimetrodon: processor-level preventive thermal management via idle cycle injection
Michael Karpelson, Whitney P, Gu Wei, and Wood J. 3/6/2011. “Design and fabrication of ultralight high-voltage power circuits for flapping-wing robotic insects.” In 2011 Twenty-Sixth Annual IEEE Applied Power Electronics Conference and Exposition (APEC), Pp. 2070–2077. IEEE. Publisher's VersionAbstract
Flapping-wing robotic insects are small, highly maneuverable flying robots inspired by biological insects and useful for a wide range of tasks, including exploration, environmental monitoring, search and rescue, and surveillance. Recently, robotic insects driven by piezoelectric actuators have achieved the important goal of taking off with external power; however, fully autonomous operation requires an ultralight power supply capable of generating high-voltage drive signals from low-voltage energy sources. This paper describes high-voltage switching circuit topologies and control methods suitable for driving piezoelectric actuators in flapping-wing robotic insects and discusses the physical implementation of these topologies, including the fabrication of custom magnetic components by laser micromachining and other weight minimization techniques. The performance of laser micromachined magnetics and custom-wound commercial magnetics is compared through the experimental realization of a tapped inductor boost converter capable of stepping up a 3.7V Li-poly cell input to 200V. The potential of laser micromachined magnetics is further shown by implementing a similar converter weighing 20mg (not including control functionality) and capable of up to 70mW output at 200V and up to 100mW at 100V.
Design and fabrication of ultralight high-voltage power circuits for flapping-wing robotic insects
Kevin Brownell, Ali Khan, Gu Wei, and David Brooks. 3/1/2011. “Automating design of voltage interpolation to address process variations.” Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, 99, Pp. 1–14. Publisher's VersionAbstract

Post-fabrication tuning provides a promising design approach to mitigate the performance and power overheads of process variation in advanced fabrication technologies. This paper explores design considerations and VLSI-CAD support for a recently proposed post-fabrication tuning knob called voltage interpolation. Successful implementation of this technique requires examination of the design tradeoffs between circuit tuning range and static power overheads within the synthesis flow of the design process, in addition to the implications of place and route. Results from the exploration of the scheme for a 64-core chip-multiprocessor machine using industrial-grade design blocks show that the scheme can be used to mitigate overhead arising from random and correlated within-die process variations. A design using voltage interpolation can match the nominal delay target with a 16% power cost, or for the same power budget, incur only a 13% delay overhead after variations.

Automating design of voltage interpolation to address process variations
Wonyoung Kim, David Brooks, and Gu Wei. 2/20/2011. “A fully-integrated 3-level DC/DC converter for nanosecond-scale DVS with fast shunt regulation.” In IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), Pp. 268–270. IEEE. Publisher's VersionAbstract
In recent years, chip multiprocessor architectures have emerged to scale performance while staying within tight power constraints. This trend motivates per core/block dynamic voltage and frequency scaling (DVFS) with fast voltage transition. Given the high cost and bulk of off-chip DC/DC converters to implement multiple on-chip power domains, there has been a surge of interest in on-chip converters. This paper presents the design and experimental results of a fully integrated 3-level DC/DC converter that merges characteristics of both inductor-based buck and switched-capacitor (SC) converters. While off-chip buck converters show high conversion efficiency, their on-chip counterparts suffer from loss due to low quality inductors. With the help of flying capacitors, the 3-level converter requires smaller inductors than the buck converter, reducing loss and on-die area overhead. Compared to SC converters that need more com plex structures to regulate higher than half the input voltage, 3-level converters can efficiently regulate the output voltage across a wide range of levels and load currents. Measured results from a 130nm CMOS test-chip prototype demon strate nanosecond-scale voltage transition times and peak conversion efficiency of 77%.
A fully-integrated 3-level DC/DC converter for nanosecond-scale DVS with fast shunt regulation
Krishna Rangan, Michael Powell, Gu Wei, and David Brooks. 2/12/2011. “Achieving uniform performance and maximizing throughput in the presence of heterogeneity.” In High Performance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium on, Pp. 3–14. IEEE. Publisher's VersionAbstract
Continued scaling of process technologies is critical to sustaining improvements in processor frequencies and performance. However, shrinking process technologies exacerbates process variations - the deviation of process parameters from their target specifications. In the context of multi-core CMPs, which are implemented to feature homogeneous cores, within-die process variations result in substantially different core frequencies. Exposing such process-variation induced heterogeneity interferes with the norm of marketing chips at a single frequency. Further, application performance is undesirably dictated by the frequency of the core it is running on. To work around these challenges, a single uniform frequency, dictated by the slowest core, is currently chosen as the chip frequency sacrificing the increased performance capabilities of cores that could operate at higher frequencies. In this paper, we propose choosing the mean frequency across all cores, in lieu of the minimum frequency, as the single-frequency to use as the chip sales frequency. We examine several scheduling algorithms implemented below the O/S in hardware/firmware that guarantee minimum application performance near that of the average frequency, by masking process-variation induced heterogeneity from the end-user. We show that our Throughput-Driven Fairness (TDF) scheduling policy improves throughput by an average of 12% compared to a naive fairness scheme (round-robin) for frequency-sensitive applications. At the same time, TDF allows 98% of chips to maintain minimum performance at or above 90% of that expected at the mean frequency, providing a single uniform performance level to present for the chip.
Achieving uniform performance and maximizing throughput in the presence of heterogeneity

Pages