Michael Lyons, Mark Hempstead, Gu Wei, and David Brooks. 1/2012. “The Accelerator Store: a shared memory framework for accelerator-based systems.” Transactions on Architecture and Code Optimization (TACO). Publisher's VersionAbstract
This paper presents the many-accelerator architecture, a design approach combining the scalability of homogeneous multi-core architectures and system-on-chip’s high performance and power-efficient hardware accelerators. In preparation for systems containing tens or hundreds of accelerators, we characterize a diverse pool of accelerators and find each contains significant amounts of SRAM memory (up to 90% of their area). We take advantage of this discovery and introduce the accelerator store, a scalable architectural component to minimize accelerator area by sharing its memories between accelerators. We evaluate the accelerator store for two applications and find significant system area reductions (30%) in exchange for small overheads (2% performance, 0%–8% energy). The paper also identifies new research directions enabled by the accelerator store and the many-accelerator architecture.
The accelerator store: A shared memory framework for accelerator-based systems
Amanda Tseng and David Brooks. 1/2012. “Analytical Latency-Throughput Model of Future Power Constrained Multicore Processors.” In Workshop on Energy Efficient Design, ISCA, 600: Pp. 800.Abstract
Despite increased core counts that provide significant throughput performance gains, single thread performance is still an important metric in today’s processor designs. Due to chip power constraints, architects must carefully allocate power budgets to additional cores or increased single thread performance. To study this tradeoff between different performance metrics, we construct an analytical model that computes single thread and throughput performance under a given power budget for both symmetric and asymmetric multicore architectures. We also consider multi-task workloads, where optimal designs might include more than one large core in the heterogeneous architecture. Our analytical model considers the optimal number and complexity of cores in a processor and quantifies the benefits of asymmetric designs when trading latency and throughput. We show that a diverse set of core designs can be optimal in different scenarios.
Analytical Latency-Throughput Model of Future Power Constrained Multicore Processors
Wonyoung Kim, David Brooks, and Gu Wei. 1/2012. “A fully-integrated 3-level DC-DC converter for nanosecond-scale DVFS.” Solid-State Circuits, IEEE Journal of, 47, 1, Pp. 206–219. Publisher's VersionAbstract
On-chip DC-DC converters have the potential to offer fine-grain power management in modern chip-multiprocessors. This paper presents a fully integrated 3-level DC-DC converter, a hybrid of buck and switched-capacitor converters, implemented in 130 nm CMOS technology. The 3-level converter enables smaller inductors (1 nH) than a buck, while generating a wide range of output voltages compared to a 1/2 mode switched-capacitor converter. The test-chip prototype delivers up to 0.85 A load current while generating output voltages from 0.4 to 1.4 V from a 2.4 V input supply. It achieves 77% peak efficiency at power density of 0.1 W/mm 2 and 63% efficiency at maximum power density of 0.3 W/mm 2 . The converter scales output voltage from 0.4 V to 1.4 V (or vice-versa) within 20 ns at a constant 450 mA load current. A shunt regulator reduces peak-to-peak voltage noise from 0.27 V to 0.19 V under pseudo-randomly fluctuating load currents. Using simulations across a wide range of design parameters, the paper compares conversion efficiencies of the 3-level, buck and switched-capacitor converters.
A fully-integrated 3-level DC-DC converter for nanosecond-scale DVFS
Michael Lyons, Gu Wei, and David Brooks. 1/2012. “Shrink-Fit: A Framework for Flexible Accelerator Sizing.” IEEE Computer Architecture Letters, 12, 1, Pp. 17 - 20. Publisher's VersionAbstract
RTL design complexity discouraged adoption of reconfigurable logic in general purpose systems, impeding opportunities for performance and energy improvements. Recent improvements to HLS compilers simplify RTL design and are easing this barrier. A new challenge will emerge: managing reconfigurable resources between multiple applications with custom hardware designs. In this paper, we propose a method to “shrink-fit” accelerators within widely varying fabric budgets. Shrink-fit automatically shrinks existing accelerator designs within small fabric budgets and grows designs to increase performance when larger budgets are available. Our method takes advantage of current accelerator design techniques and introduces a novel architectural approach based on fine-grained virtualization. We evaluate shrink-fit using a synthesized implementation of an IDCT for decoding JPEGs and show the IDCT accelerator can shrink by a factor of 16x with minimal performance and area overheads. Using shrink-fit, application designers can achieve the benefits of hardware acceleration with single RTL designs on FPGAs large and small.
Shrink-Fit: A Framework for Flexible Accelerator Sizing
Javier Lira, Carlos Molina, David Brooks, and Antonio Gonzalez. 12/18/2011. “Implementing a hybrid SRAM/eDRAM NUCA architecture.” In High Performance Computing (HiPC), 2011 18th International Conference on, Pp. 1–10. Bengaluru, India: IEEE. Publisher's VersionAbstract
Advances in technology allowed for integrating DRAM-like structures into the chip, called embedded DRAM (eDRAM). This technology has already been successfully implemented in some GPUs and other graphic-intensive SoC, like game consoles. The most recent processor from IBM ® , POWER7, is the first general-purpose processor that integrates an eDRAM module on the chip. In this paper, we propose a hybrid cache architecture that exploits the main features of both memory technologies, speed of SRAM and high density of eDRAM. We demonstrate, that due to the high locality found in emerging applications, a high percentage of data that enters to the on-chip last-level cache are not accessed again before they are evicted. Based on that observation, we propose a placement scheme where re-accessed data blocks are stored in fast, but costly in terms of area and power, SRAM banks, while eDRAM banks store data blo cks that just arrive to the NUCA cache or were demoted from a SRAM bank. We show that a well-balanced SRAM / eDRAM NUCA cache can achieve similar performance results than using a NUCA cache composed of only SRAM banks, but reduces area by 15% and power consumed by 10%. Furthermore, we also explore several alternatives to exploit the area reduction we gain by using the hybrid architecture, resulting in an overall performance improvement of 4%.
Implementing a hybrid SRAM/eDRAM NUCA architecture
Vijay Reddi and David Brooks. 10/2011. “Resilient Architectures via Collaborative Design: Maximizing Commodity Processor Performance in the Presence of Variations.” Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 30, 10, Pp. 1429–1445. Publisher's VersionAbstract
Unintended variations in circuit lithography and undesirable fluctuations in circuit operating parameters such as supply voltage and temperature are threatening the continuation of technology scaling that microprocessor evolution relies on. Although circuit-level solutions for some variation problems may be possible, they are prohibitively expensive and impractical for commodity processors, on which not only the consumer market but also an increasing segment of the business market now depends. Solutions at the microarchitecture level and even the software level, on the other hand, overcome some of these circuit-level challenges without significantly raising costs or lowering performance. Using examples drawn from our Alarms Project and related work, we illustrate how collaborative design that encompasses circuits, architecture, and chip-resident software leads to a cost-effective solution for inductive voltage noise, sometimes called the dI / dt problem. The strategy that we use for assuring correctness while preserving performance can be extended to other variation problems.
Resilient Architectures via Collaborative Design: Maximizing Commodity Processor Performance in the Presence of Variations
Pierre Duhamel, Judson Porter, Benjamin Finio, Geoffrey Barrows, David Brooks, Gu Wei, and Robert Wood. 9/25/2011. “Hardware in the loop for optical flow sensing in a robotic bee.” In Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ International Conference on, Pp. 1099–1106. San Francisco, CA, USA: IEEE. Publisher's VersionAbstract
The design of autonomous robots involves the development of many complex, interdependent components, including the mechanical body and its associated actuators, sensors, and algorithms to handle sensor processing, control, and high-level task planning. For the design of a robotic bee (RoboBee) it is necessary to optimize across the design space for minimum weight and power consumption to increase flight time; however, the design space of a single component is large, the interconnectedness and tradeoffs across components must be considered, and interdisciplinary collaborations cause different component design timelines. In this work, we show how the development of a hardware in the loop (HWIL) system for a flapping wing microrobot can simplify and accelerate evaluation of a large number of design choices. Specifically, we explore the design space of the visual system including sensor hardware and associated optical flow processing. We demonstrate the utility of the HWIL system in exposing trends on system performance for optical flow algorithm, field of view, sensor resolution, and frame rate.
Hardware in the loop for optical flow sensing in a robotic bee
Ankur Agrawal, Kumar Hanumolu, and Gu Wei. 9/19/2011. “Area efficient phase calibration of a 1.6 GHz multiphase DLL.” In 2011 IEEE Custom Integrated Circuits Conference (CICC), Pp. 1–4. IEEE. Publisher's VersionAbstract
This paper describes a digital calibration scheme that corrects for phase spacing errors in a multiphase clock generating delay-locked loop (DLL). The calibration scheme employs sub-sampling using a frequency-offset clock with respect to the DLL reference clock, to measure phase-offsets. The phase-correction circuit uses one digital-to-analog converter across eight variable-delay buffers to reduce the area consumption by 62%. The test-chip, designed in a 130 nm CMOS process, demonstrates a 8-phase 1.6 GHz DLL with a worst-case phase error of 450 fs.
Area efficient phase calibration of a 1.6 GHz multiphase DLL
David Brooks. 9/2011. “CPUs, GPUs, and Hybrid Computing.” IEEE Micro, Pp. 4–6. Publisher's VersionAbstract
This introduction to the special issue discusses advances and challenges in the field of hybrid CPU/GPU computing.
CPUs, GPUs, and Hybrid Computing
Hayun Chung and Gu Wei. 8/7/2011. “Design considerations for ADC-based backplane receivers.” In 2011 IEEE 54th International Midwest Symposium on Circuits and Systems (MWSCAS), Pp. 1–4. IEEE. Publisher's VersionAbstract
High-speed ADC-based backplane receivers often suffer from high power consumption and complexity and require careful designs. This paper discusses circuit- and system-level design considerations for such receivers. A low-power, high-speed front-end ADC circuit and a high-level design-space exploration of ADC-based receivers are presented.
Design considerations for ADC-based backplane receivers
Mark Hempstead, David Brooks, and Gu Wei. 7/2011. “An Accelerator-Based Wireless Sensor Network Processor in 130 nm CMOS.” Emerging and Selected Topics in Circuits and Systems, IEEE Journal on, 1, 2, Pp. 193–202. Publisher's VersionAbstract
Networks of ultra-low-power nodes capable of sensing, computation, and wireless communication have applications in medicine, science, industrial automation, and security. Reducing power consumption requires the development of system-on-chip implementations that must provide both energy efficiency and adequate performance to meet the demands of the long deployment lifetimes and bursts of computation that characterize wireless sensor network (WSN) applications. Therefore, this work argues that designers should evaluate the design in terms of average power for an entire workload, including active and idle periods, not just the metric of energy-per-instruction.
An Accelerator-Based Wireless Sensor Network Processor in 130 nm CMOS
Michael Karpelson, Robert J Wood, and Gu Wei. 6/15/2011. “Low power control IC for efficient high-voltage piezoelectric driving in a flying robotic insect.” In 2011 Symposium on VLSI Circuits-Digest of Technical Papers, Pp. 178–179. IEEE.Abstract
A dual-channel, low power control IC for driving high voltage piezoelectric actuators in a flapping-wing robotic insect is presented. The IC controls milligram-scale power electronics that meet the stringent weight and power requirements of aerial microrobots. Designed in a 0.13µm CMOS process, the IC implements an efficient control algorithm to drive piezoelectric actuators with high temporal resolution while consuming <100µW during normal operation at 1.0V. Keywords: low power, SOC, high voltage, piezoelectric actuator, and microrobotics.
Low power control IC for efficient high-voltage piezoelectric driving in a flying robotic insect
Peter Bailis, Vijay Reddi, Sanjay Gandhi, David Brooks, and Margo Seltzer. 6/9/2011. “Dimetrodon: processor-level preventive thermal management via idle cycle injection.” In Design Automation Conference (DAC), 2011 48th ACM/EDAC/IEEE, Pp. 89–94. San Diego, CA, USA: IEEE.Abstract
Processor-level dynamic thermal management techniques have long targeted worst-case thermal margins. We examine the thermal-performance trade-offs in average-case, preventive thermal management by actively degrading application performance to achieve long-term thermal control. We propose Dimetrodon, the use of idle cycle injection, a flexible, per-thread technique, as a preventive thermal management mechanism and demonstrate its efficiency compared to hardware techniques in a commodity operating system on real hardware under throughput and latency-sensitive real-world workloads. Compared to inflexible hardware techniques, Dimetrodon achieves favorable trade-offs for temperature reductions up to 30% due to rapid heat dissipation during short idle intervals.
Dimetrodon: processor-level preventive thermal management via idle cycle injection
Michael Karpelson, Whitney P, Gu Wei, and Wood J. 3/6/2011. “Design and fabrication of ultralight high-voltage power circuits for flapping-wing robotic insects.” In 2011 Twenty-Sixth Annual IEEE Applied Power Electronics Conference and Exposition (APEC), Pp. 2070–2077. IEEE. Publisher's VersionAbstract
Flapping-wing robotic insects are small, highly maneuverable flying robots inspired by biological insects and useful for a wide range of tasks, including exploration, environmental monitoring, search and rescue, and surveillance. Recently, robotic insects driven by piezoelectric actuators have achieved the important goal of taking off with external power; however, fully autonomous operation requires an ultralight power supply capable of generating high-voltage drive signals from low-voltage energy sources. This paper describes high-voltage switching circuit topologies and control methods suitable for driving piezoelectric actuators in flapping-wing robotic insects and discusses the physical implementation of these topologies, including the fabrication of custom magnetic components by laser micromachining and other weight minimization techniques. The performance of laser micromachined magnetics and custom-wound commercial magnetics is compared through the experimental realization of a tapped inductor boost converter capable of stepping up a 3.7V Li-poly cell input to 200V. The potential of laser micromachined magnetics is further shown by implementing a similar converter weighing 20mg (not including control functionality) and capable of up to 70mW output at 200V and up to 100mW at 100V.
Design and fabrication of ultralight high-voltage power circuits for flapping-wing robotic insects
Kevin Brownell, Ali Khan, Gu Wei, and David Brooks. 3/1/2011. “Automating design of voltage interpolation to address process variations.” Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, 99, Pp. 1–14. Publisher's VersionAbstract

Post-fabrication tuning provides a promising design approach to mitigate the performance and power overheads of process variation in advanced fabrication technologies. This paper explores design considerations and VLSI-CAD support for a recently proposed post-fabrication tuning knob called voltage interpolation. Successful implementation of this technique requires examination of the design tradeoffs between circuit tuning range and static power overheads within the synthesis flow of the design process, in addition to the implications of place and route. Results from the exploration of the scheme for a 64-core chip-multiprocessor machine using industrial-grade design blocks show that the scheme can be used to mitigate overhead arising from random and correlated within-die process variations. A design using voltage interpolation can match the nominal delay target with a 16% power cost, or for the same power budget, incur only a 13% delay overhead after variations.

Automating design of voltage interpolation to address process variations
Wonyoung Kim, David Brooks, and Gu Wei. 2/20/2011. “A fully-integrated 3-level DC/DC converter for nanosecond-scale DVS with fast shunt regulation.” In IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), Pp. 268–270. IEEE. Publisher's VersionAbstract
In recent years, chip multiprocessor architectures have emerged to scale performance while staying within tight power constraints. This trend motivates per core/block dynamic voltage and frequency scaling (DVFS) with fast voltage transition. Given the high cost and bulk of off-chip DC/DC converters to implement multiple on-chip power domains, there has been a surge of interest in on-chip converters. This paper presents the design and experimental results of a fully integrated 3-level DC/DC converter that merges characteristics of both inductor-based buck and switched-capacitor (SC) converters. While off-chip buck converters show high conversion efficiency, their on-chip counterparts suffer from loss due to low quality inductors. With the help of flying capacitors, the 3-level converter requires smaller inductors than the buck converter, reducing loss and on-die area overhead. Compared to SC converters that need more com plex structures to regulate higher than half the input voltage, 3-level converters can efficiently regulate the output voltage across a wide range of levels and load currents. Measured results from a 130nm CMOS test-chip prototype demon strate nanosecond-scale voltage transition times and peak conversion efficiency of 77%.
A fully-integrated 3-level DC/DC converter for nanosecond-scale DVS with fast shunt regulation
Krishna Rangan, Michael Powell, Gu Wei, and David Brooks. 2/12/2011. “Achieving uniform performance and maximizing throughput in the presence of heterogeneity.” In High Performance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium on, Pp. 3–14. IEEE. Publisher's VersionAbstract
Continued scaling of process technologies is critical to sustaining improvements in processor frequencies and performance. However, shrinking process technologies exacerbates process variations - the deviation of process parameters from their target specifications. In the context of multi-core CMPs, which are implemented to feature homogeneous cores, within-die process variations result in substantially different core frequencies. Exposing such process-variation induced heterogeneity interferes with the norm of marketing chips at a single frequency. Further, application performance is undesirably dictated by the frequency of the core it is running on. To work around these challenges, a single uniform frequency, dictated by the slowest core, is currently chosen as the chip frequency sacrificing the increased performance capabilities of cores that could operate at higher frequencies. In this paper, we propose choosing the mean frequency across all cores, in lieu of the minimum frequency, as the single-frequency to use as the chip sales frequency. We examine several scheduling algorithms implemented below the O/S in hardware/firmware that guarantee minimum application performance near that of the average frequency, by masking process-variation induced heterogeneity from the end-user. We show that our Throughput-Driven Fairness (TDF) scheduling policy improves throughput by an average of 12% compared to a naive fairness scheme (round-robin) for frequency-sensitive applications. At the same time, TDF allows 98% of chips to maintain minimum performance at or above 90% of that expected at the mean frequency, providing a single uniform performance level to present for the chip.
Achieving uniform performance and maximizing throughput in the presence of heterogeneity
David Brooks. 1/25/2011. “The alarms project: a hardware/software approach to addressing parameter variations.” In Design Automation Conference (ASP-DAC), 2011 16th Asia and South Pacific, Pp. 291–291. IEEE. Publisher's VersionAbstract
Parameter variations (process, voltage, and temperature) threaten continued performance scaling of power-constrained computer systems. As designers seek to contain the power consumption of microprocessors through reductions in supply voltage and power-saving techniques such as clock-gating, these systems suffer increasingly large power supply fluctuations due to the finite impedance of the power supply network. These supply fluctuations, referred to as voltage emergencies, must be managed to guarantee correctness. Traditional approaches to address this problem incur high-cost or compromise power/performance efficiency. Our research seeks ways to handle these alarm conditions through a combined hardware/software approach, motivated by root cause analysis of voltage emergencies revealing that many of these events are heavily linked to both program control flow and microarchitectural events (cache misses and pipeline flushes). This talk will discuss three aspects of the project: (1) a fail-safe mechanism that provides hardware guaranteed correctness; (2) a voltage emergency predictor that leverages control flow and microarchitectural event information to predict voltage emergencies up to 16 cycles in advance; and (3) a proof-of-concept dynamic compiler implementation that demonstrates that dynamic code transformations can be used to eliminate voltage emergencies from the instruction stream with minimal impact on performance [1–9].
The alarms project: a hardware/software approach to addressing parameter variations
Vijay Reddi, Svilen Kanev, Wonyoung Kim, Simone Campanoni, Michael Smith, Gu Wei, and David Brooks. 1/2011. “Voltage Noise in Production Processors.” IEEE Micro, 31, 1. Publisher's VersionAbstract
Voltage variations are a major challenge in processor design. Here, researchers characterize the voltage noise characteristics of programs as they run to completion on a production Core 2 Duo processor. Furthermore, they characterize the implications of resilient architecture design for voltage variation in future systems.
Voltage Noise in Production Processors
Vijay Reddi, Svilen Kanev, Wonyoung Kim, Simone Campanoni, Michael Smith, Gu Wei, and David Brooks. 12/4/2010. “Voltage smoothing: Characterizing and mitigating voltage noise in production processors via software-guided thread scheduling.” In 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture. IEEE. Publisher's VersionAbstract
Parameter variations have become a dominant challenge in microprocessor design. Voltage variation is especially daunting because it happens so rapidly. We measure and characterize voltage variation in a running Intel Core2 Duo processor. By sensing on-die voltage as the processor runs single-threaded, multi-threaded, and multi-program workloads, we determine the average supply voltage swing of the processor to be only 4%, far from the processor’s 14% worst-case operating voltage margin. While such large margins guarantee correctness, they penalize performance and power efficiency. We investigate and quantify the benefits of designing a processor for typical-case (rather than worst-case) voltage swings, assuming that a fail-safe mechanism protects it from infrequently occurring large voltage fluctuations. With today’s processors, such resilient designs could yield 15% to 20% performance improvements. But we also show that in future systems, these gains could be lost as increasing voltage swings intensify the frequency of fail-safe recoveries. After characterizing microarchitectural activity that leads to voltage swings within multi-core systems, we show that a voltage-noise-aware thread scheduler in software can co-schedule phases of different programs to mitigate error recovery overheads in future resilient processor designs.
Voltage smoothing: Characterizing and mitigating voltage noise in production processors via software-guided thread scheduling