2012
Simone Campanoni, Timothy Jones, Glenn Holloway, Vijay Reddi, Gu Wei, and David Brooks. 3/31/2012. “
HELIX: Automatic parallelization of irregular programs for chip multiprocessing.” In International Symposium on Code Generation and Optimization (CGO). ACM.
Publisher's VersionAbstractWe describe and evaluate HELIX, a new technique for automatic loop parallelization that assigns successive iterations of a loop to separate threads. We show that the inter-thread communication costs forced by loop-carried data dependences can be mitigated by code optimization, by using an effective heuristic for selecting loops to parallelize, and by using helper threads to prefetch synchronization signals. We have implemented HELIX as part of an optimizing compiler framework that automatically selects and parallelizes loops from general sequential programs. The framework uses an analytical model of loop speedups, combined with profile data, to choose loops to parallelize. On a six-core Intel Core i7-980X, HELIX achieves speedups averaging 2.25, with a maximum of 4.12, for thirteen C benchmarks from SPEC CPU2000.
HELIX: Automatic parallelization of irregular programs for chip multiprocessing H Chung and Gu Wei. 1/5/2012. “
Simulated-annealing-based adaptive equaliser for on-die variation compensation.” Electronics letters, 48, 1, Pp. 18–19.
Publisher's VersionAbstract
Fully exploiting the flexibility of lookup-table-based equalisers, it is proposed to compensate for on-die variation effects within a transmit-side equaliser. To efficiently deal with the nonlinear nature of circuit non-idealities, the proposed equaliser utilises simulated annealing for adaptation.
Simulated-annealing-based adaptive equaliser for on-die variation compensation Michael Lyons, Mark Hempstead, Gu Wei, and David Brooks. 1/2012. “
The Accelerator Store: a shared memory framework for accelerator-based systems.” Transactions on Architecture and Code Optimization (TACO).
Publisher's VersionAbstractThis paper presents the many-accelerator architecture, a design approach combining the scalability of homogeneous multi-core architectures and system-on-chip’s high performance and power-efficient hardware accelerators. In preparation for systems containing tens or hundreds of accelerators, we characterize a diverse pool of accelerators and find each contains significant amounts of SRAM memory (up to 90% of their area). We take advantage of this discovery and introduce the accelerator store, a scalable architectural component to minimize accelerator area by sharing its memories between accelerators. We evaluate the accelerator store for two applications and find significant system area reductions (30%) in exchange for small overheads (2% performance, 0%–8% energy). The paper also identifies new research directions enabled by the accelerator store and the many-accelerator architecture.
The accelerator store: A shared memory framework for accelerator-based systems Michael Lyons, Mark Hempstead, Gu Wei, and David Brooks. 1/2012. “
The accelerator store: A shared memory framework for accelerator-based systems.” ACM Transactions on Architecture. and Code Optimization, 8, 4, Pp. 1-22.
Publisher's VersionAbstract
This paper presents the many-accelerator architecture, a design approach combining the scalability of homogeneous multi-core architectures and system-on-chip's high performance and power-efficient hardware accelerators. In preparation for systems containing tens or hundreds of accelerators, we characterize a diverse pool of accelerators and find each contains significant amounts of SRAM memory (up to 90% of their area). We take advantage of this discovery and introduce the accelerator store, a scalable architectural component to minimize accelerator area by sharing its memories between accelerators. We evaluate the accelerator store for two applications and find significant system area reductions (30%) in exchange for small overheads (2% performance, 0%--8% energy). The paper also identifies new research directions enabled by the accelerator store and the many-accelerator architecture.
The accelerator store: A shared memory framework for accelerator-based systems Amanda Tseng and David Brooks. 1/2012. “
Analytical Latency-Throughput Model of Future Power Constrained Multicore Processors.” In Workshop on Energy Efficient Design, ISCA, 600: Pp. 800.
AbstractDespite increased core counts that provide significant throughput performance gains, single thread performance is still an important metric in today’s processor designs. Due to chip power constraints, architects must carefully allocate power budgets to additional cores or increased single thread performance. To study this tradeoff between different performance metrics, we construct an analytical model that computes single thread and throughput performance under a given power budget for both symmetric and asymmetric multicore architectures. We also consider multi-task workloads, where optimal designs might include more than one large core in the heterogeneous architecture. Our analytical model considers the optimal number and complexity of cores in a processor and quantifies the benefits of asymmetric designs when trading latency and throughput. We show that a diverse set of core designs can be optimal in different scenarios.
Analytical Latency-Throughput Model of Future Power Constrained Multicore Processors Wonyoung Kim, David Brooks, and Gu Wei. 1/2012. “
A fully-integrated 3-level DC-DC converter for nanosecond-scale DVFS.” Solid-State Circuits, IEEE Journal of, 47, 1, Pp. 206–219.
Publisher's VersionAbstractOn-chip DC-DC converters have the potential to offer fine-grain power management in modern chip-multiprocessors. This paper presents a fully integrated 3-level DC-DC converter, a hybrid of buck and switched-capacitor converters, implemented in 130 nm CMOS technology. The 3-level converter enables smaller inductors (1 nH) than a buck, while generating a wide range of output voltages compared to a 1/2 mode switched-capacitor converter. The test-chip prototype delivers up to 0.85 A load current while generating output voltages from 0.4 to 1.4 V from a 2.4 V input supply. It achieves 77% peak efficiency at power density of 0.1 W/mm 2 and 63% efficiency at maximum power density of 0.3 W/mm 2 . The converter scales output voltage from 0.4 V to 1.4 V (or vice-versa) within 20 ns at a constant 450 mA load current. A shunt regulator reduces peak-to-peak voltage noise from 0.27 V to 0.19 V under pseudo-randomly fluctuating load currents. Using simulations across a wide range of design parameters, the paper compares conversion efficiencies of the 3-level, buck and switched-capacitor converters.
A fully-integrated 3-level DC-DC converter for nanosecond-scale DVFS Michael Lyons, Gu Wei, and David Brooks. 1/2012. “
Shrink-Fit: A Framework for Flexible Accelerator Sizing.” IEEE Computer Architecture Letters, 12, 1, Pp. 17 - 20.
Publisher's VersionAbstractRTL design complexity discouraged adoption of reconfigurable logic in general purpose systems, impeding opportunities for performance and energy improvements. Recent improvements to HLS compilers simplify RTL design and are easing this barrier. A new challenge will emerge: managing reconfigurable resources between multiple applications with custom hardware designs. In this paper, we propose a method to “shrink-fit” accelerators within widely varying fabric budgets. Shrink-fit automatically shrinks existing accelerator designs within small fabric budgets and grows designs to increase performance when larger budgets are available. Our method takes advantage of current accelerator design techniques and introduces a novel architectural approach based on fine-grained virtualization. We evaluate shrink-fit using a synthesized implementation of an IDCT for decoding JPEGs and show the IDCT accelerator can shrink by a factor of 16x with minimal performance and area overheads. Using shrink-fit, application designers can achieve the benefits of hardware acceleration with single RTL designs on FPGAs large and small.
Shrink-Fit: A Framework for Flexible Accelerator Sizing 2011
Javier Lira, Carlos Molina, David Brooks, and Antonio Gonzalez. 12/18/2011. “
Implementing a hybrid SRAM/eDRAM NUCA architecture.” In High Performance Computing (HiPC), 2011 18th International Conference on, Pp. 1–10. Bengaluru, India: IEEE.
Publisher's VersionAbstractAdvances in technology allowed for integrating DRAM-like structures into the chip, called embedded DRAM (eDRAM). This technology has already been successfully implemented in some GPUs and other graphic-intensive SoC, like game consoles. The most recent processor from IBM ® , POWER7, is the first general-purpose processor that integrates an eDRAM module on the chip. In this paper, we propose a hybrid cache architecture that exploits the main features of both memory technologies, speed of SRAM and high density of eDRAM. We demonstrate, that due to the high locality found in emerging applications, a high percentage of data that enters to the on-chip last-level cache are not accessed again before they are evicted. Based on that observation, we propose a placement scheme where re-accessed data blocks are stored in fast, but costly in terms of area and power, SRAM banks, while eDRAM banks store data blo cks that just arrive to the NUCA cache or were demoted from a SRAM bank. We show that a well-balanced SRAM / eDRAM NUCA cache can achieve similar performance results than using a NUCA cache composed of only SRAM banks, but reduces area by 15% and power consumed by 10%. Furthermore, we also explore several alternatives to exploit the area reduction we gain by using the hybrid architecture, resulting in an overall performance improvement of 4%.
Implementing a hybrid SRAM/eDRAM NUCA architecture Vijay Reddi and David Brooks. 10/2011. “
Resilient Architectures via Collaborative Design: Maximizing Commodity Processor Performance in the Presence of Variations.” Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 30, 10, Pp. 1429–1445.
Publisher's VersionAbstractUnintended variations in circuit lithography and undesirable fluctuations in circuit operating parameters such as supply voltage and temperature are threatening the continuation of technology scaling that microprocessor evolution relies on. Although circuit-level solutions for some variation problems may be possible, they are prohibitively expensive and impractical for commodity processors, on which not only the consumer market but also an increasing segment of the business market now depends. Solutions at the microarchitecture level and even the software level, on the other hand, overcome some of these circuit-level challenges without significantly raising costs or lowering performance. Using examples drawn from our Alarms Project and related work, we illustrate how collaborative design that encompasses circuits, architecture, and chip-resident software leads to a cost-effective solution for inductive voltage noise, sometimes called the dI / dt problem. The strategy that we use for assuring correctness while preserving performance can be extended to other variation problems.
Resilient Architectures via Collaborative Design: Maximizing Commodity Processor Performance in the Presence of Variations Pierre Duhamel, Judson Porter, Benjamin Finio, Geoffrey Barrows, David Brooks, Gu Wei, and Robert Wood. 9/25/2011. “
Hardware in the loop for optical flow sensing in a robotic bee.” In Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ International Conference on, Pp. 1099–1106. San Francisco, CA, USA: IEEE.
Publisher's VersionAbstractThe design of autonomous robots involves the
development of many complex, interdependent components,
including the mechanical body and its associated actuators,
sensors, and algorithms to handle sensor processing, control,
and high-level task planning. For the design of a robotic
bee (RoboBee) it is necessary to optimize across the design
space for minimum weight and power consumption to increase
flight time; however, the design space of a single component is
large, the interconnectedness and tradeoffs across components
must be considered, and interdisciplinary collaborations cause
different component design timelines.
In this work, we show how the development of a hardware
in the loop (HWIL) system for a flapping wing microrobot can
simplify and accelerate evaluation of a large number of design
choices. Specifically, we explore the design space of the visual
system including sensor hardware and associated optical flow
processing. We demonstrate the utility of the HWIL system
in exposing trends on system performance for optical flow
algorithm, field of view, sensor resolution, and frame rate.
Hardware in the loop for optical flow sensing in a robotic bee Ankur Agrawal, Kumar Hanumolu, and Gu Wei. 9/19/2011. “
Area efficient phase calibration of a 1.6 GHz multiphase DLL.” In 2011 IEEE Custom Integrated Circuits Conference (CICC), Pp. 1–4. IEEE.
Publisher's VersionAbstractThis paper describes a digital calibration scheme that corrects for phase spacing errors in a multiphase clock generating delay-locked loop (DLL). The calibration scheme employs sub-sampling using a frequency-offset clock with respect to the DLL reference clock, to measure phase-offsets. The phase-correction circuit uses one digital-to-analog converter across eight variable-delay buffers to reduce the area consumption by 62%. The test-chip, designed in a 130 nm CMOS process, demonstrates a 8-phase 1.6 GHz DLL with a worst-case phase error of 450 fs.
Area efficient phase calibration of a 1.6 GHz multiphase DLL David Brooks. 9/2011. “
CPUs, GPUs, and Hybrid Computing.” IEEE Micro, Pp. 4–6.
Publisher's VersionAbstractThis introduction to the special issue discusses advances and challenges in the field of hybrid CPU/GPU computing.
CPUs, GPUs, and Hybrid Computing Hayun Chung and Gu Wei. 8/7/2011. “
Design considerations for ADC-based backplane receivers.” In 2011 IEEE 54th International Midwest Symposium on Circuits and Systems (MWSCAS), Pp. 1–4. IEEE.
Publisher's VersionAbstractHigh-speed ADC-based backplane receivers often suffer from high power consumption and complexity and require careful designs. This paper discusses circuit- and system-level design considerations for such receivers. A low-power, high-speed front-end ADC circuit and a high-level design-space exploration of ADC-based receivers are presented.
Design considerations for ADC-based backplane receivers Mark Hempstead, David Brooks, and Gu Wei. 7/2011. “
An Accelerator-Based Wireless Sensor Network Processor in 130 nm CMOS.” Emerging and Selected Topics in Circuits and Systems, IEEE Journal on, 1, 2, Pp. 193–202.
Publisher's VersionAbstractNetworks of ultra-low-power nodes capable of sensing, computation, and wireless communication have applications in medicine, science, industrial automation, and security. Reducing power consumption requires the development of system-on-chip implementations that must provide both energy efficiency and adequate performance to meet the demands of the long deployment lifetimes and bursts of computation that characterize wireless sensor network (WSN) applications. Therefore, this work argues that designers should evaluate the design in terms of average power for an entire workload, including active and idle periods, not just the metric of energy-per-instruction.
An Accelerator-Based Wireless Sensor Network Processor in 130 nm CMOS Michael Karpelson, Robert J Wood, and Gu Wei. 6/15/2011. “
Low power control IC for efficient high-voltage piezoelectric driving in a flying robotic insect.” In 2011 Symposium on VLSI Circuits-Digest of Technical Papers, Pp. 178–179. IEEE.
AbstractA dual-channel, low power control IC for driving high
voltage piezoelectric actuators in a flapping-wing robotic insect is presented. The IC controls milligram-scale power electronics that meet the stringent weight and power requirements
of aerial microrobots. Designed in a 0.13µm CMOS process,
the IC implements an efficient control algorithm to drive piezoelectric actuators with high temporal resolution while consuming <100µW during normal operation at 1.0V.
Keywords: low power, SOC, high voltage, piezoelectric
actuator, and microrobotics.
Low power control IC for efficient high-voltage piezoelectric driving in a flying robotic insect Peter Bailis, Vijay Reddi, Sanjay Gandhi, David Brooks, and Margo Seltzer. 6/9/2011. “
Dimetrodon: processor-level preventive thermal management via idle cycle injection.” In Design Automation Conference (DAC), 2011 48th ACM/EDAC/IEEE, Pp. 89–94. San Diego, CA, USA: IEEE.
AbstractProcessor-level dynamic thermal management techniques have long targeted worst-case thermal margins. We examine the thermal-performance trade-offs in average-case, preventive thermal management by actively degrading application performance to achieve long-term thermal control. We propose Dimetrodon, the use of idle cycle injection, a flexible, per-thread technique, as a preventive thermal management mechanism and demonstrate its efficiency compared to hardware techniques in a commodity operating system on real hardware under throughput and latency-sensitive real-world workloads. Compared to inflexible hardware techniques, Dimetrodon achieves favorable trade-offs for temperature reductions up to 30% due to rapid heat dissipation during short idle intervals.
Dimetrodon: processor-level preventive thermal management via idle cycle injection Michael Karpelson, Whitney P, Gu Wei, and Wood J. 3/6/2011. “
Design and fabrication of ultralight high-voltage power circuits for flapping-wing robotic insects.” In 2011 Twenty-Sixth Annual IEEE Applied Power Electronics Conference and Exposition (APEC), Pp. 2070–2077. IEEE.
Publisher's VersionAbstractFlapping-wing robotic insects are small, highly maneuverable flying robots inspired by biological insects and useful for a wide range of tasks, including exploration, environmental monitoring, search and rescue, and surveillance. Recently, robotic insects driven by piezoelectric actuators have achieved the important goal of taking off with external power; however, fully autonomous operation requires an ultralight power supply capable of generating high-voltage drive signals from low-voltage energy sources. This paper describes high-voltage switching circuit topologies and control methods suitable for driving piezoelectric actuators in flapping-wing robotic insects and discusses the physical implementation of these topologies, including the fabrication of custom magnetic components by laser micromachining and other weight minimization techniques. The performance of laser micromachined magnetics and custom-wound commercial magnetics is compared through the experimental realization of a tapped inductor boost converter capable of stepping up a 3.7V Li-poly cell input to 200V. The potential of laser micromachined magnetics is further shown by implementing a similar converter weighing 20mg (not including control functionality) and capable of up to 70mW output at 200V and up to 100mW at 100V.
Design and fabrication of ultralight high-voltage power circuits for flapping-wing robotic insects Kevin Brownell, Ali Khan, Gu Wei, and David Brooks. 3/1/2011. “
Automating design of voltage interpolation to address process variations.” Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, 99, Pp. 1–14.
Publisher's VersionAbstract
Post-fabrication tuning provides a promising design approach to mitigate the performance and power overheads of process variation in advanced fabrication technologies. This paper explores design considerations and VLSI-CAD support for a recently proposed post-fabrication tuning knob called voltage interpolation. Successful implementation of this technique requires examination of the design tradeoffs between circuit tuning range and static power overheads within the synthesis flow of the design process, in addition to the implications of place and route. Results from the exploration of the scheme for a 64-core chip-multiprocessor machine using industrial-grade design blocks show that the scheme can be used to mitigate overhead arising from random and correlated within-die process variations. A design using voltage interpolation can match the nominal delay target with a 16% power cost, or for the same power budget, incur only a 13% delay overhead after variations.
Automating design of voltage interpolation to address process variations Wonyoung Kim, David Brooks, and Gu Wei. 2/20/2011. “
A fully-integrated 3-level DC/DC converter for nanosecond-scale DVS with fast shunt regulation.” In IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), Pp. 268–270. IEEE.
Publisher's VersionAbstractIn recent years, chip multiprocessor architectures have emerged to scale performance while staying within tight power constraints. This trend motivates per core/block dynamic voltage and frequency scaling (DVFS) with fast voltage transition. Given the high cost and bulk of off-chip DC/DC converters to implement multiple on-chip power domains, there has been a surge of interest in on-chip converters. This paper presents the design and experimental results of a fully integrated 3-level DC/DC converter that merges characteristics of both inductor-based buck and switched-capacitor (SC) converters. While off-chip buck converters show high conversion efficiency, their on-chip counterparts suffer from loss due to low quality inductors. With the help of flying capacitors, the 3-level converter requires smaller inductors than the buck converter, reducing loss and on-die area overhead. Compared to SC converters that need more com plex structures to regulate higher than half the input voltage, 3-level converters can efficiently regulate the output voltage across a wide range of levels and load currents. Measured results from a 130nm CMOS test-chip prototype demon strate nanosecond-scale voltage transition times and peak conversion efficiency of 77%.
A fully-integrated 3-level DC/DC converter for nanosecond-scale DVS with fast shunt regulation Krishna Rangan, Michael Powell, Gu Wei, and David Brooks. 2/12/2011. “
Achieving uniform performance and maximizing throughput in the presence of heterogeneity.” In High Performance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium on, Pp. 3–14. IEEE.
Publisher's VersionAbstractContinued scaling of process technologies is critical to sustaining improvements in processor frequencies and performance. However, shrinking process technologies exacerbates process variations - the deviation of process parameters from their target specifications. In the context of multi-core CMPs, which are implemented to feature homogeneous cores, within-die process variations result in substantially different core frequencies. Exposing such process-variation induced heterogeneity interferes with the norm of marketing chips at a single frequency. Further, application performance is undesirably dictated by the frequency of the core it is running on. To work around these challenges, a single uniform frequency, dictated by the slowest core, is currently chosen as the chip frequency sacrificing the increased performance capabilities of cores that could operate at higher frequencies. In this paper, we propose choosing the mean frequency across all cores, in lieu of the minimum frequency, as the single-frequency to use as the chip sales frequency. We examine several scheduling algorithms implemented below the O/S in hardware/firmware that guarantee minimum application performance near that of the average frequency, by masking process-variation induced heterogeneity from the end-user. We show that our Throughput-Driven Fairness (TDF) scheduling policy improves throughput by an average of 12% compared to a naive fairness scheme (round-robin) for frequency-sensitive applications. At the same time, TDF allows 98% of chips to maintain minimum performance at or above 90% of that expected at the mean frequency, providing a single uniform performance level to present for the chip.
Achieving uniform performance and maximizing throughput in the presence of heterogeneity