Publications by Type: Conference Paper

2013
Xuan Zhang, Tao Tong, David Brooks, and Gu Wei. 9/22/2013. “Supply-Noise Resilient Adaptive Clocking for Battery-Powered Aerial Microrobotic System-on-Chip in 40nm CMOS.” In IEEE Custom Integrated Circuits Conference (CICC). Publisher's VersionAbstract
A battery-powered aerial microrobotic System-on-Chip (SoC) has stringent weight and power budgets, which requires fully-integrated solutions for both clock generation and voltage regulation. Supply-noise resilience is important yet challenging for such SoC systems due to a non-constant battery discharge profile and load current variability. This paper proposes an adaptive-frequency clocking scheme that can tolerate supply noise and improve performance when implemented with an integrated voltage regulator (IVR). Measurements from a `brain' SoC, implemented in 40nm CMOS, demonstrate 2× performance improvement with adaptive-frequency clocking over conventional fixed-frequency clocking. Combining adaptive-frequency clocking with open-loop IVR extends error-free operation to a wider battery voltage range (2.8 to 3.8V) with higher average performance.
Supply-Noise Resilient Adaptive Clocking for Battery-Powered Aerial Microrobotic System-on-Chip in 40nm CMOS
Mario Lok, David Brooks, Robert Wood, and Gu Wei. 9/15/2013. “Design and analysis of an integrated driver for piezoelectric actuators.” In IEEE Energy Conversion Congress and Exposition. Publisher's VersionAbstract
Small-scale, highly maneuverable, flapping-wing robotic insects have a wide range of applications, including exploration, environmental monitoring, search and rescue, and surveillance. For these small-scale robots, a piezoelectric cantilever actuator driven by a high voltage drive signal is a preferred actuation mechanism. The generation of this drive signal via light and efficient power electronics is critical given the limited weight budget for the flapping-wing robot. Previous work demonstrated actuator drive circuitry using discrete power transistors and passive elements. This paper presents a new design that integrates all the power FETs into a single monolithic IC, reducing the weight of the power electronics to fit within the weight budget. This design adds the capability of driving multiple outputs to accommodate recent electromechanical design advances for flying robots.
Design and analysis of an integrated driver for piezoelectric actuators
Xuan Zhang, Tao Tong, Svilen Kanev, Sae Lee, Gu Wei, and David Brooks. 9/4/2013. “Characterizing and Evaluating Voltage Noise in Multi-Core Near-Threshold Processors.” In International Symposium on Low Power Electronics and Design (ISLPED). Publisher's VersionAbstract
Lowering the supply voltage to improve energy efficiency leads to higher load current and elevated supply sensitivity. In this paper, we provide the first quantitative analysis of voltage noise in multi-core near-threshold processors in a future 10nm technology across SPEC CPU2006 benchmarks. Our results reveal larger guardband requirement and significant energy efficiency loss due to power delivery nonidealities at near threshold, and highlight the importance of accurate voltage noise characterization for design exploration of energy-centric computing systems using near-threshold cores.
Characterizing and Evaluating Voltage Noise in Multi-Core Near-Threshold Processors
Yakun Shao and David Brooks. 9/4/2013. “Energy Characterization and Instruction-Level Energy Model of Intel's Xeon Phi Processor.” In International Symposium on Low Power Electronics and Design (ISLPED). Publisher's VersionAbstract
Intel’s Xeon Phi is the first commercial many-core/multi-thread x86-based processor. Xeon Phi belongs to a new breed of high performance computing processors that seek high compute density as well as energy efficiency. However, no high- level energy model is available for Xeon Phi software developers to quickly evaluate and optimize energy efficiency. This work demonstrates an instruction-level energy model for the Xeon Phi processor to facilitate the development of energy-efficient software. In order to construct this model, we first characterize the energy consumption of the processor, identifying how energy per instruction scales with the number of cores, the number of active threads per core, and instruction types. Based on the energy characterization, we construct an instruction-level energy model and validate the accuracy of the model between 1% and 5% for real world benchmarks. We show that the energy model can be used to identify software inefficiencies for these benchmarks and find that Linpack code can be optimized to increase energy efficiency by as much as 10%.
Energy Characterization and Instruction-Level Energy Model of Intel's Xeon Phi Processor
Brandon Reagen, Yakun Shao, Gu Wei, and David Brooks. 9/4/2013. “Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware.” In International Symposium on Low Power Electronics and Design (ISLPED). Publisher's VersionAbstract
As the traditional performance gains of technology scaling diminish, one of the most promising directions is building special purpose fixed function hardware blocks, commonly referred to as accelerators. Accelerators have become prevalent in industrial SoC designs for their low power, high performance potential. In this work we explore thousands of implementations of classical software workloads in hardware. This thorough, detailed design space search of hardware accelerators gives architects a quantita- tive way to reason about the differences in implementations. The exploration presented in this work shows that the space is full of poor design choices. By thoroughly analyzing each benchmark, we show which provide the most performance when implemented in hardware given a fixed power budget and explain which design techniques work best for each workload.
Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware
Yakun Shao and David Brooks. 4/21/2013. “ISA-Independent Workload Characterization and its Implications for Specialized Architectures.” In International Symposium on Performance Analysis of Systems and Software (ISPASS). Publisher's VersionAbstract
Specialized architectures will become increasingly important as the computing industry demands more energy- efficient designs. The application-centric design style for these architectures is heavily dependent on workload characterization of intrinsic program characteristics, but at the same time these architectures are likely to be decoupled from legacy ISAs. In this work, we perform ISA-independent workload characterization for a variety of important intrinsic program characteristics relating to computation, memory, and control flow. The analysis is performed using a JIT compiler that emits ISA-independent instructions. We compare this analysis with an x86 trace and find that several of the analyses are highly sensitive to the ISA. We conclude that designers of specialized architectures must adopt ISA-independent workload characterization approaches.
ISA-Independent Workload Characterization and its Implications for Specialized Architectures
2012
Sae Lee, David Brooks, and Gu Wei. 7/2012. “Evaluation of voltage stacking for near-threshold multicore computing.” In ISLPED '12: Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design, Pp. 373–378https. Publisher's VersionAbstract

This paper evaluates voltage stacking in the context of near-threshold multicore computing. Key attributes of voltage stacking are investigated using results from a test-chip prototype built in 150nm FDSOI CMOS. By "stacking" logic blocks on top of each other, voltage stacking reduces the chip current draw and simplifies off-chip power delivery but within-die voltage noise due to inter-layer current mismatch is an issue. Results show that unlike conventional power delivery schemes, supply rail impedance in voltage stacked systems depend on aggregate power consumption, leading to better noise immunity for high power (low impedance) operation for many-core processors.

Evaluation of voltage stacking for near-threshold multicore computing
Svilen Kanev, Gu Wei, and David Brooks. 7/2012. “XIOSim: power-performance modeling of mobile x86 cores.” In International symposium on Low power Electronics and Design (ISLPED). ACM. Publisher's VersionAbstract
Simulation is one of the main vehicles of computer architecture research. In this paper, we present XIOSim –- a highly detailed microarchitectural simulator targeted at mobile x86 microprocessors. The simulator execution model that we propose is a blend between traditional user-level simulation and full-system simulation. Our current implementation features detailed power and performance core models which allow microarchitectural exploration. Using a novel validation methodology, we show that XIOSim’s performance models manage to stay well within 10% of real hardware for the whole SPEC CPU2006 suite. Furthermore, we validate power models against measured data to show a deviation of less than 5% in terms of average power consumption.
XIOSim: power-performance modeling of mobile x86 cores
Simone Campanoni, Timothy Jones, Glenn Holloway, Gu Wei, and David Brooks. 6/3/2012. “The HELIX project: overview and directions.” In Design Automation Conference (DAC). San Francisco, CA, USA: ACM. Publisher's VersionAbstract
Parallelism has become the primary way to maximize processor performance and power efficiency. But because creating parallel programs by hand is difficult and prone to error, there is an urgent need for automatic ways of transforming conventional programs to exploit modern multicore systems. The HELIX compiler transformation is one such technique that has proven effective at parallelizing individual sequential programs automatically for a real six-core processor. We describe that transformation in the context of the broader HELIX research project, which aims to optimize the throughput of a multicore processor by coordinated changes in its architecture, its compiler, and its operating system. The goal is to make automatic parallelization mainstream in multiprogramming settings through adaptive algorithms for extracting and tuning thread-level parallelism.
The HELIX project: overview and directions
Simone Campanoni, Timothy Jones, Glenn Holloway, Vijay Reddi, Gu Wei, and David Brooks. 3/31/2012. “HELIX: Automatic parallelization of irregular programs for chip multiprocessing.” In International Symposium on Code Generation and Optimization (CGO). ACM. Publisher's VersionAbstract
We describe and evaluate HELIX, a new technique for automatic loop parallelization that assigns successive iterations of a loop to separate threads. We show that the inter-thread communication costs forced by loop-carried data dependences can be mitigated by code optimization, by using an effective heuristic for selecting loops to parallelize, and by using helper threads to prefetch synchronization signals. We have implemented HELIX as part of an optimizing compiler framework that automatically selects and parallelizes loops from general sequential programs. The framework uses an analytical model of loop speedups, combined with profile data, to choose loops to parallelize. On a six-core Intel Core i7-980X, HELIX achieves speedups averaging 2.25, with a maximum of 4.12, for thirteen C benchmarks from SPEC CPU2000.
HELIX: Automatic parallelization of irregular programs for chip multiprocessing
Amanda Tseng and David Brooks. 1/2012. “Analytical Latency-Throughput Model of Future Power Constrained Multicore Processors.” In Workshop on Energy Efficient Design, ISCA, 600: Pp. 800.Abstract
Despite increased core counts that provide significant throughput performance gains, single thread performance is still an important metric in today’s processor designs. Due to chip power constraints, architects must carefully allocate power budgets to additional cores or increased single thread performance. To study this tradeoff between different performance metrics, we construct an analytical model that computes single thread and throughput performance under a given power budget for both symmetric and asymmetric multicore architectures. We also consider multi-task workloads, where optimal designs might include more than one large core in the heterogeneous architecture. Our analytical model considers the optimal number and complexity of cores in a processor and quantifies the benefits of asymmetric designs when trading latency and throughput. We show that a diverse set of core designs can be optimal in different scenarios.
Analytical Latency-Throughput Model of Future Power Constrained Multicore Processors
2011
Javier Lira, Carlos Molina, David Brooks, and Antonio Gonzalez. 12/18/2011. “Implementing a hybrid SRAM/eDRAM NUCA architecture.” In High Performance Computing (HiPC), 2011 18th International Conference on, Pp. 1–10. Bengaluru, India: IEEE. Publisher's VersionAbstract
Advances in technology allowed for integrating DRAM-like structures into the chip, called embedded DRAM (eDRAM). This technology has already been successfully implemented in some GPUs and other graphic-intensive SoC, like game consoles. The most recent processor from IBM ® , POWER7, is the first general-purpose processor that integrates an eDRAM module on the chip. In this paper, we propose a hybrid cache architecture that exploits the main features of both memory technologies, speed of SRAM and high density of eDRAM. We demonstrate, that due to the high locality found in emerging applications, a high percentage of data that enters to the on-chip last-level cache are not accessed again before they are evicted. Based on that observation, we propose a placement scheme where re-accessed data blocks are stored in fast, but costly in terms of area and power, SRAM banks, while eDRAM banks store data blo cks that just arrive to the NUCA cache or were demoted from a SRAM bank. We show that a well-balanced SRAM / eDRAM NUCA cache can achieve similar performance results than using a NUCA cache composed of only SRAM banks, but reduces area by 15% and power consumed by 10%. Furthermore, we also explore several alternatives to exploit the area reduction we gain by using the hybrid architecture, resulting in an overall performance improvement of 4%.
Implementing a hybrid SRAM/eDRAM NUCA architecture
Pierre Duhamel, Judson Porter, Benjamin Finio, Geoffrey Barrows, David Brooks, Gu Wei, and Robert Wood. 9/25/2011. “Hardware in the loop for optical flow sensing in a robotic bee.” In Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ International Conference on, Pp. 1099–1106. San Francisco, CA, USA: IEEE. Publisher's VersionAbstract
The design of autonomous robots involves the development of many complex, interdependent components, including the mechanical body and its associated actuators, sensors, and algorithms to handle sensor processing, control, and high-level task planning. For the design of a robotic bee (RoboBee) it is necessary to optimize across the design space for minimum weight and power consumption to increase flight time; however, the design space of a single component is large, the interconnectedness and tradeoffs across components must be considered, and interdisciplinary collaborations cause different component design timelines. In this work, we show how the development of a hardware in the loop (HWIL) system for a flapping wing microrobot can simplify and accelerate evaluation of a large number of design choices. Specifically, we explore the design space of the visual system including sensor hardware and associated optical flow processing. We demonstrate the utility of the HWIL system in exposing trends on system performance for optical flow algorithm, field of view, sensor resolution, and frame rate.
Hardware in the loop for optical flow sensing in a robotic bee
Ankur Agrawal, Kumar Hanumolu, and Gu Wei. 9/19/2011. “Area efficient phase calibration of a 1.6 GHz multiphase DLL.” In 2011 IEEE Custom Integrated Circuits Conference (CICC), Pp. 1–4. IEEE. Publisher's VersionAbstract
This paper describes a digital calibration scheme that corrects for phase spacing errors in a multiphase clock generating delay-locked loop (DLL). The calibration scheme employs sub-sampling using a frequency-offset clock with respect to the DLL reference clock, to measure phase-offsets. The phase-correction circuit uses one digital-to-analog converter across eight variable-delay buffers to reduce the area consumption by 62%. The test-chip, designed in a 130 nm CMOS process, demonstrates a 8-phase 1.6 GHz DLL with a worst-case phase error of 450 fs.
Area efficient phase calibration of a 1.6 GHz multiphase DLL
Hayun Chung and Gu Wei. 8/7/2011. “Design considerations for ADC-based backplane receivers.” In 2011 IEEE 54th International Midwest Symposium on Circuits and Systems (MWSCAS), Pp. 1–4. IEEE. Publisher's VersionAbstract
High-speed ADC-based backplane receivers often suffer from high power consumption and complexity and require careful designs. This paper discusses circuit- and system-level design considerations for such receivers. A low-power, high-speed front-end ADC circuit and a high-level design-space exploration of ADC-based receivers are presented.
Design considerations for ADC-based backplane receivers
Michael Karpelson, Robert J Wood, and Gu Wei. 6/15/2011. “Low power control IC for efficient high-voltage piezoelectric driving in a flying robotic insect.” In 2011 Symposium on VLSI Circuits-Digest of Technical Papers, Pp. 178–179. IEEE.Abstract
A dual-channel, low power control IC for driving high voltage piezoelectric actuators in a flapping-wing robotic insect is presented. The IC controls milligram-scale power electronics that meet the stringent weight and power requirements of aerial microrobots. Designed in a 0.13µm CMOS process, the IC implements an efficient control algorithm to drive piezoelectric actuators with high temporal resolution while consuming <100µW during normal operation at 1.0V. Keywords: low power, SOC, high voltage, piezoelectric actuator, and microrobotics.
Low power control IC for efficient high-voltage piezoelectric driving in a flying robotic insect
Peter Bailis, Vijay Reddi, Sanjay Gandhi, David Brooks, and Margo Seltzer. 6/9/2011. “Dimetrodon: processor-level preventive thermal management via idle cycle injection.” In Design Automation Conference (DAC), 2011 48th ACM/EDAC/IEEE, Pp. 89–94. San Diego, CA, USA: IEEE.Abstract
Processor-level dynamic thermal management techniques have long targeted worst-case thermal margins. We examine the thermal-performance trade-offs in average-case, preventive thermal management by actively degrading application performance to achieve long-term thermal control. We propose Dimetrodon, the use of idle cycle injection, a flexible, per-thread technique, as a preventive thermal management mechanism and demonstrate its efficiency compared to hardware techniques in a commodity operating system on real hardware under throughput and latency-sensitive real-world workloads. Compared to inflexible hardware techniques, Dimetrodon achieves favorable trade-offs for temperature reductions up to 30% due to rapid heat dissipation during short idle intervals.
Dimetrodon: processor-level preventive thermal management via idle cycle injection
Michael Karpelson, Whitney P, Gu Wei, and Wood J. 3/6/2011. “Design and fabrication of ultralight high-voltage power circuits for flapping-wing robotic insects.” In 2011 Twenty-Sixth Annual IEEE Applied Power Electronics Conference and Exposition (APEC), Pp. 2070–2077. IEEE. Publisher's VersionAbstract
Flapping-wing robotic insects are small, highly maneuverable flying robots inspired by biological insects and useful for a wide range of tasks, including exploration, environmental monitoring, search and rescue, and surveillance. Recently, robotic insects driven by piezoelectric actuators have achieved the important goal of taking off with external power; however, fully autonomous operation requires an ultralight power supply capable of generating high-voltage drive signals from low-voltage energy sources. This paper describes high-voltage switching circuit topologies and control methods suitable for driving piezoelectric actuators in flapping-wing robotic insects and discusses the physical implementation of these topologies, including the fabrication of custom magnetic components by laser micromachining and other weight minimization techniques. The performance of laser micromachined magnetics and custom-wound commercial magnetics is compared through the experimental realization of a tapped inductor boost converter capable of stepping up a 3.7V Li-poly cell input to 200V. The potential of laser micromachined magnetics is further shown by implementing a similar converter weighing 20mg (not including control functionality) and capable of up to 70mW output at 200V and up to 100mW at 100V.
Design and fabrication of ultralight high-voltage power circuits for flapping-wing robotic insects
Wonyoung Kim, David Brooks, and Gu Wei. 2/20/2011. “A fully-integrated 3-level DC/DC converter for nanosecond-scale DVS with fast shunt regulation.” In IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), Pp. 268–270. IEEE. Publisher's VersionAbstract
In recent years, chip multiprocessor architectures have emerged to scale performance while staying within tight power constraints. This trend motivates per core/block dynamic voltage and frequency scaling (DVFS) with fast voltage transition. Given the high cost and bulk of off-chip DC/DC converters to implement multiple on-chip power domains, there has been a surge of interest in on-chip converters. This paper presents the design and experimental results of a fully integrated 3-level DC/DC converter that merges characteristics of both inductor-based buck and switched-capacitor (SC) converters. While off-chip buck converters show high conversion efficiency, their on-chip counterparts suffer from loss due to low quality inductors. With the help of flying capacitors, the 3-level converter requires smaller inductors than the buck converter, reducing loss and on-die area overhead. Compared to SC converters that need more com plex structures to regulate higher than half the input voltage, 3-level converters can efficiently regulate the output voltage across a wide range of levels and load currents. Measured results from a 130nm CMOS test-chip prototype demon strate nanosecond-scale voltage transition times and peak conversion efficiency of 77%.
A fully-integrated 3-level DC/DC converter for nanosecond-scale DVS with fast shunt regulation
Krishna Rangan, Michael Powell, Gu Wei, and David Brooks. 2/12/2011. “Achieving uniform performance and maximizing throughput in the presence of heterogeneity.” In High Performance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium on, Pp. 3–14. IEEE. Publisher's VersionAbstract
Continued scaling of process technologies is critical to sustaining improvements in processor frequencies and performance. However, shrinking process technologies exacerbates process variations - the deviation of process parameters from their target specifications. In the context of multi-core CMPs, which are implemented to feature homogeneous cores, within-die process variations result in substantially different core frequencies. Exposing such process-variation induced heterogeneity interferes with the norm of marketing chips at a single frequency. Further, application performance is undesirably dictated by the frequency of the core it is running on. To work around these challenges, a single uniform frequency, dictated by the slowest core, is currently chosen as the chip frequency sacrificing the increased performance capabilities of cores that could operate at higher frequencies. In this paper, we propose choosing the mean frequency across all cores, in lieu of the minimum frequency, as the single-frequency to use as the chip sales frequency. We examine several scheduling algorithms implemented below the O/S in hardware/firmware that guarantee minimum application performance near that of the average frequency, by masking process-variation induced heterogeneity from the end-user. We show that our Throughput-Driven Fairness (TDF) scheduling policy improves throughput by an average of 12% compared to a naive fairness scheme (round-robin) for frequency-sensitive applications. At the same time, TDF allows 98% of chips to maintain minimum performance at or above 90% of that expected at the mean frequency, providing a single uniform performance level to present for the chip.
Achieving uniform performance and maximizing throughput in the presence of heterogeneity

Pages