Publications by Year: 2012

2012
Sae Lee, David Brooks, and Gu Wei. 7/2012. “Evaluation of voltage stacking for near-threshold multicore computing.” In ISLPED '12: Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design, Pp. 373–378https. Publisher's VersionAbstract

This paper evaluates voltage stacking in the context of near-threshold multicore computing. Key attributes of voltage stacking are investigated using results from a test-chip prototype built in 150nm FDSOI CMOS. By "stacking" logic blocks on top of each other, voltage stacking reduces the chip current draw and simplifies off-chip power delivery but within-die voltage noise due to inter-layer current mismatch is an issue. Results show that unlike conventional power delivery schemes, supply rail impedance in voltage stacked systems depend on aggregate power consumption, leading to better noise immunity for high power (low impedance) operation for many-core processors.

Evaluation of voltage stacking for near-threshold multicore computing
Svilen Kanev, Gu Wei, and David Brooks. 7/2012. “XIOSim: power-performance modeling of mobile x86 cores.” In International symposium on Low power Electronics and Design (ISLPED). ACM. Publisher's VersionAbstract
Simulation is one of the main vehicles of computer architecture research. In this paper, we present XIOSim –- a highly detailed microarchitectural simulator targeted at mobile x86 microprocessors. The simulator execution model that we propose is a blend between traditional user-level simulation and full-system simulation. Our current implementation features detailed power and performance core models which allow microarchitectural exploration. Using a novel validation methodology, we show that XIOSim’s performance models manage to stay well within 10% of real hardware for the whole SPEC CPU2006 suite. Furthermore, we validate power models against measured data to show a deviation of less than 5% in terms of average power consumption.
XIOSim: power-performance modeling of mobile x86 cores
Simone Campanoni, Timothy Jones, Glenn Holloway, Gu Wei, and David Brooks. 6/26/2012. “HELIX: Making the extraction of thread-level parallelism mainstream.” IEEE Micro, 32, 4, Pp. 8–18. Publisher's VersionAbstract
Improving system performance increasingly depends on exploiting microprocessor parallelism, yet mainstream compilers still don't parallelize code automatically. Helix automatically parallelizes general-purpose programs without requiring any special hardware; avoids slowing down compiled programs, making it a suitable candidate for mainstream compilers; and outperforms the most similar historical technique that has been implemented in production compilers.
HELIX: Making the extraction of thread-level parallelism mainstream
Simone Campanoni, Timothy Jones, Glenn Holloway, Gu Wei, and David Brooks. 6/26/2012. “Making the Extraction of Thread-Level Parallelism Mainstream.” IEEE Micro. Publisher's VersionAbstract
Improving system performance increasingly depends on exploiting microprocessor parallelism, yet mainstream compilers still do not parallelize code automatically. Promising parallelization approaches have either required manual programmer assistance, depended on special hardware features, or risked slowing down programs they should have speeded up. HELIX is one such approach that automatically parallelizes general-purpose programs without requiring any special hardware. In this paper we show that in practice HELIX always avoids slowing down compiled programs, making it a suitable candidate for mainstream compilers. We also show experimentally that HELIX outperforms the most similar historical technique that has been implemented in production compilers.
HELIX: Making the extraction of thread-level parallelism mainstream
Simone Campanoni, Timothy Jones, Glenn Holloway, Gu Wei, and David Brooks. 6/3/2012. “The HELIX project: overview and directions.” In Design Automation Conference (DAC). San Francisco, CA, USA: ACM. Publisher's VersionAbstract
Parallelism has become the primary way to maximize processor performance and power efficiency. But because creating parallel programs by hand is difficult and prone to error, there is an urgent need for automatic ways of transforming conventional programs to exploit modern multicore systems. The HELIX compiler transformation is one such technique that has proven effective at parallelizing individual sequential programs automatically for a real six-core processor. We describe that transformation in the context of the broader HELIX research project, which aims to optimize the throughput of a multicore processor by coordinated changes in its architecture, its compiler, and its operating system. The goal is to make automatic parallelization mainstream in multiprogramming settings through adaptive algorithms for extracting and tuning thread-level parallelism.
The HELIX project: overview and directions
Michael Karpelson, Gu Wei, and J Wood. 4/2012. “Driving high voltage piezoelectric actuators in microrobotic applications.” Sensors and actuators A: Physical, 176, Pp. 78–89. Publisher's VersionAbstract

Piezoelectric actuators have been used successfully to enable locomotion in aerial and ambulatory microrobotic platforms. However, the use of piezoelectric actuators presents two major challenges for power electronic design: generating high-voltage drive signals in systems typically powered by low-voltage energy sources, and recovering unused energy from the actuators. Due to these challenges, conventional drive circuits become too bulky or inefficient in low mass applications. This work describes electrical characteristics and drive requirements of low mass piezoelectric actuators, the design and optimization of suitable drive circuit topologies, aspects of the physical instantiation of these topologies, including the fabrication of extremely lightweight magnetic components, and a custom, ultra low power integrated circuit that implements control functionality for the drive circuits. The principles and building blocks presented here enable efficient high-voltage drive circuits that can satisfy the stringent weight and power requirements of microrobotic applications.

Driving high voltage piezoelectric actuators in microrobotic applications
Simone Campanoni, Timothy Jones, Glenn Holloway, Vijay Reddi, Gu Wei, and David Brooks. 3/31/2012. “HELIX: Automatic parallelization of irregular programs for chip multiprocessing.” In International Symposium on Code Generation and Optimization (CGO). ACM. Publisher's VersionAbstract
We describe and evaluate HELIX, a new technique for automatic loop parallelization that assigns successive iterations of a loop to separate threads. We show that the inter-thread communication costs forced by loop-carried data dependences can be mitigated by code optimization, by using an effective heuristic for selecting loops to parallelize, and by using helper threads to prefetch synchronization signals. We have implemented HELIX as part of an optimizing compiler framework that automatically selects and parallelizes loops from general sequential programs. The framework uses an analytical model of loop speedups, combined with profile data, to choose loops to parallelize. On a six-core Intel Core i7-980X, HELIX achieves speedups averaging 2.25, with a maximum of 4.12, for thirteen C benchmarks from SPEC CPU2000.
HELIX: Automatic parallelization of irregular programs for chip multiprocessing
H Chung and Gu Wei. 1/5/2012. “Simulated-annealing-based adaptive equaliser for on-die variation compensation.” Electronics letters, 48, 1, Pp. 18–19. Publisher's VersionAbstract

Fully exploiting the flexibility of lookup-table-based equalisers, it is proposed to compensate for on-die variation effects within a transmit-side equaliser. To efficiently deal with the nonlinear nature of circuit non-idealities, the proposed equaliser utilises simulated annealing for adaptation.

Simulated-annealing-based adaptive equaliser for on-die variation compensation
Michael Lyons, Mark Hempstead, Gu Wei, and David Brooks. 1/2012. “The accelerator store: A shared memory framework for accelerator-based systems.” ACM Transactions on Architecture. and Code Optimization, 8, 4, Pp. 1-22. Publisher's VersionAbstract

This paper presents the many-accelerator architecture, a design approach combining the scalability of homogeneous multi-core architectures and system-on-chip's high performance and power-efficient hardware accelerators. In preparation for systems containing tens or hundreds of accelerators, we characterize a diverse pool of accelerators and find each contains significant amounts of SRAM memory (up to 90% of their area). We take advantage of this discovery and introduce the accelerator store, a scalable architectural component to minimize accelerator area by sharing its memories between accelerators. We evaluate the accelerator store for two applications and find significant system area reductions (30%) in exchange for small overheads (2% performance, 0%--8% energy). The paper also identifies new research directions enabled by the accelerator store and the many-accelerator architecture.

The accelerator store: A shared memory framework for accelerator-based systems
Michael Lyons, Mark Hempstead, Gu Wei, and David Brooks. 1/2012. “The Accelerator Store: a shared memory framework for accelerator-based systems.” Transactions on Architecture and Code Optimization (TACO). Publisher's VersionAbstract
This paper presents the many-accelerator architecture, a design approach combining the scalability of homogeneous multi-core architectures and system-on-chip’s high performance and power-efficient hardware accelerators. In preparation for systems containing tens or hundreds of accelerators, we characterize a diverse pool of accelerators and find each contains significant amounts of SRAM memory (up to 90% of their area). We take advantage of this discovery and introduce the accelerator store, a scalable architectural component to minimize accelerator area by sharing its memories between accelerators. We evaluate the accelerator store for two applications and find significant system area reductions (30%) in exchange for small overheads (2% performance, 0%–8% energy). The paper also identifies new research directions enabled by the accelerator store and the many-accelerator architecture.
The accelerator store: A shared memory framework for accelerator-based systems
Amanda Tseng and David Brooks. 1/2012. “Analytical Latency-Throughput Model of Future Power Constrained Multicore Processors.” In Workshop on Energy Efficient Design, ISCA, 600: Pp. 800.Abstract
Despite increased core counts that provide significant throughput performance gains, single thread performance is still an important metric in today’s processor designs. Due to chip power constraints, architects must carefully allocate power budgets to additional cores or increased single thread performance. To study this tradeoff between different performance metrics, we construct an analytical model that computes single thread and throughput performance under a given power budget for both symmetric and asymmetric multicore architectures. We also consider multi-task workloads, where optimal designs might include more than one large core in the heterogeneous architecture. Our analytical model considers the optimal number and complexity of cores in a processor and quantifies the benefits of asymmetric designs when trading latency and throughput. We show that a diverse set of core designs can be optimal in different scenarios.
Analytical Latency-Throughput Model of Future Power Constrained Multicore Processors
Wonyoung Kim, David Brooks, and Gu Wei. 1/2012. “A fully-integrated 3-level DC-DC converter for nanosecond-scale DVFS.” Solid-State Circuits, IEEE Journal of, 47, 1, Pp. 206–219. Publisher's VersionAbstract
On-chip DC-DC converters have the potential to offer fine-grain power management in modern chip-multiprocessors. This paper presents a fully integrated 3-level DC-DC converter, a hybrid of buck and switched-capacitor converters, implemented in 130 nm CMOS technology. The 3-level converter enables smaller inductors (1 nH) than a buck, while generating a wide range of output voltages compared to a 1/2 mode switched-capacitor converter. The test-chip prototype delivers up to 0.85 A load current while generating output voltages from 0.4 to 1.4 V from a 2.4 V input supply. It achieves 77% peak efficiency at power density of 0.1 W/mm 2 and 63% efficiency at maximum power density of 0.3 W/mm 2 . The converter scales output voltage from 0.4 V to 1.4 V (or vice-versa) within 20 ns at a constant 450 mA load current. A shunt regulator reduces peak-to-peak voltage noise from 0.27 V to 0.19 V under pseudo-randomly fluctuating load currents. Using simulations across a wide range of design parameters, the paper compares conversion efficiencies of the 3-level, buck and switched-capacitor converters.
A fully-integrated 3-level DC-DC converter for nanosecond-scale DVFS
Michael Lyons, Gu Wei, and David Brooks. 1/2012. “Shrink-Fit: A Framework for Flexible Accelerator Sizing.” IEEE Computer Architecture Letters, 12, 1, Pp. 17 - 20. Publisher's VersionAbstract
RTL design complexity discouraged adoption of reconfigurable logic in general purpose systems, impeding opportunities for performance and energy improvements. Recent improvements to HLS compilers simplify RTL design and are easing this barrier. A new challenge will emerge: managing reconfigurable resources between multiple applications with custom hardware designs. In this paper, we propose a method to “shrink-fit” accelerators within widely varying fabric budgets. Shrink-fit automatically shrinks existing accelerator designs within small fabric budgets and grows designs to increase performance when larger budgets are available. Our method takes advantage of current accelerator design techniques and introduces a novel architectural approach based on fine-grained virtualization. We evaluate shrink-fit using a synthesized implementation of an IDCT for decoding JPEGs and show the IDCT accelerator can shrink by a factor of 16x with minimal performance and area overheads. Using shrink-fit, application designers can achieve the benefits of hardware acceleration with single RTL designs on FPGAs large and small.
Shrink-Fit: A Framework for Flexible Accelerator Sizing