This paper evaluates voltage stacking in the context of near-threshold multicore computing. Key attributes of voltage stacking are investigated using results from a test-chip prototype built in 150nm FDSOI CMOS. By "stacking" logic blocks on top of each other, voltage stacking reduces the chip current draw and simplifies off-chip power delivery but within-die voltage noise due to inter-layer current mismatch is an issue. Results show that unlike conventional power delivery schemes, supply rail impedance in voltage stacked systems depend on aggregate power consumption, leading to better noise immunity for high power (low impedance) operation for many-core processors.
Simulation is one of the main vehicles of computer architecture research. In this paper, we present XIOSim –- a highly detailed microarchitectural simulator targeted at mobile x86 microprocessors. The simulator execution model that we propose is a blend between traditional user-level simulation and full-system simulation. Our current implementation features detailed power and performance core models which allow microarchitectural exploration. Using a novel validation methodology, we show that XIOSim’s performance models manage to stay well within 10% of real hardware for the whole SPEC CPU2006 suite. Furthermore, we validate power models against measured data to show a deviation of less than 5% in terms of average power consumption.
Improving system performance increasingly depends on exploiting microprocessor parallelism, yet mainstream compilers still don't parallelize code automatically. Helix automatically parallelizes general-purpose programs without requiring any special hardware; avoids slowing down compiled programs, making it a suitable candidate for mainstream compilers; and outperforms the most similar historical technique that has been implemented in production compilers.
Improving system performance increasingly depends on exploiting microprocessor parallelism, yet mainstream compilers still do not parallelize code automatically. Promising parallelization approaches have either required manual programmer assistance, depended on special hardware features, or risked slowing down programs they should have speeded up. HELIX is one such approach that automatically parallelizes general-purpose programs without requiring any special hardware. In this paper we show that in practice HELIX always avoids slowing down compiled programs, making it a suitable candidate for mainstream compilers. We also show experimentally that HELIX outperforms the most similar historical technique that has been implemented in production compilers.
Parallelism has become the primary way to maximize processor performance and power efficiency. But because creating parallel programs by hand is difficult and prone to error, there is an urgent need for automatic ways of transforming conventional programs to exploit modern multicore systems. The HELIX compiler transformation is one such technique that has proven effective at parallelizing individual sequential programs automatically for a real six-core processor. We describe that transformation in the context of the broader HELIX research project, which aims to optimize the throughput of a multicore processor by coordinated changes in its architecture, its compiler, and its operating system. The goal is to make automatic parallelization mainstream in multiprogramming settings through adaptive algorithms for extracting and tuning thread-level parallelism.
Piezoelectric actuators have been used successfully to enable locomotion in aerial and ambulatory microrobotic platforms. However, the use of piezoelectric actuators presents two major challenges for power electronic design: generating high-voltage drive signals in systems typically powered by low-voltage energy sources, and recovering unused energy from the actuators. Due to these challenges, conventional drive circuits become too bulky or inefficient in low mass applications. This work describes electrical characteristics and drive requirements of low mass piezoelectric actuators, the design and optimization of suitable drive circuit topologies, aspects of the physical instantiation of these topologies, including the fabrication of extremely lightweight magnetic components, and a custom, ultra low power integrated circuit that implements control functionality for the drive circuits. The principles and building blocks presented here enable efficient high-voltage drive circuits that can satisfy the stringent weight and power requirements of microrobotic applications.
We describe and evaluate HELIX, a new technique for automatic loop parallelization that assigns successive iterations of a loop to separate threads. We show that the inter-thread communication costs forced by loop-carried data dependences can be mitigated by code optimization, by using an effective heuristic for selecting loops to parallelize, and by using helper threads to prefetch synchronization signals. We have implemented HELIX as part of an optimizing compiler framework that automatically selects and parallelizes loops from general sequential programs. The framework uses an analytical model of loop speedups, combined with profile data, to choose loops to parallelize. On a six-core Intel Core i7-980X, HELIX achieves speedups averaging 2.25, with a maximum of 4.12, for thirteen C benchmarks from SPEC CPU2000.
Fully exploiting the flexibility of lookup-table-based equalisers, it is proposed to compensate for on-die variation effects within a transmit-side equaliser. To efficiently deal with the nonlinear nature of circuit non-idealities, the proposed equaliser utilises simulated annealing for adaptation.
This paper presents the many-accelerator architecture, a design approach combining the scalability of homogeneous multi-core architectures and system-on-chip’s high performance and power-efficient hardware accelerators. In preparation for systems containing tens or hundreds of accelerators, we characterize a diverse pool of accelerators and find each contains significant amounts of SRAM memory (up to 90% of their area). We take advantage of this discovery and introduce the accelerator store, a scalable architectural component to minimize accelerator area by sharing its memories between accelerators. We evaluate the accelerator store for two applications and find significant system area reductions (30%) in exchange for small overheads (2% performance, 0%–8% energy). The paper also identifies new research directions enabled by the accelerator store and the many-accelerator architecture.
This paper presents the many-accelerator architecture, a design approach combining the scalability of homogeneous multi-core architectures and system-on-chip's high performance and power-efficient hardware accelerators. In preparation for systems containing tens or hundreds of accelerators, we characterize a diverse pool of accelerators and find each contains significant amounts of SRAM memory (up to 90% of their area). We take advantage of this discovery and introduce the accelerator store, a scalable architectural component to minimize accelerator area by sharing its memories between accelerators. We evaluate the accelerator store for two applications and find significant system area reductions (30%) in exchange for small overheads (2% performance, 0%--8% energy). The paper also identifies new research directions enabled by the accelerator store and the many-accelerator architecture.
Despite increased core counts that provide significant throughput performance gains, single thread performance is still an important metric in today’s processor designs. Due to chip power constraints, architects must carefully allocate power budgets to additional cores or increased single thread performance. To study this tradeoff between different performance metrics, we construct an analytical model that computes single thread and throughput performance under a given power budget for both symmetric and asymmetric multicore architectures. We also consider multi-task workloads, where optimal designs might include more than one large core in the heterogeneous architecture. Our analytical model considers the optimal number and complexity of cores in a processor and quantifies the benefits of asymmetric designs when trading latency and throughput. We show that a diverse set of core designs can be optimal in different scenarios.
On-chip DC-DC converters have the potential to offer fine-grain power management in modern chip-multiprocessors. This paper presents a fully integrated 3-level DC-DC converter, a hybrid of buck and switched-capacitor converters, implemented in 130 nm CMOS technology. The 3-level converter enables smaller inductors (1 nH) than a buck, while generating a wide range of output voltages compared to a 1/2 mode switched-capacitor converter. The test-chip prototype delivers up to 0.85 A load current while generating output voltages from 0.4 to 1.4 V from a 2.4 V input supply. It achieves 77% peak efficiency at power density of 0.1 W/mm 2 and 63% efficiency at maximum power density of 0.3 W/mm 2 . The converter scales output voltage from 0.4 V to 1.4 V (or vice-versa) within 20 ns at a constant 450 mA load current. A shunt regulator reduces peak-to-peak voltage noise from 0.27 V to 0.19 V under pseudo-randomly fluctuating load currents. Using simulations across a wide range of design parameters, the paper compares conversion efficiencies of the 3-level, buck and switched-capacitor converters.
RTL design complexity discouraged adoption of reconfigurable logic in general purpose systems, impeding opportunities for performance and energy improvements. Recent improvements to HLS compilers simplify RTL design and are easing this barrier. A new challenge will emerge: managing reconfigurable resources between multiple applications with custom hardware designs. In this paper, we propose a method to “shrink-fit” accelerators within widely varying fabric budgets. Shrink-fit automatically shrinks existing accelerator designs within small fabric budgets and grows designs to increase performance when larger budgets are available. Our method takes advantage of current accelerator design techniques and introduces a novel architectural approach based on fine-grained virtualization. We evaluate shrink-fit using a synthesized implementation of an IDCT for decoding JPEGs and show the IDCT accelerator can shrink by a factor of 16x with minimal performance and area overheads. Using shrink-fit, application designers can achieve the benefits of hardware acceleration with single RTL designs on FPGAs large and small.