Publications by Year: 2014

2014
Brandon Reagen, Robert Adolf, Sophia Shao, Gu Wei, and David Brooks. 10/26/2014. “MachSuite: Benchmarks for Accelerator Design and Customized Architectures.” In IEEE International Symposium on Workload Characterization (IISWC). Publisher's VersionAbstract
Recent high-level synthesis and accelerator-related architecture papers show a great disparity in workload selection among projects and research groups. To provide standardization within the accelerator research community, we present MachSuite, a benchmark suite for high-level synthesis tools and accelerator-centric architectures. MachSuite is the compilation of carefully selected workloads to cover a diverse application space and algorithm choices. All the benchmarks in MachSuite are implemented to be well suited for high-level synthesis. A thorough characterization further demonstrates the diverse behaviors among benchmarks, representative of different customization challenges. MachSuite enables commensurability across research projects while mitigating the burden of accelerator implementation and workload selection.
MachSuite: Benchmarks for Accelerator Design and Customized Architectures
Svilen Kanev, Kim Hazelwood, Gu Wei, and David Brooks. 10/26/2014. “Tradeoffs between Power Management and Tail Latency in Warehouse-Scale Applications.” In International Symposium on Workload Characterization (IISWC), Pp. 31-40. IEEE. Publisher's VersionAbstract
The growth in datacenter computing has increased the importance of energy-efficiency in servers. Techniques to reduce power have brought server designs close to achieving energy-proportional computing. However, they stress the inherent tradeoff between aggressive power management and quality of service (QoS) – the dominant metric of performance in datacenters. In this paper, we characterize this tradeoff for 15 benchmarks representing workloads from Google’s datacenters. We show that 9 of these benchmarks often toggle their cores between short bursts of activity and sleep. In doing so, they stress sleep selection algorithms and can cause tail latency degradation or missed potential for power savings of up to 10% on important workloads like web search. However, improving sleep selection alone is not sufficient for large efficiency gains on current server hardware. To guide the direction needed for such large gains, we profile datacenter applications for susceptibility to dynamic voltage and frequency scaling (DVFS). We find the largest potential in DVFS which is cognizant of latency/power tradeoffs on a workload-per-workload basis.
Tradeoffs between Power Management and Tail Latency in Warehouse-Scale Applications
Svilen Kanev, Kim Hazelwood, Gu Wei, and David Brooks. 10/26/2014. “Tradeoffs between power management and tail latency in warehouse-scale applications.” In 2014 IEEE International Symposium on Workload Characterization (IISWC), Pp. 31–40. IEEE. Publisher's VersionAbstract
The growth in datacenter computing has increased the importance of energy-efficiency in servers. Techniques to reduce power have brought server designs close to achieving energy-proportional computing. However, they stress the inherent tradeoff between aggressive power management and quality of service (QoS) - the dominant metric of performance in datacenters. In this paper, we characterize this tradeoff for 15 benchmarks representing workloads from Google's datacenters. We show that 9 of these benchmarks often toggle their cores between short bursts of activity and sleep. In doing so, they stress sleep selection algorithms and can cause tail latency degradation or missed potential for power savings of up to 10% on important workloads like web search. However, improving sleep selection alone is not sufficient for large efficiency gains on current server hardware. To guide the direction needed for such large gains, we profile datacenter applications for susceptibility to dynamic voltage and frequency scaling (DVFS). We find the largest potential in DVFS which is cognizant of latency/power tradeoffs on a workload-per-workload basis.
Tradeoffs between Power Management and Tail Latency in Warehouse-Scale Applications
Michael Lyons, Gu Wei, and David Brooks. 10/19/2014. “Multi-accelerator system development with the shrinkfit acceleration framework.” In 2014 IEEE 32nd International Conference on Computer Design (ICCD), Pp. 75–82. IEEE. Publisher's VersionAbstract
This paper introduces the ShrinkFit accelerator framework, which simplifies the design of systems combining multiple accelerators. A single ShrinkFit system design can be deployed to FPGAs large and small, without time-consuming architectural parameter surveys. We describe four ShrinkFit accelerators implemented for an FPGA-based robotic bee brain prototype and demonstrate the flexibility of ShrinkFit with low performance overheads (under 10% on average) and low resource overheads (0-8% for accelerators and under 2% for hard logic blocks).
Multi-accelerator system development with the ShrinkFit acceleration framework
Michael Lyons, Gu Wei, and David Brooks. 10/2014. “Multi-accelerator system development with the ShrinkFit acceleration framework.” In International Conference on Computer Design. Seoul, Korea (South). Publisher's VersionAbstract
This paper introduces the ShrinkFit accelerator framework, which simplifies the design of systems combining multiple accelerators. A single ShrinkFit system design can be deployed to FPGAs large and small, without time-consuming architectural parameter surveys. We describe four ShrinkFit accelerators implemented for an FPGA-based robotic bee brain prototype and demonstrate the flexibility of ShrinkFit with low performance overheads (under 10% on average) and low resource overheads (0-8% for accelerators and under 2% for hard logic blocks).
Multi-accelerator system development with the ShrinkFit acceleration framework
Yakun Shao, Brandon Reagen, Gu Wei, and David Brooks. 6/14/2014. “Aladdin: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures.” In International Symposium on Computer Architecture (ISCA). Publisher's VersionAbstract
Hardware specialization, in the form of accelerators that provide custom datapath and control for specific algorithms and applications, promises impressive performance and energy advantages compared to traditional architectures. Current research in accelerator analysis relies on RTL-based synthesis flows to produce accurate timing, power, and area estimates. Such techniques not only require significant effort and expertise but are also slow and tedious to use, making large design space exploration infeasible. To overcome this problem, we present Aladdin, a pre-RTL, power-performance accelerator modeling framework and demonstrate its application to system-on-chip (SoC) simulation. Aladdin estimates performance, power, and area of accelerators within 0.9%, 4.9%, and 6.6% with respect to RTL implementations. Integrated with architecture-level core and memory hierarchy simulators, Aladdin provides researchers an approach to model the power and performance of accelerators in an SoC environment.
Aladdin: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures
Simone Campanoni, Kevin Brownell, Svilen Kanev, Timothy Jones, Gu Wei, and David Brooks. 6/14/2014. “HELIX-RC: An Architecture-Compiler Co-Design for Automatic Parallelization of Irregular Programs.” In International Symposium on Computer Architecture (ISCA), 3rd ed., 42: Pp. 217–228. Publisher's VersionAbstract
Data dependences in sequential programs limit paralleliza- tion because extracted threads cannot run independently. Although thread-level speculation can avoid the need for precise dependence analysis, communication overheads required to synchronize actual dependences counteract the benefits of parallelization. To address these challenges, we propose a lightweight architectural enhancement co-designed with a parallelizing compiler, which together can decouple communication from thread execution. Simulations of these approaches, applied to a processor with 16 Intel Atom-like cores, show an average of 6.85× performance speedup for six SPEC CINT2000 benchmarks.
HELIX-RC: An Architecture-Compiler Co-Design for Automatic Parallelization of Irregular Programs
Lin Shiung and Gu Wei. 6/10/2014. “Voltage regulator integrated with semiconductor chip.” United States of America US8749021B2 (U.S Patent). Publisher's VersionAbstract
The present invention reveals a semiconductor chip structure and its application circuit network, wherein the switching voltage regulator or converter is integrated with a semiconductor chip by chip fabrication methods, so that the semiconductor chip has the ability to regulate voltage within a specific voltage range. Therefore, when many electrical devices of different working voltages are placed on a Printed Circuit Board (PCB), only a certain number of semiconductor chips need to be constructed. Originally, in order to account for the different demands in voltage, power supply units of different output voltages, or a variety of voltage regulators need to be added. However, using the built-in voltage regulator or converter, the voltage range can be immediately adjusted to that which is needed. This improvement allows for easier control of electrical devices of different working voltages and decreases response time of electrical devices.
Voltage regulator integrated with semiconductor chip
Pradip Bose, David Brooks, Subhasish Mitra, Karthick Rajamani, Mircea Stan, Kevin Skadron, and Gu Wei. 4/1/2014. “Cross-Layer Modeling Framework for Energy-Efficient Resilience”. Publisher's VersionAbstract

We describe a novel cross-layer, resilience focused integrated modeling framework. This is targeted to help define ultra energy-efficient embedded systems in the post-14nm CMOS design era, without compromising system-level resilience. The targeted application domain is represented by the suite of applications and kernels announced as part of the ongoing PERFECT program sponsored by DARPA MTO.

Cross-Layer Modeling Framework for Energy-Efficient Resilience
Xuan Zhang, Tao Tong, David Brooks, and Gu Wei. 3/31/2014. “Evaluating Adaptive Clocking for Supply-Noise Resilience in Battery-Powered Aerial Microrobotic System-on-Chip.” In IEEE Transactions on Circuits and Systems (TCAS). Vol. PP. Publisher's VersionAbstract
A battery-powered aerial microrobotic System-on-Chip (SoC) has stringent weight and power budgets, which requires fully integrated solutions for both clock generation and voltage regulation. Supply-noise resilience is important yet challenging for such SoC systems due to a non-constant battery discharge profile and load current variability. This paper proposes an adaptive-frequency clocking scheme that can tolerate supply noise and improve performance when implemented with an integrated voltage regulator (IVR). Measurements from a `brain' SoC, implemented in 40 nm CMOS, demonstrate 2 × performance improvement with adaptive-frequency clocking over conventional fixed-frequency clocking. Combining adaptive-frequency clocking with open-loop IVR extends error-free operation to a wider battery voltage range (2.8 to 3.8 V) with higher average performance.
Evaluating Adaptive Clocking for Supply-Noise Resilience in Battery-Powered Aerial Microrobotic System-on-Chip
Simone Campanoni, Svilen Kanev, Kevin Brownell, Gu Wei, and David Brooks. 2014. “Breaking Cyclic-Multithreading Parallelization with XML Parsing.” In International Workshop on Parallelism in Mobile Platforms (PRISM). Publisher's VersionAbstract
HELIX-RC, a modern re-evaluation of the cyclic-multithreading (CMT) compiler technique, extracts threads from sequential code automatically. As a CMT approach, HELIX-RC gains performance by running iterations of the same loop on different cores in a multicore. It successfully boosts performance for several SPEC CINT benchmarks previously considered unparallelizable. However, this paper shows there are workloads with different characteristics, which even idealized CMT cannot parallelize. We identify how to overcome an inherent limitation of CMT for these workloads. CMT techniques only run iterations of a single loop in parallel at any given time. We propose exploiting parallelism not only within a single loop, but also among multiple loops. We call this execution model Multiple CMT (MCMT), and show that it is crucial for auto-parallelizing a broader class of workloads. To highlight the need for MCMT, we target a workload that is naturally hard for CMT – parsing XML-structured data. We show that even idealized CMT fails on XML parsing. Instead, MCMT extracts speedups up to 3.9x on 4 cores.
Breaking Cyclic-Multithreading Parallelization with XML Parsing