Publications by Year: 2015

2015
Brandon Reagen, Robert Adolf, Gu Wei, and David Brooks. 10/26/2015. “The MachSuite Benchmark.” In Boston Area Architecture Workshop (BARC). Raleigh, NC, USA. Publisher's VersionAbstract
Recent high-level synthesis and accelerator-related architecture papers show a great disparity in workload selection. To improve standardization within the accelerator research community, we present MachSuite, a collection of 19 benchmarks for evaluating high-level synthesis tools and accelerator-centric architectures. MachSuite spans a broad application space, captures a variety of different program behaviors, and provides implementations tailored towards the needs of accelerator designers and researchers, including support for high-level synthesis. We illustrate these aspects by characterizing each benchmark along five different dimensions, highlighting trends and salient features.
MachSuite: Benchmarks for accelerator design and customized architectures
Mario Lok, Xuan Zhang, Elizabeth Helblinh, Robert Wood, David Brooks, and Gu Wei. 9/28/2015. “A Power Electronics Unit to Drive Piezoelectric Actuators for Flying Microrobots.” In IEEE Custom Integrated Circuits Conference (CICC). Publisher's VersionAbstract
This paper describes a power electronics unit (PEU) for an insect-scale flapping-wing robot. Three power saving techniques used in the actuator driver of the PEU — envelope tracking, dynamic common mode, and charge sharing — reduce power consumption while retaining weight benefits of an inductor-less linear driver. A pair of actuator driver ICs energize four 15nF capacitor loads, which represent the piezoelectric actuators of a flapping-wing robot. The PEU consumes 290mW, which translates to 37% lower power compared to a design without these power saving techniques.
A Power Electronics Unit to Drive Piezoelectric Actuators for Flying Microrobots
Hayun Chung, Toprak Deniz, Alexander Rylyakov, John Bulzacchelli, Daniel Friedman, and Gu Wei. 8/2015. “A 7.5 GS/s flash ADC and a 10.24 GS/s time-interleaved ADC for backplane receivers in 65 nm CMOS.” Analog Integrated Circuits and Signal Processing, 85, 2, Pp. 299–310. Publisher's VersionAbstract
This paper presents a 7.5 GS/s, 4.5 bit flash analog-to-digital converter (ADC) for high-speed backplane communication. A two-stage track-and-hold (T/H) structure enables high input bandwidth and low power consumption at the same time. A sampling clock duty cycle control technique, which allocates more tracking time to the bandwidth-limited second T/H stage, facilitates high sampling rates. A digital offset correction scheme compensates both random and systematic offsets due to process variation and T/H amplifier gain nonlinearity, simultaneously. Two test-chip prototypes were fabricated in a 65 nm CMOS process. Experimental results of a standalone ADC chip demonstrate 3.8 effective number of bits (ENOB) at 7.5 GS/s. The figure-of-merit (FOM) of the standalone ADC is 0.49 pJ/conversion-step. The second test chip combines two ADCs together in order to demonstrate a time-interleaved ADC (TI-ADC) for use in high-speed backplane receivers. The TI-ADC operates at 10.24 GS/s while achieving 3.5 ENOB and 0.65 pJ/conversion-step FOM.
A 7.5 GS/s flash ADC and a 10.24 GS/s time-interleaved ADC for backplane receivers in 65 nm CMOS
Paul Whatmough, George Smart, Shidhartha Das, Yiannis Andreopoulos, and David Bull. 6/17/2015. “A 0.6V All-Digital Body-Coupled Wakeup Transceiver for IoT Applications.” In IEEE Symposium on VLSI Circuits (VLSIC). Kyoto, Japan. Publisher's VersionAbstract
A body-coupled symmetric wakeup transceiver is proposed for always-on device discovery in IoT applications requiring security and low-power consumption. The wakeup transceiver (WTRx) is implemented in 65nm CMOS, using digital logic cells and operates at 0.6V. A directly-modulated open-loop DCO generates an OOK-modulated 10MHz carrier, with a frequency-locked loop for intermittent calibration. A passive receiver incorporates a digital IO cell as hysteretic comparator, with a two-phase correlator bank. A novel MAC scheme allows for duty-cycling in both transmitter and receiver. Measured power consumption is 3.54μW, with sensitivity of 88mV and maximum wakeup latency of 150ms.
A 0.6V All-Digital Body-Coupled Wakeup Transceiver for IoT Applications
Xuan Zhang, Mario Lok, Tao Tong, Simon Chaput, Sae Lee, Brandon Reagen, Hyunkwang Lee, David Brooks, and Gu Wei. 6/17/2015. “A Multi-Chip System Optimized for Insect-Scale Flapping-Wing Robots.” In IEEE Symposium on VLSI Circuits (VLSIC). Publisher's VersionAbstract
We demonstrate a battery-powered multi-chip system optimized for insect-scale flapping wing robots that meets the tight weight limit and real-time performance demands of autonomous flight. Measured results show open-loop wing flapping driven by a power electronics unit and energy efficiency improvements via hardware acceleration.
A Multi-Chip System Optimized for Insect-Scale Flapping-Wing Robots
Brandon Reagen, Xaun Zhang, David Brooks, and Gu Wei. 6/14/2015. “From PDF to GDS: Designing the RoboBee SoC.” WARP 2015 6th Workshop on Architectural Research Prototyping (Co-Located with the 42nd International Symposium on Computer Architecture). Publisher's VersionAbstract
Developing the Robobee was a multi-discipline, 5-year project funded by a National Science Foundation Expeditions in Computing award with the goal of achieving autonomous flight with a bee-sized micro-robot [2]. The intent of the research was to help re-stabilize the declining bee population which researchers have shown could have devastating effects on the earths ecosystem. Bees are remarkably efficient; their skeleton weighs almost nothing: requiring minimal lift to takeoff and sustain flight; their brains are small: pre-programmed with a minimal set of instincts necessary for the colonies survival. Their capabilities under such stringent weight and compute limitations makes them a prime target for pushing what modern robotics and computer systems can do. The weight and power limits require a custom System-onChip (SoC) be built. Conventional off-chip voltage regulators are heavy and bulky, and thus cannot fit under the weight and form factor of the robotic bee. Commercial Off-The-Shelf parts (COTS) micro-controllers consume too much power to perform the required computation for autonomous flight. The solution is to pack as much IP onto a single die. SoCs have been the trend of all semi-conductor companies over the past decade from mobile and embedded to server grade solutions. In this paper we recount our experiences designing such a chip. We highlight the major challenges faced when designing for such a unique form factor, how designs and specifications were set by each collaborating lab, the difficulties of integrating a plethora of IP consisting of in-house digital and analog blocks, and the design flows we used. We also discuss how invaluable HLS was in reducing the engineering burden, focusing design efforts at higher levels of abstraction, and an overall successful tape-out.
From PDF to GDS: Designing the RoboBee SoC
Svilen Kanev, Juan Darago, Kim Hazelwood, Tipp Moseley, Gu Wei, and David Brooks. 6/13/2015. “Profiling a Warehouse-Scale Computer.” In International Symposium on Computer Architecture (ISCA). Publisher's VersionAbstract
With the increasing prevalence of warehouse-scale (WSC) and cloud computing, understanding the interactions of server applications with the underlying microarchitecture becomes ever more important in order to extract maximum performance out of server hardware. To aid such understanding, this paper presents a detailed microarchitectural analysis of live datacenter jobs, measured on more than 20,000 Google machines over a three year period, and comprising thousands of different applications. We first find that WSC workloads are extremely diverse, breeding the need for architectures that can tolerate application variability without performance loss. However, some patterns emerge, offering opportunities for co-optimization of hardware and software. For example, we identify common building blocks in the lower levels of the software stack. This \"datacenter tax\" can comprise nearly 30% of cycles across jobs running in the fleet, which makes its constituents prime candidates for hardware specialization in future server systems-on-chips. We also uncover opportunities for classic microarchitectural optimizations for server processors, especially in the cache hierarchy. Typical workloads place significant stress on instruction caches and prefer memory latency over bandwidth. They also stall cores often, but compute heavily in bursts. These observations motivate several interesting directions for future warehouse-scale computers.
Profiling a Warehouse-Scale Computer
Sae Lee, Tao Tong, Xuang Zhang, David Brooks, and Gu Wei. 6/2015. “A 16-Core Voltage-Stacked System with an Integrated Switched-Capacitor DC-DC Converter.” In IEEE Symposium on VLSI Circuits (VLSIC), 99: Pp. 1-14. Kyoto, Japan. Publisher's VersionAbstract
A 16-core voltage-stacked IC integrated with a switched-capacitor DC-DC converter demonstrates efficient power delivery. To overcome inter-layer voltage noise issues, the test chip implements and evaluates the benefits of self-timed clocking and clock-phase interleaving. The integrated converter offers minimum voltage guarantees and further reduces voltage noise.
A 16-Core Voltage-Stacked System with an Integrated Switched-Capacitor DC-DC Converter
Yakun Shao, Brandon Reagen, Gu Wei, and David Brooks. 5/13/2015. “The aladdin approach to accelerator design and modeling.” IEEE Micro, 35, 3, Pp. 58–70. Publisher's VersionAbstract
Hardware specialization, in the form of datapath and control circuitry customized to particular algorithms or applications, promises impressive performance and energy advantages compared to traditional architectures. Current research in accelerators relies on RTL-based synthesis flows to produce accurate timing, power, and area estimates. Such techniques not only require significant effort and expertise but also are slow and tedious to use, making large design space exploration infeasible. To overcome this problem, the authors developed Aladdin, a pre-RTL, power-performance accelerator modeling framework and demonstrated its application to system-on-chip (SoC) simulation. Aladdin estimates performance, power, and area of accelerators within 0.9, 4.9, and 6.6 percent with respect to RTL implementations. Integrated with architecture-level general-purpose core and memory hierarchy simulators, Aladdin provides researchers with a fast but accurate way to model the power and performance of accelerators in an SoC environment.
The aladdin approach to accelerator design and modeling
Simone Campanoni, Glenn Holloway, Gu Wei, and David Brooks. 2/7/2015. “HELIX-UP: Relaxing Program Semantics to Unleash Parallelization.” In International Symposium on Code Generation and Optimization (CGO), Pp. 235–245. San Francisco, CA, USA. Publisher's VersionAbstract
Automatic generation of parallel code for general-purpose commodity processors is a challenging computational problem. Nevertheless, there is a lot of latent thread-level parallelism in the way sequential programs are actually used. To convert latent parallelism into performance gains, users may be willing to compromise on the quality of a programś results. We have developed a parallelizing compiler and runtime that substantially improve scalability by allowing parallelized code to briefly sidestep strict adherence to language semantics at run time. In addition to boosting performance, our approach limits the sensitivity of parallelized code to the parameters of target CPUs (such as core-to-core communication latency) and the accuracy of data dependence analysis.
HELIX-UP: Relaxing Program Semantics to Unleash Parallelization
Sam Xi, Hans Jacobson, Pradip Bose, Gu Wei, and David Brooks. 2/7/2015. “Quantifying Sources of Error in McPAT and Potential Impacts on Architectural Studies.” In International Symposium on High Performance Computer Architecture (HPCA). Publisher's VersionAbstract
Architectural power modeling tools are widely used by the computer architecture community for rapid evaluations of high-level design choices and design space explorations. Currently, McPAT is the de facto power model, but the literature does not yet contain a careful examination of its modeling accuracy. In addition, the issue of how greatly power modeling error can affect architectural-level studies has not been quantified before. In this work, we present the first rigorous assessment of McPAT’s core power and area models with a detailed, validated power modeling toolchain used in current industrial practice. We find that McPAT’s predictions can have significant error because some of the models are either incomplete, too high-level, or assume implementations of structures that differ from that of the core at hand. We demonstrate that large errors are possible when using McPAT’s dynamic power estimates in the context of voltage noise and thermal hotspots, but for steady-state properties, accurately modeling leakage power is more important. Based on our analysis, we are able to provide guidelines for creating accurate McPAT models, even without access to detailed industrial power modeling tools. We conclude that in spite of its accuracy gaps, McPAT is still a very useful tool for many architectural studies, and its limitations can often be adequately addressed for a given research study of interest.
Quantifying Sources of Error in McPAT and Potential Impacts on Architectural Studies
Vijay Reddi, Meeta Gupta, Glenn Holloway, Gu Wei, Michael Smith, and David Brooks. 2015. “Adaptive event-guided system and method for avoiding voltage emergencies”.Abstract
In a preferred embodiment, the present invention is a system for avoiding voltage emergencies. The system comprises a microprocessor, an actuator for throttling the microprocessor, a voltage emergency detector and a voltage emergency predictor. The voltage emergency detector may comprise, for example, a checkpoint recovery mechanism or a sensor. The voltage emergency predictor of a preferred embodiment comprises means for tracking control flow instructions and microarchitectural events, means for storing voltage emergency signatures that cause voltage emergencies, means for comparing current control flow and microarchitectural events with stored voltage emergency signatures to predict voltage emergencies, and means for actuating said actuator to throttle said microprocessor to avoid predicted voltage emergencies.
Adaptive event-guided system and method for avoiding voltage emergencies
Brandon Reagen, Gu Wei, and David Brooks. 2015. “How Hardware Accelerators Trade-Off Pipelining and Parallelism to Maximize Efficiency.” In Boston Area Architecture Workshop (BARC). Publisher's Version
Yakun Shao, Sam Xi, Viji Srinivasan, Gu Wei, and David Brooks. 2015. “Toward Cache-Friendly Hardware Accelerators.” In HPCA Sensors and Cloud Architectures Workshop (SCAW). Publisher's VersionAbstract
Increasing demand for power-efficient, high-performance computing has spurred a growing number and diversity of hardware accelerators in mobile Systems on Chip (SoCs) as well as servers and desktops. Despite their energy efficiency, fixed-function accelerators lack programmability, especially compared with general-purpose processors. Today’s accelerators rely on software-managed scratchpad memory and Direct Memory Access (DMA) to provide fixed-latency memory access and data transfer, which leads to significant chip resource and software engineering costs. On the other hand, hardware-managed caches with support for virtual memory and cache coherence are well-known to ease programmability in general-purpose processors, but these features are not commonly supported in today’s fixed-function accelerators. As a first step toward cache-friendly accelerator design, this paper discusses limitations of scratchpad-based memories in today’s accelerators, identifies challenges to support hardware-managed caches, and explores opportunities to ease the cache integration.
Toward Cache-Friendly Hardware Accelerators
Khalid Al-Hawaj, Simone Campanoni, Gu Wei, and David Brooks. 2015. “Unified Cache: A Case for Low-Latency Communication.” In 3rd International Workshop on Parallelism in Mobile Platforms (PRISM). Portland, OR, USA.Abstract
Increasing computational demand on mobile devices calls for energy-friendly solutions for accelerating single programs. In the multicore era, thread level parallelism (TLP) can accelerate single-threaded programs without requiring power-hungry cores. HELIX-RC, a recently proposed co-design between the HELIX parallelizing compiler and its target architecture, shows that substantial TLP can be extracted from loops with small bodies by optimizing core-to-core communication. Previously, the effectiveness of the HELIX-RC approach has been demonstrated through simulation. In this paper, we evaluate a HELIXRC-like solution on a real platform. We have developed a simplified version of the HELIX-RC architecture that we call unified cache, and we have implemented it on an FPGA board. Our design augments a multicore platform with a simplified ring cache—the architectural component of the HELIX-RC co-design. With the aid of microbenchmarks, our FPGA prototype confirms the HELIX-RC findings. David Brooks After describing both the ring cache and the parallel code generated by the HELIX compiler, we sketch the design of the unified cache and we evaluate its implementation on a Xilinx VC707 FPGA board.
Unified Cache: A Case for Low-Latency Communication