Publications

2008
Benjamin Lee and David Brooks. 3/2008. “Efficiency trends and limits from comprehensive microarchitectural adaptivity.” In ACM SIGARCH Computer Architecture News, 3rd ed., 43: Pp. 36–47. ACM. Publisher's VersionAbstract

ncreasing demand for power-efficient, high-performance computing requires tuning applications and/or the underlying hardware to improve the mapping between workload heterogeneity and computational resources. To assess the potential benefits of hardware tuning, we propose a framework that leverages synergistic interactions between recent advances in (a) sampling, (b) predictive modeling, and (c) optimization heuristics. This framework enables qualitatively new capabilities in analyzing the performance and power characteristics of adaptive microarchitectures. For the first time, we are able to simultaneously consider high temporal and comprehensive spatial adaptivity. In particular, we optimize efficiency for many, short adaptive intervals and identify the best configuration of 15 parameters, which define a space of 240B point.

With frequent sub-application reconfiguration and a fully reconfigurable hardware substrate, adaptive microarchitectures achieve bips3/w efficiency gains of up to 5.3x (median 2.4x) relative to their static counterparts already optimized for a given application. This 5.3x efficiency gain is derived from a 1.6x performance gain and 0.8x power reduction. Although several applications achieve a significant fraction of their potential efficiency with as few as three adaptive parameters, the three most significant parameters differ across applications. These differences motivate a hardware substrate capable of comprehensive adaptivity to meet these diverse application requirements.

Efficiency trends and limits from comprehensive microarchitectural adaptivity
Kevin Brownell, Gu Wei, and David Brooks. 3/2008. “Evaluation of voltage interpolation to address process variations.” In Proceedings of the 2008 IEEE/ACM International Conference on Computer-Aided Design, Pp. 529–536. IEEE Press.Abstract

Abstract — Post-fabrication tuning provides a promising design approach to mitigate the performance and power overheads of process variation in advanced fabrication technologies. This paper explores design considerations and VLSI-CAD support for a recently proposed postfabrication tuning knob called voltage interpolation. The paper discusses design tradeoffs between circuit tuning range and static power overheads that can be performed within the synthesis flow of the design process. The paper explores the scheme for a 64-core chip-multiprocessor machine using industrial-grade design blocks and shows that the scheme can be used to mitigate overhead arising from random and correlated within-die process variations. The analysis shows that the scheme can match the nominal delay target with a 10 % power cost, or for the same power budget, incur only a 9 % delay overhead after variations. I.

Evaluation of voltage interpolation to address process variations
Meeta Gupta, Krishna Rangan, Michael Smith, Gu Wei, and David Brooks. 2/16/2008. “DeCoR: A delayed commit and rollback mechanism for handling inductive noise in processors.” In 2008 IEEE 14th International Symposium on High Performance Computer Architecture, Pp. 381–392. IEEE. Publisher's VersionAbstract
Increases in peak current draw and reductions in the operating voltage of processors stress the importance of dealing with voltage fluctuations in processors. Noise-margin violations lead to undesired effects, like timing violations, which may result in incorrect execution of applications. Several recent architectural solutions for inductive noise have been proposed that, unfortunately, have a strong correlation to the underlying power-delivery package model and require a feedback loop that is largely constrained by the voltage/current sensor characteristics. The resulting solutions are not robust across a wide range of microprocessor designs and packaging technologies. This paper proposes a Delayed-commit and rollback scheme (DeCoR) that guarantees correctness, insensitive to the package model or the responsiveness of the voltage sensors. In particular, our approach recovers from, rather than attempting to avoid, voltage emergencies. This approach incurs a small performance penalty when compared to an ideal machine that does not have voltage emergencies. We show that explicit checkpoint-recovery schemes, intended to handle infrequent events, e.g., radiation-induced soft errors, suffer from large performance overheads for frequently-occurring voltage emergencies. DeCoR requires very few modifications to modern processor designs, as it leverages the existing store queue and reorder buffers. Unlike conventional designs that conservatively protect all components of the processor from inductive noise with overly-large timing margins, our approach only requires conservative protection of the architected register state and cache write paths.
DeCoR: A delayed commit and rollback mechanism for handling inductive noise in processors
Wonyoung Kim, Meeta Gupta, Gu Wei, and David Brooks. 2/16/2008. “System level analysis of fast, per-core DVFS using on-chip switching regulators.” In 2008 IEEE 14th International Symposium on High Performance Computer Architecture, Pp. 123–134. Salt Lake City, UT, USA: Ieee. Publisher's VersionAbstract
Portable, embedded systems place ever-increasing demands on high-performance, low-power microprocessor design. Dynamic voltage and frequency scaling (DVFS) is a well-known technique to reduce energy in digital systems, but the effectiveness of DVFS is hampered by slow voltage transitions that occur on the order of tens of microseconds. In addition, the recent trend towards chip-multiprocessors (CMP) executing multi-threaded workloads with heterogeneous behavior motivates the need for per-core DVFS control mechanisms. Voltage regulators that are integrated onto the same chip as the microprocessor core provide the benefit of both nanosecond-scale voltage switching and per-core voltage control. We show that these characteristics provide significant energy-saving opportunities compared to traditional off-chip regulators. However, the implementation of on-chip regulators presents many challenges including regulator efficiency and output voltage transient characteristics, which are significantly impacted by the system-level application of the regulator. In this paper, we describe and model these costs, and perform a comprehensive analysis of a CMP system with on-chip integrated regulators. We conclude that on-chip regulators can significantly improve DVFS effectiveness and lead to overall system energy savings in a CMP, but architects must carefully account for overheads and costs when designing next-generation DVFS systems and algorithms.
System level analysis of fast, per-core DVFS using on-chip switching regulators
Xiaoyao Liang, David Brooks, and Gu Wei. 2/3/2008. “A process-variation-tolerant floating-point unit with voltage interpolation and variable latency.” In 2008 IEEE International Solid-State Circuits Conference - Digest of Technical Papers, Pp. 404–623. San Francisco, CA, USA: IEEE. Publisher's VersionAbstract
This paper explores two fine-grained, post-fabrication circuit-tuning techniques to combat process variation for pipelined logic componentsrdquo voltage interpolation and variable latency. These techniques are applied to a single-precision floating-point unit (FPU) designed using a standard CAD synthesis flow in a 0.13 mum CMOS logic process with 8 metal layers. Measured results from fabricated chips show that both techniques provide wide frequency tuning range to deal with frequency fluctuations arising from process variations with minimal power overhead, and in some configurations, power savings.
A process-variation-tolerant floating-point unit with voltage interpolation and variable latency
Hanumolu Kumar, Kratyuk Volodymyr, Gu Wei, and Moon Un-Ku. 2/2008. “A sub-picosecond resolution 0.5–1.5 GHz digital-to-phase converter.” IEEE Journal of Solid-State Circuits, 43, 2, Pp. 414–424. Publisher's VersionAbstract
A digital-to-phase converter (DPC) is an essential building block in applications such as source-synchronous interfaces and digital phase modulators. The resolution of DPCs using analog phase interpolators is severely affected by the operating frequency and rise times of the interpolator inputs. In this paper, we present a new DPC architecture that achieves high resolution independent of both the operating frequency and the rise time. The 8 phases generated by a phase-locked loop are dithered using a delta-sigma modulator to shape the truncation error to high frequency and is subsequently filtered using a delay-locked loop phase filter. The test chip, fabricated in a 0.13 mum CMOS process, operates from 0.5 -1.5 GHz and achieves a differential nonlinearity of less than plusmn0.1 ps and an integral nonlinearity of plusmn12 ps. The total power consumption while operating at 1 GHz is 15 mW.
A sub-picosecond resolution 0.5–1.5 GHz digital-to-phase converter
Hanumolu Kumar, Gu Wei, and Moon Ku. 2/2008. “A wide-tracking range clock and data recovery circuit.” IEEE Journal of Solid-State Circuits, 43, 2, Pp. 425–439. Publisher's VersionAbstract
A hybrid analog-digital quarter-rate clock and data recovery circuit (CDR) that achieves a wide-tracking range and excellent frequency and phase tracking resolution is presented in this paper. A split-tuned analog phase-locked loop (PLL) provides eight equally spaced phases needed for quarter-rate data recovery and the digital CDR loop adjusts the phase of the PLL output clocks in a precise manner to facilitate plesiochronous clocking. The CDR employs a second-order digital loop filter and combines delta-sigma modulation with the analog PLL to achieve sub-picosecond phase resolution and better than 2 ppm frequency resolution. A test chip fabricated in a 0.18 mum CMOS process achieves BER <10 -12 and consumes 14 mW power while operating at 2 Gb/s. The tracking range is greater than plusmn5000 ppm and plusmn2500 ppm at 10 kHz and 20 kHz modulation frequencies, respectively, making this CDR suitable for systems employing spread-spectrum clocking.
A wide-tracking range clock and data recovery circuit
Xiaoyao Liang, Ramon Canal, Gu Wei, and David Brooks. 1/2008. “Replacing 6t srams with 3t1d drams in the l1 data cache to combat process variability.” Micro, IEEE, 28, 1, Pp. 60–68. Publisher's VersionAbstract
With continued technology scaling, process variations will be especially detrimental to six-transistor static memory structures (6T SRAMs). A memory architecture using three-transistor, one-diode DRAM (3T1D) cells in the L1 data cache tolerates wide process variations with little performance degradation, making it a promising choice for on-chip cache structures for next-generation microprocessors.
Replacing 6t srams with 3t1d drams in the l1 data cache to combat process variability
Michael Lyons and David Brooks. 2008. “Application-Specific Hardware Design for Wireless Sensor Network Energy and Delay Reduction.” Workshop on Optimizations for DSP and Embedded Systems (ODES). Publisher's VersionAbstract
Battery-powered embedded systems, such as wireless sensor network (WSN) motes, require low energy usage to extend system lifetime. WSN motes must power sensors, a processor, and a radio for wireless communication over long periods of time, and are therefore particularly sensitive to energy use. Recent techniques for reducing WSN energy consumption, such as aggregation, require additional computation to reduce the cost of sending data by minimizing radio data transmissions. Larger demands on the processor will require more computational energy, but traditional energy reduction approaches, such as multi-core scaling with reduced frequency and voltage may prove heavy handed and ineffective for motes. Instead, application-specific hardware design (ASHD) architectures can reduce computational energy consumption by processing operations common to specific applications more efficiently than a general purpose processor. By the nature of their deeply embedded operation, motes support a limited set of applications, and thus the conventional general purpose computing paradigm may not be well-suited to mote operation. Both simple and complex operations can improve performance and use orders of magnitude less energy with application-specific hardware. This paper examines the design considerations of a hardware accelerator for compressed Bloom filters, a data structure for efficiently storing set membership. Additionally, we evaluate our ASHD design for three representative wireless sensor network applications: monitoring network-wide mote status, object tracking, and on-mote duplicate packet filtering. We demonstrate that ASHD design reduces network latency by 59% and computational energy by 98%, and show the need for architecting processors for ASHD accelerators. 
Xiaoyao Liang, Benjamin Lee, Gu Wei, and David Brooks. 2008. “Design and Test Strategies for Microarchitectural PostFabrication”. Publisher's VersionAbstract

Process variations are a major hurdle for continued technology scaling. Both systematic and random variations will affect the critical delay of fabricated chips, causing a wide frequency and power distribution. Tuning techniques adapt the microarchitecture to mitigate the impact of variations at post-fabrication testing time. This paper proposes a new post-fabrication testing framework that accounts for testing costs. This framework uses on-chip canary circuits to capture systematic variation while using statistical analysis to estimate random variation. We derive regression models to predict chip performance and power. These techniques comprise an integrated framework that identifies the most energy efficient post-fabrication tuning configuration for each chip.

Design and Test Strategies for Microarchitectural PostFabrication
2007
Ratnayake NS, Haratsch F, and Gu Wei. 12/2007. “A Bit-Node Centric Architecture for Low-Density Parity-Check Decoders.” In IEEE GLOBECOM 2007-IEEE Global Telecommunications Conference, Pp. 265–270. IEEE. Publisher's VersionAbstract
A bit-node centric decoder architecture for low- density parity-check codes is proposed. This architecture performs the optimum sum-product algorithm. A bit node processing unit computes the bit-to-check node messages sequentially, while the computation of the check-to-bit node messages is broken up into several steps. A stand-alone decoder architecture, and a decoder architecture for a concatenated detector-decoder system are presented. The proposed stand-alone decoder architecture requires significantly less memory compared to other known serial architectures. The hardware requirements are reduced even further for the concatenated detector-decoder system.
A Bit-Node Centric Architecture for Low-Density Parity-Check Decoders
Xiaoyao Liang, Ramon Canal, Gu Wei, and David Brooks. 12/2007. “Process variation tolerant 3T1D-based cache architectures.” In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, Pp. 15–26. Chicago, IL, USA: IEEE Computer Society. Publisher's VersionAbstract
Process variations will greatly impact the stability, leakage power consumption, and performance of future microprocessors. These variations are especially detrimental to 6T SRAM (6-transistor static memory) structures and will become critical with continued technology scaling. In this paper, we propose new on-chip memory architectures based on novel 3T1D DRAM (3-transistor, 1-diode dynamic memory) cells. We provide a detailed comparison between 6T and 3T1D designs in the context of a L1 data cache. The effects of physical device variation on a 3T1D cache can be lumped into variation of data retention times. This paper proposes a range of cache refresh and placement schemes that are sensitive to retention time, and we show that most of the retention time variations can be masked by the microarchitecture when using these schemes. We have performed detailed circuit and architectural simulations assuming different degrees of variability in advanced technology nodes, and we show that the resulting memory architecture can tolerate large process variations with little or even no impact on performance when compared to ideal 6T SRAM designs. Furthermore, these designs are robust to memory cell stability issues and can achieve large power savings. These advantages make the new memory architectures a promising choice for on-chip variation-tolerant cache structures required for next generation microprocessors.
Process variation tolerant 3T1D-based cache architectures
Xiaoyao Liang, Ramon Canal, Gu Wei, and David Brooks. 12/2007. “Process Variation Tolerant Register Files Based On Dynamic Memories.” Workshop on Architectural Support for Gigascale Integration, held with Int’l Symposium on Computer Architecture (ISCA-34). Publisher's VersionAbstract
Transistor gate length and threshold voltage variability due to process variations will greatly impact the stability, leakage power, and performance of future microprocessors. These variations are especially detrimental to continued scaling of 6T SRAM (6-transistor static memory) structures. This paper proposes replacing traditional SRAM-based cells in mutliported register files with cells based on 3T1D DRAM (3-transistor, 1diode dynamic memory) cells, which can absorb the effects of device physical variations into a single parameter– the data retention time. By leveraging the transient data in the processor and dependency slack in the pipeline, retention time variation can be hidden into the existing processor architecture. Thus the proposed register file can effectively tolerate very large process variation with little or even no impact on performance, addresses stability concerns, and reduces power consumption, when compared with ideal SRAM-based designs. Detailed circuit and architectural simulations and analysis verify a 1% normalized performance loss even under very large process variations, and 22% average power savings.
Process Variation Tolerant Register Files Based On Dynamic Memories
Xiaoyao Liang, Ramon Canal, Gu Wei, and David Brooks. 12/2007. “Process variation tolerant register files based on dynamic memories.” In Workshop on Architectural Support for Gigascale Integration (ASGI-07) in conjunction with ISCA.Abstract
Process variations will greatly impact the stability, leakage power consumption, and performance of future microprocessors. These variations are especially detrimental to 6T SRAM (6-transistor static memory) structures and will become critical with continued technology scaling. In this paper, we propose new on-chip memory architectures based on novel 3T1D DRAM (3-transistor, 1-diode dynamic memory) cells. We provide a detailed comparison between 6T and 3T1D designs in the context of a L1 data cache. The effects of physical device variation on a 3T1D cache can be lumped into variation of data retention times. This paper proposes a range of cache refresh and placement schemes that are sensitive to retention time, and we show that most of the retention time variations can be masked by the microarchitecture when using these schemes.Ω
Process variation tolerant register files based on dynamic memories
Xiaoyao Liang, Kerem Turgay, and David Brooks. 11/4/2007. “Architectural power models for SRAM and CAM structures based on hybrid analytical/empirical techniques.” In Proceedings of the 2007 IEEE/ACM international conference on Computer-aided design, Pp. 824–830. IEEE Press. Publisher's VersionAbstract
The need to perform power analysis in the early stages of the design process has become critical as power has become a major design constraint. Embedded and high-performance microprocessors incorporate large on-chip cache and similar SRAM-based or CAM-based structures, and these components can consume a significant fraction of the total chip power. Thus an accurate power modeling method for such structures is important in early architecture design studies. We present a unified architecture-level power modeling methodology for array structures which is highly-accurate, parameterizable, and technology scalable. We demonstrate the applicability of the model to different memory structures (SRAMs and CAMs) and include leakage-variability in advanced technologies. The power modeling approach is validated against HSPICE power simulation results, and we show power estimation accuracy within 5% of detailed circuit simulations.
Architectural power models for SRAM and CAM structures based on hybrid analytical/empirical techniques
Burnham R, Gu Wei, Yang Ken, and Hindi Haitham. 9/16/2007. “A comprehensive phase-transfer model for delay-locked loops.” In 2007 IEEE Custom Integrated Circuits Conference, Pp. 627–630. IEEE. Publisher's VersionAbstract
This paper presents a comprehensive model for analyzing the behavior of an analog delay-locked loop (DLL). Unlike previous models, the proposed version includes both constant and variable phase-offset terms, making it possible to calculate jitter transfer characteristics, stability, and static phase errors from a single unified model. The topology more closely approximates the underlying architecture of the DLL, resulting in improved accuracy and enabling better tradeoffs between bandwidth, stability, and power.
A comprehensive phase-transfer model for delay-locked loops
Hanumolu Kumar, Gu Wei, Moon Ku, and Kartikeya Mayaram. 9/16/2007. “Digitally-enhanced phase-locking circuits.” In 2007 IEEE Custom Integrated Circuits Conference, Pp. 361–368. IEEE. Publisher's VersionAbstract
In this paper, we present an overview of digital techniques that can overcome the drawbacks of analog phase-looked loops (PLLs) implemented in deep-submicron CMOS processes. The design of key building blocks of digital PLLs such as the time-to-digital converter and digital-to-frequency converters are discussed in detail. The implementation and measured results of two digital PLL architectures, (1) based on a digitally controlled oscillator and (2) based on a digital phase accumulator, are presented. The experimental results demonstrate the feasibility of using digital PLLs in digital systems requiring high-performance PLLs.
Digitally-enhanced phase-locking circuits
Ratnayake NS, F Haratsch, and Gu Wei. 9/2007. “Serial Sum-Product Architecture for Low-Density Parity-Check Codes.” In 2007 16th International Conference on Computer Communications and Networks, Pp. 154–158. IEEE. Publisher's VersionAbstract
A serial sum-product architecture for low-density parity-check (LDPC) codes is presented. In the proposed architecture, a standard bit node processing unit computes the bit to check node messages sequentially, while the check node computations are broken up into several steps and computed on the fly. This bit node centric architecture requires considerably less memory compared to other serial architectures, including the check node centric architecture.
Serial Sum-Product Architecture for Low-Density Parity-Check Codes
Meeta Gupta, Krishna Rangan, Michael Smith, Gu Wei, and David Brooks. 8/27/2007. “Towards a software approach to mitigate voltage emergencies.” In Low Power Electronics and Design (ISLPED), 2007 ACM/IEEE International Symposium on, Pp. 123–128. IEEE. Publisher's VersionAbstract
Increases in peak current draw and reductions in the operating voltages of processors continue to amplify the importance of dealing with voltage fluctuations in processors. One approach suggested has been to not only react to these fluctuations but also attempt to eliminate future occurrences of these fluctuations by dynamically modifying the executing program. This paper investigates the potential of a very simple dynamic scheme to appreciably reduce the number of run-time voltage emergencies. It shows that we can map many of the voltage emergencies in the execution of the SPEC benchmarks on an aggressive superscalar design to a few static loops, categorize the microarchitectural cause of the emergencies in each important loop through simple observations and a simple priority function, and finally apply straight forward software optimization strategies to mitigate up to 70% of the future voltage swings.
Towards a software approach to mitigate voltage emergencies
Helal M, Straayer Z, Gu Wei, and Perrott H. 6/14/2007. “A low jitter 1.6 GHz multiplying DLL utilizing a scrambling time-to-digital converter and digital correlation.” In 2007 IEEE Symposium on VLSI Circuits, Pp. 166–167. IEEE. Publisher's VersionAbstract
This paper presents a 1.6 GHz multiplying delay-locked loop (MDLL) that leverages time-to-digital conversion and a digital correlation technique to achieve low deterministic jitter while still maintaining low random jitter. A proposed time-to-digital converter consists of a ring oscillator that is gated on and off to accurately measure time and scramble the measurement's residual error. Using a 50 MHz reference, the prototype system has measured reference spurs less than -59 dBc and an overall measured jitter of 1.41 ps.
A low jitter 1.6 GHz multiplying DLL utilizing a scrambling time-to-digital converter and digital correlation

Pages