Publications by Year: 2007

2007
Ratnayake NS, Haratsch F, and Gu Wei. 12/2007. “A Bit-Node Centric Architecture for Low-Density Parity-Check Decoders.” In IEEE GLOBECOM 2007-IEEE Global Telecommunications Conference, Pp. 265–270. IEEE. Publisher's VersionAbstract
A bit-node centric decoder architecture for low- density parity-check codes is proposed. This architecture performs the optimum sum-product algorithm. A bit node processing unit computes the bit-to-check node messages sequentially, while the computation of the check-to-bit node messages is broken up into several steps. A stand-alone decoder architecture, and a decoder architecture for a concatenated detector-decoder system are presented. The proposed stand-alone decoder architecture requires significantly less memory compared to other known serial architectures. The hardware requirements are reduced even further for the concatenated detector-decoder system.
A Bit-Node Centric Architecture for Low-Density Parity-Check Decoders
Xiaoyao Liang, Ramon Canal, Gu Wei, and David Brooks. 12/2007. “Process variation tolerant 3T1D-based cache architectures.” In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, Pp. 15–26. Chicago, IL, USA: IEEE Computer Society. Publisher's VersionAbstract
Process variations will greatly impact the stability, leakage power consumption, and performance of future microprocessors. These variations are especially detrimental to 6T SRAM (6-transistor static memory) structures and will become critical with continued technology scaling. In this paper, we propose new on-chip memory architectures based on novel 3T1D DRAM (3-transistor, 1-diode dynamic memory) cells. We provide a detailed comparison between 6T and 3T1D designs in the context of a L1 data cache. The effects of physical device variation on a 3T1D cache can be lumped into variation of data retention times. This paper proposes a range of cache refresh and placement schemes that are sensitive to retention time, and we show that most of the retention time variations can be masked by the microarchitecture when using these schemes. We have performed detailed circuit and architectural simulations assuming different degrees of variability in advanced technology nodes, and we show that the resulting memory architecture can tolerate large process variations with little or even no impact on performance when compared to ideal 6T SRAM designs. Furthermore, these designs are robust to memory cell stability issues and can achieve large power savings. These advantages make the new memory architectures a promising choice for on-chip variation-tolerant cache structures required for next generation microprocessors.
Process variation tolerant 3T1D-based cache architectures
Xiaoyao Liang, Ramon Canal, Gu Wei, and David Brooks. 12/2007. “Process Variation Tolerant Register Files Based On Dynamic Memories.” Workshop on Architectural Support for Gigascale Integration, held with Int’l Symposium on Computer Architecture (ISCA-34). Publisher's VersionAbstract
Transistor gate length and threshold voltage variability due to process variations will greatly impact the stability, leakage power, and performance of future microprocessors. These variations are especially detrimental to continued scaling of 6T SRAM (6-transistor static memory) structures. This paper proposes replacing traditional SRAM-based cells in mutliported register files with cells based on 3T1D DRAM (3-transistor, 1diode dynamic memory) cells, which can absorb the effects of device physical variations into a single parameter– the data retention time. By leveraging the transient data in the processor and dependency slack in the pipeline, retention time variation can be hidden into the existing processor architecture. Thus the proposed register file can effectively tolerate very large process variation with little or even no impact on performance, addresses stability concerns, and reduces power consumption, when compared with ideal SRAM-based designs. Detailed circuit and architectural simulations and analysis verify a 1% normalized performance loss even under very large process variations, and 22% average power savings.
Process Variation Tolerant Register Files Based On Dynamic Memories
Xiaoyao Liang, Ramon Canal, Gu Wei, and David Brooks. 12/2007. “Process variation tolerant register files based on dynamic memories.” In Workshop on Architectural Support for Gigascale Integration (ASGI-07) in conjunction with ISCA.Abstract
Process variations will greatly impact the stability, leakage power consumption, and performance of future microprocessors. These variations are especially detrimental to 6T SRAM (6-transistor static memory) structures and will become critical with continued technology scaling. In this paper, we propose new on-chip memory architectures based on novel 3T1D DRAM (3-transistor, 1-diode dynamic memory) cells. We provide a detailed comparison between 6T and 3T1D designs in the context of a L1 data cache. The effects of physical device variation on a 3T1D cache can be lumped into variation of data retention times. This paper proposes a range of cache refresh and placement schemes that are sensitive to retention time, and we show that most of the retention time variations can be masked by the microarchitecture when using these schemes.Ω
Process variation tolerant register files based on dynamic memories
Xiaoyao Liang, Kerem Turgay, and David Brooks. 11/4/2007. “Architectural power models for SRAM and CAM structures based on hybrid analytical/empirical techniques.” In Proceedings of the 2007 IEEE/ACM international conference on Computer-aided design, Pp. 824–830. IEEE Press. Publisher's VersionAbstract
The need to perform power analysis in the early stages of the design process has become critical as power has become a major design constraint. Embedded and high-performance microprocessors incorporate large on-chip cache and similar SRAM-based or CAM-based structures, and these components can consume a significant fraction of the total chip power. Thus an accurate power modeling method for such structures is important in early architecture design studies. We present a unified architecture-level power modeling methodology for array structures which is highly-accurate, parameterizable, and technology scalable. We demonstrate the applicability of the model to different memory structures (SRAMs and CAMs) and include leakage-variability in advanced technologies. The power modeling approach is validated against HSPICE power simulation results, and we show power estimation accuracy within 5% of detailed circuit simulations.
Architectural power models for SRAM and CAM structures based on hybrid analytical/empirical techniques
Burnham R, Gu Wei, Yang Ken, and Hindi Haitham. 9/16/2007. “A comprehensive phase-transfer model for delay-locked loops.” In 2007 IEEE Custom Integrated Circuits Conference, Pp. 627–630. IEEE. Publisher's VersionAbstract
This paper presents a comprehensive model for analyzing the behavior of an analog delay-locked loop (DLL). Unlike previous models, the proposed version includes both constant and variable phase-offset terms, making it possible to calculate jitter transfer characteristics, stability, and static phase errors from a single unified model. The topology more closely approximates the underlying architecture of the DLL, resulting in improved accuracy and enabling better tradeoffs between bandwidth, stability, and power.
A comprehensive phase-transfer model for delay-locked loops
Hanumolu Kumar, Gu Wei, Moon Ku, and Kartikeya Mayaram. 9/16/2007. “Digitally-enhanced phase-locking circuits.” In 2007 IEEE Custom Integrated Circuits Conference, Pp. 361–368. IEEE. Publisher's VersionAbstract
In this paper, we present an overview of digital techniques that can overcome the drawbacks of analog phase-looked loops (PLLs) implemented in deep-submicron CMOS processes. The design of key building blocks of digital PLLs such as the time-to-digital converter and digital-to-frequency converters are discussed in detail. The implementation and measured results of two digital PLL architectures, (1) based on a digitally controlled oscillator and (2) based on a digital phase accumulator, are presented. The experimental results demonstrate the feasibility of using digital PLLs in digital systems requiring high-performance PLLs.
Digitally-enhanced phase-locking circuits
Ratnayake NS, F Haratsch, and Gu Wei. 9/2007. “Serial Sum-Product Architecture for Low-Density Parity-Check Codes.” In 2007 16th International Conference on Computer Communications and Networks, Pp. 154–158. IEEE. Publisher's VersionAbstract
A serial sum-product architecture for low-density parity-check (LDPC) codes is presented. In the proposed architecture, a standard bit node processing unit computes the bit to check node messages sequentially, while the check node computations are broken up into several steps and computed on the fly. This bit node centric architecture requires considerably less memory compared to other serial architectures, including the check node centric architecture.
Serial Sum-Product Architecture for Low-Density Parity-Check Codes
Meeta Gupta, Krishna Rangan, Michael Smith, Gu Wei, and David Brooks. 8/27/2007. “Towards a software approach to mitigate voltage emergencies.” In Low Power Electronics and Design (ISLPED), 2007 ACM/IEEE International Symposium on, Pp. 123–128. IEEE. Publisher's VersionAbstract
Increases in peak current draw and reductions in the operating voltages of processors continue to amplify the importance of dealing with voltage fluctuations in processors. One approach suggested has been to not only react to these fluctuations but also attempt to eliminate future occurrences of these fluctuations by dynamically modifying the executing program. This paper investigates the potential of a very simple dynamic scheme to appreciably reduce the number of run-time voltage emergencies. It shows that we can map many of the voltage emergencies in the execution of the SPEC benchmarks on an aggressive superscalar design to a few static loops, categorize the microarchitectural cause of the emergencies in each important loop through simple observations and a simple priority function, and finally apply straight forward software optimization strategies to mitigate up to 70% of the future voltage swings.
Towards a software approach to mitigate voltage emergencies
Helal M, Straayer Z, Gu Wei, and Perrott H. 6/14/2007. “A low jitter 1.6 GHz multiplying DLL utilizing a scrambling time-to-digital converter and digital correlation.” In 2007 IEEE Symposium on VLSI Circuits, Pp. 166–167. IEEE. Publisher's VersionAbstract
This paper presents a 1.6 GHz multiplying delay-locked loop (MDLL) that leverages time-to-digital conversion and a digital correlation technique to achieve low deterministic jitter while still maintaining low random jitter. A proposed time-to-digital converter consists of a ring oscillator that is gated on and off to accurately measure time and scramble the measurement's residual error. Using a 50 MHz reference, the prototype system has measured reference spurs less than -59 dBc and an overall measured jitter of 1.41 ps.
A low jitter 1.6 GHz multiplying DLL utilizing a scrambling time-to-digital converter and digital correlation
David Brooks, Robert Dick, Russ Joseph, and Li Shang. 5/2007. “Power, thermal, and reliability modeling in nanometer-scale microprocessors.” Micro, IEEE, 27, 3, Pp. 49–62. Publisher's VersionAbstract
System integration and performance requirements are dramatically increasing the power consumptions and power densities of high-performance microprocessors. High power consumption introduces challenges to various aspects of microprocessor and computer system design. It increases the cost of cooling and packaging design, reduces system reliability, complicates power supply circuitry design, and reduces battery life. Researchers have recently dedicated intensive effort to power-related design problems. Modeling is the essential first step toward design optimization. In this article, the power, thermal and reliability modeling problems are explained and recent advances in their accurate and efficient analysis are surveyed.
Power, thermal, and reliability modeling in nanometer-scale microprocessors
Benjamin Lee and David Brooks. 5/2007. “Spatial Sampling and Regression Strategies.” Micro, IEEE, 27, 3, Pp. 74–93. Publisher's VersionAbstract
This new simulation paradigm for microarchitectural design evaluation and optimization counters growing simulation costs stemming from the exponentially increasing size of design spaces. the authors demonstrate how to obtain a more comprehensive understanding of the design space by selectively simulating a modest number of designs from that space and then more effectively leveraging the simulation data using techniques in statistical inference.
Spatial Sampling and Regression Strategies
Meeta Gupta, Jarod Oatley, Russ Joseph, Gu Wei, and David Brooks. 4/16/2007. “Understanding voltage variations in chip multiprocessors using a distributed power-delivery network.” In Design, Automation & Test in Europe Conference & Exhibition, 4/16/2007. DATE'07, Pp. 1–6. Nice, France: IEEE. Publisher's VersionAbstract
Recent efforts to address microprocessor power dissipation through aggressive supply voltage scaling and power management require that designers be increasingly cognizant of power supply variations. These variations, primarily due to fast changes in supply current, can be attributed to architectural gating events that reduce power dissipation. In order to study this problem, the authors propose a fine-grain, parameterizable model for power-delivery networks that allows system designers to study localized, on-chip supply fluctuations in high-performance microprocessors. Using this model, the authors analyze voltage variations in the context of next-generation chip-multiprocessor (CMP) architectures using both real applications and synthetic current traces. They find that the activity of distinct cores in CMPs present several new design challenges when considering power supply noise, and they describe potentially problematic activity sequences that are unique to CMP architectures
Understanding voltage variations in chip multiprocessors using a distributed power-delivery network
Benjamin Lee, David Brooks, Bronis Supinski, Martin Schulz, Karan Singh, and Sally McKee. 3/2007. “Methods of inference and learning for performance modeling of parallel applications.” In Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming, Pp. 249–258. ACM. Publisher's VersionAbstract

Increasing system and algorithmic complexity combined with a growing number of tunable application parameters pose significant challenges for analytical performance modeling. We propose a series of robust techniques to address these challenges. In particular, we apply statistical techniques such as clustering, association, and correlation analysis, to understand the application parameter space better. We construct and compare two classes of effective predictive models: piecewise polynomial regression and artifical neural networks. We compare these techniques with theoretical analyses and experimental results. Overall, both regression and neural networks are accurate with median error rates ranging from 2.2 to 10.5 percent. The comparable accuracy of these models suggest differentiating features will arise from ease of use, transparency, and computational efficiency.

Methods of inference and learning for performance modeling of parallel applications
Benjamin Lee and David Brooks. 2/10/2007. “Illustrative design space studies with microarchitectural regression models.” In High Performance Computer Architecture, 2/10/2007. HPCA 2/10/2007. IEEE 13th International Symposium on, Pp. 340–351. Phoenix, Arizona, USA: IEEE. Publisher's VersionAbstract
We apply a scalable approach for practical, comprehensive design space evaluation and optimization. This approach combines design space sampling and statistical inference to identify trends from a sparse simulation of the space. The computational efficiency of sampling and inference enables new capabilities in design space exploration. We illustrate these capabilities using performance and power models for three studies of a 260,000 point design space: (1) Pareto frontier analysis, (2) pipeline depth analysis, and (3) multiprocessor heterogeneity analysis. For each study, we provide an assessment of predictive error and sensitivity of observed trends to such error. We construct Pareto frontiers and find predictions for Pareto optima are no less accurate than those for the broader design space. We reproduce and enhance prior pipeline depth studies, demonstrating constrained sensitivity studies may not generalize when many other design parameters are held at constant values. Lastly, we identify efficient heterogeneous core designs by clustering per benchmark optimal architectures. Collectively, these studies motivate the application of techniques in statistical inference for more effective use of modern simulator infrastructure
Illustrative design space studies with microarchitectural regression models
Wonyoung Kim, Meeta Gupta, Gu-Wei, and David Brooks. 2007. “Enabling on-chip switching regulators for multi-core processors using current staggering.” Proceedings of the Work. on Architectural Support for Gigascale Integration.Abstract

Portable, embedded systems place ever-increasing demands on high-performance, low-power microprocessor design. Dynamic voltage and frequency scaling (DVFS) is a wellknown technique to reduce energy in portable systems, but DVFS effectiveness suffers from the fact that voltage transitions occur on the order of tens of microseconds. Voltage regulators that are integrated on the same chip as the microprocessor core provide the benefit of both nanosecond-scale voltage switching and improved power delivery. However, the implementation of on-chip regulators presents many challenges including regulator efficiency and output voltage transient characteristics. In this paper, we discuss architectural support for on-chip regulator designs. Specifically, we show that in a chip-multiprocessor system, current staggering can be employed by restricting the simultaneous enabling/disabling of cores due to clock gating. We discuss tradeoffs between current staggering and regulator circuit design parameters, and we show that regulation efficiency of greater than 80% is possible for a variety of multi-threaded applications.

Benjamin Lee and David Brooks. 2007. “Statistical inference for efficient microarchitectural analysis.” SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing, Pp. 130–es. Publisher's VersionAbstract

Microarchitectural design exploration is often inefficient and ad hoc due to computational costs of simulators. Trends toward multi-core, multi-threading lead to diversity in viable core designs, thereby requiring comprehensive design exploration while exponentially increasing design space size. Similarly, application performance topology is a function of input parameters, but models to optimize performance and/or predict scalability are increasingly difficult to derive analytically due to system complexity. We collect measurements sampled sparsely, uniformly at random from the space of interest and formulate non-linear regression models. We demonstrate the broad effectiveness of regression for predicting (1) the power and performance of a microarchitectural design space with median error rates of 5.5 to 7.5 percent using 1K samples from a 1B point space and (2) the performance of parallel applications, Semicoarsening Multigrid and High-Performance Linpack, with median error rates of 2.5 to 5.0 percent using 500 samples from more than 3K points.

Statistical inference for efficient microarchitectural analysis
Mark Hempstead, Nikhil Tripathi, Patrick Mauro, Gu Wei, and David Brooks. 2007. “Ultra low power system for sensor network applications.” 32nd International Symposium on Computer Architecture (ISCA'05). Publisher's VersionAbstract
Recent years have seen a burgeoning interest in embedded wireless sensor networks with applications ranging from habitat monitoring to medical applications. Wireless sensor networks have several important attributes that require special attention to device design. These include the need for inexpensive, long-lasting, highly reliable devices coupled with very low performance requirements. Ultimately, the "holy grail" of this design space is a truly untethered device that operates off of energy scavenged from the ambient environment. In this paper, we describe an application-driven approach to the architectural design and implementation of a wireless sensor device that recognizes the event-driven nature of many sensor-network workloads. We have developed a full-system simulator for our sensor node design to verify and explore our architecture. Our simulation results suggest one to two orders of magnitude reduction in power dissipation over existing commodity-based systems for an important class of sensor network applications. We are currently in the implementation stage of design, and plan to tape out the first version of our system within the next year.
Ultra low power system for sensor network applications