Publications by Year: 2008

Benjamin Lee, Jamison Collins, Hong Wang, and David Brooks. 11/8/2008. “CPR: Composable performance regression for scalable multiprocessor models.” In 2008 41st IEEE/ACM International Symposium on Microarchitecture, Pp. 270–281. IEEE. Publisher's VersionAbstract
Uniprocessor simulators track resource utilization cycle by cycle to estimate performance. Multiprocessor simulators, however, must account for synchronization events that increase the cost of every cycle simulated and shared resource contention that increases the total number of cycles simulated. These effects cause multiprocessor simulation times to scale superlinearly with the number of cores. Composable performance regression (CPR) fundamentally addresses these intractable multiprocessor simulation times, estimating multiprocessor performance with a combination of uniprocessor, contention, and penalty models. The uniprocessor model predicts baseline performance of each core while the contention models predict interfering accesses from other cores. Uniprocessor and contention model outputs are composed by a penalty model to produce the final multiprocessor performance estimate. Trained with a production quality simulator, CPR is accurate with median errors of 6.63, 4.83 percent for dual-, quad-core multiprocessors. Furthermore, composable regression is scalable, requiring 0.33x the simulations required by prior regression strategies.
Benjamin Lee and David Brooks. 10/24/2008. “Roughness of microarchitectural design topologies and its implications for optimization.” In 2008 IEEE 14th International Symposium on High Performance Computer Architecture, Pp. 240–251. IEEE. Publisher's VersionAbstract
Recent advances in statistical inference and machine learning close the divide between simulation and classical optimization, thereby enabling more rigorous and robust microarchitectural studies. To most effectively utilize these now computationally tractable techniques, we characterize design topology roughness and leverage this characterization to guide our usage of analysis and optimization methods. In particular, we compute roughness metrics that require high-order derivatives and multi-dimensional integrals of design metrics, such as performance and power. These roughness metrics exhibit noteworthy correlations (1) against regression model error, (2) against non-linearities and non-monotonicities of contour maps, and (3) against the effectiveness of optimization heuristics such as gradient ascent. Thus, this work quantifies the implications of design topology roughness for commonly used methods and practices in microarchitectural analysis. 
Roughness of microarchitectural design topologies and its implications for optimization
Chung Hayun, Liu Andrew, and Gu Wei. 9/21/2008. “A 12.5-Gbps, 7-bit transmit DAC with 4-tap LUT-based equalization in 0.13 $μ$m CMOS.” In 2008 IEEE Custom Integrated Circuits Conference, Pp. 563–566. IEEE. Publisher's VersionAbstract
This paper presents a 12.5-Gbps transmitter that uses a lookup table (LUT)-based equalizer to compensate for within-die imperfections. An equalization technique with 2x sampling is proposed to accommodate timing offsets in the multiphase clocks used for 8:1 serialization. LUT code remapping is also demonstrated to compensate for mismatch effects that introduce nonlinearity in the transmit DAC. Experimental results of a 7-bit resolution transmitter with 4-tap equalization, implemented in 0.13 mum CMOS, show the LUT-based equalizer can significantly improve the signal integrity of an otherwise closed eye for data transmitted at 12.5-Gbps.
A 12.5-Gbps, 7-bit transmit DAC with 4-tap LUT-based equalization in 0.13 $μ$m CMOS
Ankur Agrawal, Pavan Kumar Hanumolu, and Gu-Yeon Wei. 9/21/2008. “A 8 x 5 Gb/s source-synchronous receiver with clock generator phase error correction.” In 2008 IEEE Custom Integrated Circuits Conference, Pp. 459–462. IEEE. Publisher's VersionAbstract
This paper describes the design and implementation of a 8times5 Gb/s source-synchronous receiver in a 0.13 mum CMOS technology. The receiver employs a cascaded-DLL architecture that avoids filtering of the jitter on the received clock to enhance jitter tolerance bandwidth. A technique is proposed to correct phase spacing mismatch in DLLs that reduces the error standard deviations by more than 40% and improves receiver timing margins.
A 8 x 5 Gb/s source-synchronous receiver with clock generator phase error correction
Ruwan Ratnayake, Aleksandar Kavcic, and Gu Wei. 9/16/2008. “A high-throughput maximum a posteriori probability detector.” IEEE Journal of solid-state circuits, 43, 8, Pp. 1846–1858. Publisher's VersionAbstract
This paper presents a maximum a posteriori probability (MAP) detector, based on a forward-only algorithm that can achieve high throughputs. The MAP algorithm is optimal in terms of bit error rate (BER) performance and, with Turbo decoding, can approach performance close to the channel capacity limit. The proposed detector utilizes a deep-pipelined architecture implemented in skew-tolerant domino and experimentally measured results verify the detector can achieve throughputs greater than 750MHz while consuming 2.4W. The detector is implemented in a 0.13μm CMOS technology and has a die area of 9.9 mm 2 .
A high-throughput maximum a posteriori probability detector
Xuning Chen, Gu Wei, and Peh Shiuan. 8/11/2008. “Design of low-power short-distance opto-electronic transceiver front-ends with scalable supply voltages and frequencies.” In Proceedings of the 2008 international symposium on Low Power Electronics & Design, Pp. 277–282. Publisher's VersionAbstract
The need for low-power I/Os is widely recognized, as I/Os take up a significant portion of total chip power. In recent years, researchers have pointed to the potential system-level power savings that can be realized if dynamic voltage scalable I/Os are available. However, substantial challenges remain in building such links. This paper presents the design and implementation details of opto-electronic transceiver front-end blocks where supply voltage can scale from 1.2V to 0.6V with almost linearly scalable bandwidth from 8Gb/s to 4Gb/s, and power consumption from 36mW to 5mW in a 130nm CMOS process. To the best of our knowledge, this is the first circuit demonstration of voltage-scalable optical links. It demonstrates the feasibility of dynamic voltage scalable optical I/Os.
Design of low-power short-distance opto-electronic transceiver front-ends with scalable supply voltages and frequencies
Gu Wei, David Brooks, Ali Khan, and Xiaoyao Liang. 8/11/2008. “Instruction-driven clock scheduling with glitch mitigation.” In Proceeding of the 13th international symposium on Low power electronics and design (ISLPED '08), Pp. 357–362. ACM. Publisher's VersionAbstract
Instruction-driven clock scheduling is a mechanism that minimizes clock power in deeply-pipelined datapaths. Analysis of realistic processor workloads shows a preponderance of bubbles persist through pipelines like the floating point unit. Clock scheduling ostensibly adapts pipeline depth with respect to bubbles in the instruction stream without performance loss. Unfortunately, shallower pipelines (i.e. longer pipe stages) are prone to larger amounts of glitches propagating through logic, increasing dynamic power. Experimentally measured results from a 130 nm FPU test chip with flexible clocking capabilities show a super-linear increase in glitch-induced dynamic power for shallower pipelines. While higher glitch power can severely diminish the power savings offered by clock scheduling, judicious clocking of intermediate stages offers glitch mitigation to recover power savings for worst-case scenarios. Detailed analysis of clock scheduling applied to a FPU in a POWER4-like processor running realistic workloads shows an average net power savings of 15% compared to an aggressively clock-gated design.
Instruction-driven clock scheduling with glitch mitigation
Xiaoyao Liang, Gu Wei, and David Brooks. 6/21/2008. “Revival: A variation-tolerant architecture using voltage interpolation and variable latency.” In Computer Architecture, 6/21/2008. ISCA'08. 35th International Symposium on, Pp. 191–202. IEEE. Publisher's VersionAbstract
Process variations are poised to significantly degrade performance benefits sought by moving to the next nanoscale technology node. Parameter fluctuations in devices can introduce large variations in peak operation among chips, among cores on a single chip, and among microarchitectural blocks within one core. Hence, it will be difficult to only rely on traditional frequency binning to efficiently cover the large variations that are expected. Furthermore, multiple voltage/frequency domains introduce significant hardware overhead and alone cannot address the full extent of delay variations expected in future multi-core systems. In this paper, we present ReVIVaL, which combines two fine-grained post-fabrication tuning techniques---voltage interpolation(VI) and variable latency(VL). We show that the frequency variation between chips, between cores on one chip, and between functional units within cores can be reduced to a very small range. The effectiveness of these techniques are further verified through experiments on test chips fabricated in a 130 nm CMOS process. Detailed architectural simulations of multi-core processors demonstrate significant performance and power advantages are possible by combining variable latency with voltage interpolation.
Revival: A variation-tolerant architecture using voltage interpolation and variable latency
Michael Karpelson, Gu Wei, and Wood J. 5/19/2008. “A review of actuation and power electronics options for flapping-wing robotic insects.” In 2008 IEEE international conference on robotics and automation, Pp. 779–786. IEEE. Publisher's VersionAbstract
Flapping-wing robotic insects require actuators with high power densities at centimeter to micrometer scales. Due to the low weight budget, the selection and design of the actuation mechanism needs to be considered in parallel with the design of the power electronics required to drive it. This paper explores the design space of flapping-wing microrobots weighing 1g and under by determining mechanical requirements for the actuation mechanism, analyzing potential actuation technologies, and discussing the design and realization of the required power electronics. Promising combinations of actuators and power circuits are identified and used to estimate microrobot performance.
A review of actuation and power electronics options for flapping-wing robotic insects
Mark Hempstead, Gu Wei, and David Brooks. 5/18/2008. “System design considerations for sensor network applications.” In 2008 IEEE International Symposium on Circuits and Systems (ISCAS), Pp. 2566–2569. Seattle, WA: IEEE. Publisher's VersionAbstract
Systems research in the emerging space of wireless sensor networks has exploded. Researchers have deployed nodes composed of a wireless radio, MEMS sensors and low power computation for applications from medical sensing to volcanic monitoring. We must consider several requirements - including the need for inexpensive, long-lasting, highly reliable devices coupled with very low performance requirements - when designing devices for wireless sensor networks. An untethered, fully-integrated node that operates off of energy scavenged from the ambient environment is the ultimate goal. We take an application-driven approach to the design of a wireless sensor network node. Our approach addresses the event-driven nature that is characteristic of many sensor network workloads. We have completed a detailed architectural analysis of this space using a full-system simulator and RTL model. From this analysis, we chose to implement a design that best achieves the power goals and performance requirements of wireless sensor network applications.
System design considerations for sensor network applications
Helal M, Z Straayer, Gu Wei, and Perroth H. 4/2008. “A highly digital MDLL-based clock multiplier that leverages a self-scrambling time-to-digital converter to achieve subpicosecond jitter performance.” IEEE Journal of Solid-State Circuits, 43, 4, Pp. 855–863. Publisher's VersionAbstract
This paper presents a mostly digital multiplying delay-locked loop (MDLL) architecture that leverages a new time-to-digital converter (TDC) and a correlated double-sampling technique to achieve subpicosecond jitter performance. The key benefit of the proposed structure is that it provides a highly digital technique to reduce deterministic jitter in the MDLL output with low sensitivity to mismatch and offset in the associated tuning circuits. The TDC structure, which is based on a gated ring oscillator (GRO), is expected to benefit other PLL/DLL applications as well due to the fact that it scrambles and first-order noise shapes its associated quantization noise. Measured results are presented of a custom MDLL prototype that multiplies a 50 MHz reference frequency to 1.6 GHz with 928 fs rms jitter performance. The prototype consists of two 0.13 mum integrated circuits, which have a combined active area of 0.06 mm 2 and a combined core power of 5.1 mW, in addition to an FPGA board, a discrete DAC, and a simple RC filter.
A highly digital MDLL-based clock multiplier that leverages a self-scrambling time-to-digital converter to achieve subpicosecond jitter performance
Simone Campanoni, Giovanni Agosta, and Stefano Reghizzi. 4/2008. “A parallel dynamic compiler for CIL bytecode.” In ACM Sigplan Notices, 4th ed., 43: Pp. 11-20. ACM. Publisher's VersionAbstract

Multi-core technology is being employed in most recent high-performance architectures. Such architectures need specifically designed multi-threaded software to exploit all the potentialities of their hardware parallelism.

At the same time, object code virtualization technologies are achieving a growing popularity, as they allow higher levels of software portability and reuse.

Thus, a virtual execution environment running on a multi-core processor has to run complex, high-level applications and to exploit as much as possible the underlying parallel hardware. We propose an approach that leverages on CMP features to expose a novel pipeline synchronization model for the internal threads of the dynamic compiler.

Thanks to compilation latency masking effect of the pipeline organization, our dynamic compiler, ILDJIT, is able to achieve significant speedups (26% on average) with respect to the baseline, when the underlying hardware exposes at least two cores.

A parallel dynamic compiler for CIL bytecode
Mark Hempstead, Michael Lyons, David Brooks, and Gu Wei. 4/2008. “Survey of hardware systems for wireless sensor networks.” Journal of Low Power Electronics, 4, 1, Pp. 11–20. Publisher's VersionAbstract
Wireless sensor networks have been gaining interest as a platform that changes how we interact with the physical world. Applications in medicine, military, inventory management, structural and environmental monitoring, and the like can benefit from low-power wireless nodes that communicate data collected via a variety of sensors. Current deployments of wireless sensor networks (WSN) rely on off-the-shelf commodity-based microcontrollers, but the unoptimized energy consumption of these systems can limit the effective lifetimes. Ideally, researchers would like to deeply embed wireless sensor network nodes in the physical world, relying on energy scavenged from the ambient environment. This paper provides a survey of ultra low power processors specifically designed for WSN applications that have begun to emerge from research labs, which require detailed understanding of tradeoffs between application space, architecture, and circuit techniques to implement these low-power systems.
Survey of hardware systems for wireless sensor networks
Benjamin Lee and David Brooks. 3/2008. “Efficiency trends and limits from comprehensive microarchitectural adaptivity.” In ACM SIGARCH Computer Architecture News, 3rd ed., 43: Pp. 36–47. ACM. Publisher's VersionAbstract

ncreasing demand for power-efficient, high-performance computing requires tuning applications and/or the underlying hardware to improve the mapping between workload heterogeneity and computational resources. To assess the potential benefits of hardware tuning, we propose a framework that leverages synergistic interactions between recent advances in (a) sampling, (b) predictive modeling, and (c) optimization heuristics. This framework enables qualitatively new capabilities in analyzing the performance and power characteristics of adaptive microarchitectures. For the first time, we are able to simultaneously consider high temporal and comprehensive spatial adaptivity. In particular, we optimize efficiency for many, short adaptive intervals and identify the best configuration of 15 parameters, which define a space of 240B point.

With frequent sub-application reconfiguration and a fully reconfigurable hardware substrate, adaptive microarchitectures achieve bips3/w efficiency gains of up to 5.3x (median 2.4x) relative to their static counterparts already optimized for a given application. This 5.3x efficiency gain is derived from a 1.6x performance gain and 0.8x power reduction. Although several applications achieve a significant fraction of their potential efficiency with as few as three adaptive parameters, the three most significant parameters differ across applications. These differences motivate a hardware substrate capable of comprehensive adaptivity to meet these diverse application requirements.

Efficiency trends and limits from comprehensive microarchitectural adaptivity
Kevin Brownell, Gu Wei, and David Brooks. 3/2008. “Evaluation of voltage interpolation to address process variations.” In Proceedings of the 2008 IEEE/ACM International Conference on Computer-Aided Design, Pp. 529–536. IEEE Press.Abstract

Abstract — Post-fabrication tuning provides a promising design approach to mitigate the performance and power overheads of process variation in advanced fabrication technologies. This paper explores design considerations and VLSI-CAD support for a recently proposed postfabrication tuning knob called voltage interpolation. The paper discusses design tradeoffs between circuit tuning range and static power overheads that can be performed within the synthesis flow of the design process. The paper explores the scheme for a 64-core chip-multiprocessor machine using industrial-grade design blocks and shows that the scheme can be used to mitigate overhead arising from random and correlated within-die process variations. The analysis shows that the scheme can match the nominal delay target with a 10 % power cost, or for the same power budget, incur only a 9 % delay overhead after variations. I.

Evaluation of voltage interpolation to address process variations
Meeta Gupta, Krishna Rangan, Michael Smith, Gu Wei, and David Brooks. 2/16/2008. “DeCoR: A delayed commit and rollback mechanism for handling inductive noise in processors.” In 2008 IEEE 14th International Symposium on High Performance Computer Architecture, Pp. 381–392. IEEE. Publisher's VersionAbstract
Increases in peak current draw and reductions in the operating voltage of processors stress the importance of dealing with voltage fluctuations in processors. Noise-margin violations lead to undesired effects, like timing violations, which may result in incorrect execution of applications. Several recent architectural solutions for inductive noise have been proposed that, unfortunately, have a strong correlation to the underlying power-delivery package model and require a feedback loop that is largely constrained by the voltage/current sensor characteristics. The resulting solutions are not robust across a wide range of microprocessor designs and packaging technologies. This paper proposes a Delayed-commit and rollback scheme (DeCoR) that guarantees correctness, insensitive to the package model or the responsiveness of the voltage sensors. In particular, our approach recovers from, rather than attempting to avoid, voltage emergencies. This approach incurs a small performance penalty when compared to an ideal machine that does not have voltage emergencies. We show that explicit checkpoint-recovery schemes, intended to handle infrequent events, e.g., radiation-induced soft errors, suffer from large performance overheads for frequently-occurring voltage emergencies. DeCoR requires very few modifications to modern processor designs, as it leverages the existing store queue and reorder buffers. Unlike conventional designs that conservatively protect all components of the processor from inductive noise with overly-large timing margins, our approach only requires conservative protection of the architected register state and cache write paths.
DeCoR: A delayed commit and rollback mechanism for handling inductive noise in processors
Wonyoung Kim, Meeta Gupta, Gu Wei, and David Brooks. 2/16/2008. “System level analysis of fast, per-core DVFS using on-chip switching regulators.” In 2008 IEEE 14th International Symposium on High Performance Computer Architecture, Pp. 123–134. Salt Lake City, UT, USA: Ieee. Publisher's VersionAbstract
Portable, embedded systems place ever-increasing demands on high-performance, low-power microprocessor design. Dynamic voltage and frequency scaling (DVFS) is a well-known technique to reduce energy in digital systems, but the effectiveness of DVFS is hampered by slow voltage transitions that occur on the order of tens of microseconds. In addition, the recent trend towards chip-multiprocessors (CMP) executing multi-threaded workloads with heterogeneous behavior motivates the need for per-core DVFS control mechanisms. Voltage regulators that are integrated onto the same chip as the microprocessor core provide the benefit of both nanosecond-scale voltage switching and per-core voltage control. We show that these characteristics provide significant energy-saving opportunities compared to traditional off-chip regulators. However, the implementation of on-chip regulators presents many challenges including regulator efficiency and output voltage transient characteristics, which are significantly impacted by the system-level application of the regulator. In this paper, we describe and model these costs, and perform a comprehensive analysis of a CMP system with on-chip integrated regulators. We conclude that on-chip regulators can significantly improve DVFS effectiveness and lead to overall system energy savings in a CMP, but architects must carefully account for overheads and costs when designing next-generation DVFS systems and algorithms.
System level analysis of fast, per-core DVFS using on-chip switching regulators
Xiaoyao Liang, David Brooks, and Gu Wei. 2/3/2008. “A process-variation-tolerant floating-point unit with voltage interpolation and variable latency.” In 2008 IEEE International Solid-State Circuits Conference - Digest of Technical Papers, Pp. 404–623. San Francisco, CA, USA: IEEE. Publisher's VersionAbstract
This paper explores two fine-grained, post-fabrication circuit-tuning techniques to combat process variation for pipelined logic componentsrdquo voltage interpolation and variable latency. These techniques are applied to a single-precision floating-point unit (FPU) designed using a standard CAD synthesis flow in a 0.13 mum CMOS logic process with 8 metal layers. Measured results from fabricated chips show that both techniques provide wide frequency tuning range to deal with frequency fluctuations arising from process variations with minimal power overhead, and in some configurations, power savings.
A process-variation-tolerant floating-point unit with voltage interpolation and variable latency
Hanumolu Kumar, Kratyuk Volodymyr, Gu Wei, and Moon Un-Ku. 2/2008. “A sub-picosecond resolution 0.5–1.5 GHz digital-to-phase converter.” IEEE Journal of Solid-State Circuits, 43, 2, Pp. 414–424. Publisher's VersionAbstract
A digital-to-phase converter (DPC) is an essential building block in applications such as source-synchronous interfaces and digital phase modulators. The resolution of DPCs using analog phase interpolators is severely affected by the operating frequency and rise times of the interpolator inputs. In this paper, we present a new DPC architecture that achieves high resolution independent of both the operating frequency and the rise time. The 8 phases generated by a phase-locked loop are dithered using a delta-sigma modulator to shape the truncation error to high frequency and is subsequently filtered using a delay-locked loop phase filter. The test chip, fabricated in a 0.13 mum CMOS process, operates from 0.5 -1.5 GHz and achieves a differential nonlinearity of less than plusmn0.1 ps and an integral nonlinearity of plusmn12 ps. The total power consumption while operating at 1 GHz is 15 mW.
A sub-picosecond resolution 0.5–1.5 GHz digital-to-phase converter
Hanumolu Kumar, Gu Wei, and Moon Ku. 2/2008. “A wide-tracking range clock and data recovery circuit.” IEEE Journal of Solid-State Circuits, 43, 2, Pp. 425–439. Publisher's VersionAbstract
A hybrid analog-digital quarter-rate clock and data recovery circuit (CDR) that achieves a wide-tracking range and excellent frequency and phase tracking resolution is presented in this paper. A split-tuned analog phase-locked loop (PLL) provides eight equally spaced phases needed for quarter-rate data recovery and the digital CDR loop adjusts the phase of the PLL output clocks in a precise manner to facilitate plesiochronous clocking. The CDR employs a second-order digital loop filter and combines delta-sigma modulation with the analog PLL to achieve sub-picosecond phase resolution and better than 2 ppm frequency resolution. A test chip fabricated in a 0.18 mum CMOS process achieves BER <10 -12 and consumes 14 mW power while operating at 2 Gb/s. The tracking range is greater than plusmn5000 ppm and plusmn2500 ppm at 10 kHz and 20 kHz modulation frequencies, respectively, making this CDR suitable for systems employing spread-spectrum clocking.
A wide-tracking range clock and data recovery circuit