Yingmin Li, David Brooks, Zhigang Hu, Kevin Skadron, and Pradip Bose. 8/11/2004. “Understanding the energy efficiency of simultaneous multithreading.” In Proceedings of the 2004 international symposium on Low power electronics and design, Pp. 44–49. Newport Beach, CA, USA: ACM. Publisher's VersionAbstract
Simultaneous multithreading (SMT) has proven to be an effective method of increasing the performance of microprocessors by extracting additional instruction-level parallelism from multiple threads. In current microprocessor designs, power-efficiency is of critical importance, and we present modeling extensions to an architectural simulator to allow us to study the power-performance efficiency of SMT. After a thorough design space exploration we find that SMT can provide a performance speedup of nearly 20% for a wide range of applications with a power overhead of roughly 24%. Thus, SMT can provide a substantial benefit for energy-efficiency metrics such as ED/sup 2/. We also explore the underlying reasons for the power uplift, analyze the impact of leakage-sensitive process technologies, and discuss our model validation strategy.
Understanding the energy efficiency of simultaneous multithreading
Kim Hazelwood and David Brooks. 8/2004. “Eliminating voltage emergencies via microarchitectural voltage control feedback and dynamic optimization.” In Proceedings of the 2004 international symposium on Low power electronics and design, Pp. 326–331. ACM. Publisher's VersionAbstract

Microprocessor designers use techniques such as clock gating to reduce power dissipation. An unfortunate side-effect of these techniques is the processor current fluctuations that stress the power-delivery network. Recent research has focused on hardware-only mechanisms to detect and eliminate these fluctuations. While the solutions have been effective at avoiding operating-range violations, they have done so at a performance penalty to the executing program.

Compilers are well equipped to rearrange instructions such that current fluctuations are less dramatic, with minimal performance implications. Furthermore, a dynamic optimizer can eliminate the problem at run time, avoiding the difficult task of statically predicting voltage emergencies.

This paper proposes complementing existing hardware solutions with additional run-time software to address problematic code sequences that cause recurring voltage swings. Our proposal extends existing hardware techniques to additionally provide feedback to a dynamic optimizer, which can provide a long-term solution, often without impacting the performance of the executing application.

We found that recurring voltage fluctuations do exist in the SPEC2000 benchmarks, and that given very little information from the hardware, a dynamic optimizer can locate and correct many of the recurring voltage emergencies.

Eliminating voltage emergencies via microarchitectural voltage control feedback and dynamic optimization
David Brooks and Joerg Henkel. 8/2004. “High level power modeling and analysis.” International Symposium on Low Power Electronics and Design: Proceedings of the 2004 international symposium on Low power electronics and design. Publisher's Version
Victor Zyuban, David Brooks, Viji Srinivasan, Michael Gschwind, Pradip Bose, Philip Strenski, and Philip Emma. 8/2004. “Integrated analysis of power and performance for pipelined microprocessors.” Computers, IEEE Transactions on, 53, 8, Pp. 1004–1016. Publisher's VersionAbstract
Choosing the pipeline depth of a microprocessor is one of the most critical design decisions that an architect must make in the concept phase of a microprocessor design. To be successful in today’s cost/performance marketplace, modern CPU designs must effectively balance both performance and power dissipation. The choice of pipeline depth and target clock frequency has a critical impact on both of these metrics. In this paper, we describe an optimization methodology based on both analytical models and detailed simulations for power and performance as a function of pipeline depth. Our results for a set of SPEC2000 applications show that, when both power and performance are considered for optimization, the optimal clock period is around 18 FO4. We also provide a detailed sensitivity analysis of the optimal pipeline depth against key assumptions of our energy models. Finally, we discuss the potential risks in design quality for overly aggressive or conservative choices of pipeline depth.
Integrated analysis of power and performance for pipelined microprocessors
Ruwan Ratnayake, Gu-Yeon Wei, and Aleksandar Kavcic. 6/2004. “Pipelined parallel architecture for high throughput MAP detectors.” In 2004 IEEE International Symposium on Circuits and Systems (IEEE Cat. No. 04CH37512), 2: Pp. II–505. IEEE. Publisher's VersionAbstract
A maximum a posteriori probability (MAP) detector based on a forward only algorithm with high throughput is considered in this paper. MAP gives the optimal performance and, with Turbo decoding, can achieve performance close to the channel capacity limits. Deep pipelined architecture for the forward only method is presented and compared with the other throughput-increasing methods. Simulation results based on the iterative MAP-LDPC (low-density parity check) system are shown. Hardware implementation issues that exploit the regularities of the structure are also discussed.
Pipelined parallel architecture for high throughput MAP detectors
Hanumolu Kumar, Casper Bryan, Mooney Randy, Gu Wei, and Moon Ku. 5/26/2004. “Jitter in high-speed serial and parallel links.” In 2004 IEEE International Symposium on Circuits and Systems (IEEE Cat. No. 04CH37512), 4: Pp. IV–425. Vancouver, BC, Canada: IEEE. Publisher's VersionAbstract
Jitter degrades the performance of both high-speed serial and parallel I/O links by limiting the maximum achievable data-rates. We present analytical expressions to evaluate the effect of jitter on the performance of high-speed links. These expressions enable simple calculation of worst-case voltage and timing margins in the presence of jitter. This analysis is also extended to equalized links. Finally, we show that the limited bandwidth of the channel can amplify high frequency jitter and present means to counteract jitter amplification.
Jitter in high-speed serial and parallel links
Yong Cheol Bae and Gu-Yeon Wei. 5/23/2004. “A mixed PLL/DLL architecture for low jitter clock generation.” In 2004 IEEE International Symposium on Circuits and Systems (IEEE Cat. No. 04CH37512), 4: Pp. IV–788. Vancouver, BC, Canada: IEEE. Publisher's VersionAbstract
This paper presents a mixed PLL/DLL architecture for low-jitter clock generation that merges phase-locked loop (PLL) and delay-locked loop (DLL) characteristics. It relies on an interpolator to configure the loop to operate more like a PLL or more like a DLL depending on the interpolator's settings. The ability to vary interpolator settings enables wide range control of the clock generator's loop bandwidth. Therefore, the loop bandwidth can readily be adjusted to accommodate different noise conditions. A discrete-time Z-domain analysis is provided to illustrate the noise filtering characteristics of the loop in the presence of various noise sources and highlight the potential advantages of the mixed PLL/DLL architecture. Simulation results verify stable operation of the loop designed for a 0.18 /spl mu/m CMOS process.
A mixed PLL/DLL architecture for low jitter clock generation
David Brooks, Pradip Bose, and Margaret Martonosi. 3/2004. “Power-performance simulation: design and validation strategies.” ACM SIGMETRICS Performance Evaluation Review, 31, 4, Pp. 13–18. Publisher's VersionAbstract

Microprocessor research and development increasingly relies on detailed simulations to make design choices. As such, the structure, speed, and accuracy of microarchitectural simulators is of critical importance to the field. This paper describes our experiences in building two simulators, using related but distinct approaches.One of the most important attributes of a simulator is its ability to accurately convey design trends as different aspects of the microarchitecture are varied. In this work, we break down accuracy---a broad term--- into two sub-types: relative and absolute accuracy. We then discuss typical abstraction errors in power-performance simulators and show when they do (or do not) affect the design rule choices a user of those simulator might make. By performing this validation study using the Wattch and Power Timer simulators, the work addresses validation issues both broadly and in the specific case of a fairly widely-used simulator.

Power-performance simulation: design and validation strategies
Fulford RF, Gu Wei, and Matt Welsh. 2/2004. “A portable, low-power, wireless two-lead EKG system.” In The 26th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, 1: Pp. 2141–2144. IEEE. Publisher's VersionAbstract

Sensor devices ("motes") which integrate an embedded microprocessor, low-power radio and a limited amount of storage have the potential to significantly enhance the provision of emergency medical care. Wearable vital sign sensors can wirelessly monitor patient condition, alerting healthcare providers to changes in status while simultaneously delivering data to a backend archival system for longer-term storage. As part of the CodeBlue initiative at Harvard University, we previously developed a mote-based pulse oximetry module which gathers data from a noninvasive finger sensor and transmits it wirelessly to a base station. To expand the capabilities of the mote for healthcare applications we now introduce EKG on Mica2, the first custom-designed electrocardiograph sensor board to interface with this platform. We additionally present VitalEKG, a collection of software components which allow the capture and wireless transmission of heart activity traces. We present preliminary test results which validate our approach and suggest the feasibility of future enhancements.

A portable, low-power, wireless two-lead EKG system
Yingmin Li, K Skadron, Z Hu, and David Brooks. 2004. “Evaluating the thermal efficiency of SMT and CMP architectures.” IBM TJ Watson Conference on Interaction between Architecture, Circuits, and Compilers.Abstract
Simultaneous multithreading (SMT) and chip multiprocessing (CMP) both allow a chip to achieve greater throughput, but their thermal properties are still poorly understood. This paper uses Turandot, PowerTimer, and HotSpot to evaluate the thermal eff iciency for a Power4/Power5-like core. Our results show that although SMT and CMP exhibit similar peak operating temperatures, the mechanism by which they heat up are quite different. More specifically, SMT heating is primarily caused by localized heating in certain key structures such as the register file, due to increased utilization. On the other hand, CMP heating is mainly caused by the global impact of increased energy output, due to the extra energy of an added core. Because of this difference in heat up machanism, we found that the best thermal management technique is also different for SMT and CMP. Finally, we show that CMP and SMT will scale differently as the contribution of leakage power grows, with CMP suffering from higher leakage due to the second core's higher temperature and the exponential temperature-dependence of subthreshold leakage.
Evaluating the thermal efficiency of SMT and CMP architectures
David Brooks. 2004. “Integrated Architectural Level Power-Performance Modeling Toolkit”.Abstract
We are currently developing a robust, integrated infrastructure for studying power-performance issues across a range of systems.  By leveraging a common ISA and shared simulation infrastructure, we will be able to perform apples-to-apples comparisons between processors intended for specific design spaces.  For example, recently there has been significant attention brought to the idea of reusing microprocessor cores in multiple design spaces.  In particular, there has been much interest in exploring the possibility of using multiple low-power, embedded processors in blade systems or SMP-on-a-chip designs for server workloads.  There has also been interest in taking server-class microprocessors and bringing them into use in lower-end systems.  For example, the processor core of the original POWER4 microprocessor has recently been introduced as the PowerPC970 -- a 64-bit microprocessor for use in blade servers and desktop (and potentially laptop) systems. We utilize the MET/Turandot toolkit originally developed at IBM TJ Watson Research Center as the underlying PowerPC microarchitecture performance simulator [3].  Turandot is flexible enough to model a broad range of microarchitectures and has undergone extensive validation [3].  In addition,  Turandot has been augmented with power models to explore power-performance tradeoffs in an internal IBM tool called PowerTimer [4]. Turandot is freely available to the research community through licensing arrangements with IBM, and we are currently working with IBM to develop an external, public release of PowerTimer. 
Integrated Architectural Level Power-Performance Modeling Toolkit
Gu Wei, Stonick T, Weinlader Dan, Sonntag Jeff, and Searles Shawn. 12/13/2003. “A 500MHz MP/DLL clock generator for a 5Gb/s backplane transceiver in 0.25/spl mu/m CMOS.” In 2003 IEEE International Solid-State Circuits Conference, 12/13/2003. Digest of Technical Papers. ISSCC., Pp. 464–465. IEEE. Publisher's VersionAbstract
Low-jitter clock generation is a critical component for enabling robust high-speed operation of 5Gb/s backplane transceivers. The implementation of a 500MHz clock synthesizer that operates either as a multiplying phase-locked loop (MPLL) or a multiplying delay-locked loop (MDLL) is described. The choice depends on the noise characteristics of the input clock source. This MP/DLL design is implemented in a 0.25/spl mu/m CMOS process and operates with a 2.5V supply.
David Brooks, Pradip Bose, Vijayalakshmi Srinivasan, Michael Gschwind, Philip Emma, and Michael Rosenfield. 9/2003. “New methodology for early-stage, microarchitecture-level power-performance analysis of microprocessors.” IBM Journal of Research and Development, 47, 5.6, Pp. 653–670. Publisher's VersionAbstract
The PowerTimer toolset has been developed for use in early-stage, microarchitecture-level power-performance analysis of microprocessors. The key component of the toolset is a parameterized set of energy functions that can be used in conjunction with any given cycle-accurate microarchitectural simulator. The energy functions model the power consumption of primitive and hierarchically composed building blocks which are used in microarchitecture-level performance models. Examples of structures modeled are pipeline stage latches, queues, buffers and component read/write multiplexers, local clock buffers, register files, and cache array macros. The energy functions can be derived using purely analytical equations that are driven by organizational, circuit, and technology parameters or behavioral equations that are derived from empirical, circuit-level simulation experiments. After describing the modeling methodology, we present analysis results in the context of a current-generation superscalar processor simulator to illustrate the use and effectiveness of such early-stage models. In addition to average power and performance tradeoff analysis, PowerTimer is useful in assessing the typical and worst-case power (or current) swings that occur between successive cycle windows in a given workload execution. Such a characterization of workloads at the early stage of microarchitecture definition helps pinpoint potential inductive noise problems on the voltage rail that can be addressed by designing an appropriate package or by suitably tuning the dynamic power management controls within the processor.
New methodology for early-stage, microarchitecture-level power-performance analysis of microprocessors
Stonick T, Gu Wei, Sonntag L, and Weinlader K. 5/15/2003. “An adaptive PAM-4 5-Gb/s backplane transceiver in 0.25-/spl mu/m CMOS.” IEEE Journal of Solid-State Circuits, 38, 3, Pp. 436–443. Publisher's VersionAbstract
This paper describes a novel backplane transceiver, which uses PAM-4 (pulse amplitude modulated four level) signalling and continuously adaptive transmit based equalization to move 5 Gcb/s (channel bits per second) across typical FR-4 backplanes for total distances of up to 50 inches through two sets of backplane connectors. The paper focuses on the implementation of the equalizer and the adaptation algorithms, and includes measured results. The 17 mm/sup 2/ device is implemented in a 0.25 /spl mu/m CMOS process, operates on 2.5 V and 3.3 V supplies and consumes 1.2 W.
An adaptive PAM-4 5-Gb/s backplane transceiver in 0.25-/spl mu/m CMOS
P Bose, David Brooks, A Buyuktosunoglu, P. Cook, K Das, P Emma, M Gschwind, H Jacobson, T Karkhanis, and P Kudva. 4/1/2003. “Early-stage definition of LPX: A low power issue-execute processor.” Power-Aware Computer Systems, Pp. 89–92. Publisher's VersionAbstract

We present the high-level microarchitecture of LPX: a low-power issue-execute processor prototype that is being designed by a joint industry-academia research team. LPX implements a very small subset of a RISC architecture, with a primary focus on a vector (SIMD) multimedia extension. The objective of this project is to validate some key new ideas in power-aware microarchitecture techniques, supported by recent advances in circuit design and clocking.

Early-stage definition of LPX: A low power issue-execute processor
Russ Joseph, David Brooks, and Margaret Martonosi. 2/12/2003. “Control techniques to eliminate voltage emergencies in high performance processors.” In High-Performance Computer Architecture, 2/12/2003. HPCA-9 2/12/2003. Proceedings. The Ninth International Symposium on, Pp. 79–90. Anaheim, CA, USA: IEEE. Publisher's VersionAbstract
Increasing focus on power dissipation issues in current microprocessors has led to a host of proposals for clock gating and other power-saving techniques. While generally effective at reducing average power, many of these techniques have the undesired side-effect of increasing both the variability of power dissipation and the variability of current drawn by the processor This increase in current variability, often referred to as the dI/dt problem, can cause supply voltage fluctuations. Such voltage fluctuations lead to unreliable circuits if not addressed, and increasingly expensive chip packaging techniques are needed to mitigate them. This paper proposes and evaluates a methodology for augmenting packaging techniques for dI/dt with microarchitectural control mechanisms. We discuss the resonant frequencies most relevant to current microprocessor packages, produce and evaluate a "dI/dt stressmark" that exercises the system at its resonant frequency, and characterize the behavior of more mainstream applications. Based on these results plus evaluations of the impact of controller error and delay, our microarchitectural control proposals offer bounds on supply voltage fluctuations, with nearly negligible impact on performance and energy. With the ITRS roadmap predicting aggressive drops in supply voltage and power supply impedances in coming chip generations, novel voltage control techniques will be required to stay on track. Our microarchitectural dI/dt controllers represent a step in this direction.
Control techniques to eliminate voltage emergencies in high performance processors
Hanumolu Kumar, Bryan Casper, Mooney Randy, Gu Wei, and Moon Ku. 2003. “Analysis of PLL clock jitter in high-speed serial links.” IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, 50, 11, Pp. 879–886.
Jaeha Kim, A Horowitz, and Gu Wei. 2003. “Design of CMOS adaptive-bandwidth PLL/DLLs: A general approach.” IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, 50, 11, Pp. 860–869.
Viji Srinivasan, David Brooks, Michael Gschwind, Pradip Bose, Victor Zyuban, Philip Strenski, and Philip Emma. 11/18/2002. “Optimizing pipelines for power and performance.” In Microarchitecture, 11/18/2002. (MICRO-35). Proceedings. 35th Annual IEEE/ACM International Symposium on, Pp. 333–344. IEEE. Publisher's VersionAbstract
During the concept phase and definition of next generation high-end processors, power and performance will need to be weighted appropriately to deliver competitive cost/performance. It is not enough to adopt a CPI-centric view alone in early-stage definition studies. One of the fundamental issues confronting the architect at this stage is the choice of pipeline depth and target frequency. In this paper we present an optimization methodology that starts with an analytical power-performance model to derive optimal pipeline depth for a superscalar processor. The results are validated and further refined using detailed simulation based analysis. As part of the power-modeling methodology, we have developed equations that model the variation of energy as a function of pipeline depth. Our results using a set of SPEC2000 applications show that when both power and performance are considered for optimization, the optimal clock period is around 18 FO4. We also provide a detailed sensitivity analysis of the optimal pipeline depth against key assumptions of these energy models.
Optimizing pipelines for power and performance
M Martonosi, David Brooks, and V Tiwari. 2/2002. “Architecture-level power modeling with Wattch.” Computer, 35, 2, Pp. 64–64.