Publications

2002
Alper Buyuktosunoglu, David Albonesi, Stanley Schuster, David Brooks, Pradip Bose, and Peter Cook. 1/2002. “Power-efficient issue queue design.” In Power aware computing, Pp. 35–58. Kluwer Academic Publishers. Publisher's VersionAbstract

Increasing levels of power dissipation threaten to limit the performance gains of future high-end, out-of-order issue microprocessors. Therefore, it is imperative that designers devise techniques that significantly reduce the power dissipation of the key hardware structures on the chip without unduly compromising performance. Such a key structure in out-of-order designs is the issue queue. Although crucial in achieving high performance, the issue queues are often a major contributor to the overall power consumption of the chip, potentially affecting both thermal issues related to hot spots and energy issues related to battery life. In this chapter, we present two techniques that significantly reduce issue queue power while maintaining high performance operation. First, we evaluate the power savings achieved by implementing a CAM/RAM structure for the issue queue as an alternative to the more power-hungry latch-based issue queue used in many designs. We then present the microarchitecture and circuit design of an adaptive issue queue that leverages transmission gate insertion to provide dynamic low-cost configurability of size and speed. We compare two different dynamic adaptation algorithms that use issue queue utilization and parallelism metrics in order to size the issue queue on-the-fly during execution. Together, these two techniques provide over a 70% average reduction in issue queue power dissipation for a collection of the SPEC CPU2000 integer benchmarks, with only a 3% overall performance degradation.

Power-efficient issue queue design
Gu Wei, Horowitz Mark, and Jaeka Kim. 2002. “Energy-efficient design of high-speed links.” In Power Aware Design Methodologies, 8: Pp. 201–239. Springer, Boston, MA. Publisher's VersionAbstract

Techniques for reducing power consumption and bandwidth limitations of inter-chip communication have been getting more attention to improve the performance of modern digital systems. This chapter begins with a brief overview of high-speed link design and describes some of the power vs. performance trade-offs associated with various design choices. The chapter then investigates various techniques that a designer may employ to reduce power consumption. Three examples of link designs and link building blocks found in the literature present energy-efficient implementations of these techniques.

Energy-efficient design of high-speed links
Pradip Bose, David Brooks, Viji Srinivasan, and Philip Emma. 2002. Power-Performance and Power Swing Characterization in Adaptive Microarchitectures. Technical Paper Archive - Research Reports. IBM Research. Publisher's VersionAbstract
In this paper, we present an analysis of some of the fundamental power-performance tradeoffs in processors that employ adaptive techniques to vary sizes, bandwidths, clock-gating modes and clock frequencies. Initial expectations are set using simple analytical reasoning models. Later, simulation-based data is presented in the context of a simple, low-power super scalar processor prototype (called LPX) that is currently under development as a test vehicle. There are three fundamental issues that we attempt to address in this paper: (a) Does dynamic adaptation - in clocking or microarchitectural resources - help extend the power-performance efficiency range of wider-issue superscalars ? (b) What factors of power and power-density reductions are within practical reach in future adaptive processors ? (c) Does the presence of dynamic adaptation modes cause unacceptably large, worst-case power (or current) swings in affected sub-units ?
Power-Performance and Power Swing Characterization in Adaptive Microarchitectures
2001
David Brooks, Margaret Martonosi, John Wellman, and Pradip Bose. 12/2001. “Power-performance modeling and tradeoff analysis for a high end microprocessor.” Power-Aware Computer Systems, Pp. 126–136. Publisher's VersionAbstract
We describe a new power-performance modeling toolkit, developed to aid in the evaluation and definition of future power-efficient, PowerPC TM processors. The base performance models in use in this project are: (a) a fast but cycle-accurate, parameterized research simulator and (b) a slower, pre-RTL reference model that models a specific high-end machine in full, latchaccurate detail. Energy characterizations are derived from real, circuit-level power simulation data. These are then combined to form higher-level energy models that are driven by microarchitecture-level parameters of interest. The overall methodology allows us to conduct power-performance tradeoff studies in defining the follow-on design points within a given product family. We present a few experimental results to illustrate the kinds of tradeoffs one can study using this tool.
Power-performance modeling and tradeoff analysis for a high end microprocessor
David Brooks and Martonosi (advisor) Margaret. 11/2001. “Design and modeling of power-efficient computer architectures.” Princeton University ProQuest Dissertations Publishing. Publisher's VersionAbstract

Power dissipation and thermal issues are increasingly significant in modern processors. As a result, it is crucial that power/performance tradeoffs be made more visible to chip architects and compiler writers, in addition to circuit designers. Most traditional power analysis tools achieve high accuracy by calculating power estimates for designs only after the circuit design, layout, and floorplanning are complete. In addition to being available only late in the design process, such tools are often quite slow, which compounds the difficulty of running them for a large space of design possibilities.

This thesis presents a methodology for estimating power dissipation at a much earlier stage in the design cycle and at a much higher level. Watch and PowerTimer are two working examples of the use of this methodology. Both tools provide a framework for analyzing and optimizing microprocessor power dissipation at the architecture-level. These tools are 1000X or more faster than existing layout-level power tools, and yet maintain accuracy within 10% of their estimates as verified using industry tools on leading-edge designs. These tools can allow architects to explore and cull the design space early on and opens up the field of power-efficient computing to a wider range of researchers by providing a power evaluation methodology within the portable and familiar SimpleScalar framework.

This thesis also considers several applications of architectural-level power modeling to propose specific architectural-level power and temperature saving optimizations—value-based clock gating and dynamic thermal management. Value-based clock gating is a technique which exploits the dynamic bitwidth requirements of typical applications to save power within arithmetic units and the memory hierarchy. We have demonstrated that this technique can save roughly 50% of the power in the integer execution units. With dynamic thermal management, temperature sensors and throttling techniques are combined to adaptively slow down the CPU for extended periods of particularly high-power code sequences. This allows the CPU package and power delivery system to be designed for a much lower maximum power rating, with minimal performance impact for typical applications.

The techniques presented in this thesis represent some of the first work in the area of high-performance, low-power processor design at the architectural level. We hope that this work, and the other research in the area of low-power architectural modeling and design, will help future generations of processor architectures to meet the many new challenges in this area.

Design and modeling of power-efficient computer architectures
Gu Wei. 11/2001. “Energy-efficient I/O interface design with adaptive power-supply regulation”. Publisher's VersionAbstract
This paper presents a low-power high-speed CMOS signaling interface that operates off of an adaptively regulated supply. A feedback loop adjusts the supply voltage on a chain of inverters until the delay through the chain is equal to half of the input period. This voltage is then distributed to the I/O subsystem through an efficient switching power-supply regulator. Dynamically scaling the supply with respect to frequency leads to a simple and robust design consisting mostly of digital CMOS gates, while enabling maximum energy efficiency. The interface utilizes high-impedance drivers for operation across a wide range of voltages and frequencies, a dual-loop delay-locked loop for accurate timing recovery, and an input receiver whose bandwidth tracks with the I/O frequency to filter out high-frequency noise. Test chips fabricated in a 0.35-/spl mu/m CMOS technology achieve transfer rates of 0.2-1.0 Gb/s/pin with a regulated supply ranging from 1.3-3.2 V.
Energy-efficient I/O interface design with adaptive power-supply regulation
Russ Joseph, David Brooks, and Margaret Martonosi. 8/2001. “Live, runtime power measurements as a foundation for evaluating power/performance tradeoffs.” Workshop on Complexity Effectice Design WCED, held in conjunction with ISCA, 28.Abstract
Of the many ways one could gauge the complexity-effectiveness of a design or design element, one candidate approach is to consider a design's power/performance tradeoffs. This paper describes our early-stage results in a broad effort to evaluate the power-performance tradeoffs of a range of benchmarks and microarchitectures. In particular, this paper presents power data collected on-the-fly on real x86 machines as they execute carefully-constructed microbenchmarks. The microbenchmarks exercise aspects of the system such as data cache and branch predictor. They are parametrically-variable to consider how load dependence, cache miss rate, branch mispredict rate, and branch distance all impact the power and performance of a CPU. For example, from these experiments, we learn that CPU performance increases essentially monotonically with cache hit rate, while CPU power encounters a maximum at roughly 80-90% cache hit rates. Likewise, we show results demonstrating that performance-neutral issues such as bit populations in the data cache values can display interesting power trends. While the experimental results are preliminary, we feel that the techniques described in this paper will o er a useful foundation for a broad range of power/performance tradeoffs.
Alper Buyuktosunoglu, Stanley Schuster, David Brooks, Pradip Bose, Peter Cook, and David Albonesi. 6/11/2001. “An adaptive issue queue for reduced power at high performance.” Power-Aware Computer Systems, Pp. 25–39. Publisher's VersionAbstract

Increasing power dissipation has become a major constraint for future performance gains in the design of microprocessors. In this paper, we present the circuit design of an issue queue for a superscalar processor that leverages transmission gate insertion to provide dynamic low-cost configurability of size and speed. A novel circuit structure dynamically gathers statistics of issue queue activity over intervals of instruction execution. These statistics are then used to change the size of an issue queue organization on-the-fly to improve issue queue energy and performance. When applied to a fixed, full-size issue queue structure, the result is up to a 70% reduction in energy dissipation. The complexity of the additional circuitry to achieve this result is almost negligible. Furthermore, self-timed techniques embedded in the adaptive scheme can provide a 56% decrease in cycle time of the CAM array read of the issue queue when we change the adaptive issue queue size from 32 entries (largest possible) to 8 entries (smallest possible in our design).

An adaptive issue queue for reduced power at high performance
Alper Buyuktosunoglu, David Albonesi, Stanley Schuster, David Brooks, Pradip Bose, and Peter Cook. 3/2001. “A circuit level implementation of an adaptive issue queue for power-aware microprocessors.” In Proceedings of the 11th Great Lakes symposium on VLSI, Pp. 73–78. ACM. Publisher's VersionAbstract
Increasing power dissipation has become a major constraint for future per~brmartce gains in the design of microproces- sors. In this paper, we present the circuit design of an issue queue for a superscalar processor that leverages transmis- sion gate insertion to provide dynamic low-cost configura- bility of size and speed. A novel circuit structure dynami- cally gathers statistics of issue queue activity over intervals of instruction execution. These statistics are then used to change the size of an issue queue organization on-the-fly to improve issue queue energy and performance. When applied to a fixed, full-size issue queue structure, the result is up to a 70% reduction in energy dissipation. The complexity of the additional circuitry to achieve this result is almost neg- ligible. Furthermore, self-timed techniques embedded in the adaptive scheme can provide a 56% decrease in cycle time of the CAM array read of the issue queue when we change the adaptive issue queue size f¥om 32 entries (largest possible) to 8 entries (smallest possible in our design). 
A circuit level implementation of an adaptive issue queue for power-aware microprocessors
David Brooks and Margaret Martonosi. 1/19/2001. “Dynamic thermal management for high-performance microprocessors.” In High-Performance Computer Architecture, 1/19/2001. HPCA. The Seventh International Symposium on, Pp. 171–182. IEEE. Publisher's VersionAbstract

With the increasing clock rate and transistor count of today's microprocessors, power dissipation is becoming a critical component of system design complexity. Thermal and power-delivery issues are becoming especially critical for high-performance computing systems. In this work, we investigate dynamic thermal management as a technique to control CPU power dissipation. With the increasing usage of clock gating techniques, the average power dissipation typically seen by common applications is becoming much less than the chip's rated maximum power dissipation. However system designers still must design thermal heat sinks to withstand the worse-case scenario. We define and investigate the major components of any dynamic thermal management scheme. Specifically we explore the tradeoffs between several mechanisms for responding to periods of thermal trauma and we consider the effects of hardware and software implementations. With approximate dynamic thermal management, the CPU can be designed for a much lower maximum power rating, with minimal performance impact for typical applications.

Pradip Bose, Margaret Martonosi, and David Brooks. 2001. “Modeling and Analyzing CPU Power and Performance: Metrics, Methods, and Abstractions.” Tutorial, ACM SIGMETRICS. Modeling and Analyzing CPU Power and Performance: Metrics, Methods, and Abstractions
2000
Sidiropoulos Stefanos, Liu Dean, Kim Jaeha, Gu Wei, and Horowitz Mark. 2000. “Adaptive bandwidth DLLs and PLLs using regulated supply CMOS buffers.” In 2000 Symposium on VLSI Circuits. Digest of Technical Papers (Cat. No. 00CH37103), Pp. 124–127. IEEE.
David Brooks and Margaret Martonosi. 2000. “Adaptive thermal management for high-performance microprocessors.” In In Workshop on Complexity Effective Design. Citeseer.
David Brooks, Pradip Bose, Stanley Schuster, Hans Jacobson, Prabhaka Kudva, Alper Buyuktosunoglu, J Wellman, Victor Zyuban, Manish Gupta, and Peter Cook. 2000. “Power-aware microarchitecture: Design and modeling challenges for next-generation microprocessors.” Micro, IEEE, 20, 6, Pp. 26–44.
David Brooks and Margaret Martonosi. 2000. “Value-based clock gating and operation packing: dynamic strategies for improving processor power and performance.” ACM Transactions on Computer Systems (TOCS), 18, 2, Pp. 89–126.
Gu Wei, Jaeha Kim, Dean Liu, Stefanos Sidiropoulos, and Mark A Horowitz. 2000. “A variable-frequency parallel I/O interface with adaptive power-supply regulation.” IEEE Journal of Solid-State Circuits, 35, 11, Pp. 1600–1610.
Margaret Martonosi, Vivek Tiwari, and David Brooks. 2000. “Wattch: A framework for architectural-level power analysis and optimizations.” isca, Pp. 83.
1999
Gu Wei and Horowitz Mark. 4/1999. “A fully digital, energy-efficient, adaptive power-supply regulator.” IEEE Journal of solid-state Circuits, 34, 4, Pp. 520–528. Publisher's VersionAbstract

A voltage scaling technique for energy-efficient operation requires an adaptive power-supply regulator to significantly reduce dynamic power consumption in synchronous digital circuits. A digitally controlled power converter that dynamically tracks circuit performance with a ring oscillator and regulates the supply voltage to the minimum required to operate at a desired frequency is presented. This paper investigates the issues involved in designing a fully digital power converter and describes a design fabricated in a MOSIS 0.8-/spl mu/m process. A variable-frequency digital controller design takes advantage of the power savings available through adaptive supply-voltage scaling and demonstrates converter efficiency greater than 90% over a dynamic range of regulated voltage levels.

A fully digital, energy-efficient, adaptive power-supply regulator
David Brooks and Margaret Martonosi. 1/9/1999. “Dynamically exploiting narrow width operands to improve processor power and performance.” In High-Performance Computer Architecture, 1/9/1999. Proceedings. Fifth International Symposium On, Pp. 13–22. Orlando, FL, USA: IEEE. Publisher's VersionAbstract

In general-purpose microprocessors, recent trends have pushed towards 64 bit word widths, primarily to accommodate the large addressing needs of some programs. Many integer problems, however, rarely need the full 64 bit dynamic range these CPUs provide. In fact, another recent instruction set trend has been increased support for sub-word operations (that is, manipulating data in quantities less than the full word size). In particular, most major processor families have introduced "multimedia" instruction set extensions that operate in parallel on several sub-word quantities in the same ALU. This paper notes that across the SPECint95 benchmarks, over half of the integer operation executions require 16 bits or less. With this as motivation, our work proposes hardware mechanisms that dynamically recognize and capitalize on these "narrow-bitwidth" instances. Both optimizations require little additional hardware, and neither requires compiler support. The first, power-oriented, optimization reduces processor power consumption by using aggressive clock gating to turn off portions of integer arithmetic units that will be unnecessary for narrow bitwidth operations. This optimization results in an over 50% reduction in the integer unit's power consumption for the SPECint95 and MediaBench benchmark suites. The second optimization improves performance by merging together narrow integer operations and allowing them to share a single functional unit. Conceptually akin to a dynamic form of MMX, this optimization offers speedups of 4.3%-6.2% for SPECint95 and 8.0%-10.4% for MediaBench.

Dynamically exploiting narrow width operands to improve processor power and performance
David Brooks and Margaret Martonosi. 1/1999. “Implementing application-specific cache-coherence protocols in configurable hardware.” In Network-Based Parallel Computing. Communication, Architecture, and Applications, Pp. 181–195. Springer. Publisher's VersionAbstract

Streamlining communication is key to achieving good performance in shared-memory parallel programs. While full hardware support for cache coherence generally offers the best performance, not all parallel machines provide it. Instead, software layers using Shared Virtual Memory (SVM) can be built to enforce coherence at a higher level. In prior work, researchers have studied application-specific cache coherence protocols implemented either in SVM systems or as handlers run by programmable protocol processors. Since the protocols are specialized to the needs of a single application, they can be particularly helpful in reducing the long latencies and processing overhead that sometimes degrade performance in SVM systems. This paper studies implementing application-specific protocols in hardware, but not via an instruction-based protocol processor as is typical. Instead, we consider configurable implementations based on Field-Programmable Gate Arrays (FPGAs). This approach can be faster than software-based techniques and less expensive than some hardware-based techniques. We study one application, appbt, in detail, including a VHDL-level design of the configurable protocol design. We sketch out approaches for other applications as well. Implementing protocol operations in configurable hardware improves communication performance by roughly 11X for a 32-node system. While overall speedups are a more modest 12% our method is promising because of its flexibility and because it offers a new way of harnessing configurable hardware at the network interface, where it already exists or could be easily added to current systems.

Implementing application-specific cache-coherence protocols in configurable hardware

Pages