Publications

1999
Gu Wei and Horowitz Mark. 4/1999. “A fully digital, energy-efficient, adaptive power-supply regulator.” IEEE Journal of solid-state Circuits, 34, 4, Pp. 520–528. Publisher's VersionAbstract

A voltage scaling technique for energy-efficient operation requires an adaptive power-supply regulator to significantly reduce dynamic power consumption in synchronous digital circuits. A digitally controlled power converter that dynamically tracks circuit performance with a ring oscillator and regulates the supply voltage to the minimum required to operate at a desired frequency is presented. This paper investigates the issues involved in designing a fully digital power converter and describes a design fabricated in a MOSIS 0.8-/spl mu/m process. A variable-frequency digital controller design takes advantage of the power savings available through adaptive supply-voltage scaling and demonstrates converter efficiency greater than 90% over a dynamic range of regulated voltage levels.

A fully digital, energy-efficient, adaptive power-supply regulator
David Brooks and Margaret Martonosi. 1/9/1999. “Dynamically exploiting narrow width operands to improve processor power and performance.” In High-Performance Computer Architecture, 1/9/1999. Proceedings. Fifth International Symposium On, Pp. 13–22. Orlando, FL, USA: IEEE. Publisher's VersionAbstract

In general-purpose microprocessors, recent trends have pushed towards 64 bit word widths, primarily to accommodate the large addressing needs of some programs. Many integer problems, however, rarely need the full 64 bit dynamic range these CPUs provide. In fact, another recent instruction set trend has been increased support for sub-word operations (that is, manipulating data in quantities less than the full word size). In particular, most major processor families have introduced "multimedia" instruction set extensions that operate in parallel on several sub-word quantities in the same ALU. This paper notes that across the SPECint95 benchmarks, over half of the integer operation executions require 16 bits or less. With this as motivation, our work proposes hardware mechanisms that dynamically recognize and capitalize on these "narrow-bitwidth" instances. Both optimizations require little additional hardware, and neither requires compiler support. The first, power-oriented, optimization reduces processor power consumption by using aggressive clock gating to turn off portions of integer arithmetic units that will be unnecessary for narrow bitwidth operations. This optimization results in an over 50% reduction in the integer unit's power consumption for the SPECint95 and MediaBench benchmark suites. The second optimization improves performance by merging together narrow integer operations and allowing them to share a single functional unit. Conceptually akin to a dynamic form of MMX, this optimization offers speedups of 4.3%-6.2% for SPECint95 and 8.0%-10.4% for MediaBench.

Dynamically exploiting narrow width operands to improve processor power and performance
David Brooks and Margaret Martonosi. 1/1999. “Implementing application-specific cache-coherence protocols in configurable hardware.” In Network-Based Parallel Computing. Communication, Architecture, and Applications, Pp. 181–195. Springer. Publisher's VersionAbstract

Streamlining communication is key to achieving good performance in shared-memory parallel programs. While full hardware support for cache coherence generally offers the best performance, not all parallel machines provide it. Instead, software layers using Shared Virtual Memory (SVM) can be built to enforce coherence at a higher level. In prior work, researchers have studied application-specific cache coherence protocols implemented either in SVM systems or as handlers run by programmable protocol processors. Since the protocols are specialized to the needs of a single application, they can be particularly helpful in reducing the long latencies and processing overhead that sometimes degrade performance in SVM systems. This paper studies implementing application-specific protocols in hardware, but not via an instruction-based protocol processor as is typical. Instead, we consider configurable implementations based on Field-Programmable Gate Arrays (FPGAs). This approach can be faster than software-based techniques and less expensive than some hardware-based techniques. We study one application, appbt, in detail, including a VHDL-level design of the configurable protocol design. We sketch out approaches for other applications as well. Implementing protocol operations in configurable hardware improves communication performance by roughly 11X for a 32-node system. While overall speedups are a more modest 12% our method is promising because of its flexibility and because it offers a new way of harnessing configurable hardware at the network interface, where it already exists or could be easily added to current systems.

Implementing application-specific cache-coherence protocols in configurable hardware
1998
Christina Leung, David Brooks, Margaret Martonosi, and Douglas Clark. 1998. “Power-Aware Architecture Studies: Omgoing Work at Princeton.” Power-Driven Microarchitecture Workshop.Abstract
Power dissipation limits have emerged as a major constraint in the design of microprocessors. At the low end of the performance spectrum, namely in the world of handheld and portable devices or systems, power has always dominated over performance (execution time) as the primary design issue. Battery life and system cost constraints drive the design team to consider power over performance in such a scenario. Increasingly, however, power is also a key design issue in the workstation and server markets (see Gowan et al.)1 In this high-end arena the increasing microarchitectural complexities, clock frequencies, and die sizes push the chiplevel—and hence the system-level—power consumption to such levels that traditionally air-cooled multiprocessor server boxes may soon need budgets for liquid-cooling or refrigeration hardware. This need is likely to cause a break point—with a step upward—in the ever-decreasing price-performance ratio curve. As such, a design team that considers power consumption and dissipation limits early in the design cycle and can thereby adopt an inherently lower power microarchitectural line will have a definite edge over competing teams. Thus far, most of the work done in the area of high-level power estimation has been focused at the register-transfer-level (RTL) description in the processor design flow. Only recently have we seen a surge of interest in estimating power at the microarchitecture definition stage, and specific work on power-efficient microarchitecture design has been reported.2-8 Here, we describe the approach of using energy-enabled performance simulators in early design. We examine some of the emerging paradigms in processor design and comment on their inherent power-performance characteristics.
Power-Aware Architecture Studies: Omgoing Work at Princeton
1997
David Harris, Ron Ho, Gu Wei, and Horowitz Mark. 1997. “The fanout-of-4 inverter delay metric.” Unveröffentlichtes Manuskript: http://odin. ac. hmc. edu/harris/research/FO4. pdf.Abstract
Digital circuit delays vary with feature size, process corner, operating voltage, and junction temperature. Delays are steadily decreasing with advances in process technology, so comparing results reported in nanoseconds between process generations is difficult. This paper proposes using the delay of a fanout-of-4 inverter (FO4) to normalize process and operating condition variations and quantifies how well this normalization works. A novel application of this correlation is a power-reduction technique. Power supply and operating frequency can be regulated on the fly to minimize power while a chip is performing non-critical operations while allowing full-speed operation when necessary. Proposed implementations [1,2,3] rely on a good correlation between ring-oscillator frequency and critical path latency. The tracking of chip delays with FO4 delay determines the necessary extra margin for functionality over process and environmental variation.
The fanout-of-4 inverter delay metric
1996
Gu Wei and Horowitz Mark. 8/12/1996. “A low power switching power supply for self-clocked systems.” In Proceedings of 1996 International Symposium on Low Power Electronics and Design, Pp. 313–317. Monterey, CA, USA: IEEE. Publisher's VersionAbstract
This paper presents a digital power supply controller for variable frequency and voltage circuits. By using a ring oscillator as a method of predicting circuit performance, the regulated voltage is set to the minimum required to operate at a reference frequency which maximizes energy efficiency. Our initial test silicon, implemented with a fixed frequency controller is analyzed and reveals that the controller's power consumption is a major limitation for such a design. To make the controller power dissipation scale with the CV/sup 2/f power of the load, we introduce a new architecture with variable frequency control, which allows the controller's supply and frequency to scale along with the load device.
A low power switching power supply for self-clocked systems

Pages