Publications by Year: 2004

2004
Mark Hempstead, Matt Welsh, and David Brooks. 11/16/2004. “TinyBench: The case for a standardized benchmark suite for TinyOS based wireless sensor network devices.” In Local Computer Networks, 11/16/2004. 29th Annual IEEE International Conference on, Pp. 585–586. Tampa, FL, USA: IEEE. Publisher's VersionAbstract
The growing wireless sensor network research community lacks a standard method for evaluating hardware platforms. Traditional benchmark suites do not sufficiently address the needs of sensor network designers. This work provides motivation for a benchmark suite and details an approach for benchmarking TinyOS compatible hardware. To aid the development of future hardware architectures, we propose the creation of a standard single node benchmark suite, based on both real applications and "stressmarks." We present sample benchmark results and call for further work in this area.
TinyBench: The case for a standardized benchmark suite for TinyOS based wireless sensor network devices
Yau Chin, John Sheu, and David Brooks. 10/11/2004. “Evaluating techniques for exploiting instruction slack.” In Computer Design: VLSI in Computers and Processors, 10/11/2004. ICCD 10/11/2004. Proceedings. IEEE International Conference on, Pp. 375–378. San Jose, CA, USA: IEEE. Publisher's VersionAbstract
In many workloads, 25% to 50% of instructions have slack allowing them to be delayed without impacting performance. To exploit this slack, processors may implement more power-efficient, longer latency pipelines or provide dynamically scaled pipelines using multiple clock domains. Issuing instructions with slack to slower pipelines can result in substantial power savings, with minimal performance loss. Considering both dynamic and static power dissipation, we found that by using longer latency pipelines the power of functional unit pipelines decreases by 20% to 55% with a performance impact of 0% to 3% for SPEC2000 and MediaBench workloads. Dynamic scaling reduces the performance loss in intense multimedia workloads by up to 2%, but achieves lower power savings.
Evaluating techniques for exploiting instruction slack
Yingmin Li, David Brooks, Zhigang Hu, Kevin Skadron, and Pradip Bose. 8/11/2004. “Understanding the energy efficiency of simultaneous multithreading.” In Proceedings of the 2004 international symposium on Low power electronics and design, Pp. 44–49. Newport Beach, CA, USA: ACM. Publisher's VersionAbstract
Simultaneous multithreading (SMT) has proven to be an effective method of increasing the performance of microprocessors by extracting additional instruction-level parallelism from multiple threads. In current microprocessor designs, power-efficiency is of critical importance, and we present modeling extensions to an architectural simulator to allow us to study the power-performance efficiency of SMT. After a thorough design space exploration we find that SMT can provide a performance speedup of nearly 20% for a wide range of applications with a power overhead of roughly 24%. Thus, SMT can provide a substantial benefit for energy-efficiency metrics such as ED/sup 2/. We also explore the underlying reasons for the power uplift, analyze the impact of leakage-sensitive process technologies, and discuss our model validation strategy.
Understanding the energy efficiency of simultaneous multithreading
Kim Hazelwood and David Brooks. 8/2004. “Eliminating voltage emergencies via microarchitectural voltage control feedback and dynamic optimization.” In Proceedings of the 2004 international symposium on Low power electronics and design, Pp. 326–331. ACM. Publisher's VersionAbstract

Microprocessor designers use techniques such as clock gating to reduce power dissipation. An unfortunate side-effect of these techniques is the processor current fluctuations that stress the power-delivery network. Recent research has focused on hardware-only mechanisms to detect and eliminate these fluctuations. While the solutions have been effective at avoiding operating-range violations, they have done so at a performance penalty to the executing program.

Compilers are well equipped to rearrange instructions such that current fluctuations are less dramatic, with minimal performance implications. Furthermore, a dynamic optimizer can eliminate the problem at run time, avoiding the difficult task of statically predicting voltage emergencies.

This paper proposes complementing existing hardware solutions with additional run-time software to address problematic code sequences that cause recurring voltage swings. Our proposal extends existing hardware techniques to additionally provide feedback to a dynamic optimizer, which can provide a long-term solution, often without impacting the performance of the executing application.

We found that recurring voltage fluctuations do exist in the SPEC2000 benchmarks, and that given very little information from the hardware, a dynamic optimizer can locate and correct many of the recurring voltage emergencies.

Eliminating voltage emergencies via microarchitectural voltage control feedback and dynamic optimization
David Brooks and Joerg Henkel. 8/2004. “High level power modeling and analysis.” International Symposium on Low Power Electronics and Design: Proceedings of the 2004 international symposium on Low power electronics and design. Publisher's Version
Victor Zyuban, David Brooks, Viji Srinivasan, Michael Gschwind, Pradip Bose, Philip Strenski, and Philip Emma. 8/2004. “Integrated analysis of power and performance for pipelined microprocessors.” Computers, IEEE Transactions on, 53, 8, Pp. 1004–1016. Publisher's VersionAbstract
Choosing the pipeline depth of a microprocessor is one of the most critical design decisions that an architect must make in the concept phase of a microprocessor design. To be successful in today’s cost/performance marketplace, modern CPU designs must effectively balance both performance and power dissipation. The choice of pipeline depth and target clock frequency has a critical impact on both of these metrics. In this paper, we describe an optimization methodology based on both analytical models and detailed simulations for power and performance as a function of pipeline depth. Our results for a set of SPEC2000 applications show that, when both power and performance are considered for optimization, the optimal clock period is around 18 FO4. We also provide a detailed sensitivity analysis of the optimal pipeline depth against key assumptions of our energy models. Finally, we discuss the potential risks in design quality for overly aggressive or conservative choices of pipeline depth.
Integrated analysis of power and performance for pipelined microprocessors
Ruwan Ratnayake, Gu-Yeon Wei, and Aleksandar Kavcic. 6/2004. “Pipelined parallel architecture for high throughput MAP detectors.” In 2004 IEEE International Symposium on Circuits and Systems (IEEE Cat. No. 04CH37512), 2: Pp. II–505. IEEE. Publisher's VersionAbstract
A maximum a posteriori probability (MAP) detector based on a forward only algorithm with high throughput is considered in this paper. MAP gives the optimal performance and, with Turbo decoding, can achieve performance close to the channel capacity limits. Deep pipelined architecture for the forward only method is presented and compared with the other throughput-increasing methods. Simulation results based on the iterative MAP-LDPC (low-density parity check) system are shown. Hardware implementation issues that exploit the regularities of the structure are also discussed.
Pipelined parallel architecture for high throughput MAP detectors
Hanumolu Kumar, Casper Bryan, Mooney Randy, Gu Wei, and Moon Ku. 5/26/2004. “Jitter in high-speed serial and parallel links.” In 2004 IEEE International Symposium on Circuits and Systems (IEEE Cat. No. 04CH37512), 4: Pp. IV–425. Vancouver, BC, Canada: IEEE. Publisher's VersionAbstract
Jitter degrades the performance of both high-speed serial and parallel I/O links by limiting the maximum achievable data-rates. We present analytical expressions to evaluate the effect of jitter on the performance of high-speed links. These expressions enable simple calculation of worst-case voltage and timing margins in the presence of jitter. This analysis is also extended to equalized links. Finally, we show that the limited bandwidth of the channel can amplify high frequency jitter and present means to counteract jitter amplification.
Jitter in high-speed serial and parallel links
Yong Cheol Bae and Gu-Yeon Wei. 5/23/2004. “A mixed PLL/DLL architecture for low jitter clock generation.” In 2004 IEEE International Symposium on Circuits and Systems (IEEE Cat. No. 04CH37512), 4: Pp. IV–788. Vancouver, BC, Canada: IEEE. Publisher's VersionAbstract
This paper presents a mixed PLL/DLL architecture for low-jitter clock generation that merges phase-locked loop (PLL) and delay-locked loop (DLL) characteristics. It relies on an interpolator to configure the loop to operate more like a PLL or more like a DLL depending on the interpolator's settings. The ability to vary interpolator settings enables wide range control of the clock generator's loop bandwidth. Therefore, the loop bandwidth can readily be adjusted to accommodate different noise conditions. A discrete-time Z-domain analysis is provided to illustrate the noise filtering characteristics of the loop in the presence of various noise sources and highlight the potential advantages of the mixed PLL/DLL architecture. Simulation results verify stable operation of the loop designed for a 0.18 /spl mu/m CMOS process.
A mixed PLL/DLL architecture for low jitter clock generation
David Brooks, Pradip Bose, and Margaret Martonosi. 3/2004. “Power-performance simulation: design and validation strategies.” ACM SIGMETRICS Performance Evaluation Review, 31, 4, Pp. 13–18. Publisher's VersionAbstract

Microprocessor research and development increasingly relies on detailed simulations to make design choices. As such, the structure, speed, and accuracy of microarchitectural simulators is of critical importance to the field. This paper describes our experiences in building two simulators, using related but distinct approaches.One of the most important attributes of a simulator is its ability to accurately convey design trends as different aspects of the microarchitecture are varied. In this work, we break down accuracy---a broad term--- into two sub-types: relative and absolute accuracy. We then discuss typical abstraction errors in power-performance simulators and show when they do (or do not) affect the design rule choices a user of those simulator might make. By performing this validation study using the Wattch and Power Timer simulators, the work addresses validation issues both broadly and in the specific case of a fairly widely-used simulator.

Power-performance simulation: design and validation strategies
Fulford RF, Gu Wei, and Matt Welsh. 2/2004. “A portable, low-power, wireless two-lead EKG system.” In The 26th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, 1: Pp. 2141–2144. IEEE. Publisher's VersionAbstract

Sensor devices ("motes") which integrate an embedded microprocessor, low-power radio and a limited amount of storage have the potential to significantly enhance the provision of emergency medical care. Wearable vital sign sensors can wirelessly monitor patient condition, alerting healthcare providers to changes in status while simultaneously delivering data to a backend archival system for longer-term storage. As part of the CodeBlue initiative at Harvard University, we previously developed a mote-based pulse oximetry module which gathers data from a noninvasive finger sensor and transmits it wirelessly to a base station. To expand the capabilities of the mote for healthcare applications we now introduce EKG on Mica2, the first custom-designed electrocardiograph sensor board to interface with this platform. We additionally present VitalEKG, a collection of software components which allow the capture and wireless transmission of heart activity traces. We present preliminary test results which validate our approach and suggest the feasibility of future enhancements.

A portable, low-power, wireless two-lead EKG system
Yingmin Li, K Skadron, Z Hu, and David Brooks. 2004. “Evaluating the thermal efficiency of SMT and CMP architectures.” IBM TJ Watson Conference on Interaction between Architecture, Circuits, and Compilers.Abstract
Simultaneous multithreading (SMT) and chip multiprocessing (CMP) both allow a chip to achieve greater throughput, but their thermal properties are still poorly understood. This paper uses Turandot, PowerTimer, and HotSpot to evaluate the thermal eff iciency for a Power4/Power5-like core. Our results show that although SMT and CMP exhibit similar peak operating temperatures, the mechanism by which they heat up are quite different. More specifically, SMT heating is primarily caused by localized heating in certain key structures such as the register file, due to increased utilization. On the other hand, CMP heating is mainly caused by the global impact of increased energy output, due to the extra energy of an added core. Because of this difference in heat up machanism, we found that the best thermal management technique is also different for SMT and CMP. Finally, we show that CMP and SMT will scale differently as the contribution of leakage power grows, with CMP suffering from higher leakage due to the second core's higher temperature and the exponential temperature-dependence of subthreshold leakage.
Evaluating the thermal efficiency of SMT and CMP architectures
David Brooks. 2004. “Integrated Architectural Level Power-Performance Modeling Toolkit”.Abstract
We are currently developing a robust, integrated infrastructure for studying power-performance issues across a range of systems.  By leveraging a common ISA and shared simulation infrastructure, we will be able to perform apples-to-apples comparisons between processors intended for specific design spaces.  For example, recently there has been significant attention brought to the idea of reusing microprocessor cores in multiple design spaces.  In particular, there has been much interest in exploring the possibility of using multiple low-power, embedded processors in blade systems or SMP-on-a-chip designs for server workloads.  There has also been interest in taking server-class microprocessors and bringing them into use in lower-end systems.  For example, the processor core of the original POWER4 microprocessor has recently been introduced as the PowerPC970 -- a 64-bit microprocessor for use in blade servers and desktop (and potentially laptop) systems. We utilize the MET/Turandot toolkit originally developed at IBM TJ Watson Research Center as the underlying PowerPC microarchitecture performance simulator [3].  Turandot is flexible enough to model a broad range of microarchitectures and has undergone extensive validation [3].  In addition,  Turandot has been augmented with power models to explore power-performance tradeoffs in an internal IBM tool called PowerTimer [4]. Turandot is freely available to the research community through licensing arrangements with IBM, and we are currently working with IBM to develop an external, public release of PowerTimer. 
Integrated Architectural Level Power-Performance Modeling Toolkit