Simultaneous multithreading (SMT) and chip multiprocessing (CMP) both allow a chip to achieve greater throughput, but their thermal properties are still poorly understood. This paper uses Turandot, PowerTimer, and HotSpot to evaluate the thermal eff iciency for a Power4/Power5-like core. Our results show that although SMT and CMP exhibit similar peak operating temperatures, the mechanism by which they heat up are quite different. More specifically, SMT heating is primarily caused by localized heating in certain key structures such as the register file, due to increased utilization. On the other hand, CMP heating is mainly caused by the global impact of increased energy output, due to the extra energy of an added core. Because of this difference in heat up machanism, we found that the best thermal management technique is also different for SMT and CMP. Finally, we show that CMP and SMT will scale differently as the contribution of leakage power grows, with CMP suffering from higher leakage due to the second core's higher temperature and the exponential temperature-dependence of subthreshold leakage.
The PowerTimer toolset has been developed for use in early-stage, microarchitecture-level power-performance analysis of microprocessors. The key component of the toolset is a parameterized set of energy functions that can be used in conjunction with any given cycle-accurate microarchitectural simulator. The energy functions model the power consumption of primitive and hierarchically composed building blocks which are used in microarchitecture-level performance models. Examples of structures modeled are pipeline stage latches, queues, buffers and component read/write multiplexers, local clock buffers, register files, and cache array macros. The energy functions can be derived using purely analytical equations that are driven by organizational, circuit, and technology parameters or behavioral equations that are derived from empirical, circuit-level simulation experiments. After describing the modeling methodology, we present analysis results in the context of a current-generation superscalar processor simulator to illustrate the use and effectiveness of such early-stage models. In addition to average power and performance tradeoff analysis, PowerTimer is useful in assessing the typical and worst-case power (or current) swings that occur between successive cycle windows in a given workload execution. Such a characterization of workloads at the early stage of microarchitecture definition helps pinpoint potential inductive noise problems on the voltage rail that can be addressed by designing an appropriate package or by suitably tuning the dynamic power management controls within the processor.
This paper describes a novel backplane transceiver, which uses PAM-4 (pulse amplitude modulated four level) signalling and continuously adaptive transmit based equalization to move 5 Gcb/s (channel bits per second) across typical FR-4 backplanes for total distances of up to 50 inches through two sets of backplane connectors. The paper focuses on the implementation of the equalizer and the adaptation algorithms, and includes measured results. The 17 mm/sup 2/ device is implemented in a 0.25 /spl mu/m CMOS process, operates on 2.5 V and 3.3 V supplies and consumes 1.2 W.
We present the high-level microarchitecture of LPX: a low-power issue-execute processor prototype that is being designed by a joint industry-academia research team. LPX implements a very small subset of a RISC architecture, with a primary focus on a vector (SIMD) multimedia extension. The objective of this project is to validate some key new ideas in power-aware microarchitecture techniques, supported by recent advances in circuit design and clocking.
We describe a new power-performance modeling toolkit, developed to aid in the evaluation and definition of future power-efficient, PowerPC TM processors. The base performance models in use in this project are: (a) a fast but cycle-accurate, parameterized research simulator and (b) a slower, pre-RTL reference model that models a specific high-end machine in full, latchaccurate detail. Energy characterizations are derived from real, circuit-level power simulation data. These are then combined to form higher-level energy models that are driven by microarchitecture-level parameters of interest. The overall methodology allows us to conduct power-performance tradeoff studies in defining the follow-on design points within a given product family. We present a few experimental results to illustrate the kinds of tradeoffs one can study using this tool.
Of the many ways one could gauge the complexity-effectiveness of a design or design element, one candidate approach is to consider a design's power/performance tradeoffs. This paper describes our early-stage results in a broad effort to evaluate the power-performance tradeoffs of a range of benchmarks and microarchitectures. In particular, this paper presents power data collected on-the-fly on real x86 machines as they execute carefully-constructed microbenchmarks. The microbenchmarks exercise aspects of the system such as data cache and branch predictor. They are parametrically-variable to consider how load dependence, cache miss rate, branch mispredict rate, and branch distance all impact the power and performance of a CPU. For example, from these experiments, we learn that CPU performance increases essentially monotonically with cache hit rate, while CPU power encounters a maximum at roughly 80-90% cache hit rates. Likewise, we show results demonstrating that performance-neutral issues such as bit populations in the data cache values can display interesting power trends. While the experimental results are preliminary, we feel that the techniques described in this paper will o er a useful foundation for a broad range of power/performance tradeoffs.
Increasing power dissipation has become a major constraint for future performance gains in the design of microprocessors. In this paper, we present the circuit design of an issue queue for a superscalar processor that leverages transmission gate insertion to provide dynamic low-cost configurability of size and speed. A novel circuit structure dynamically gathers statistics of issue queue activity over intervals of instruction execution. These statistics are then used to change the size of an issue queue organization on-the-fly to improve issue queue energy and performance. When applied to a fixed, full-size issue queue structure, the result is up to a 70% reduction in energy dissipation. The complexity of the additional circuitry to achieve this result is almost negligible. Furthermore, self-timed techniques embedded in the adaptive scheme can provide a 56% decrease in cycle time of the CAM array read of the issue queue when we change the adaptive issue queue size from 32 entries (largest possible) to 8 entries (smallest possible in our design).
A voltage scaling technique for energy-efficient operation requires an adaptive power-supply regulator to significantly reduce dynamic power consumption in synchronous digital circuits. A digitally controlled power converter that dynamically tracks circuit performance with a ring oscillator and regulates the supply voltage to the minimum required to operate at a desired frequency is presented. This paper investigates the issues involved in designing a fully digital power converter and describes a design fabricated in a MOSIS 0.8-/spl mu/m process. A variable-frequency digital controller design takes advantage of the power savings available through adaptive supply-voltage scaling and demonstrates converter efficiency greater than 90% over a dynamic range of regulated voltage levels.
Power dissipation limits have emerged as a major constraint in the design of microprocessors. At the low end of the performance spectrum, namely in the world of handheld and portable devices or systems, power has always dominated over performance (execution time) as the primary design issue. Battery life and system cost constraints drive the design team to consider power over performance in such a scenario. Increasingly, however, power is also a key design issue in the workstation and server markets (see Gowan et al.)1 In this high-end arena the increasing microarchitectural complexities, clock frequencies, and die sizes push the chiplevel—and hence the system-level—power consumption to such levels that traditionally air-cooled multiprocessor server boxes may soon need budgets for liquid-cooling or refrigeration hardware. This need is likely to cause a break point—with a step upward—in the ever-decreasing price-performance ratio curve. As such, a design team that considers power consumption and dissipation limits early in the design cycle and can thereby adopt an inherently lower power microarchitectural line will have a definite edge over competing teams. Thus far, most of the work done in the area of high-level power estimation has been focused at the register-transfer-level (RTL) description in the processor design flow. Only recently have we seen a surge of interest in estimating power at the microarchitecture definition stage, and specific work on power-efficient microarchitecture design has been reported.2-8 Here, we describe the approach of using energy-enabled performance simulators in early design. We examine some of the emerging paradigms in processor design and comment on their inherent power-performance characteristics.
Digital circuit delays vary with feature size, process corner, operating voltage, and junction temperature. Delays are steadily decreasing with advances in process technology, so comparing results reported in nanoseconds between process generations is difficult. This paper proposes using the delay of a fanout-of-4 inverter (FO4) to normalize process and operating condition variations and quantifies how well this normalization works. A novel application of this correlation is a power-reduction technique. Power supply and operating frequency can be regulated on the fly to minimize power while a chip is performing non-critical operations while allowing full-speed operation when necessary. Proposed implementations [1,2,3] rely on a good correlation between ring-oscillator frequency and critical path latency. The tracking of chip delays with FO4 delay determines the necessary extra margin for functionality over process and environmental variation.