Advances in technology allowed for integrating DRAM-like structures into the chip, called embedded DRAM (eDRAM). This technology has already been successfully implemented in some GPUs and other graphic-intensive SoC, like game consoles. The most recent processor from IBM ® , POWER7, is the first general-purpose processor that integrates an eDRAM module on the chip. In this paper, we propose a hybrid cache architecture that exploits the main features of both memory technologies, speed of SRAM and high density of eDRAM. We demonstrate, that due to the high locality found in emerging applications, a high percentage of data that enters to the on-chip last-level cache are not accessed again before they are evicted. Based on that observation, we propose a placement scheme where re-accessed data blocks are stored in fast, but costly in terms of area and power, SRAM banks, while eDRAM banks store data blo cks that just arrive to the NUCA cache or were demoted from a SRAM bank. We show that a well-balanced SRAM / eDRAM NUCA cache can achieve similar performance results than using a NUCA cache composed of only SRAM banks, but reduces area by 15% and power consumed by 10%. Furthermore, we also explore several alternatives to exploit the area reduction we gain by using the hybrid architecture, resulting in an overall performance improvement of 4%.
Unintended variations in circuit lithography and undesirable fluctuations in circuit operating parameters such as supply voltage and temperature are threatening the continuation of technology scaling that microprocessor evolution relies on. Although circuit-level solutions for some variation problems may be possible, they are prohibitively expensive and impractical for commodity processors, on which not only the consumer market but also an increasing segment of the business market now depends. Solutions at the microarchitecture level and even the software level, on the other hand, overcome some of these circuit-level challenges without significantly raising costs or lowering performance. Using examples drawn from our Alarms Project and related work, we illustrate how collaborative design that encompasses circuits, architecture, and chip-resident software leads to a cost-effective solution for inductive voltage noise, sometimes called the dI / dt problem. The strategy that we use for assuring correctness while preserving performance can be extended to other variation problems.
The design of autonomous robots involves the
development of many complex, interdependent components,
including the mechanical body and its associated actuators,
sensors, and algorithms to handle sensor processing, control,
and high-level task planning. For the design of a robotic
bee (RoboBee) it is necessary to optimize across the design
space for minimum weight and power consumption to increase
flight time; however, the design space of a single component is
large, the interconnectedness and tradeoffs across components
must be considered, and interdisciplinary collaborations cause
different component design timelines.
In this work, we show how the development of a hardware
in the loop (HWIL) system for a flapping wing microrobot can
simplify and accelerate evaluation of a large number of design
choices. Specifically, we explore the design space of the visual
system including sensor hardware and associated optical flow
processing. We demonstrate the utility of the HWIL system
in exposing trends on system performance for optical flow
algorithm, field of view, sensor resolution, and frame rate.
This paper describes a digital calibration scheme that corrects for phase spacing errors in a multiphase clock generating delay-locked loop (DLL). The calibration scheme employs sub-sampling using a frequency-offset clock with respect to the DLL reference clock, to measure phase-offsets. The phase-correction circuit uses one digital-to-analog converter across eight variable-delay buffers to reduce the area consumption by 62%. The test-chip, designed in a 130 nm CMOS process, demonstrates a 8-phase 1.6 GHz DLL with a worst-case phase error of 450 fs.
High-speed ADC-based backplane receivers often suffer from high power consumption and complexity and require careful designs. This paper discusses circuit- and system-level design considerations for such receivers. A low-power, high-speed front-end ADC circuit and a high-level design-space exploration of ADC-based receivers are presented.
Networks of ultra-low-power nodes capable of sensing, computation, and wireless communication have applications in medicine, science, industrial automation, and security. Reducing power consumption requires the development of system-on-chip implementations that must provide both energy efficiency and adequate performance to meet the demands of the long deployment lifetimes and bursts of computation that characterize wireless sensor network (WSN) applications. Therefore, this work argues that designers should evaluate the design in terms of average power for an entire workload, including active and idle periods, not just the metric of energy-per-instruction.
A dual-channel, low power control IC for driving high
voltage piezoelectric actuators in a flapping-wing robotic insect is presented. The IC controls milligram-scale power electronics that meet the stringent weight and power requirements
of aerial microrobots. Designed in a 0.13µm CMOS process,
the IC implements an efficient control algorithm to drive piezoelectric actuators with high temporal resolution while consuming <100µW during normal operation at 1.0V.
Keywords: low power, SOC, high voltage, piezoelectric
actuator, and microrobotics.
Processor-level dynamic thermal management techniques have long targeted worst-case thermal margins. We examine the thermal-performance trade-offs in average-case, preventive thermal management by actively degrading application performance to achieve long-term thermal control. We propose Dimetrodon, the use of idle cycle injection, a flexible, per-thread technique, as a preventive thermal management mechanism and demonstrate its efficiency compared to hardware techniques in a commodity operating system on real hardware under throughput and latency-sensitive real-world workloads. Compared to inflexible hardware techniques, Dimetrodon achieves favorable trade-offs for temperature reductions up to 30% due to rapid heat dissipation during short idle intervals.
Flapping-wing robotic insects are small, highly maneuverable flying robots inspired by biological insects and useful for a wide range of tasks, including exploration, environmental monitoring, search and rescue, and surveillance. Recently, robotic insects driven by piezoelectric actuators have achieved the important goal of taking off with external power; however, fully autonomous operation requires an ultralight power supply capable of generating high-voltage drive signals from low-voltage energy sources. This paper describes high-voltage switching circuit topologies and control methods suitable for driving piezoelectric actuators in flapping-wing robotic insects and discusses the physical implementation of these topologies, including the fabrication of custom magnetic components by laser micromachining and other weight minimization techniques. The performance of laser micromachined magnetics and custom-wound commercial magnetics is compared through the experimental realization of a tapped inductor boost converter capable of stepping up a 3.7V Li-poly cell input to 200V. The potential of laser micromachined magnetics is further shown by implementing a similar converter weighing 20mg (not including control functionality) and capable of up to 70mW output at 200V and up to 100mW at 100V.
Post-fabrication tuning provides a promising design approach to mitigate the performance and power overheads of process variation in advanced fabrication technologies. This paper explores design considerations and VLSI-CAD support for a recently proposed post-fabrication tuning knob called voltage interpolation. Successful implementation of this technique requires examination of the design tradeoffs between circuit tuning range and static power overheads within the synthesis flow of the design process, in addition to the implications of place and route. Results from the exploration of the scheme for a 64-core chip-multiprocessor machine using industrial-grade design blocks show that the scheme can be used to mitigate overhead arising from random and correlated within-die process variations. A design using voltage interpolation can match the nominal delay target with a 16% power cost, or for the same power budget, incur only a 13% delay overhead after variations.
In recent years, chip multiprocessor architectures have emerged to scale performance while staying within tight power constraints. This trend motivates per core/block dynamic voltage and frequency scaling (DVFS) with fast voltage transition. Given the high cost and bulk of off-chip DC/DC converters to implement multiple on-chip power domains, there has been a surge of interest in on-chip converters. This paper presents the design and experimental results of a fully integrated 3-level DC/DC converter that merges characteristics of both inductor-based buck and switched-capacitor (SC) converters. While off-chip buck converters show high conversion efficiency, their on-chip counterparts suffer from loss due to low quality inductors. With the help of flying capacitors, the 3-level converter requires smaller inductors than the buck converter, reducing loss and on-die area overhead. Compared to SC converters that need more com plex structures to regulate higher than half the input voltage, 3-level converters can efficiently regulate the output voltage across a wide range of levels and load currents. Measured results from a 130nm CMOS test-chip prototype demon strate nanosecond-scale voltage transition times and peak conversion efficiency of 77%.
Continued scaling of process technologies is critical to sustaining improvements in processor frequencies and performance. However, shrinking process technologies exacerbates process variations - the deviation of process parameters from their target specifications. In the context of multi-core CMPs, which are implemented to feature homogeneous cores, within-die process variations result in substantially different core frequencies. Exposing such process-variation induced heterogeneity interferes with the norm of marketing chips at a single frequency. Further, application performance is undesirably dictated by the frequency of the core it is running on. To work around these challenges, a single uniform frequency, dictated by the slowest core, is currently chosen as the chip frequency sacrificing the increased performance capabilities of cores that could operate at higher frequencies. In this paper, we propose choosing the mean frequency across all cores, in lieu of the minimum frequency, as the single-frequency to use as the chip sales frequency. We examine several scheduling algorithms implemented below the O/S in hardware/firmware that guarantee minimum application performance near that of the average frequency, by masking process-variation induced heterogeneity from the end-user. We show that our Throughput-Driven Fairness (TDF) scheduling policy improves throughput by an average of 12% compared to a naive fairness scheme (round-robin) for frequency-sensitive applications. At the same time, TDF allows 98% of chips to maintain minimum performance at or above 90% of that expected at the mean frequency, providing a single uniform performance level to present for the chip.
Parameter variations (process, voltage, and temperature) threaten continued performance scaling of power-constrained computer systems. As designers seek to contain the power consumption of microprocessors through reductions in supply voltage and power-saving techniques such as clock-gating, these systems suffer increasingly large power supply fluctuations due to the finite impedance of the power supply network. These supply fluctuations, referred to as voltage emergencies, must be managed to guarantee correctness. Traditional approaches to address this problem incur high-cost or compromise power/performance efficiency. Our research seeks ways to handle these alarm conditions through a combined hardware/software approach, motivated by root cause analysis of voltage emergencies revealing that many of these events are heavily linked to both program control flow and microarchitectural events (cache misses and pipeline flushes). This talk will discuss three aspects of the project: (1) a fail-safe mechanism that provides hardware guaranteed correctness; (2) a voltage emergency predictor that leverages control flow and microarchitectural event information to predict voltage emergencies up to 16 cycles in advance; and (3) a proof-of-concept dynamic compiler implementation that demonstrates that dynamic code transformations can be used to eliminate voltage emergencies from the instruction stream with minimal impact on performance [1–9].
Voltage variations are a major challenge in processor design. Here, researchers characterize the voltage noise characteristics of programs as they run to completion on a production Core 2 Duo processor. Furthermore, they characterize the implications of resilient architecture design for voltage variation in future systems.