x

Computational Sprinting

http://www.cis.upenn.edu/acg/sprinting/

Although transistor density continues to increase, voltage scaling has stalled and thus power density is increasing each technology generation. Particularly in mobile devices, which have limited cooling options, these trends lead to a utilization wall in which sustained chip performance is limited primarily by power rather than area. However, many mobile applications do not demand sustained performance; rather they comprise short bursts of computation in response to sporadic user activity. To improve responsiveness for such applications, we propose computational sprinting--exceeding sustainable thermal limits for short intervals--to improve responsiveness of interactive applications. Computational sprinting improves responsiveness by activating reserve cores (parallel sprinting) and/or boosting frequency/voltage (frequency sprinting) to power levels that far exceed the system's sustainable cooling capabilities, relying on thermal capacitance to briefly buffer heat.

This project is a collaboration between UM CSE, UM ECE, and U. Pennsylvania. It is supported by NSF under grant CCF-1161505.

Designing Efficient Data Centers

Power Management for Online Data Intensive Services.

Architects and circuit designers have made enormous strides in managing the energy efficiency and peak power demands of processors and other silicon systems. Sophisticated power management features and modes are now myriad across system components, from DRAM to processors to disks. And yet, despite these advances, typical data centers today suffer embarrassing energy inefficiencies: it is not unusual for less than 20% of a data center's multi- megawatt total power draw to flow to computer systems actively performing useful work. Managing power and energy is challenging because individual systems and entire facilities are conservatively provisioned for rare utilization peaks, which leads to energy waste in underutilized systems and over-provisioning of physical infrastructure. Power management is particularly challenging for Online Data Intensive (OLDI) services--workloads like social networking, web search, ad serving, and machine translation that perform significant computing over massive data sets for each user request but require responsiveness in sub-second time scales. These inefficiencies lead to worldwide energy waste measured in billions of dollars and tens of millions of metric tons of CO2. In this project, my team is pursuing ways to make OLDI systems more energy- and capital- infrastructure efficient, while maintaining tight response times.

This project entails collaborations with Google, HP Labs, and ARM, and currently involves three graduate student researchers. The work has been partially supported by NSF under grants CNS-0834403 and CCF-0811320.

Implications of Byte-addressable Non-volatile Memories on Online Transaction Processing.

Within the next few years, one of several candidate byte-addressable non-volatile RAM technologies will likely reach volume manufacturing, and potentially supplant Flash (or even DRAM) in the memory/storage hierarchy. Though numerous studies in the architecture community have considered the density and energy advantages of these NVRAMs, and sought to address their unique challenges (e.g., limited write endurance and read-write performance asymmetries), few studies have sought to exploit the durability of NVRAM. OLTP systems have long relied on hard disks as the nonvolatile medium for durable and recoverable transactions. However, the better latency, lower energy, and most of all byte-addressability of emerging NVRAMs raises new opportunities that call for a re-think of OLTP storage management, potentially enabling drastic simplifications of other aspects of the database (e.g., concurrency control). This project investigates how best to exploit low-latency byte-addressable non-volatile memories for transaction processing.

The project is a collaboration between Oracle and UM, involving two Oracle researchers and one graduate student.

Beyond Solid State Disks: Saving Energy with Flash in Enterprise Systems

Energy efficiency is rapidly becoming a key constraint in the design of enterprise systems. By 2011, yearly data center energy consumption in the United States is projected to grow to over 100 billion kWh at cost of $7.4 billion. As much as 40% of this energy is consumed by DRAM and disks. Portable consumer devices, where battery life has long been a key concern, instead use faster and more energy-efficient Flash storage. To exploit Flash's energy, latency, and bandwidth advantages in the enterprise market, storage vendors have recently announced high-capacity Flash solid-state disks (SSDs). However, because they are accessed through archaic block-device interfaces designed for legacy rotating disks, SSDs fail to fully-exploit the low latency and high bandwidth Flash can provide. Furthermore, replacing conventional disks with SSDs does not address the growing power consumption of severs' DRAM. In this project, we propose further opportunities to save energy with Flash in enterprise systems. Instead of placing Flash behind traditional I/O interfaces, we integrate Flash with the server's memory system, making it directly accessible within processors' physical address space.

This project is supported by the National Science Foundation under grant CNS- 0834403, and gifts from HP Labs.

Multiprocessor Memory System Design

Polymorphic Multicore Cache Architecture

The semiconductor industry has hit a wall - chip-level power and cooling constraints have slowed the march of clock frequency, forcing industry to instead bet on multicore to provide energy-efficient performance scalability. Although the multicore trend poses daunting challenges for application developers, it also creates new opportunities unavailable in traditional multi- chip multiprocessors: the drastic change in the relative costs of on-chip communication and computation enable application designs with tightly-coupled threads and frequent sharing that would prove latency- and bandwidth-prohibitive in traditional multiprocessors. Unfortunately, current multicore memory systems are inflexible and poorly-suited to support coordinated execution, as they provide no direct means for core-to-core communication or to optimize data placement on chip. Moreover, intra-chip access patterns vary drastically across applications - there is no one-size-fits-all static cache architecture. We are designing the Polymorphic Multicore Cache Architecture (PMCA) - a modular on-chip cache design where software configures primitive hardware mechanisms to provide a cache architecture suited to a specific workload.

Spatio-Temporal Memory Streaming (STeMS)

http://www.ece.cmu.edu/~stems

While advances in semiconductor fabrication have enabled phenomenal increases in processor speeds, advances in DRAM fabrication have primarily increased density, providing only modest improvements in access latency. Conventional processors bridge the processor/memory performance gap with an on-chip cache hierarchy where each level provides progressively slower access to a larger subset of data. However, as the processor/memory performance gap grows and on-chip storage capacity increases, the simple heuristic policies of conventional caches (e.g., LRU replacement) are becoming less effective at preventing processor stalls due to off-chip accesses. The performance penalty of ineffective cache management is particularly acute in commercial server applications, where frequent traversals of linked-data structures result in long chains of dependent off-chip misses. Instead of accessing data individually upon processor request, we propose Spatio-Temporal Memory Streaming (STeMS), a memory system design where data are managed in the form of spatio-temporal memory streams-data groups whose accesses are correlated in space or time. STeMS dynamically constructs streams from memory access sequences that exhibit a repetitive layout in memory (spatial correlation) or that recur over the course of program execution (temporal correlation). By fetching stream elements in parallel using recorded history, STeMS increases memory level parallelism for both independent and dependent access sequences. By throttling stream transfer to stay ahead of processor requests, STeMS hides main memory access latency while improving utilization of pin bandwidth and on-chip storage.

Performance Evaluation Methodology

BigHouse

http://www.eecs.umich.edu/BigHouse/

Computer architects have long relied on software simulation to measure dynamic performance metrics (e.g. CPI) of a proposed design. Unfortunately, detailed software simulators have become four or more orders of magnitude slower than their hardware counterparts, rendering hardware measurement methodologies impracticable for simulation studies of large-scale commercial server systems. The SimFlex project is advancing the state-of-the-art in simulation tools and measurement methodology to enable fast, accurate, and flexible simulation of large-scale systems. SimFlex combines component-based software design, full- system simulation, statistical sampling, and simulation state checkpointing to enable rapid system evaluation with commercially-relevant benchmark applications, such as online transaction processing databases, while validating performance estimates with statistical measures of confidence. The SimFlex project has recently released two new computer architecture simulation tools to the academic/industrial community. TurboSMARTSim integrates rigorous statistical sampling methodology with live-points, a per-benchmark library of minimal reusable machine state, to accelerate microarchitecture simulation turnaround by 250x over previous simulation sampling approaches while maintaining high accuracy and confidence in estimates. Flexus is a family of component-based C++ computer architecture simulators that enable full-system timing-accurate simulation of uni- and multiprocessor systems running unmodified commercial applications and operating systems.

SimFlex

http://www.ece.cmu.edu/~simflex

Computer architects have long relied on software simulation to measure dynamic performance metrics (e.g. CPI) of a proposed design. Unfortunately, detailed software simulators have become four or more orders of magnitude slower than their hardware counterparts, rendering hardware measurement methodologies impracticable for simulation studies of large-scale commercial server systems. The SimFlex project is advancing the state-of-the-art in simulation tools and measurement methodology to enable fast, accurate, and flexible simulation of large-scale systems. SimFlex combines component-based software design, full- system simulation, statistical sampling, and simulation state checkpointing to enable rapid system evaluation with commercially-relevant benchmark applications, such as online transaction processing databases, while validating performance estimates with statistical measures of confidence. The SimFlex project has recently released two new computer architecture simulation tools to the academic/industrial community. TurboSMARTSim integrates rigorous statistical sampling methodology with live-points, a per-benchmark library of minimal reusable machine state, to accelerate microarchitecture simulation turnaround by 250x over previous simulation sampling approaches while maintaining high accuracy and confidence in estimates. Flexus is a family of component-based C++ computer architecture simulators that enable full-system timing-accurate simulation of uni- and multiprocessor systems running unmodified commercial applications and operating systems.