Memory-System Architecture Studies

UM faculty: Ed Davidson, Trevor Mudge, Steve Reinhardt, Gary Tyson

    Given the relative rates of performance improvement in processor vs. memory technologies, it is clear that the memory system will play an ever-increasing role in determining overall system performance.  Indeed, there is a strong perception that much current work in improving core processor performance will be moot if corresponding advances in memory systems are not found.  The situation is exacerbated by the fact that most academic architecture research is based on benchmarks such as SPEC95 that have markedly different cache behavior from the commercial workloads that dominate the usage of IBM?s customers (see, for example, Maynard et al.?s paper in ASPLOS-VI).

    Our assumption in this area is that data cache performance is a major, and likely the primary, performance limiter for commercial applications on future systems. Our preliminary experiments will analyze commercial workloads provided by IBM (e.g., via reference traces) to verify this assumption and quantify the extent to which it holds.  To understand the importance of these studies, we must compare the system-wide performance impact of data caches relative to other factors such as branch prediction and instruction caches.

    Given our assumption, we will employ a two-pronged approach to improving data cache performance.  First, multi-lateral primary cache organizations have been shown to be effective keep important data close to the processor, delivering this data with low latency and high bandwidth.  We will continue our investigations into multi-lateral cache structures, focusing on their applicability to large commercial workloads. Second, increasing memory densities enable the use of very large secondary or tertiary cache structures (tens to hundreds of megabytes).  We will study the application of intelligent, software-based algorithms for prefetching and replacement in these large caches.

    Our earlier work has shown that multi-lateral cache designs perform as well as or better than larger single-structure caches while requiring less die area. In addition to reducing miss ratios, a multi-lateral L1 cache increases available data bandwidth relative to traditional, single structure caches.  We propose to continue the development of the multi-lateral cache paradigm, concentrating on the design of the detection unit (the decision making process for cache line placement) and the applicability of prefetching into one of the cache files.

    The detection unit in our previous studies gathers information about the characteristics of the incoming cache line; this behavior contrasts with conventional cache allocation policies which utilize replacement strategies based on the activity of the resident lines (e.g., LRU).  Optimal replacement uses information on both displaced and referenced lines to achieve the best decision. Approximations to optimal replacement (either the multi-lateral detection unit or an LRU replacement scheme) perform well, but do not overlap when correct placement is made.  We will explore new designs for the detection unit that include more information about the previous characteris­tics of the resident cache lines to guide replacement.  This work will then be extended to incorpo­rate prefetch operations centering on compiler insertion of prefetch instructions and hardware support for line placement within the multi-lateral cache structure.

    While multi-lateral primary caches focus on delivering critical data with low latency, secondary and tertiary caches must work to hide large main-memory access latencies while delivering high bandwidth to the primary caches.  As main-memory access latencies grow to hundreds of instruction execution times, even infrequent misses from secondary and tertiary caches reduce performance significantly. This large speed gap provides both the motivation to control these caches more intelligently and the opportunity to do so using software. We are studying intelligent, software-based algorithms that integrate prefetching and replacement control for large secondary and tertiary caches in the range of tens to hundreds of megabytes. Our initial designs integrate key ideas from virtual memory to increase flexibility without needlessly sacrificing performance.  Key questions that we plan to answer as part of this proposed work include the
following:

    Again, traces reflecting the memory behavior of large commercial applications will be a key IBM contribution to this study. The emphasis on large memory structures means that very long traces will be required for our measurements to have significance. Stone and Puzak have shown that the required trace length grows quadratically with the size of the cache being studied. In recognition of this, we also plan to spend some effort on the analysis of the effects of trace length and the management of long traces for our simulation environment. Work on compression being performed as part of a DARPA grant may feed into some novel approaches to this problem.

    Unfortunately, trace-driven simulations are limited in their ability to predict system performance, especially with dependence-driven out-of-order processor architectures. We are also investigating the possibility of using the SimOS system simulator, in conjunction with a detailed processor simulator, to perform detailed execution-driven simulation of samples of large commercial workloads.  Given the computational demands of detailed execution-driven simulations, we will use them for detailed analysis of the most interesting configurations, as identified by our trace-driven simulations.