L1 data cache and memory reference pattern analysis for an IBM-supplied PowerPC trace
Branch prediction results for some IBM-supplied S/390 traces
L1 data cache and memory reference pattern analysis for an IBM-supplied PowerPC trace
Researchers: Trevor Mudge, Gary Tyson, Jeff Ringenberg, Nomik Eden
The first study performed on this trace was to obtain its L1 data cache MP1000I (Misses Per 1000 Instructions) for a variety of basic cache configurations. The results indicated that its performance was poor when compared to a CINT95 reference set and only slightly better than a CFP95 reference set (see graph).
These poor results led us to the next study that attempted to increase the trace's L1 data cache performance by running it through several different caches, namely a victim cache, a cache using next-sequential prefetch (zero prefetch buffers), and a cache that combined the features of the victim and the prefetch. The results were much better than the original cache, especially for small, direct-mapped implementations (see graphs).
Once we had obtained results for these common L1 data cache configurations, we decided to examine the trace's memory reference patterns to see if we could use the patterns to create a cache structure that would increase the trace's performance further. We looked at how the trace referenced memory over time, how the trace "jumped around" inside of the cache blocks, and finally how each of the cache blocks acted while active in the cache. These are described in more detail below:
The first study of how the trace referenced memory over time showed that the trace was accessing memory in many different locations at many different times. In other words, the reference pattern was highly fragmented thus making it difficult to predict what address would next be accessed. There were, however, a few working sets that were visible as solid (or very nearly solid) lines indicating frequent accesses to that particular set, but these were still dominated by the high frequency of non-sequential accesses (see graphs). These results show that a cache structure that keeps track of previously accessed addresses in a sort of "trace" may be able to effectively prefetch sets of data that would otherwise never be prefetched together by using the trace as a sort of history. This idea is currently being implememented in the form of a new cache structure that we call a "logical prefetch cache".The second study of how the trace "jumped around" in the cache blocks measured the difference between subsequent block offsets when a block was accessed more than once. For example, if a cache block was first accessed at offset one and then later accessed at offset seven, the "Block Position Difference between Subsequent References" would be equal to six. Conversly, if the opposite were true and offset seven was accessed followed by offset one being accessed, the "Block Position Difference between Subsequent References" would be equal to negative six. The results of the study showed that there was a nearly even distribution of the next access being before or after the previous block reference (see graph). This may indicate that some form of previous-line/next-line prefetch based on a block's reference history could be used to effectively prefetch useful data into the block instead of using a static heuristic such as next-line prefetch. This idea is being considered as a possible cache structure that could potentially increase the trace's performance.
The final study of how each of the cache blocks acted while active in the cache dynamically grouped each block into one of four groups based on the block's cache tour. A cache tour begins when a block is brought into the cache and ends when it is replaced. The four groups that a block can be a member of are as follows:
The results of the study show that for small cache sizes of 8 - 32KB, there is a large number of NTNS blocks that should have never been placed into the cache (see graphs). It should be noted that as the cache size increases, the number of cache tours decreases because less tours need to be ended by a block being replaced. The results also indicate that if a cache structure had knowledge of a block's group (whether it is TS, TNS, etc), then it could put the block, or a part of the block, in a cache structure that would be beneficial to the overall cache's performance. For instance, TS blocks would be placed in the cache as usual since cache's are most effective at preserving spatial and temporal locality. TNS blocks would allow only the accessed address to be placed into a cache with one word blocks since the rest of the block would be usless. NTS blocks would be brought into a buffer but not cached since they have no temporal locality. Finally, NTNS blocks would just access the address and not bring the block or the address into any part of the cache since doing so would be useless. If a cache structure had these "oracle-like" abilities, then it could cache data much more intelligently, and thus use the main cache much more efficiently. Implementing a cache structure with these attributes is currently being considered.Temporal / Spatial (TS) - A block is in this category if at least one address in the block was accessed more than once and more than one address was accessed in the block. Temporal / Not-Spatial (TNS) - A block is in this category if an address in the block was accessed more than once and only one address was accessed in the block (note that this is the same address that was accessed more than once). Not-Temporal / Spatial (NTS) - A block is in this category if no address is accessed more than once in the block and more than one address in the block is accessed. Not-Temporal / Not-Spatial (NTNS) - A block is in this catagory if no address is accessed more than once and no more than one address is accessed. This essentially means that the block should have never been brought into the cache.
Branch prediction results for some IBM-supplied S/390 traces
Researchers: Trevor Mudge, Jeff Ringenberg, Nomik Eden
The first study that was performed on these traces was to run them on two different global history branch prediction schemes (YAGS - Yet Another Global Scheme and Bi-Mode). The results indicated that the traces were doing quite poor with YAGS outperforming Bi-Mode for larger predictor sizes and Bi-Mode outperforming YAGS for smaller predictor sizes. This was expected based on the design of the two schemes (see graph).
Our second study was two-fold. First we compared the traces' results with a reference set that was obtained from branch address/taken information traces of the CINT95 benchmarks. The results indicated that the S/390 traces were doing significantly poorer than the CINT95 benchmarks, especially for small predictor sizes. Second, we introduced two new schemes into the simulation environment, Gshare and a new version of YAGS, and removed the Bi-Mode scheme. The results showed that the traces still did poorly when compared to the CINT95 benchmarks, but that the new version of YAGS was generating better results than all previous schemes for the S/390 traces (see graph). We are currently experimenting with different configurations of our new version of YAGS to increase its performance on the traces.