Caching used to be so simple. You used caching on uniprocessor, single user systems to free yourself from the prison of slow I/O, whether that I/O was from the system bus to the processor or, from the hard drive to the bus. As multiple-CPU systems, multi-user systems and the Web became commonplace, simply placing fast RAM between system components with varying I/O throughputs was no longer enough. System designers had to contend with making sure that the available data to the processor was the real data and so it was that the almost intractable problems of cache coherency emerged.
Cache Coherency Basics
Over the last few years, cache coherency theory and practice has become more critical than ever. With processors far outpacing the bus in terms of speed, and the rapid rise of Symmetric Multiprocessing (SMP) operating systems like Windows 2000 and 2003 Server, Solaris, and Linux not to mention the growing popularity of clustering systems such as Scyld Beowulf and Multicomputer Operating System for UniX. (MOSIX), memory and disk access has become the bottleneck in high performance computing.
So it is that now, more then ever, system designers must deal with the fundamental problem of cache coherency, displaying and writing current data without error. Cache coherency theory and practice offers many options to handle these problems, but none of them are perfect.
The core problem with reading data is deciding when to kick some data out of the cache and what should be thrown out. The most common solution is to use a least recently used algorithm to decide which data to exile. However, this isn’t fool, or algorithm, proof.
More polished caching software and firmware use a least frequently used algorithm. At the price of memory and processing overhead, these schemes ensure that frequently accessed data will stay in the cache even if it hasn’t been called on recently.
Cache Writing
Another, probably even more important, issue is how a cache contends with data writing. Most caches will check, using a variety of algorithms, to see if the write request will force the cache to waste time overwrite identical data that has already been written. At the same time, anything that optimizes the data-writing process, thus minimizing the mechanical and therefore most time consuming element of writing data to disk is clearly advantageous to system performance.
Write-through designs, which automatically write data to disk as soon as possible, have the advantage of making sure that the cache writes data to disk as soon as possible. This reduces the probability of data corruption. Unfortunately, write-through caches are not the most processor-efficient means of handling data writes. By preempting clock cycles and occupying bus bandwidth at times when they’re needed for ongoing processes, write-through caches slow down the system.
Because of this, write-through caching tends only to be used in the most mission critical systems like Online Transaction Processing (OLTP) with servers running extremely fast processors, drivers and data buses.
The alternative approach, deferred writes delays disk writes until the system is free from other activities. This used to be, and still is, a favorite method in non-mission critical mainframe environments. But its users speed advantage has overcome data integrity concerns and so end-user operating systems, like Windows XP, use delay disk writes by default.
With the rise of non-stop systems, though, delayed write is making a comeback even in the OLTP world. Oracle’s Oracle 9i DBMS, for example, utilizes delayed write even in its clustering software: Real Application Clusters. In this case, the underlying system is presumed to be good enough at automatically stopping data writing conflicts that delayed write caching’s speed advantages outweighs data integrity concerns.
Eventually, as data is changed, the cache copy will be updated. One of the troubles, designers must avoid here is wasting time updating cache copies that won’t be needed again.
On the flip side, if you’re using a write-invalidate system, in which the most recent changes invalidates any other existing copies of the data and your co-worker Esther, makes other changes, ideally, the cache coherency protocol will validates and updates her copy from your cached version of the data.
Sound easy, but it’s not. For example, if you and Esther are simultaneously updating a database, your changes invalidate her copy, and her changes invalidate yours. Then both caches start registering misses because neither copy can be updated until it has been re-validated from the other. Missed reads and validations can quickly begin to clog up data communications. In the worse case, you end up with a virtual traffic jam on the bus or network, in what’s called the ping-pong effect.
Avoiding this usually requires data locking similar to that deployed in most DBMSs. Co-coordinating cache data locking and DBMS locking, as you might imagine, can be a real challenge.
Another fundamental problem is how to determine the optimal cache size for a system. No master key or algorithm exists to solve this one. You might think that bigger is always better, but you’d be wrong. As cache size increases, at first cache misses decrease, but not in a linear fashion. And eventually, cache misses increase. The ratio of misses to hits shrinks until it reaches the data-pollution point. When this occurs–determined by the size of the cache and its component memory blocks–misses begin to increase. Using smaller memory blocks, or pages, delays the data pollution in large memory caches, but it won’t stop it.
Multiprocessor Cache Concerns
Multiprocessor systems must also contend with three other issues: memory, communications, and bandwidth contention.
At first glance it might seem that simply enabling processors to share a common cache would be a grand idea. The advantage of this approach is that applications that can make good use of shared data and code will need less memory and will execute more efficiently. The fatal disadvantage is that it only works well with each processor having direct, immediate access to the cache, a situation that only exists in well-designed SMP systems.
Even then, memory contention, which arises when multiple processors call on a single memory location, can cause serious performance problems.
Another problem is that bandwidth contention, which can occur whether the processors are located on a network or on a common bus. The larger a multiprocessor system or cluster becomes, the more time it requires to carry out data communications, and the more likely it is that there will be a data traffic jam on the bus.
So it is that the most popular approach is to furnish every system processor with its own cache. Private caches avoid most memory-contention problems. However, this approach increases communications and bandwidth contention because cache coherency must be kept between each cache. And, needless to say, you then face the complexities of maintaining coherency across multiple caches.
Maintaining Coherency with Multiple Caches
So, how do you go about maintaining coherency with multiple caches? There are two main approaches: snoopy protocols, which constantly monitor bus transactions, and directory based protocols, which keep a map of data memory locations.
In both cases each cache controller looks for cache-consistency commands in the data stream. When it finds them, the cache controller then processes it to see if it applies to its own cache.
The disadvantage to these bus approaches is that out buses have limited bandwidth. In particular, with the snoopy approach, the more often it “snoops” on the bus, the less bandwidth is available for the bus’ main job of ferrying information back and forth. In addition, the broadcast nature of cache-consistency commands requires even more valuable bandwidth. Thus, you typically find snoopy protocols used in situations with very high bus speeds such as Massively Parallel Processing (MPP) and Non-Uniform Memory Access (NUMA) systems.
Directory-based protocols can track what’s in the cache either in a centralized location or with a distributed direction. Typically, even though this requires additional hardware resources to act as multi-system memory manager, the directory approach scales much better than snooping and so lends itself to most multiprocessor systems even including lightly-coupled clusters and even Web caching.
More precisely, a cache directory maps every cached data block’s location. Each directory record holds a flag bit and pointers to the location of every copy of a data block. The flag bit, usually called the dirty bit, indicates whether a particular cache can write to a particular data block
Many different directory plans exist. From a structural point of view, there are three kinds: full map, limited, and linked list. These almost always have one element in common: control structures that ensure that only a single cache can write to a data block at a given time. When that happens, the writing cache is said to have ownership of the data block. Each directory type also provides a way of transmitting a data-change notification through main memory and the caches.
Full-Map Directory
Full-map directories have the singular advantage of being able to store information on every block in memory. This style of directory has pointers equal in number to the number of processors on the system plus two status bits and the dirty bit. The bits that correspond to a processor indicate whether a particular block is present in that processor’s cache. The status bits, which are duplicated in the cache, show whether the block can be written to and whether it is a valid block.
As you might predict, full-map directories have problems. Because the directory is almost always centralized, processors tend to compete for it and this can lead to a performance bottleneck. Another concern is that searching and updating the directory can put an undue burden on both bus and caching performance.
Limited Directory
Limited directories avoid the memory troubles of full-map protocols. By requiring that only a limited number of pointers be maintained, they keep memory overhead within bounds. Otherwise, the structure resembles that of a full-map directory. When the number of copies of a particular memory block exceeds the number of pointers assigned to track them, the newest pointer replaces an earlier one in a process called eviction using such methods as first-in/last-out, to determine which pointer is discarded.
The problem here is that while statistically, a limited directory works well; additional processing overhead is needed to avoid dumping data blocks with primary storage-directory information. Processor thrashing may also happen if the number of pointers is smaller than the number of processors referencing a limited set of data blocks.
Another problem with limited directories is that on scalable point-to-point interconnection networks, it’s very difficult to ensure that the point-to-point broadcast will reach every applicable cache. While thriftier with system resources than other directory or snoopy schemes, limited directories should be deployed cautiously.
Linked-List Directory
The most successful directory scheme is the IEEE Scalable Coherent Interface (SCI). This is based on a double-linked list approach. While single link list approaches are doable, SCI has become the pre-dominant directory approach. In part this is because double-linking enables the system to send invalidation messages up and down the chain. This results in is faster, more flexible adjustments to cache changes.
With SCI, when a processor calls for a data block, it doesn’t receive a copy from main memory. Instead, it receives a pointer to the most recently added cache copy. The requesting cache then asks for the data block from the cache owning the freshest version of the data. This cache replies by setting a pointer to the requesting cache and transmitting the data block. When multiple data-block requests are received, they are handled in first-in/first-out order.
Of course, hybrid cache coherency systems that use both snooping and directory methods are not only possible, they’re commonly implemented. In 2003, there is no golden rule to determine which approach, or which mix, is ideal for general computing cases.
Cashing in on Cache
As you can tell, cache coherency is difficult to do well. But, without cache coherency methods insuring that there is a consistent global view of cached data, data corruption and lousy system performance are not far away. Like it or lump it, successful high performance computing, or with today’s 3GHz and faster processors, even single CPU computing requires successful cache coherency implementations. It’s as simple, and as complex, as that.