Practical Technology

for practical people.

Cache 101: What your Mother should have told you

| 0 comments

The easiest way to give any computer more oomph is to add a cache. In the eternal race of CPU vs. memory access speeds, CPUs win hands down. Modern CPUs need information faster than memory can provide and much faster than hard disk allow. The way to provide datastarved processors with their minimum daily requirement of bytes is to dedicate a high-speed chunk of memory to storing information for the CPU: a cache.Normally, we think of caches as speeding up disk accesses by storing high-demand information in RAM memory where it can be snagged in nanoseconds instead of milliseconds. While this is important, caches are equally vital for supplying the CPU with data faster than it can be accessed from main memory.

When processors clunked along at a measy 10MHz, memory caching wasn’t that important; everything in the system was slow. With the coming of faster processors, like the 80386, those days are gone. CPUs were forced to wait for a clock cycle for data hence the phrase “wait state.”

Let’s look at a real-world example. A 25MHz 386 has a clock cycle time of 40ns. That’s the least amount of time the CPU needs for its fastest instruction. Usually, the CPU takes two cycles to read or write to memory. To keep up with the CPU when running flat out, you would need memory with 80ns access times. Which, as it happens, is perhaps the most-common fast dynamic RAM (DRAM) around.

Unfortunately, it isn’t that simple. DRAM has to be recharged every time it’s used, and the recharge time is typically 90 percent of the access time. Get the picture? An 80ns DRAM chip has a cycle time of 152ns. That’s already too slow, but DRAM has to pay another performance penalty: DRAM chips must be refreshed frequently or they lose their stored data. Typically, this puts about an additional 8 to 10 percent bite on performance. Ouch.

There are ways around this problem. Interleaved memory separates memory into banks. In motherboards with two banks, even-numbered memory locations are assigned to one bank while their odd-numbered brethren are assigned to the other bank. When data is read sequentially, this reduces wait states since data is picked up first from one bank and then the other. This gives each bank enough time to take care of its housekeeping chores before being called on again for access.

Data isn’t always accessed in such a straightforward manner, however. When it’s not, and the CPU needs data kept in two different locations in the same bank, the CPU will again waste clock cycles waiting for memory.

PAGING COMMANDER DATA

Paging is another attempt to overcome memory speed hurdles. In paging, memory is addressed not by absolute location, but by memory sectors of “pages.” Data addresses in memory aren’t shifted dynamically, instead, the memory address is static, thus this scheme’s other name, ‘static column.”

As long as the data can be found in a single page, typically 2K in size, the system doesn’t waste time. Once the CPU asks for data outside the current page, all the usual problems of slow memory reappear.

While both interleaving and paging help, neither is enough to avoid wasteful wait states completely. In fact, they’re often combined to wring the most benefit from each approach, but caching is the solution most effective at eliminating all but a fraction of performance-draining wait states.

Cache success is usually measured by the percent of memory accesses that find a needed byte of information in the cache. Every time the CPU gets the required data, the cache is said to have a hit. Cache efficiency is judged like a baseball player’s batting average. The number of hits compared to the number of times the CPU asks for data gives the cache its “hits ratio.” Unlike Wade Boggs or George Brett, an average hovering around .350 stinks for a cache. Successful caches should have a hit ratio above 90 percent, or in baseball terms, a cache should bat above .900. Ted Williams, eat your heart out.

It takes more than a high average to make a good ballplayer or cache, though. An all-star cache needs a good balance of four factors. These are: 1) hit ratio, 2) hit speed, 3) miss speed, and that all-important one, 4) cost.

Hit speed refers to how fast the cache can come back with data if it has the information in storage. Faster is always better here. The Ricky Hendersons of caching use 20ns SRAM. Unfortunately, SRAM, like star base stealers, don’t come cheap. No one ever said making an MVP cache was easy.

Miss speed is how long it takes a cache to report that the requested information isn’t within its confines. Slow miss-reporting can put any cache on the disabled list.

Cost can never be overlooked. Designing the best possible memory system is really quite simple: Just use the fastest SRAM you can find, get a good caching algorithm, and make it serve as both primary and secondary storage. The rub is that you’ll feel like you’re paying Darryl Strawberry’s salary out of your budget. It simply isn’t feasible to build this kind of system–the cost is too high.

CACHE TRADE-OFFS

For both developers and end users, cache size is often the first consideration. Arguments have raged for years over how big cache should be. The answer, and you’re going to hate this, is that it depends. There isn’t even a good rule of thumb. Bigger is better (the most often quoted guideline) is only true up to a point before you get into diminishing returns.

A cache’s hit ratio depends on a delicate balance of elements. At least as important as cache size is the size of data elements within the cache. These elements are called “data lines.”

In a simple cache, when the CPU calls for data, the cache controller–or memory management unit (MMU)–searches a tag list. The tag list can be thought of as an address list. Each tag is attached, by either pointers or a linked list, to the appropriate data line. Longer data lines increase cache efficiency more than large cache size. A 64-byte data-line design, for example, should be almost an order of magnitude more efficient than a 4-byte design in a 64K cache.

Again, though, there’s no easy way of determining the right mix of data-line length and cache size. The type of application and how these applications access memory are as important as any design considerations in finding the best possible cache. Compromise is a way of life in everything.

CACHES AND STILL MORE CACHES

Caches are classified by how they address data storage. The quick and dirty way is direct mapping. In direct mapping, the cache’s data lines have a one-to-one correspondence with the data addresses in memory or storage.

While direct mapping is popular, it has a weakness. Since the cache is usually smaller than memory, and always smaller than mass storage, each direct address line in the map corresponds to more than one section of memory. When the same cache line is used to refer to more than one part of memory or storage, performance drops through the floor. Adding insult to injury, this occurs even if there’s still free room in the cache.

Consequently, designers have looked at fully associative caches. In these designs, cache data lines are assigned dynamically. While these caches have great hit ratios, the designs may not be as fast as a direct-mapped system’s hit. The reason is that although an associative cache can adapt to your CPU’s memory demands, it takes slightly longer to find the required data.

For example, in a direct-mapped cache, the cache controller always knows where to go for a particular chunk of information. That information may not be in the cache, but the controller doesn’t need to waste time looking any further. An associative cache program can’t make that assumption, so it must search through the entire cache in poor implementations, or a lookup table in better schemes, before reporting a miss.

Set-associative caches attempt to get the best of both worlds. The best known examples of this kind of caching are the Intel 82385 cache controller chip and the 80486’s internal 8K cache. The Motorola 68040’s internal cache is even more sophisticated. It breaks the cache into one area for data and another for instructions for faster performance.

The 486 uses a four-way set-associative design. The 8K cache is treated as four 2K blocks. Each block is made up of 128 16-bit-wide data lines. Within the block, the data lines are bundled together in 32 four-line sets. These sets correspond to absolute memory addresses the same way a direct-mapping cache would. At the same time, the data lines within a set can be dynamically set to any data within the set’s fixed range.

The resulting cache has the quickness of direct mapping combined with the flexibility and high hit rates of fully associative caching. It should come as no surprise to learn that set-associative caching designs can be found in most cache controllers.

ALGORITHM DU JOUR

There are other important performance issues. One of the most critical is when to fetch data from memory or storage. Some algorithms only call for data when it has been requested and hasn’t been found in the cache. Other algorithms try to predict what information will be called for next and request data before the CPU calls for it.

The idea of pre-fetching data is an attractive one, but it’s not easy to make it work effectively. After all, if the cache guesses wrong, time is wasted not only extracting data but dealing with the excess baggage of unwanted data. Worse yet, you may have gotten rid of needed data to make room for useless information.

Data-line length must be considered even more carefully thanusual in pre-fetch designs. Longer data lines, for instance, can lessen a cache’s efficiency by wasting space when unneeded data is loaded.

Pre-fetch systems usually rely on the fact that if the CPU just called one block of data, the odds are that the next block asked for will be the following sequential block. For instance, many disk-caching programs read an entire track of data. The software is betting that the data is unfragmented on the disk and that the CPU will soon need the information stored in the next sector. With a sufficiently large cache (larger is better here), cache performance can drastically improve the cache’s hit ratio.

It sounds great, but this one-block lookahead approach can increase traffic on both the bus and in memory. If you’re running a multiprocessor system, like some Compaq Systempros, or already pushing the bus’s bandwidth to the limit with multimedia applications, this method will start degrading overall system performance.

A more sophisticated approach is to pre-fetch only after a miss. There’s no such thing as a free lunch, however, and the check here is an increase in I/O traffic and a decrease in cache hits compared to always pre-fetching. In short, you don’t put as much of a burden on the bus or memory, but you don’t get as many hits either.

Even more sophisticated caches pre-fetch data not only on misses but when there’s a hit on a pre-fetched data line. In other words, the cache anticipates that when the CPU asks for new data, it’s likely to want other data associated with its new inquiries.

Another consideration is whether both the cache and memory should be searched concurrently. Most caches are “look through” designs. In this case, main memory isn’t searched until a miss has been reported. Some designs, called “look aside,” search both the cache and memory simultaneously.

There are several pitfalls to be wary of in the look-aside approach. The cache controller must be able to manage both searches and resolve any conflicts. This is perfectly possible, but the controller must be more powerful than its look-through brother and the cache’s code must be more tightly written. In layman’s terms, this means more things that can go wrong.

All caches have to deal with the problem of throwing out the trash, i.e., stale data. A cache only has so much room, and when it’s full, room must be made for new data.

This problem can be attacked in three ways. The first has the advantage of simplicity. The cache simply throws data out at random. It works, it’s easy to build, and it’s a lousy way to keep up your hit ratio in large caches. Curiously enough, studies show that in certain kinds of small caches, random replacement works better than other approaches. Random replacement can’t be dismissed out of hand.

The next method can also be put together by almost any tinkering programmer. This approach uses the tried and true first-in, first-out (FIFO) concept. Data moves through the cache in an orderly fashion. The oldest information is thrown out as new information comes in.

Developers are never content simply to use easy methods. Programmers have been working for years on implementations of least recently used (LRU) algorithms. In these schemes, the data that’s thrown out of the cache is the data which has gone the longest without being used. When properly implemented, LRU is the most successful approach for most caches.

That’s easier said that done. LRU has two strikes against it. First, every data line must have counter bits. After all, you need some way of telling which data line was accessed last. Second, the cache controller or program has to burn time deciding what’s the least-used data.

TO WRITE OR NOT TO WRITE

The most heated issue in cache design is when to write data. The belts-and-suspenders crowd believes in “write through.” In this approach, there’s no question about it–when you make a change to data in the cache, the change is immediately passed along to memroy and/or the mass-storage device. Of course, while this is happening, the processor is spinning its wheels waiting to get back to work.

That’s why some designers take a different tack. Their approach is “posted write,” also called ‘write back.” In this method, changes are posted to the cache at top speed, but the cache controller waits until it is idle to write the change to memory.

In posted write, there are two methods of handling data writes. In “always copy-back,” data is written to memory eventually even if the only change is that a line of data has been moved out of the cache. The other approach, “flagged copy-back,” calls for changes to be written only if the cached information has been changed.

No matter which way you go with posted write, there are some unavoidable problems. The one that makes users sweat is that out-of-date data can be held in memory for up to seconds at a time before the cache updates the stale information. If your system happens to get hit by the mother of all power surges, you can kiss that information goodbye.

What’s that you say? You have an uninterruptable power supply and don’t need to worry about a blackout? Well, there’s another dimension to this problem that doesn’t get mentioned often. Other devices in or attached to your computer also access memory through direct-memory access (DMA). this can bypass not only the CPU but the cache and its controller as well.

As a result, your FAX card or modern may pick up wrong data. This can foul things up completely. Or a peripheral can change data in main memory, thus once more throwing a monkey wrench into your system’s memory management.

There are ways to tackle this problem. The i82385, as well as the 68040, use wha’s known as “bus snooping.” By watching all memory transfers across the bus, both CPU and peripheral initiated, DMA changes can be recorded property in the cache.

One must wonder, though, if the benefits of posted writes are worth the trouble. At the outside, only 15 percent of your data transactions are writes. The rest are reads. It hardly seems worth exposing your data to any danger to speed up such a small percentage of data transactions.

CACHE COHERENCY

Some trouble with keeping data current is unavoidable in modern systems. As chips that feature onboard caches like the 486 and the Motorola 68040 become more common, this problem will pop up more often. Now computers frequently have chip-, controller-, and software-based caches. This is fine, but how do you get three different caches, memory, and storage all to agree on the state of any given byte of accessed data? In other words, how do you keep data coherent? There are no easy answers.

For today’s solutions, most software and hardware rely on bus snooping with write-through techniques. There are other approaches, but they belong more to the realm of theory than practice. This will change since the ever-increasing speeds of processors demand higher-performing caches.

With many cache-design elements, trade-offs are difficult to make and hard to optimize. It’s no wonder that there’s such as profusion of caching software and hardware. Nonetheless, several practical lessons can be drawn from cache theory.

The first lesson is that every computer needs a cache. Even if the memory can keep up with the CPU, the hard and floppy drives won’t be able to. There’s no better way to give your PC a kick in the pants than to add a cache.

A cache doesn’t need to be big. While trial and error is the only way you’ll discover what cache size works best for you, more than 2Mb of cache will be overkill on most systems for most applications. Like so many other things in caching, even this vague rule is not written in stone. Multiuser systems or file servers probably can use all the cache you can stuff into them.

Cache design is more important than size. The only way you’ll be able to judge that is from head-to-head comparisons of caches. Cache designers don’t give out their recipes, and it wouldn’t be much help even if they did. There are so many cache-design considerations that the only way you can tel what works for you is to give each cache a try.

The one design issue that users can sink their teeth into is memory-update policies. When you need the fastest possible cache, copy-back designs should be what you look at first. Everyone else would be better off with write-through designs.

One issue that I haven’t touched on is hardware caches vs. software caches. For memory/CPU consideration, hardware caches are the only answer. When it comes to caching data from disks, however, it’s another story entirely.

Hardware caches, such as those found on caching controllers, tend to be faster than software-based caches and particularly shine at writing data. Yet they cost three to four times the price of a caching program. For most users, software caching paired with additional motherboard RAM is a clear win. Network and multiuser operating systems will be better off with hardware caches.

Good caches may be difficult to make, but their value is indisputable. It’s clear that the future will continue to bring caching improvements as developers continue to wrestle with the difficult decisions required to make successful caches. As users, we will all benefit as caching technology matures and becomes more tightly integrated with system design.
First published in Ziff-Davis’ Computer Shopper

Leave a Reply