October 29, 1992
by sjvn01
0 comments

Gophering the Internet

Let’s face it, getting the most out of the Internet isn’t easy. Even archie, described in my last visit, only helps with one specific area of net use, finding and ftping files. Probably more than a few of you have been saying, “The net is neat but why does it have to be so hard?” Well, these days, with the right software, it doesn’t have to be so hard. There are two user-friendly programs that makes using the Internet’s resources easier than ever before: gopher and wais.

Before this dynamic duo showed up, some of the net’s most valuable resources were only available to a lucky few in the know. The most important of these resources are the online databases. These systems provide public access to everything from library catalogs to technical documentation collections. Unfortunately few people knew how to access these databases. Now, with gopher at your side, you can liberate this information for your own uses.

Gopher

Gopher and wais may sound like Ren and Stimpy, but they’re anything but cartoons. Unlike the other tools I’ve been looking at, ftp and archie, gopher is a general purpose information tool. Gopher builds on the foundation of ftp, archie and other information sources to erect an easy-to-use, menu-driven interface to the net’s file and informational resources.

Gopher was ‘born’ at the University of Minnesota, the Golden Gophers. Its name is a bad pun on the University’s sports team name and the program’s purpose, ‘go-fer’ the data gopher!

Unlike archie, which relies on a centralized archie database of ftpable files, gopher doesn’t rely on any particular data collection. To use the analogy of a library, archie is like a card catalog dedicated to publicly available files. Gopher, on the other hand, is like a librarian. Gopher doesn’t know where a particular item is but it does knows where to find out where information is hiding.

The best thing about gopher is that you don’t have to have a clue about where some file or bit of information is. IP addresses, file formats, domain names, forget ’em, with Gopher you don’t need to know Unix esoteria. Gopher does the dirty work, all you have to do is pose the questions.

Getting a Gopher

To get gopher started you should have a gopher client on your system. If you don’t, you can telnet your way to a site with a publicly accessible gopher client the same way you can access archie.

You shouldn’t have to do this, however. Gopher client programs are free and come in makes and models for almost every architecture and operating system under the sun. While the Unix character-based interface is the most common front-end to gopher, you can also get a HyperCard-style gopher for the Macintosh and a DOS-character interface based sub-species for PCs. You can get the right one for your system by using archie to find a nearby site with ftpable gopher files. If all else fails, you can always find the gopher clients at the site: boom-box.micro.umn.edu in the pub/gopher directory. Always look for a closer site first, however, everyone who a net’s expert knows about boom-box and that site can be very busy.

Once you’ve installed the program, usually a simple task, although you will require system administrator privileges on Unix systems, you type gopher at your command prompt. You’re then presented with a set of menus. You then select the choice that looks like the best path to your informational destination. You could, for instance decide, that you want to find a library with a copy of the newest Tom Clancy thriller.

In the pre-gopher days, you’d do this by telneting to every computerized library catalog you could think of. Right, like everyone knows IP addresses or domain names for automated card catalogs. Even after you found your library and its access point, you then faced the problem of how to log into the system. Some just let you right in, some require a user id of ‘guest,’ others, ‘anonymous’ and so on. And, if you made it that far, you’d have to figure out to that system’s particular each idiosyncrasies. It’s no wonder that until recently Internet information access has been a black art practiced only by net gurus.

Go-Fer It.

With gopher, however, the gopher server takes care of all this. Your client gopher starts looking for the information. It does this by first checking for local resources, usually a gopher server, or telnets to a preset gopher service.

Gopher clients come with a pre-coded gopher server they look to for information, but this can, and should be, changed to access the closest available gopher server. The server, presented with your request then tries to figure out where to find the information. All you know, sitting at your desk, is that a few seconds after you start your inquiry is that gopher has presented you with menu choices that take you closer to your destination.

These choices come in two forms: resources and directories. A directory, marked with a ‘/’ at the end of its menu item, indicates that choosing this item will lead you to a sub-menu. Resources are, like the name indicates, actual sources of information.

From the menus, you proceed to narrow down your choices until you can reach an appropriate resource. In our search for our Clancy high-tech shoot em-up, for example, we can probably live without looking into card catalogs for libraries thousands of miles away,

Eventually, you end up with what you and gopher agree is probably a system or program that can supply you with the information or file you need. At this point, you and Gopher go-fer for it.
When you access a resource, gopher takes over the job of logging in to the computer and service. Gopher also shields you from the local system. No matter what you’re logged into, you use gopher’s search interface, not the remote systems.

This has one great advantage, you never have to learn the ins and outs of a database you may only use once. There are two mirror image problems to gopher’s approach. The first is that while gopher can perform fairly complicated searches, you may not know if the software gopher is talking to can handle it.

Archie, for example, can only search on a single word. Or, you could try searching for say ‘386DX and Unix’ Some systems take that to mean you want to know about books or articles which contain both the words ‘386DX’ and ‘Unix.’ Others assume you really want the phrase ‘386DX and Unix’. With gopher, in the way, you’ll only know that your searches are going wrong.

The flip side to this is that gopher defaults to the lowest common denominator searching. The resource you’re accessing may be capable of very precise searches, but you’ll be limited to gopher’s search capacities. Gopher, for instance, can’t tell the difference between lowercase and uppercase. This may not be a big deal if you’re only occasionally on a data hunt, but big-time information hunters will want to throw gopher out the window. At least they’ll feel that way until they recall how much work hunting for information without gopher is.

Another thing you should keep in mind with Gopher’s is that sometimes Gopher may dig up an information resource for you that you can’t access. The most common example of this is are the news services. The UPI news feed is available, for example, on many academic sites but its inaccessible from most commercial sites no matter what the gopher menu says.

Another interesting point is that not all gopher servers are the same. Some servers may be much stronger in certain areas than they are in others. That’s because gopher servers tend to be best connected to local resources. The original gopher server at the University of Minnesota, to no surprise, is filled with information resources from that school. One of the neater things about gopher, however, is that you’re not limited to a single server. You can use gopher to hunt for other gopher servers that might give you access to information that’s more your speed.

Flaws, and all, you’ll never mistake gopher for such powerful single purpose, online information retrevial engines as Ziffnet’s Computer Library, gopher does have its good points. Because gopher brings the almost limitless information resources to your reach, gopher is an invaluable tool for any Internet explorer. Gopher’s not the only one that making the Internet a better place for information hunters. Next time around, I’ll take a look at wais and, still making the transition from experiment to the essential, the Web.

A version of this story was published in Computer Shopper in 1993.

September 26, 1992
by sjvn01
0 comments

Hunting Data on the ftp Line with Archie

Is there anyone alive who doesn’t thrill to the idea of exploration? While we can’t be Indiana Jones, it is possible to search for temples of software treasure in the Internet. To computer fans, a new program or an interesting data file can be as exciting as any golden idol. And, besides, we never ever have to run into snakes!

In our last go-around, I explained the basics of using ftp (file-transfer protocol) to find and download files from Internet sites. Thanks to sites that allow anonymous ftp logins, it’s possible for anyone with net access to download files… if you know where to find them.

While ftp enables you to search around in a public directory for files, if the files you’re looking for aren’t there, hunting for them won’t do you much good. Worse still, it can be like looking for specific file can be like looking for a needle in a haystack. The virtual universe of cyberpunk science-fiction is already with us in that the Internet really can be a confusing and confounding maze. Fortunately, for all would-be network explorers, there’s a faithful guide to lead you through the deepest, darkest network nodes: archie. Archie is a database program that does the hard work of finding and indexing files throughout the Internet. Instead of tracking down elusive files on your own, you can contact an archie site and use archie to find your elusive quarry.

Archie sites are scattered across the world. [See Table 1] While you can use any archie server, it’s fair play to use the one for your geographical area. Not only should this take up less online time, but by spreading out the load, the service is more likely to avoid being overburdened.

To use archie, you have three choices. The most popular one is to use telnet to contact the closest Internet site with a copy of archie. To perform a search, you’d go through a process like this:

telnet archie.sura.net

When you have a connection, you then login as “archie.” You will not be asked for a password. The next thing you’ll see in a prompt like:

archie:

From here on out you’ll be talking to the archie program rather than the operating system the way you normally do with telnet.

The first command you should enter is:

archie:; show

This displays all of archie’s settings. For the most part you can ignore these, but two you may care about are pager and search.
Pager tells archie whether you want it to dump its results to screen or whether you want it to wait for either a carriage return or space bar after filling the screen. If you’re unable to scroll your screen back to catch something that just disappeared off the top of your display, set the pager on with:

archie: set pager

Search defines how archie goes about looking for your entry. Different archies have different search defaults. For example, some default to case sensitive searches, while others could give a hoot as to whether your search string is in all caps or all lower case.

If archie comes back with a search value of “exact” then you’ll know that’s it currently set to find only exact case matches. That is to say, a search for CHess would not find Chess. File names being what they are, that probably isn’t what you want.

You can set this parameter to your liking with the command:

archie: set search X

X can have one of four values. Exact, we’ve already seen, your other choices are: regex, sub, and subcase. When you use regex, the search string works like a Unix regular expression. In short, wildcards like * and ? are allowed. I could search for Ch* and get all references to Chess or Chessmate. Unfortunately, I’d also miss CHESS.

If you search using sub, your expedition will find any program names that contain your search phrase regardless of case. For instance, a search with “CH” would now uncover Chess and CHESS. It’s also going to dig up files with words like patches or PATCHES in the title. Sub is very powerful but it can deluge you with false hits, i.e. files that contain your search string but have nothing to do with what you’re actually looking for.

Subcase works like sub but is case sensitive. Regardless of which method or methods you use, the most important thing you can do before beginning an archie hunt is to decide how you want archie to look for a file before beginning. You’ll not only save your own time, you’ll save everyone’s time.

Now, you’re finally ready to start a search. Searching is done by simply entering:

archie: prog search_string

Archie then proceeds to give you a running score on how many hits its found and the percentage of the database that has been searched. After that, archie reports on the files that meets your specifications.

This is done by presenting you with a listing describing the files’ location. This list comes in a multi-line format. At the top of each record is the site name for the system with the file. This is followed by the file’s directory and finally the file’s name along with some size and date information.

Without telnet access, you may still be able to use archie with a local archie client. In this case, you’d just run archie from your system’s command line. This is the best way to use archie since it saves wear and tear on the network.

You’re not out of the hunt even without telnet or a local archie. Archie can be used by mail. To do this you send a message to an archie site addressed to archie. For example:

archie@archie.rutgers.edu

In the text of the message you enter archie commands. A sample search message could look like

path you@your.mail.address
prog what_ever_it_is_you’re_looking_for
quit

That’s it. The first line tells the system where to send its findings. Archie can read your from line and would normally send its results there, but its safer to use the path command. From lines can be misleading.

The prog line gives archie its search orders. The default for mail searches is regex, but you can reset it using the normal commands. Quit does what it sounds like. If there was nothing else in your message, you wouldn’t need it, but since many people have .signature files, which are automatically added to the end of their messages, it’s a good idea to use quit.

Once you have the file name and the location, you can use anonymous ftp to retrieve the file. Suppose, however, you don’t have ftp access? Sometimes you can use mail to obtain files even though without full Internet access.

There are three ways of doing this. Some servers are set up to respond specifically to mail requests for files. Others, especially on BITnet, have listserv servers that use a different method for file requests, but like direct mail requesters, are limited to local files.

In a mail request, you get a file by sending a message to the site with your request in the subject line. Such a request might look like:

To: mail-server@some_machine.somewhere.edu
Subject: send /directory/file_name

And that’s all there is to it. You put absolutely nothing in the text part of the message.
For a BITnet system, your message would run:

To: listserv@an_ibm_mainframe.somewhere.com
In the message body, you’d write
get file_name file_type

Since BITnet runs on IBM VM systems, file type is mandatory.

More generally useful are ftpmail application gateways. The most popular one is ftpmail@decwrl.dec.com. Just to make things annoying, it uses ftp syntax rather than anything that looks like the first two methods. A sample of this:

To: ftpmail@decwrl.dec.com
Subject: File Hunt
connect system.somewhere.com
chdir pub/file_location
get file
quit

Subject is actually irrelevant, but it can help you remember what’s that large, mysterious looking letter in your mailbox is. The important part is that you enter ftp commands, (be careful of typos!) one command per line, in the text.

Presuming that everything went right. You can expect to see your file in anywhere from a few hours to a few days. Speaking of getting it right, you can always send the single command ‘help’ to any of the mail servers or archie to get more help.

A version of this story was published in Feb. 1993 in Computer Shopper

August 1, 1992
by sjvn01
0 comments

Getting the Data Through: Unix & Interoperability

Historical note: It was in this article that I first referred to Linux in print.–sjvn

There I was buried in a copy of Andrew Tanenbaum’s “Modern Operating Systems” when a friend woke me up with an unusual request. He wanted to move data from his Tandy Color Computer II disks to MS-DOS-format disks.

For those of you who don’t remember the “CoCo,” it was a popular home PC in the days when CP/M systems walked the earth. Its disks, I might add, are totally incompatible with Unix or anything else like a modern operating system.

After about an hour or so of searching, I found an MS-DOS program that could read CoCo disks. As I’ve said before, I may not be a computing wizard, but I do know how to find information in a hurry.

Getting Along

In the process, though, I started thinking about interoperability, a subject near and dear to my heart. Now, interoperability is a nine-dollar word which refers to programs and data that can get along with each other. Unix, which runs on everything from an XT (boy, is it slow) to a supercomputer, is the ideal operating system for interoperability. At the lowest level, interoperability simply means the ability to transfer data from one system or program to another.

Even at this very basic stage, however, there are almost endless complications. Take, for instance, an ordinary ASCII file. Now, you may think that an ASCII file is an ASCII file is an ASCII file. Even leaving extended ASCII out of this discussion, there are still critical differences in how operating systems treat ASCII.

Let’s get down to specifics. In Unix, lines of text always end in a line feed, ASCII character 13, or the character you get when you press Ctrl-J. In MS-DOS, the default for ASCII files is for lines to end in a carriage return, ASCII character 10 or Ctrl-M, followed by a line feed.

You might think, “So what?” but the actual consequences can make you want to rip your hair out. For example, another friend uses an MS-DOS computer to read Usenet news groups on a Unix system. Her problem is that when she tries to print interesting messages, her printer treats the Unix-based messages as one incredibly long line. The output is, shall we say, ugly.

Fortunately, Unix has an abundance of tools that can handle her problem. The easiest one to implement makes use of Unix’s pattern-handling language: nawk.

Was It Something I Sed?

Nawk is a newer version of awk, which just goes to show that Unix programmers can’t resist bad puns. The awk family is the most sophisticated of the standard Unix utilities.

In Unix terms, nawk is a filter program. You put raw, unprocessed data in one end of the filter, and get processed information out the other end. Other examples of filters are sed and grep.

Sed, Unix’s stream editor, would seem to be the most logical choice for this job, but it has one major flaw. Getting sed to accept carriage returns in a command argument is like trying to move a mountain.

For doing things like changing “shopper” or “shopper” to “Shopper” throughout a manuscript with one command, sed can’t be beat. We have to look farther afield, though, when it comes to dealing with line endings.

Usually, filters are used in processes where the data is on its way from one program to another. The filter merely processes the data and moves it down the line to the next step. Nawk excels at this when dealing with database data, but nawk also lends itself well to file-format translations.

Practice

Let’s return to the simple case of moving Unix files to DOS format. The nawk program, utodos, looks like this:

Begin {FS = “unkeyable n”; RS = “unkeyable n” ORS = “unkeyable r unkeyable n”}

{print $1}

This program is invoked by the following command:

nawk -f utodos input_file output_file

Now we’ll break the program down. In the first line, we’re redefining several default variables that nawk uses. FS (Field Separator) and RS (Record Separator) are set for the value of the new line character. ORS (Output Record Separator) is, in its turn, set to a carriage return followed by a new line.

The second line tells nawk to output each line, which, thanks to FS and RS, is treated as a single field record, with the new ORS. In other words, the file is transformed so that everything that Unix sees as a line will also now be seen in MS-DOS as a line of text.

As we’ve seen in previous columns concerning shells, dollar signs and numbers are used as variables. In the awk family, $1 is always the first field. Since each line is treated as one large field, this last command obligingly prints out each line.

When I invoke the command, I use the “-f” flag to inform nawk that it will find its marching orders in the otodos file. After that, the input file name follows and “>” redirects the output from the default output, normally the console, to a file.

Cache 101: What your Mother should have told you

The easiest way to give any computer more oomph is to add a cache. In the eternal race of CPU vs. memory access speeds, CPUs win hands down. Modern CPUs need information faster than memory can provide and much faster than hard disk allow. The way to provide datastarved processors with their minimum daily requirement of bytes is to dedicate a high-speed chunk of memory to storing information for the CPU: a cache.Normally, we think of caches as speeding up disk accesses by storing high-demand information in RAM memory where it can be snagged in nanoseconds instead of milliseconds. While this is important, caches are equally vital for supplying the CPU with data faster than it can be accessed from main memory.

When processors clunked along at a measy 10MHz, memory caching wasn’t that important; everything in the system was slow. With the coming of faster processors, like the 80386, those days are gone. CPUs were forced to wait for a clock cycle for data hence the phrase “wait state.”

Let’s look at a real-world example. A 25MHz 386 has a clock cycle time of 40ns. That’s the least amount of time the CPU needs for its fastest instruction. Usually, the CPU takes two cycles to read or write to memory. To keep up with the CPU when running flat out, you would need memory with 80ns access times. Which, as it happens, is perhaps the most-common fast dynamic RAM (DRAM) around.

Unfortunately, it isn’t that simple. DRAM has to be recharged every time it’s used, and the recharge time is typically 90 percent of the access time. Get the picture? An 80ns DRAM chip has a cycle time of 152ns. That’s already too slow, but DRAM has to pay another performance penalty: DRAM chips must be refreshed frequently or they lose their stored data. Typically, this puts about an additional 8 to 10 percent bite on performance. Ouch.

There are ways around this problem. Interleaved memory separates memory into banks. In motherboards with two banks, even-numbered memory locations are assigned to one bank while their odd-numbered brethren are assigned to the other bank. When data is read sequentially, this reduces wait states since data is picked up first from one bank and then the other. This gives each bank enough time to take care of its housekeeping chores before being called on again for access.

Data isn’t always accessed in such a straightforward manner, however. When it’s not, and the CPU needs data kept in two different locations in the same bank, the CPU will again waste clock cycles waiting for memory.

PAGING COMMANDER DATA

Paging is another attempt to overcome memory speed hurdles. In paging, memory is addressed not by absolute location, but by memory sectors of “pages.” Data addresses in memory aren’t shifted dynamically, instead, the memory address is static, thus this scheme’s other name, ‘static column.”

As long as the data can be found in a single page, typically 2K in size, the system doesn’t waste time. Once the CPU asks for data outside the current page, all the usual problems of slow memory reappear.

While both interleaving and paging help, neither is enough to avoid wasteful wait states completely. In fact, they’re often combined to wring the most benefit from each approach, but caching is the solution most effective at eliminating all but a fraction of performance-draining wait states.

Cache success is usually measured by the percent of memory accesses that find a needed byte of information in the cache. Every time the CPU gets the required data, the cache is said to have a hit. Cache efficiency is judged like a baseball player’s batting average. The number of hits compared to the number of times the CPU asks for data gives the cache its “hits ratio.” Unlike Wade Boggs or George Brett, an average hovering around .350 stinks for a cache. Successful caches should have a hit ratio above 90 percent, or in baseball terms, a cache should bat above .900. Ted Williams, eat your heart out.

It takes more than a high average to make a good ballplayer or cache, though. An all-star cache needs a good balance of four factors. These are: 1) hit ratio, 2) hit speed, 3) miss speed, and that all-important one, 4) cost.

Hit speed refers to how fast the cache can come back with data if it has the information in storage. Faster is always better here. The Ricky Hendersons of caching use 20ns SRAM. Unfortunately, SRAM, like star base stealers, don’t come cheap. No one ever said making an MVP cache was easy.

Miss speed is how long it takes a cache to report that the requested information isn’t within its confines. Slow miss-reporting can put any cache on the disabled list.

Cost can never be overlooked. Designing the best possible memory system is really quite simple: Just use the fastest SRAM you can find, get a good caching algorithm, and make it serve as both primary and secondary storage. The rub is that you’ll feel like you’re paying Darryl Strawberry’s salary out of your budget. It simply isn’t feasible to build this kind of system–the cost is too high.

CACHE TRADE-OFFS

For both developers and end users, cache size is often the first consideration. Arguments have raged for years over how big cache should be. The answer, and you’re going to hate this, is that it depends. There isn’t even a good rule of thumb. Bigger is better (the most often quoted guideline) is only true up to a point before you get into diminishing returns.

A cache’s hit ratio depends on a delicate balance of elements. At least as important as cache size is the size of data elements within the cache. These elements are called “data lines.”

In a simple cache, when the CPU calls for data, the cache controller–or memory management unit (MMU)–searches a tag list. The tag list can be thought of as an address list. Each tag is attached, by either pointers or a linked list, to the appropriate data line. Longer data lines increase cache efficiency more than large cache size. A 64-byte data-line design, for example, should be almost an order of magnitude more efficient than a 4-byte design in a 64K cache.

Again, though, there’s no easy way of determining the right mix of data-line length and cache size. The type of application and how these applications access memory are as important as any design considerations in finding the best possible cache. Compromise is a way of life in everything.

CACHES AND STILL MORE CACHES

Caches are classified by how they address data storage. The quick and dirty way is direct mapping. In direct mapping, the cache’s data lines have a one-to-one correspondence with the data addresses in memory or storage.

While direct mapping is popular, it has a weakness. Since the cache is usually smaller than memory, and always smaller than mass storage, each direct address line in the map corresponds to more than one section of memory. When the same cache line is used to refer to more than one part of memory or storage, performance drops through the floor. Adding insult to injury, this occurs even if there’s still free room in the cache.

Consequently, designers have looked at fully associative caches. In these designs, cache data lines are assigned dynamically. While these caches have great hit ratios, the designs may not be as fast as a direct-mapped system’s hit. The reason is that although an associative cache can adapt to your CPU’s memory demands, it takes slightly longer to find the required data.

For example, in a direct-mapped cache, the cache controller always knows where to go for a particular chunk of information. That information may not be in the cache, but the controller doesn’t need to waste time looking any further. An associative cache program can’t make that assumption, so it must search through the entire cache in poor implementations, or a lookup table in better schemes, before reporting a miss.

Set-associative caches attempt to get the best of both worlds. The best known examples of this kind of caching are the Intel 82385 cache controller chip and the 80486’s internal 8K cache. The Motorola 68040’s internal cache is even more sophisticated. It breaks the cache into one area for data and another for instructions for faster performance.

The 486 uses a four-way set-associative design. The 8K cache is treated as four 2K blocks. Each block is made up of 128 16-bit-wide data lines. Within the block, the data lines are bundled together in 32 four-line sets. These sets correspond to absolute memory addresses the same way a direct-mapping cache would. At the same time, the data lines within a set can be dynamically set to any data within the set’s fixed range.

The resulting cache has the quickness of direct mapping combined with the flexibility and high hit rates of fully associative caching. It should come as no surprise to learn that set-associative caching designs can be found in most cache controllers.

ALGORITHM DU JOUR

There are other important performance issues. One of the most critical is when to fetch data from memory or storage. Some algorithms only call for data when it has been requested and hasn’t been found in the cache. Other algorithms try to predict what information will be called for next and request data before the CPU calls for it.

The idea of pre-fetching data is an attractive one, but it’s not easy to make it work effectively. After all, if the cache guesses wrong, time is wasted not only extracting data but dealing with the excess baggage of unwanted data. Worse yet, you may have gotten rid of needed data to make room for useless information.

Data-line length must be considered even more carefully thanusual in pre-fetch designs. Longer data lines, for instance, can lessen a cache’s efficiency by wasting space when unneeded data is loaded.

Pre-fetch systems usually rely on the fact that if the CPU just called one block of data, the odds are that the next block asked for will be the following sequential block. For instance, many disk-caching programs read an entire track of data. The software is betting that the data is unfragmented on the disk and that the CPU will soon need the information stored in the next sector. With a sufficiently large cache (larger is better here), cache performance can drastically improve the cache’s hit ratio.

It sounds great, but this one-block lookahead approach can increase traffic on both the bus and in memory. If you’re running a multiprocessor system, like some Compaq Systempros, or already pushing the bus’s bandwidth to the limit with multimedia applications, this method will start degrading overall system performance.

A more sophisticated approach is to pre-fetch only after a miss. There’s no such thing as a free lunch, however, and the check here is an increase in I/O traffic and a decrease in cache hits compared to always pre-fetching. In short, you don’t put as much of a burden on the bus or memory, but you don’t get as many hits either.

Even more sophisticated caches pre-fetch data not only on misses but when there’s a hit on a pre-fetched data line. In other words, the cache anticipates that when the CPU asks for new data, it’s likely to want other data associated with its new inquiries.

Another consideration is whether both the cache and memory should be searched concurrently. Most caches are “look through” designs. In this case, main memory isn’t searched until a miss has been reported. Some designs, called “look aside,” search both the cache and memory simultaneously.

There are several pitfalls to be wary of in the look-aside approach. The cache controller must be able to manage both searches and resolve any conflicts. This is perfectly possible, but the controller must be more powerful than its look-through brother and the cache’s code must be more tightly written. In layman’s terms, this means more things that can go wrong.

All caches have to deal with the problem of throwing out the trash, i.e., stale data. A cache only has so much room, and when it’s full, room must be made for new data.

This problem can be attacked in three ways. The first has the advantage of simplicity. The cache simply throws data out at random. It works, it’s easy to build, and it’s a lousy way to keep up your hit ratio in large caches. Curiously enough, studies show that in certain kinds of small caches, random replacement works better than other approaches. Random replacement can’t be dismissed out of hand.

The next method can also be put together by almost any tinkering programmer. This approach uses the tried and true first-in, first-out (FIFO) concept. Data moves through the cache in an orderly fashion. The oldest information is thrown out as new information comes in.

Developers are never content simply to use easy methods. Programmers have been working for years on implementations of least recently used (LRU) algorithms. In these schemes, the data that’s thrown out of the cache is the data which has gone the longest without being used. When properly implemented, LRU is the most successful approach for most caches.

That’s easier said that done. LRU has two strikes against it. First, every data line must have counter bits. After all, you need some way of telling which data line was accessed last. Second, the cache controller or program has to burn time deciding what’s the least-used data.

TO WRITE OR NOT TO WRITE

The most heated issue in cache design is when to write data. The belts-and-suspenders crowd believes in “write through.” In this approach, there’s no question about it–when you make a change to data in the cache, the change is immediately passed along to memroy and/or the mass-storage device. Of course, while this is happening, the processor is spinning its wheels waiting to get back to work.

That’s why some designers take a different tack. Their approach is “posted write,” also called ‘write back.” In this method, changes are posted to the cache at top speed, but the cache controller waits until it is idle to write the change to memory.

In posted write, there are two methods of handling data writes. In “always copy-back,” data is written to memory eventually even if the only change is that a line of data has been moved out of the cache. The other approach, “flagged copy-back,” calls for changes to be written only if the cached information has been changed.

No matter which way you go with posted write, there are some unavoidable problems. The one that makes users sweat is that out-of-date data can be held in memory for up to seconds at a time before the cache updates the stale information. If your system happens to get hit by the mother of all power surges, you can kiss that information goodbye.

What’s that you say? You have an uninterruptable power supply and don’t need to worry about a blackout? Well, there’s another dimension to this problem that doesn’t get mentioned often. Other devices in or attached to your computer also access memory through direct-memory access (DMA). this can bypass not only the CPU but the cache and its controller as well.

As a result, your FAX card or modern may pick up wrong data. This can foul things up completely. Or a peripheral can change data in main memory, thus once more throwing a monkey wrench into your system’s memory management.

There are ways to tackle this problem. The i82385, as well as the 68040, use wha’s known as “bus snooping.” By watching all memory transfers across the bus, both CPU and peripheral initiated, DMA changes can be recorded property in the cache.

One must wonder, though, if the benefits of posted writes are worth the trouble. At the outside, only 15 percent of your data transactions are writes. The rest are reads. It hardly seems worth exposing your data to any danger to speed up such a small percentage of data transactions.

CACHE COHERENCY

Some trouble with keeping data current is unavoidable in modern systems. As chips that feature onboard caches like the 486 and the Motorola 68040 become more common, this problem will pop up more often. Now computers frequently have chip-, controller-, and software-based caches. This is fine, but how do you get three different caches, memory, and storage all to agree on the state of any given byte of accessed data? In other words, how do you keep data coherent? There are no easy answers.

For today’s solutions, most software and hardware rely on bus snooping with write-through techniques. There are other approaches, but they belong more to the realm of theory than practice. This will change since the ever-increasing speeds of processors demand higher-performing caches.

With many cache-design elements, trade-offs are difficult to make and hard to optimize. It’s no wonder that there’s such as profusion of caching software and hardware. Nonetheless, several practical lessons can be drawn from cache theory.

The first lesson is that every computer needs a cache. Even if the memory can keep up with the CPU, the hard and floppy drives won’t be able to. There’s no better way to give your PC a kick in the pants than to add a cache.

A cache doesn’t need to be big. While trial and error is the only way you’ll discover what cache size works best for you, more than 2Mb of cache will be overkill on most systems for most applications. Like so many other things in caching, even this vague rule is not written in stone. Multiuser systems or file servers probably can use all the cache you can stuff into them.

Cache design is more important than size. The only way you’ll be able to judge that is from head-to-head comparisons of caches. Cache designers don’t give out their recipes, and it wouldn’t be much help even if they did. There are so many cache-design considerations that the only way you can tel what works for you is to give each cache a try.

The one design issue that users can sink their teeth into is memory-update policies. When you need the fastest possible cache, copy-back designs should be what you look at first. Everyone else would be better off with write-through designs.

One issue that I haven’t touched on is hardware caches vs. software caches. For memory/CPU consideration, hardware caches are the only answer. When it comes to caching data from disks, however, it’s another story entirely.

Hardware caches, such as those found on caching controllers, tend to be faster than software-based caches and particularly shine at writing data. Yet they cost three to four times the price of a caching program. For most users, software caching paired with additional motherboard RAM is a clear win. Network and multiuser operating systems will be better off with hardware caches.

Good caches may be difficult to make, but their value is indisputable. It’s clear that the future will continue to bring caching improvements as developers continue to wrestle with the difficult decisions required to make successful caches. As users, we will all benefit as caching technology matures and becomes more tightly integrated with system design.
First published in Ziff-Davis’ Computer Shopper

January 22, 1992
by sjvn01
0 comments

Windows for Workgroups: Hauling the Data on the Microsoft Line

Once more into the breech! Microsoft, after years of playing second fiddle to Novell in the LAN market, now enters the networking battle field with Windows for Workgroups (WinWork). WinWork adds peer-to-peer networking functionality to Windows 3.1.

This bold combination of today’s most popular graphical user interface (GUI) and LAN-capability bids to tilt the network playing field in Microsoft’s favor. Or, at least that’s what Microsoft hopes anyway. The reality is somewhat different.

Continue Reading →

September 1, 1990
by sjvn01
0 comments

Relational Databases: The Real Story

Did anyone else notice a few months ago when every food product in the land was advertised as “Light?” It didn’t matter a darn if there was any truth to their claims. Light was the new hype phrase and so everyone used it.

Breakfast cereals, margarines, and even ice creams were all advertised as being light. They got away with it because while you may think you know what light means, for advertising purposes it’s an ambiguous term. Food advertisers aren’t the only ones to play fast and loose with language, computer companies do the same thing.

Instead of “light”, buzzwords like “turbo” and “object-oriented” bombard us. One term with a precise definition has been bantered about so often that even computer veterans have lost track of its meaning. That phrase is relational database management system (RDBMS).

Companies have been using this phrase for years. There’s only one problem: few programs come even close to being relational database managers.

Thanks to their efforts even database professionals believe that RDBMSs are any program that can access more than one database at a time. Nothing could be further from the truth. Just because a program allows you to view two different tables with its procedural language doesn’t make it relational.

The sad thing about this is that there’s no real excuse for this confusion. The relational database model has been around since 1970. In that year Edgar Codd introduced it to the world in his classic paper, “A Relational Model for Large Shared Data Banks” in Communications of the ACM. Since then he and his cohort have been beating the drum for RDBMS both within and without the walls of IBM. To make certain that everyone got the idea, Codd summed up the definition of RDBMS in twelve rules.

These are:
In a way it’s funny that so many programs try to wrap themselves in the RDBMS mantle. During its first years of existence, nobody except Codd and company supported the RDBMS model. RDBMS, based firmly on mathematical theory and predicate logic, was disliked for its break with traditional database thought. It was reviled as being too theoretical. Many said that it would never be practical. Whoops!

The prevailing database model of the late 60s and early 70s was the hierarchical one. In this still popular system, data is stored in a tree structure. Every data element is defined as being a member of a group in this arrangement. These groups are themselves data elements and can be members of a higher group.

Under this scheme, your zip code is a data element is part of another data element, your address. This could then be a part of another data element, a mailing list for instance. These structures can grow quite complicated and require pointers and indexes to properly track information.

RDBMSs take a much simpler view of data. In a relational database, two dimensional arrays of rows and columns hold all information. This seems too easy and to some extent, that criticism is valid. Not every kind of information can be easily accessed under a RDBMS. Imaging information, for instance, doesn’t easily fit into a RDBMS’ neat patterns of data. Still, a RDBMS works fine for most textual or numeric data. The structure may be simple, but the data retrieval and manipulation powers inherent in it are unparalleled.

In part, this is because RDBMS theoretical underpinning allows for much easier coding. For example, finding records between two points, say all names between Vau and Vo, is a breeze in a relational system. The same search in conventional databases requires many record by record comparisons.

Even the most complicated data relationships can be reduced to two-dimensional table formats in a process called data normalization. This format makes changing and displaying information far easier than under a hierarchical system. The fundamental improvement of a RDBMS is that data can be added, deleted, or changed throughout an entire database by treating it as a single set. Ordinary database managers require record by record updates that drastically slow down performance.

The RDBMS concept sounds easy, but its full implementation does t more difficult to do in practice than in theory. The RDBMS complete definition contained within Codd’s 12 rules has proven difficult to achieve. Any database manager that claims to be relational must conform to the complete dozen. So far [1990] none have.

There are a few which have come close. IBM’s mainframe driven DB2, Ingress, Oracle, and R:Base 3.0 all attempt to meet the demands of the RDBMS model. At one time Codd stated that a relational database manager needed to meet only seven of his rules. By that standard, some have been successful. In order to understand why this has proven so hard to do, a review of some of these laws is in order.

First, here are the laws:

Rule 1: The Information Rule
All data should be presented to the user in table form.

Rule 2: Guaranteed Access Rule
All data should be accessible without ambiguity. This can be accomplished through a combination of the table name, primary key, and column name.

Rule 3: Systematic Treatment of Null Values
A field should be allowed to remain empty. This involves the support of a null value, which is distinct from an empty string or a number with a value of zero.

Rule 4: Dynamic On-Line Catalog Based on the Relational Model
A relational database must provide access to its structure through the same tools that are used to access the data.

Rule 5: Comprehensive Data Sublanguage Rule
The database must support at least one clearly defined language that includes functionality for data definition, data manipulation, data integrity, and database transaction control.

Rule 6: View Updating Rule
Data can be presented to the user in different logical combinations, called views. Each view should support the same full range of data manipulation that direct-access to a table has available.

Rule 7: High-level Insert, Update, and Delete
Data can be retrieved from a relational database in sets constructed of data from multiple rows and/or multiple tables.

Rule 8: Physical Data Independence
The user is isolated from the physical method of storing and retrieving information from the database.

Rule 9: Logical Data Independence
How a user views data should not change when the database’s logical structure (tables structure) changes.

Rule 10: Integrity Independence
The database language (like SQL) should support constraints on user input that maintain database integrity.
” No component of a primary key can have a null value. (see rule 3)
” If a foreign key is defined in one table, any value in it must exist as a primary key in another table.

Rule 11: Distribution Independence
A user should be totally unaware of whether or not the database is distributed

Rule 12: Nonsubversion Rule
There should be no way to modify the database structure other than through the multiple row database language (such as SQL).

Now, let’s dive in. The fundamental rule of RDBMS is that all information must be manageable entirely through relational means. This means that on the logical level everything in a relational database must be represented by values in tables. Here is where many database managers fall short. It is all too tempting to take a short cut in data representation or manipulation for the sake of short term efficiency. Unfortunately, this easy road leads quickly away from the basic model.

Other rules re-enforce this concept. In a true RDBMS, the database description, or catalog, must be contained in tables and be controlled by the data manipulation language.

Perhaps the most troublesome rule here is the number three which deals with nulls. In a RDBMS, nulls represent missing or inapplicable data. They are a vital part of the relational concept. This isn’t the same thing as empty or blank fields or the concept of zero.

If that sounds confusing, you’re right, it is. English isn’t the appropriate language for discussing the issue. To make it clearer, here’s the best definition I know for the concept. Null represents information which isn’t known at the time of the record creation or modification. For instance, a hospital keeps birth records. Sometimes happy new parents don’t have a first name for their little darlings. Now, there are several ways this could be handled in the hospital files. The baby could be given a false name, “John Doe” or “Smith’s daughter” and the record changed latter to the true name. This will work but it makes both record updating and reporting more complicated. Under a RDBMS, the missing information is represented by the null value.

The problem with this is that even within the relational framework there is no agreement on how to manipulate record sets containing null values. Debate rages on what to do with nulls. Since it’s at the heart of the theory, it can’t just be disregarded. For example SQL, sequential query language, the most popular relational language, identifies primary record keys by the combination of their unique identity and by being “not null.” How do you handle a situation where the primary record key includes a null value? That’s a good question with many ways to answer it.

The waters have been further muddied by Codd’s suggestions that two types of nulls should be recognized. One would represent missing information and the other inappropriate data. There’s no sign that a definite answer on this thorny question will be quickly forthcoming. [January 29th, 2008, we’re still no closer to a universal answer.]

Another rule that has proved difficult to implement is that a RDBMS must have distributional independence. In other words, the data manager must be able to cope with distributed databases. A true RDBMS shouldn’t care if the view on your screen is made up of a table from your PC and another from a VAX across the country.

As any database manager watcher knows, this is one area where progress is rapidly being made. Database servers and clients are the hottest development in the field. The RDBMS model is largely responsible for this. Its simple design and data integrity rules have made it the cornerstone of most such schemes. [In 2008, this issue has long been solved. How to handle replicating date in case of a disconnection or data corruption remains a practical problem, but the core design problem has long been solved.]

You might think that you could combine the best features of a RDBMS with other systems. Many software companies have thought just that. They’ve tried to put a coat of relational paint on top of other systems. Codd addresses this with his final rule. The gist of which is that if a relational system does have an ordinary single-record-at-a-time language it can’t be used to detour around the system’s relational rules. Sour grapes over the spread of programs that did just that wasn’t the reason for this final commandment. The relational model is fundamentally different from other ways of viewing and manipulating data. A true hybrid system could never produce the full gains promised by relational theory.

In some circles, talking about database theory is like talking about religion or politics: you’re sure to have an argument. I’m in favor of the relational model, but I’ll be the first to admit that it has some problems. There are still grey areas, like nulls, that need a clearer definition. There are practical problems as well. RDBMS require comparatively large amounts of RAM and disk storage. To put it more bluntly, they tend to be resource hogs. It should also be considered that for many purposes there’s nothing wrong with hierarchical systems.

Most flat file database managers work quite well within their theoretical constraints. In point of fact, most database programs, including such popular favorites as dBase, FoxPro, Clipper, Superbase, and Paradox owe more to this model than they do to the relational one. Still, as time goes by the theoretically superior relational model will be successfully implemented on more platforms. Then, and only then, will we have products that can honestly be presented as being relational.

Until that time, the best that can be said of most programs is that they include some relational features.

Today, in 2008, most serious DBMSs–MySQL, Oracle, PostgreSQL–are while not perfectly relational, are based in large part on Codd’s rules.

A version of this article first appeared in Byte Magazine in 1990.

Practical Technology

for practical people.