Practical Technology

for practical people.

Getting the Data Through: Unix & Interoperability

| 0 comments

Historical note: It was in this article that I first referred to Linux in print.–sjvn

There I was buried in a copy of Andrew Tanenbaum’s “Modern Operating Systems” when a friend woke me up with an unusual request. He wanted to move data from his Tandy Color Computer II disks to MS-DOS-format disks.

For those of you who don’t remember the “CoCo,” it was a popular home PC in the days when CP/M systems walked the earth. Its disks, I might add, are totally incompatible with Unix or anything else like a modern operating system.

After about an hour or so of searching, I found an MS-DOS program that could read CoCo disks. As I’ve said before, I may not be a computing wizard, but I do know how to find information in a hurry.

Getting Along

In the process, though, I started thinking about interoperability, a subject near and dear to my heart. Now, interoperability is a nine-dollar word which refers to programs and data that can get along with each other. Unix, which runs on everything from an XT (boy, is it slow) to a supercomputer, is the ideal operating system for interoperability. At the lowest level, interoperability simply means the ability to transfer data from one system or program to another.

Even at this very basic stage, however, there are almost endless complications. Take, for instance, an ordinary ASCII file. Now, you may think that an ASCII file is an ASCII file is an ASCII file. Even leaving extended ASCII out of this discussion, there are still critical differences in how operating systems treat ASCII.

Let’s get down to specifics. In Unix, lines of text always end in a line feed, ASCII character 13, or the character you get when you press Ctrl-J. In MS-DOS, the default for ASCII files is for lines to end in a carriage return, ASCII character 10 or Ctrl-M, followed by a line feed.

You might think, “So what?” but the actual consequences can make you want to rip your hair out. For example, another friend uses an MS-DOS computer to read Usenet news groups on a Unix system. Her problem is that when she tries to print interesting messages, her printer treats the Unix-based messages as one incredibly long line. The output is, shall we say, ugly.

Fortunately, Unix has an abundance of tools that can handle her problem. The easiest one to implement makes use of Unix’s pattern-handling language: nawk.

Was It Something I Sed?

Nawk is a newer version of awk, which just goes to show that Unix programmers can’t resist bad puns. The awk family is the most sophisticated of the standard Unix utilities.

In Unix terms, nawk is a filter program. You put raw, unprocessed data in one end of the filter, and get processed information out the other end. Other examples of filters are sed and grep.

Sed, Unix’s stream editor, would seem to be the most logical choice for this job, but it has one major flaw. Getting sed to accept carriage returns in a command argument is like trying to move a mountain.

For doing things like changing “shopper” or “shopper” to “Shopper” throughout a manuscript with one command, sed can’t be beat. We have to look farther afield, though, when it comes to dealing with line endings.

Usually, filters are used in processes where the data is on its way from one program to another. The filter merely processes the data and moves it down the line to the next step. Nawk excels at this when dealing with database data, but nawk also lends itself well to file-format translations.

Practice

Let’s return to the simple case of moving Unix files to DOS format. The nawk program, utodos, looks like this:

Begin {FS = “unkeyable n”; RS = “unkeyable n” ORS = “unkeyable r unkeyable n”}

{print $1}

This program is invoked by the following command:

nawk -f utodos input_file output_file

Now we’ll break the program down. In the first line, we’re redefining several default variables that nawk uses. FS (Field Separator) and RS (Record Separator) are set for the value of the new line character. ORS (Output Record Separator) is, in its turn, set to a carriage return followed by a new line.

The second line tells nawk to output each line, which, thanks to FS and RS, is treated as a single field record, with the new ORS. In other words, the file is transformed so that everything that Unix sees as a line will also now be seen in MS-DOS as a line of text.

As we’ve seen in previous columns concerning shells, dollar signs and numbers are used as variables. In the awk family, $1 is always the first field. Since each line is treated as one large field, this last command obligingly prints out each line.

When I invoke the command, I use the “-f” flag to inform nawk that it will find its marching orders in the otodos file. After that, the input file name follows and “>” redirects the output from the default output, normally the console, to a file.

Other Transportation Problems

Other data-transportation problems are even easier. Say, for example, that I want to move a dBASE IV database from where it lives on my Gateway 2000 running Interactive System V Release 3.2 Unix to my Dell running MS-DOS 5.0. This is no trouble since the databases’ structures are the same. My only problem is actually moving the file from one format to another.

One way I could do that would be to back up the database to a tape that some MS-DOS utilities I have could read. Last time around, I talked about using Unix’s own tools to make a home-grown backup program. That program can be used to solve this problem.

There are ways, however, that it could have been written that would have made it unusable. There are almost as many flavors of Unix as there are of ice cream. While a Bourne shell program will work on nearly all of them, the results may differ radically from system to system. For instance, the heart of my backup program is the line:

find . -type f -print – cpio -o > /dev/tape

Readers who know their Unix commands well will remember that the find command has a -cpio flag of its own. In other words, instead of having find locate the appropriate files and pipe them to cpio to be shipped off to tape, I could have find do all the work. The command line would look like:

find . -type f -print – cpio -o > /dev/tape

So why didn’t I do it that way? The main reason is portability. Many versions of find’s -cpio flag produce non-ASCII headers. These headers make perfect sense to their creating program, but are arrant nonsense to everyone else’s software. For these reasons, using cpio the program, not the flag, is key for interoperability.

Final Notes

I’ve barely scratched the surface of interoperability, but I’ll be coming back to it. In a world where sharing information becomes more important every day, interoperability is vital.

All this talk about Unix, however, doesn’t do you any blessed good if you don’t have Unix. New low-cost (would you believe free?) Unix-based operating systems are on their way.

One system, named Linux, will cost you only download fees and not even that if you’re on Internet. In its current version, this 80×86 Unix-clone is still only for the adventurous. If you want to give it a try, check out information on this system in the net news group comp.sys.linux. Many Linux files are also available on the Programmer’s Corner BBS.

Leave a Reply