Getting the Data Through: Unix & Interoperability

August 1, 1992 by sjvn01 | 0 comments

Historical note: It was in this article that I first referred to Linux in print.–sjvn

There I was buried in a copy of Andrew Tanenbaum’s “Modern Operating Systems” when a friend woke me up with an unusual request. He wanted to move data from his Tandy Color Computer II disks to MS-DOS-format disks.

For those of you who don’t remember the “CoCo,” it was a popular home PC in the days when CP/M systems walked the earth. Its disks, I might add, are totally incompatible with Unix or anything else like a modern operating system.

After about an hour or so of searching, I found an MS-DOS program that could read CoCo disks. As I’ve said before, I may not be a computing wizard, but I do know how to find information in a hurry.

Getting Along

In the process, though, I started thinking about interoperability, a subject near and dear to my heart. Now, interoperability is a nine-dollar word which refers to programs and data that can get along with each other. Unix, which runs on everything from an XT (boy, is it slow) to a supercomputer, is the ideal operating system for interoperability. At the lowest level, interoperability simply means the ability to transfer data from one system or program to another.

Even at this very basic stage, however, there are almost endless complications. Take, for instance, an ordinary ASCII file. Now, you may think that an ASCII file is an ASCII file is an ASCII file. Even leaving extended ASCII out of this discussion, there are still critical differences in how operating systems treat ASCII.

Let’s get down to specifics. In Unix, lines of text always end in a line feed, ASCII character 13, or the character you get when you press Ctrl-J. In MS-DOS, the default for ASCII files is for lines to end in a carriage return, ASCII character 10 or Ctrl-M, followed by a line feed.

You might think, “So what?” but the actual consequences can make you want to rip your hair out. For example, another friend uses an MS-DOS computer to read Usenet news groups on a Unix system. Her problem is that when she tries to print interesting messages, her printer treats the Unix-based messages as one incredibly long line. The output is, shall we say, ugly.

Fortunately, Unix has an abundance of tools that can handle her problem. The easiest one to implement makes use of Unix’s pattern-handling language: nawk.

Was It Something I Sed?

Nawk is a newer version of awk, which just goes to show that Unix programmers can’t resist bad puns. The awk family is the most sophisticated of the standard Unix utilities.

In Unix terms, nawk is a filter program. You put raw, unprocessed data in one end of the filter, and get processed information out the other end. Other examples of filters are sed and grep.

Sed, Unix’s stream editor, would seem to be the most logical choice for this job, but it has one major flaw. Getting sed to accept carriage returns in a command argument is like trying to move a mountain.

For doing things like changing “shopper” or “shopper” to “Shopper” throughout a manuscript with one command, sed can’t be beat. We have to look farther afield, though, when it comes to dealing with line endings.

Usually, filters are used in processes where the data is on its way from one program to another. The filter merely processes the data and moves it down the line to the next step. Nawk excels at this when dealing with database data, but nawk also lends itself well to file-format translations.

Practice

Let’s return to the simple case of moving Unix files to DOS format. The nawk program, utodos, looks like this:

Begin {FS = “unkeyable n”; RS = “unkeyable n” ORS = “unkeyable r unkeyable n”}

{print $1}

This program is invoked by the following command:

nawk -f utodos input_file output_file

Now we’ll break the program down. In the first line, we’re redefining several default variables that nawk uses. FS (Field Separator) and RS (Record Separator) are set for the value of the new line character. ORS (Output Record Separator) is, in its turn, set to a carriage return followed by a new line.

The second line tells nawk to output each line, which, thanks to FS and RS, is treated as a single field record, with the new ORS. In other words, the file is transformed so that everything that Unix sees as a line will also now be seen in MS-DOS as a line of text.

As we’ve seen in previous columns concerning shells, dollar signs and numbers are used as variables. In the awk family, $1 is always the first field. Since each line is treated as one large field, this last command obligingly prints out each line.

When I invoke the command, I use the “-f” flag to inform nawk that it will find its marching orders in the otodos file. After that, the input file name follows and “>” redirects the output from the default output, normally the console, to a file.

Practical Technology

for practical people.

Getting the Data Through: Unix & Interoperability

Leave a Reply Cancel reply