Using Random Access I/O in C and Java

This document describes

A logical model of disc I/O
A physical model of disc I/O
Random access I/O in C using POSIX functions
Random access I/O in C using <stdio.h> functions
Random access I/O in Java using RandomAccessFile.
Some observations on speed.

Logical model

There are persistent objects stored in the file system. These objects are called files. A file is a sequence of bytes.

This model applies to UNIX, MacOS (even old MacOS), and Windows, although it did not apply to early versions of MS-DOS. It also doesn't really apply to mainframe systems, where a file is regarded as a sequence of records, not bytes, or for that matter to the AS/400 environment, where file system objects are, as a rule, objects, not files. However, it is the model presumed by the C and C++ standards and by Java, also by C#, Smalltalk, and XML.

The file system provides some way of locating files given file names. For present purposes, we don't care how file names are encoded, other than to note that they are represented inside a program by some kind of string. A file object might have many names or none.

There is access control information associated with each file, determining which users may do what. The actual set of permissions varies from operating system to operating system: some operating systems have distinguished (Read, Write, Execute, Append, Delete), while Unix only distinguishes (Read, Write, Execute). Our only reason for caring here is that when we create a new file we may need to specify access control.

Inside a program, there are programming objects called channels. A channel is a connection between a program and a file. A channel has access rights (which may be more restrictive than the file system would permit, indeed, should normally be more restrictive if you only want to read the file) and a current position. The current position is a byte number, where 0 <= position <= file size.

Depending on the access rights, you may read a chunk of bytes from a file through a channel, write a chunk of bytes to a file through a channel, truncate a file through a channel, or move the current position of a channel.

Physical model

This phsyical model applies only to discs, and is oversimplified even then.

A file is held on a single disc.

A disc is a collection of surfaces. For example, a double-sided disc has two surfaces. A disc pack might have ten or more surfaces. All the surfaces are flat annuli of the same size.

Each surface is divided into tracks. A track is a thin annulus of the disc surface. All the surfaces are divided in the same way.

A disc drive has a head for each surface. All these heads are connected to the same mechanical apparatus called an arm for moving them in and out, so that when the head for surface 0 is in position to read from track 12 of surface 0, the head for surface 1 is in position to read from track 12 of surface 1.

The set of tracks at the same distance from the centre on each surface is called a cylinder, because it is the intersection of a cylinder with the disc (or disc pack). The important thing about a cylinder is that all the information in a single cylinder can be accessed without moving the arm; if the file system can allocate all of a file's space within a single cylinder then the file can be accessed quicker than if its information is scattered over many cylinders.

Each track is divided into a number of sectors. Usually, each track has the same number of sectors, and the sectors are all the same size. The original Macintoshes squeezed more data out of a floppy by spinning it slower when the arm was near the outside tracks than when it was near the inside tracks; that meant a constant bit rate wherever the arm was, but more sectors on the outside than the inside. It also meant you could play tunes on the disc drive. Modern Macintoshes use the same kind of disc drives as everyone else.

A disc drive can read or write sectors. It cannot read or write anything smaller than a sector.

We can refer to any sector in a collection of discs by using an address (disc#, track#, surface#, sector#). The disc# says which disc to talk to. The track# says where we want that disc's arm to move to (if it isn't already there). The surface# says which of the heads attached to that arm should be used. The sector# says which sector is to be read or written. To read or write a sector, the drive waits for the disc to spin around so that the sector is under the head, then it starts reading or writing. Accessing multiple consecutive sectors in the same track is fast because we don't have to wait for the disc to spin to the right place for any sector except the first.

A sector may or may not be the same thing as a block. "Sector" is a physical concept, about the physical layout of a disc. "Block" is a logical concept, about the units that the operating system uses in dealing with a disc. Blocks cannot be smaller than sectors, but from the previous paragraph, we see that there is an advantage to using blocks that are several sectors long: it takes less time to read an N-sector block than it does to read N sectors.

Some operating systems let the user ask that a file be "contiguous", that is, that it occupy the smallest number of tracks possible and that its sectors be consecutive where possible.

The operating system hides a lot

In UNIX, Windows, and MacOS, you can read any number of bytes from anywhere in a file, whether this starts on a block (or sector) boundary or not, whether the size is a multiple of the block (or sector) size or not. You can also write any number of bytes to anywhere in a file. In each case, the operating system actually reads entire blocks from the disc or writes entire blocks to the disc, so if your requests are not block-aligned you may be wasting some of the disc's bandwidth. In C it is easy to find out the natural block size for a disc file; if you don't want to use system-dependent code, 8192 bytes is a good default size.

C using POSIX system calls

In UNIX, the data structures for channels are held in operating system memory. User programs deal with things called file descriptors, which are small integers. Historically, 0 is standard input, 1 is standard output, and 2 is standard error. You obtain a file descriptor by calling an OS function; for all the other operations described here you pass the file descriptor back as the first argument to an OS function. Windows has basically the same idea, although Windows calls these numbers HANDLEs.

Creating a file

    #include <sys/types.h>
    #include <sys/stat.h>
    #include <fcntl.h>

    int creat(char const *path, mode_t mode);

One of the UNIX inventors was asked what he would change about UNIX if he had to do it again. His answer was "I'd spell 'creat' with an 'e'."

The path argument is a string giving the file name.

The mode argument is an integer giving the access rights for the new file. You make it by OR-ing together

S_IRUSR 0400 the owner can read
S_IWUSR 0200 the owner can write
S_IXUSR 0100 the owner can execute
S_IRWXU 0700 the owner can do anything

S_IRGRP 0040 file's group can read
S_IWGRP 0020 file's group can write
S_IXGRP 0010 file's group can execute
S_IRWXG 0070 file's group can do anything

S_IROTH 0004 others can read
S_IWOTH 0002 others can write
S_IXOTH 0001 others can execute
S_IRWXO 0007 others can do anything

Your program will have write access to the file through this channel even if the mode argument says otherwise, how else could the file get any contents?

The file's owner is the user id of the process that created it.

The file's group might be the group id of the directory that contained it when created, or it might be the group id of the process that created it. Read the manuals.

Since you are going to create an index that Andrew Trotman needs to be able to read, you want

    int fd;

    fd = creat("my-index", S_IRUSR|S_IWUSR|S_IRGRP|S_IROTH);

or possibly

    fd = creat("my-index", 0644);

The mode may be overridden by your umask.

If the path mentions a directory that does not exist, or if it names a directory, or if it names a file that you do not have permission to write, or if it names a file on a read-only file system, or if anything else goes wrong, creat will return -1 and set the errno variable to say what went wrong. Typically, you will just do

    #include <stdlib.h>
    #include <stdio.h>
    
    char const *index_file_name = "my-index";
    fd = creat(index_file_name, 0644);
    if (fd < 0) {
        perror(index_file_name);
        exit(EXIT_FAILURE);
    }

If all goes well and the file didn't exist, it will be created. At this point there is nothing in it.

If all goes well and the file did exist, it will be truncated. At this point there is now nothing in it.

Opening an existing file

     #include <sys/types.h>
     #include <sys/stat.h>
     #include <fcntl.h>

     int open(char const *path, int oflag, /* mode_t mode */...);

The path and mode arguments are the same as for creat().

The oflag argument says what exactly you want. It's made by OR-ing together various flags. The mode argument is only used if the oflag argument includes the O_CREAT option; if it doesn't, you don't need to pass mode at all. The options include

O_RDONLY open for reading only
O_WRONLY open for writing only
O_RDWR open for reading and writing.
O_CREAT create the file if it doesn't exist.
O_EXCL if O_CREAT and O_EXCL are both set,
fail if the file does exist.
O_TRUNC if the file exists, truncate it.

You do not have to say that you want random access; if the file is a plain disc file, you will be able to access it randomly. If it's a terminal, or a pipe, you won't.

To open the index for reading, just do

    fd = open(index_file_name, O_RDONLY, 0);
    if (fd < 0) {
        perror(index_file_name);
        exit(EXIT_FAILURE);
    }

If you want to update an existing index, but fail if the file doesn't exist, do

    fd = open(index_file_name, O_RDWR, 0);
    if (fd < 0) { ... as before ... }

If you want to create the index if it doesn't exist, or update it if it does, do

    fd = open(index_file_name, O_RDWR|O_CREATE, 0644);
    if (fd < 0) { ... }

If you've heard about the O_APPEND flag; forget it. It doesn't do what you think it does.

Closing a file

To close a file,

    if (close(fd) < 0) {
        perror("closing a file");
        exit(EXIT_FAILURE);
    }

Yes, closing a file can fail. No, there isn't really anything you can do about it. Worse, due to operating system buffering, all that a successful close() means is that "all outstanding operations on this file have been successfully queued for later processing"; buffered data might not be flushed to disc until minutes later. There are things you can do about this, and you can only do them by working at this level, not at the C standard library or Java level.

Reading through a channel

When you work at this level, you don't read or write characters, you read or write bytes. This doesn't make any difference for UNIX using an 8-bit character set, but it does make a difference for Unicode, and it does make a difference for Windows using the corresponding Windows system calls (CR+LF will not be translated to LF). Nor do you deal with single bytes; you read or write entire buffers. A buffer is any C data object; you can do "binary" I/O if you want to. But it may be better to use an array of plain chars or of unsigned chars.

To read, you have to specify which channel you want to read through, what buffer you want the results put in, and how many bytes you want.

    #include <unistd.h>

    ssize_t read(int fildes, void *buf, size_t nbyte);

    #define BUFFER_SIZE 8192
    unsigned char buffer[BUFFER_SIZE];
    ssize_t n;
    
    n = read(fd, buffer, sizeof buffer);

Of course it's not that simple. First, the operation could fail. If that happens, read() returns -1 and sets errno to say what went wrong. For example, you might have passed bad arguments. Or the file might be on a remote machine (that's the case for me right now, writing this file) and the remote machine might crash. But practically any system call might fail, so you have to check all of your system calls anyway.

The other problem is that you don't necessarily get the number of bytes you asked for. You might get fewer. In particular, if you have reached the end of the file, you will certainly get 0 as the answer, and that is the normal, indeed the only, way that UNIX indicates end-of-file. If you really desperately need all the bytes you asked for, you had better write a little function something like this:

    ssize_t read_fully(int fd, unsigned char *buf, size_t nb) {
        ssize_t n, r;
        
        r = 0;
        while (nb != 0) {
            n = read(fd, buf, nb);
            if (n < 0) return n;        /* error */
            if (n == 0) return r;        /* end of file */
            nb -= n;                        /* 0 < n <= nb */
            buf += n;
        }
        return r;
    }

This function will return a short block only at the end of the file.

There is one more problem, which is that read() might return early because it was interrupted. You really don't want to know about that. That's meant for "slow" devices like terminals and sockets; "fast" devices like discs commonly aren't affected. A small change to read_fully() will deal with it, in the cases where it can be dealt with, but you don't need to worry.

Reading will be most efficient when each read asks for a full block starting on a block boundary and when the buffer is suitably aligned. "Suitably" might mean "on an even address" (for Windows) or it might mean "on a page boundary" (for some versions of UNIX).

Finding out what the block size is

Once you have created or opened your file, you can ask the operating system about its properties.

    #include <sys/types.h>
    #include <sys/stat.h>

    int fstat(int fildes, struct stat *buf);

The struct stat includes the following fields:

off_t st_size file size in bytes
long st_blksize preferred block size in bytes

For plain disc files, st_size is the number of bytes in the file. For a pipe, it is the number of bytes available for reading at the moment (if you ask about the "read" end of the pipe), not the total number that will ever be available. For most other kinds of files it isn't defined.

The st_blksize field tells us what block size should be used.

    struct stat statbuf;
    long block_size;
    
    block_size = fstat(fd, &statbuf) < 0 ? 0 : statbuf.st_blksize;
    if (block_size == 0) block_size = 1024; /* or some other default */

Provided you pass sensible arguments, fstat() should not fail. If you choose a block size that is too big or too small, the program won't go wrong, it just won't go as fast as it might have.

Writing through a channel

Writing is just like reading, except the other way around.

    #include <unistd.h>

    ssize_t write(int fildes, const void *buf, size_t nbyte);

You provide the channel to write through, the buffer where the data can be found, and the number of bytes you want written.

As with read(), write() can fail, and it may choose to write only part of the data. You may need

    ssize_t write_fully(int fd, unsigned char *buf, size_t nb) {
        ssize_t n, r;
        
        r = 0;
        while (nb != 0) {
            n = write(fd, buf, nb);
            if (n < 0) return n;        /* error */
            if (n == 0) return r;        /* end of file */
            nb -= n;                        /* 0 < n <= nb */
            buf += n;
        }
        return r;
    }

It is advisable to check for errors from calls to read(). It is absolutely essential to check for errors from calls to write(), because if you ignore an error, you, or your customers, will lose data. In particular, no matter how big they make discs, they will fill up, and when you're writing them is just when it's likely to happen. It's even possible for the file system index structures on a disc to fill up when there are plenty of blocks available. It's certainly possible for output to a file on a remote computer to fail. I'm shouting a bit here because I have lost data myself because an idiot software company forget to check for failed write()s in their program. Important data. Several weeks of work, in fact. For our entire company.

Reading and writing other stuff

Suppose we have an array of doubles:

    #define ARR_MAX 100
    double arr[ARR_MAX];
    int narr; /* number of elements in arr[] */

Can we write that out so that we can read it back, without converting it to text? Yes.

    int read_completely(int fd, void *buf, size_t nb) {
        ssize_t n = read_fully(fd, buf, nb);
        if (n < 0) return -1;
        if (n != nb) { errno = EDOM; return -1; }
        return 0;
    }
    
    int write_completely(int fd, void *buf, size_t nb) {
        ssize_t n = write_fully(fd, buf, nb);
        if (n < 0) return -1;
        if (n != nb) { errno = EDOM; return -1; }
        return 0;
    }
    
    if (read_completely(ifd,  &arr[0], narr*sizeof arr[0]) < 0) abort();
    if (write_completely(ofd, &arr[0], narr*sizeof arr[0]) < 0) abort();

What if the reader won't know exactly how long the array is?

    if (read_completely(ifd, &narr, sizeof narr) < 0
     || read_completely(ifd, &arr[0], narr*sizeof arr[0]) < 0
    ) abort();
    
    if (write_completely(ofd, &narr, sizeof narr) < 0
     || write_completely(ofd, &arr[0], narr*sizeof arr[0]) < 0
    ) abort();

There are two issues here. First, writing data on a little-endian machine (PC Linux, say) and reading it on a big-endian machine (SPARC Solaris, say, or pretty much anything except a PC) is obviously not going to work. And writing pointers and reading them back probably isn't going to work in any realistic situation. Second, making lots of calls to the operating system to read or write little globs of data is slow.

Modern versions of UNIX have scatter/gather I/O, where you can provide an entire array of (buffer pointer, buffer size) pairs in one call to readv() or writev(), but it's still better to map things onto blocks and transfer only whole blocks.

Seeking

Since UNIX version 7, there has been one function that is used both to find out "what is the current position of this channel" and to change it.

    #include <sys/types.h>
    #include <unistd.h>
    /* #define SEEK_SET 0 */
    /* #define SEEK_CUR 1 */
    /* #define SEEK_END 2 */

    off_t lseek(int fildes, off_t offset, int whence);

One type needs some explanation here. File offsets might fit into int, or they might need long, or they might even need long long. The off_t type is whatever integer type is suitable for representing file offsets. You should use it in your own code for every file offset. This ensures that your code will port easily to a 64-bit file system.

whence == SEEK_SET: The channel's current position is set to offset. Use this to move to an absolute position.
whence == SET_CUR: The channel's current position is incremented by offset, which may be negative. Use this for relative moves.
whence == SEEK_END: The channel's current position is set to end of file plus offset, which may be negative. The main use of this is to move to the end of file, when offset==0.

If the seek succeeds, lseek() returns the channel's current position (in bytes from the beginning of the file). This is a non-negative integer; it could be zero. If something goes wrong, lseek() leaves the position unchanged, sets errno, and returns (off_t)(-1).

The most likely reason for failure is that the file isn't seekable.

To find the current position

    off_t where;
    
    where = lseek(fd, 0L, SEEK_CUR);
    if (where == (off_t)(-1)) abort();

To rewind to the beginning of the file

    if (lseek(fd, 0L, SEEK_SET) == (off_t)(-1)) abort();

To seek to the end of the file

    if (lseek(fd, 0L, SEEK_END) == (off_t)(-1)) abort();

To return to a previously remembered place

    if (lseek(fd, where, SEEK_SET) == (off_t)(-1)) abort();

C using <stdio.h> functions

The C standard I/O library was originally designed so that C programs could be ported between UNIX and mainframes (IBM and other brands). It does several things:

buffering in user space, so that the operating system gets nice big blocks
formatted input and output, for portable conversion between numbers and text
converting between the C end of line convention and the host operating system's end of record convention
in modern UNIX systems, Unicode encoding and decoding.

User programs deal with things called streams, or "FILE* pointers". These point to objects in the program's address space containing a bunch of stuff that needn't concern us. The point is that this stuff is is user space, so it's quick to get at, and it is completely lost if the program crashes.

There are three predeclared and initialised streams: stdin, stdout, and stderr. As a rule, error messages must be written to stderr, and should be written nowhere else.

You get other streams by calling fopen(), fdopen(), or popen(). Here we are only concerned with fopen().

Creating a file

    #include <stdio.h>

    FILE *fopen(char const *path, char const *mode);

The mode argument is a string having nothing whatsoever to do with the mode argument of open(). The stdio library gives you no way of setting the file permissions, you always get the default. What is that default? I wish you the very best of luck trying to find out from the documentation. In fact it is 0666: everyone can read and write the file. This is modified by your "umask", just like open() and creat().

The mode argument is in fact the analogue of oflag

"r" Open for (text) reading only.
"rb" Open for (binary) reading only.
In either case, fail if file doesn't exist.
"w" Open for (text) writing only.
"wb" Open for (binary) writing only.
Create or truncate the file.
"a" Open for (text) appending only.
"ab" Open for (binary) appending only.
Create if missing, do not truncate.
"r+" Open for (text) reading and writing.
"rb+" Open for (binary) reading and writing.
In either case, fail if file doesn't exist.
"w+" Open for (text) writing and reading.
"wb+" Open for (binary) writing and reading.
Create or truncate the file.
"a+" Open for (text) appending and reading.
"ab+" Open for (binary) appending and reading.
Create if missing, do not truncate.

A mode containing both "+" and "b" may have them in either order. Contrary to some C textbooks, "t" is not legal in an ISO C mode string.

In file systems where the end of line convention is a single \n character (whether \n is mapped to LF, to CR, to NEL, or to some other control character), there is no distinction between text and binary mode. In Windows, where the end of line convention is CR+LF, text channels convert between CR+LF outside and LF inside, while binary channels don't.

The append modes are a little tricky. In some programming languages, opening a file for appending is the same as opening it for output and seeking to the end instead of truncating it. At least in UNIX C, it means that every write through this channel will append at the end of the file, whatever the channel's current position is. So seeking "works" for reading but not for appending. This is very useful for adding to log files, because if more than one process has a log file open in append mode, each record they write will go out intact at whatever the end of the file is at the time of writing. Or it would be useful if you knew when the writes were going to happen. If you are going to exploit this, you must ensure that (a) the buffer is big enough so that there will be room for the entire record (see the 'setvbuf' function to find out how to do this) and (b) you call fflush as soon as the record you want appended is finished. The text of the C standard was, in one draft,

Opening a file with append mode ('a' as the first character in the mode argument) causes all subsequent writes to the file to be forced to the then current end-of-file, regardless of intervening calls to the fseek function. In some implementations, opening a binary file with append mode ('b' as the second or third character in the ... mode ...) may initially position the file position indicator for the stream beyond the last data written, because of null character padding.

There's worse: in append mode, you cannot know where the record was actually put. It will go where the end-of-file is at the time of writing, not where the channel's position is before writing. So you can't ask about the position before the call. Between the time when the record is written and the time when you ask where the end of file is, other processes may have written to the file as well, so the end of file may now be megabytes after the record you just wrote. So you can't ask about the position after the call either.

What I'm saying is "if you want to remember where things are written, don't use append mode".

To create your index file:

    FILE *fp;

    fp = fopen("my-index", "w");

If fopen() fails, it returns a NULL pointer. The C standard does not require it to set errno; the description of fopen() says nothing about errno. The UNIX documentation, however, does say that it sets errno on failure. So you should write

    FILE *fp;
    char const *index_file_name = "my-index";

    fp = fopen(index-file-name, "w");
    if (fp == 0) {
        perror(index_file_name);
        exit(EXIT_FAILURE);
    }

If all goes well and the file didn't exist, it will be created. At this point there is nothing in it.

If all goes well and the file did exist, it will be truncated. At this point there is now nothing in it.

There's one really nasty gotcha with the stdio library. It does buffering. The characters you write through the stream are stored in a byte array in your address space, until it fills up, and then passed to the operating system. That's good. But the buffer is not allocated until the first time you try to read or write through the channel. This is to give you a chance to provide your own buffer:

    fp = fopen(...);
    if (fp == 0) abort();
    if (setvbuf(fp, my_big_buff, _IOFBF, sizeof my_big_buff) != 0) abort();

What this means is that if you don't provide your own buffer, the first read or write through the stream may discover to its horror that memory has run out when it tries to allocate a buffer for you. Depending on the implementation, it might go ahead without a buffer, just very slowly, or the read or write attempt might fail mysteriously, even though the file opened properly.

A buffer allocated by the stdio library itself will have room for BUFSIZ bytes (this is defined in stdio.h). On the machine where I typed this, BUFSIZ was 1024, even though the preferred block size for disc files is 8192. Providing your own buffer can make your program go faster. But you don't have to.

For building your index file, you probably want mode "wb", not that it makes any difference on UNIX.

Opening an existing file

    fp = fopen(index_file_name, "rb");
    if (fp == 0) {
        perror(index_file_name);
        exit(EXIT_FAILURE);
    }

You do not have to say that you want random access; if the file is a plain disc file, you will be able to access it randomly. If it's a terminal, or a pipe, you won't.

If you want to update an existing index, but fail if the file doesn't exist, do

    fp = fopen(index_file_name, "rb+");
    if (fp == 0) { ... as before ... }

If you want to create the index if it doesn't exist, or update it if it does, do

    fp = fopen(index_file_name, "rb+");
    if (fp == 0) fp = fopen(index_file_name, "wb+");
    if (fp == 0) { ... }

Closing a file

To close a file,

    if (fclose(fp) < 0) {
        perror("closing a file");
        exit(EXIT_FAILURE);
    }

Yes, closing a file can fail. See the description of close() above.

Reading through a channel

There are lots of ways to read characters, which you should already know about. The normal way to read one character is

    int c;

    c = getc(fp);

You should not use fgetc() unless you have some very special reason to do so. You should not make the result any type other than int.

When you read one "character" from a binary-mode stream, you get one byte. All the other input operations can be built from this. For efficiency, they normally aren't. Use functions that transfer big globs of data whenever they are appropriate.

When you read one "character" from a text-mode stream, you get CR+LF translation under Windows; you do not not get that translation under UNIX or MacOS even when the file you are reading is in a remotely mounted Windows file system.

An important difference between getc() and read() is the way they report end of file. The getc() macro returns -1 for end of file. It also returns end of file when an error occurs. You have to use the feof() or ferror() function to tell the difference. So to read all the characters in a stream, do

    int c;
    
    while ((c = getc(fp)) >= 0) {
        process character c
    }
    if (feof(fp)) {
        we read all the data
    } else {
        we stopped because of an error
    }

Reading a block of characters

    #include <stdio.h>
    size_t fread(void *buf, size_t size, size_t nitems, FILE *fp);

The buf argument says where to put the data. It is normally the address of the first element of some kind of array. The size argument says how big each element of the array is. The nitems argument says how many elements we want. The fp argument says what channel to use. This is rather confusing, because the fp argument goes first in many functions.

We could almost define the function like this:

    size_t fread(void *buf, size_t size, size_t nitems, FILE *fp) {
        unsigned char *p = buf;
        size_t nb = size * nitems;
        size_t r;

        for (r = 0; r != nb; r++) {
            int c = getc(fp);
            if (c < 0) {
                if (feof(fp)) break; /* return short read */
                return 0; /* return immediately */
            }
            *p++ = c;
        }
        return r / size;
    }

The real code can get at the internals of the FILE object and is likely to be much more efficient.

We can see five important things here.

First, the result is normally the number of array elements read; if the stream is at end-of-file when the call starts, the result is 0.

Second, you get everything you asked for, if that's possible. Unlike read(), fread() isn't allowed to stop early just because it feels like it.

Third, if the things you are reading are not characters, the stream had better be in binary mode, otherwise CR+LF conversion may garble your data on Windows.

Fourth, both error and end of file look exactly the same. We have to check feof() or ferror() to tell them apart, just like a call to getc().

Fifth, the number of bytes read from the stream is not necessarily equal to the number of elements successfully read times the size of each element. Imagine, for example,

    double d;
    
    if (fread(&d, sizeof d, 1, fp) != 1) abort();

when sizeof d == 8 but there are only 4 bytes left in the file. The fread() function will read the 4 bytes. As the C standard says, "If a partial element is read, its value is indeterminate." So d is undefined. But the result should be 0, not 1. Do not rely on "dead reckoning" to keep track of the stream position; whenever you want to know the current position, ask the stream. or not. To be honest, this bothers me enough that I use my own function. In the important case where the element size is 1, there can be no partial elements. So reading blocks of bytes is OK.

Writing through a channel

Writing is just like reading, except the other way around.

    #include <unistd.h>

    size_t fwrite(void const *buf, size_t size, size_t nitems, FILE *fp);

We could almost define the function like this:

    size_t fwrite(void const *buf, size_t size, size_t items, FILE *fp) {
        unsigned char const *p = buf;
        size_t nb = size * ntimes;
        size_t r;
        
        for (r = 0; r != nb; r++) {
            int c = *p++;
            if (putc(c, fp) < 0) break;
        }
        return r / size;
    }

The real code can get at the internals of the FILE object and is likely to be much more efficient.

If the result of fwrite() is not equal to nitems, there must have been a write error. An element will be partially written only if there was a write error.

It is advisable to check for errors from calls to fread(). It is absolutely essential to check for errors from calls to fwrite(), because if you ignore an error, you, or your customers, will lose data.

Reading and writing other stuff

The fread() and fwrite() functions don't care what the buf argument points to. However, if it isn't an array of characters, the fp stream had better be in binary mode, and you had better be prepared for your file to be useful on any machine other than the one where you wrote it. I've even known binary files written using one C compiler to be unreadable on the same machine if a different C compiler was used, due to different padding between structure members.

Because the data are buffered in your address space, something like

    #define ARR_MAX 100
    double arr[ARR_MAX];
    int narr; /* number of elements in arr[] */
    ...
    if (fread(&narr, sizeof narr, 1, ifp) != 1
     || fread(&arr[0], sizeof arr[0], narr, ifp) != narr
    ) abort();
    ...
    if (fwrite(&narr, sizeof narr, 1, ofp) != 1
     || fwrite(&arr[0], sizeof arr[0], narr, ofp) != narr
    ) abort();

is perfectly fine. Because of this there is no need for any scatter/gather versions of fread() and fwrite().

Seeking

There are actually four functions related to file position in the C standard I/O library.

    long ftell(FILE *fp);

This returns the current position. For a binary stream, the value is the number of bytes from the beginning of the file: 0 means at the beginning. For a text stream, it's some sort of magic cookie that you can use to get back to the same place: all you can do with it is give it back to fseek.

If something goes wrong, ftell() sets errno and returns -1.

These days we have cheap 2 Terabyte discs. What happens if you have a file that's 5GB in size, you're at the end, you call ftell(), and LONG_MAX is 2147483647L? It doesn't work, that's what. If you are ever in that situation, find out how to compile in "LP64" mode, and do it.

    int fgetpos(FILE *fp, fpos_t *pos);

In order to handle that kind of situation portably, ISO C has a special "file position" type fpos_t and a function fgetpos() that gets it. This is an abstract data type. On some machines it might be a numeric type, but you cannot rely on it. All you can do with one of these is pass it to fsetpos(); there is no function to compare fpos_t values, which is a pity.

    int fseek(FILE *fp, long offset, int whence);

This is just like the lseek() function, except that it doesn't use the off_t type for the offset and it doesn't return the new position.

For a text stream, the offset could be 0 (meaning beginning of file), or a value returned by ftell().

For a binary stream, it's a number of bytes.

A call to fseek() may involve undoing a call to ungetc() and may involve sending buffered output data to the operating system.

    int fsetpos(FILE *fp, fpos_t const *pos);

This moves the stream's position to a previously remembered position returned by fgetpos().

There are two more functions that are part of POSIX but not part of ISO C. The POSIX interface defines an integral type off_t for file sizes and positions. On a system with 32-bit file sizes, off_t will be 32 bits. On a system with 64-bit file sizes, off_t will be 64 bits. In both cases, this has little to do with the size of long. If you are using the native UNIX lseek() function, off_t is the type you must use. The two stdio-on-POSIX functions are

    off_t ftello(FILE *stream);
    int   fseeko(FILE *stream, off_t offset, int whence);

and they are exactly like ftell() and fseek() except for using off_t instead of long as the "file position" type. If you know that your program will be compiled with a POSIX interface, you should probably use these functions.

To find the current position

    long where;
    
    where = ftell(fp);
    if (where == -1) abort();

To rewind to the beginning of the file

    if (fseek(fp, 0L, SEEK_SET)) abort();

To seek to the end of the file

    if (fseek(fp, 0L, SEEK_END)) abort();

To return to a previously remembered place

    if (fseek(fd, where, SEEK_SET)) abort();

Your assignment

The unprocessed document collection is about 0.5GB.

The dictionary will be under 1MB.

The density of words in the documents is such that the uncompressed postings will be less than 20% bigger than the document collection; that is, not more than 0.6GB and possibly rather less.

This means that even without compressing your postings, you will be able to use fseek() and ftell() safely in a 32-bit environment.

But there's another snag.

It didn't have to be this way. There was a design botch in the original stdio implementation. Nobody ever fixed it, and the ISO C committee decided that we could live with the problem, and didn't demand that implementors fix it, even though it could have been fixed quite easily.

The cause of the problem is irrelevant.

The consequence is that if you use a "+" mode (both reading and writing are allowed through the same stream), then mixing reads and writes is not going to work unless you are very careful. Basically, a read-write stream is in one of three states at any time:

uncommitted. This is the state when the stream is opened, and after a call to fseek() or fsetpos().
reading. This is the state after a call to a reading function. In this state, calling a writing function almost certainly won't work.
writing. This is the state after a call to a writing function. In this state, calling a reading function almost certainly won't work.

The workaround is that if you might have been reading through a stream and now want to write, you must call fseek(), and if you might have been writing through a stream and now want to read, you must call fseek().

You must call fseek() in this situation even if it's fseek(fp, 0L, SEEK_CUR) so that no actual movement is desired or done. If you see a fseek(fp, 0L, SEEK_CUR) in a program, it is extremely dangerous to remove it.

Random Access using RandomAccessFile in Java

Java provides random access through the class RandomAccessFile. This is a misnomer: objects of this class represent channels, not files.

These channels provide data input and data output as well as seeking, so we have to start with two interfaces.

DataOutput lets you write Java's primitive types out, plus String.

public interface DataOutput {
    public void write(int b) throws IOException;
        // Write the bottom 8 bits of b.

    public void write(byte[] b) throws IOException;
        // Write all the bytes of b[].

    public void write(byte[] b, int off, int len) throws IOException;
        // Write b[off] ... b[off+len-1].    

    public void writeBoolean(boolean b) throws IOException;
        // Write 1 if b is true, 0 if b is false.

    public void writeShort(int s) throws IOException;
        // Write bottom 16 bits in big-endian order.

    public void writeChar(int c) throws IOException;
        // Same as writeShort(), really.
        // Beware: Unicode codes are 21 bits, but
        // Java chars are only 16 bits.  Oh DEAR.

    public void writeInt(int i) throws IOException;
        // Write all 4 bytes of i in big-endian order.

    public void writeLong(long l) throws IOException;
        // Write all 8 bytes of l in big-endian order.

    public void writeFloat(float f) throws IOException;
        // writeInt(Float.floatToIntBits(f));

    public void writeDouble(double d) throws IOException;
        // writeLong(Double.doubleToLongBits(d));

    public void writeBytes(String s) throws IOException;
        // Writes the bottom byte of each character of s in turn.
        // Beware: only useful for Latin-1 characters.

    public void writeChars(String s) throws IOException;
        // applies writeChar to each character of s in turn.

    public void writeUTF(String s) throws IOException;
        // converts s to a sequence of bytes using a nonstandard
        // version of UTF-8; writes the number of bytes (not characters)
        // using writeShort(), then writes the bytes.
}

The DataInput interface lets you read any of Java's primitive data types, plus String. The String methods have a couple of gotchas: readLine() requires Latin-1 text, not Unicode in UTF-8, and readString() uses a nonstandard version of UTF-8 coding. All DataInput is really good for is reading stuff written using DataOutput.

All the methods may throw IOException; EOFException will be thrown if you hit end of file early, other IOExceptions can occur for other reasons.

Trap for young players: the equivalent of DataOutput.write is not DataInput.read but DataInput.readFully.

public interface DataInput {
    public int skipBytes(int n) throws IOException;
        // Skip n bytes if possible; stop early at EOF.
        // Return number of bytes actually skipped.
        // Does NOT throw EOFException.

    public void readFully(byte[] b) throws IOException;
        // reads b[0] ... b[b.length - 1] in turn.

    public void readFully(byte[] b, int off, int len) throws IOException;
        // reads b[off] ... b[off+len-1] in turn.

    boolean readBoolean() throws IOException;
        // Read one byte, return that byte != 0.

    byte readByte() throws IOException;
        // Read one byte (-128..+127) and return it.

    public int readUnsignedByte() throws IOException;
        // Read one byte (0..255) and return it.

    public short readShort() throws IOException;
        // Read two bytes (big-endian) and return -32768..+32767.

    public int readUnsignedShort() throws IOException;
        // Read two bytes (big-endian) and return 0..65535.

    public char readChar() throws IOException;
        // Read two bytes as big-endian Unicode character.
        // Beware: tens of thousands of Unicode characters have
        // codes that are too big to fit into 16 bits!

    public int readInt() throws IOException;
        // Read four bytes (big-endian) and return signed int.

    // There is no readUnsignedInt

    public long readLong() throws IOException;
        // Read eight bytes (big-endian) and return signed 64-bit.

    public float readFloat() throws IOException;
        // Float.intBitsToFloat(readInt()).

    public double readDouble() throws IOException;
        // Double.longBitsToDouble(readLong());

    public String readLine() throws IOException;
        // Read single-byte characters until EOF or end of line;
        // return those characters in a string, or null at EOF.
        // Beware: this can only handle ISO Latin 1 text.

    public String readUTF() throws IOException;
        // Read byte length as by readUnsignedShort().
        // Read that many bytes.
        // Interpret them as UTF-8 encoding of a String.
        // Beware: Java "UTF-8" isn't standard UTF-8.
}

Now that they're out of the way, we can describe RandomAccessFile. The model described in the official documentation is precisely our logical model.

class RandomAccessFile implements DataInput, DataOutput {
    public RandomAccessFile(File file, String mode)
        throws FileNotFoundException;
        // file is a Java file name object which names the file.
        // mode is "r", "rw", "rws", or "rwd".  There is no "w" mode.
        // All you need is "r" or "rw".

    public RandomAccessFile(String name, String mode)
        throws FileNotFoundException;
        // Takes a string for the file name.
        // Mode "r" means read only, "rw" means "read and write".

    public void close() throws IOException;
        // Don't forget to close each file.

    public FileDescriptor getFD() throws IOException;
        // Look up FileDescriptor if you really want to know.

    public FileChannel getChannel() throws IOException;
        // Look up FileChannel if you really want to know.

    public int read() throws IOException;
        // Same as readUnsignedByte(), except it returns -1
        // at end of file instead of throwing EOFException.

    public int read(byte[] b) throws IOException;
        // Same as readFully(b) except that it stops early at EOF;
        // returns number of bytes read or -1 if none.

    public int read(byte[] b, int off, int len) throws IOException;
        // Same as readFully(b, off, len) except that it stops
        // early at EOF; returns number of bytes read or -1 if none.

    public long length() throws IOException;
        // Returns size of file (in bytes).

    public long getFilePointer() throws IOException;
        // Returns current position in file (in bytes).

    public void seek(long pos) throws IOException;
        // Sets the current position to pos (>= 0).

    public void setLength(long len) throws IOException;
        // len < length(): truncates file
        // len > length(): appends enough undefined bytes
}

The class as a whole is fairly straightforward to use, except that some of the methods are, um, "quirky". Forget about subclassing it; lots of the methods are final, but not all. If you try to figure out why, you'll go blind. (:-)

Java 1.4 introduced FileChannels. They are about as close as Java can get to POSIX I/O, and if you were really trying to make an efficient IR system in Java, you might prefer these to RandomAccessFiles. But you'd start from a RandomAccessFile in any case, and they've been around since Java 1.0, so more Java systems can handle them.

Java uses 64-bit long integers for positions, so it doesn't have a problem with large files. RandomAccessFiles are defined to provide binary access, so end of line conversion is not a problem. The main complication is having to remember to do something about IOExceptions whenever you use a RandomAccessFile.

One complication is the names.

What you expect What you get
getPosition getFilePointer
setPosition seek
getLength length
setLength setLength

What you expect	What you get
getPosition	getFilePointer
setPosition	seek
getLength	length
setLength	setLength

Vile as the Java naming conventions are, following them would have been better than following no conventions...

To create a file

    String index_file_name = "my-index";
    RandomAccessFile random_stream =
        new RandomAccessFile(index_file_name, "rw");
    random_stream.setLength(0);

To open an existing file for reading

    File index_file_object = new File(index_file_name);
    if (!index_file_object.canRead()) throw Some_Exception_Or_Other;
    RandomAccessFile random_stream =
        new RandomAccessFile(index_file_object, "r");

To open an existing file for reading and writing

    RandomAccessFile random_stream =
        new RandomAccessFile(index_file_name, "rw");

To find the current position

    long where;
    
    where = random_stream.getFilePointer();

To rewind to the beginning of the file

    random_stream.seek(0);

To seek to the end of the file

    random_stream.seek(random_stream.length());

To return to a previously remembered place

    random_stream.seek(where);

Some observations on speed

In the lecture I talked about the memory hierarchy. Some numbers may be of interest.

The G4 PowerMac on my desk has a CPU that runs at 1GHz.
Its main memory bus runs at 133MHz (7.5 times slower).
Its level 2 cache is only 256kB.
Its main memory is 640MB.
If we assume that reading from main memory involves bringing a 4-word "line" into cache (16 bytes) then main memory bandwidth is on the order of 2GB/sec
By actual measurement with a fairly crude program gets these numbers.

Level Sequential access Random access
L1 cache 1.7 GB/sec 2.1 GB/sec
L2 cache 1.0 GB/sec 0.7 GB/sec
main 0.4 GB/sec 0.05 GB/sec

There is a 40 to 1 ratio here between a simple "add up random elements of an array" code for a small (L1 cache) array and a large (main memory) one.
My 500 MHz UltraSPARC gets these memory bandwidths:

Level Sequential access Random access
L1 cache 2.8 GB/sec 1.6 GB/sec
L2 cache 1.4 GB/sec 0.8 GB/sec
main 0.3 GB/sec 0.04 GB/sec

There are two lessons here. The first is that a 2:1 ratio in the CPU clocks has practically NO effect on how fast we can move bytes around. It's the memory architecture. If I did this on a 4.6 GHz Pentium with a similar level of memory technology, I would NOT expect the memory bandwidth to be 5 times higher than the G4!
The second is that dealing with large amounts of memory really is drastically different (40:1 for the G4, 70:1 for the UltraSPARC) from dealing with small amounts of memory. Ignorance of the memory hierarchy can cost you several generations of performance.
The operating system on the G4 Mac supports these interfaces:
- ATA (Advanced Technology Attachment): ATA-3 is 16 MB/sec, ATA-4 is 33 MB/sec.
- SCSI (Small Computer Systems Interface): SCSI-1 is 5MB/sec, SCSI-2 is 10MB/sec, SCSI-3 is 20MB/sec (8 bit transfers) or 40MB/sec (16-bit transfers), SCSI FAST-20 is 20 or 40 MB/sec, SCSI FAST 40 is 40 or 80 MB/sec, Ultra3 SCSI is 80 or 160 MB/sec, SCSI-4 is 320 MB/sec.
- USB (Universal Serial Bus): slow speed mode 1.5 MB/sec, full speed mode 12 MB/sec, high speed mode 480 MB/sec. My G4 supports slow and full speed.
- FireWire (IEEE 1394): the cable versions are 100 MB/sec, 200 MB/sec, or 400 MB/sec. My G4 uses 400 MB/sec.
We see that FireWire can match the speed of main memory. However, my G4 doesn't have a FireWire drive. It has an ATA-4 hard disc, which means that the bandwidth to disc is 33 MB/sec at best. Reading a 58MB file took 1.96 seconds, or 30 MB/sec. The UltraSPARC got 32 MB/sec. These figures do NOT relate to random access; they relate to sequential access. This is the good case, where after a startup time, the operating system can keep blocks flowing as fast as the cable will permit.
Reading the same file by repeatedly generating a random block number and reading that block took 25.5 seconds, or 2.3 MB/sec. Put another way, it took about 3.6 msec/block. Yet another way: 13 times slower.

Level	Sequential access	Random access
L1 cache	1.7 GB/sec	2.1 GB/sec
L2 cache	1.0 GB/sec	0.7 GB/sec
main	0.4 GB/sec	0.05 GB/sec

Level	Sequential access	Random access
L1 cache	2.8 GB/sec	1.6 GB/sec
L2 cache	1.4 GB/sec	0.8 GB/sec
main	0.3 GB/sec	0.04 GB/sec

So a fast program keeps the amount of memory it uses small, doesn't touch disc if it can avoid it, and if it does touch the disc, reads as much as it can sequentially, or at least uses system calls to let the operating system know well in advance what it is going to read.

S_IRUSR	0400	the owner can read
S_IWUSR	0200	the owner can write
S_IXUSR	0100	the owner can execute
S_IRWXU	0700	the owner can do anything

S_IRGRP	0040	file's group can read
S_IWGRP	0020	file's group can write
S_IXGRP	0010	file's group can execute
S_IRWXG	0070	file's group can do anything

S_IROTH	0004	others can read
S_IWOTH	0002	others can write
S_IXOTH	0001	others can execute
S_IRWXO	0007	others can do anything

O_RDONLY	open for reading only
O_WRONLY	open for writing only
O_RDWR	open for reading and writing.
O_CREAT	create the file if it doesn't exist.
O_EXCL	if O_CREAT and O_EXCL are both set,
	fail if the file does exist.
O_TRUNC	if the file exists, truncate it.

off_t	st_size	file size in bytes
long	st_blksize	preferred block size in bytes

"r"	Open for (text) reading only.
"rb"	Open for (binary) reading only.
	In either case, fail if file doesn't exist.
"w"	Open for (text) writing only.
"wb"	Open for (binary) writing only.
Create or truncate the file.
"a"	Open for (text) appending only.
"ab"	Open for (binary) appending only.
	Create if missing, do not truncate.
"r+"	Open for (text) reading and writing.
"rb+"	Open for (binary) reading and writing.
In either case, fail if file doesn't exist.
"w+"	Open for (text) writing and reading.
"wb+"	Open for (binary) writing and reading.
Create or truncate the file.
"a+"	Open for (text) appending and reading.
"ab+"	Open for (binary) appending and reading.
Create if missing, do not truncate.