k0de
  less

████████╗██╗  ██╗███████╗    ██████╗  ██████╗  ██████╗ ███╗   ███╗
╚══██╔══╝██║  ██║██╔════╝    ██╔══██╗██╔═══██╗██╔═══██╗████╗ ████║
   ██║   ███████║█████╗      ██║  ██║██║   ██║██║   ██║██╔████╔██║
   ██║   ██╔══██║██╔══╝      ██║  ██║██║   ██║██║   ██║██║╚██╔╝██║
   ██║   ██║  ██║███████╗    ██████╔╝╚██████╔╝╚██████╔╝██║ ╚═╝ ██║
   ╚═╝   ╚═╝  ╚═╝╚══════╝    ╚═════╝  ╚═════╝  ╚═════╝ ╚═╝     ╚═╝
    

Advanced Programming in the UNIX Environment (Chapter3)

by Fare9

3 - File I/O

Makefile for this chapter Makefile

Introduction

Start discussion of UNIX System describing functions available for file I/O– open, read, write a file and so on. Most file I/O on UNIX system can be performed using five functions: open, read, write, lseek and close. We then examine effect of various buffer sizes on read and write functions. Functions described in chapter are often referred to as unbuffered I/O, in contrast to standard I/O routines, which we describe in Chapter 5. Term unbuffered means each read or write invokes syscall in kernel. Unbuffered I/O functions are not part of ISO C, are part of POSIX.1 and Single UNIX Specification. Whenever we describe sharing of resources among multiple processes, concept of atomic operation becomes important. We examine concept with regard to file I/O and arguments to open function. This leads to discussion of how files are shared among multiple processes and which kernel data structures are involved. After describing features, we describe dup, fcntl, sync, fsync and ioctl functions.

File Descriptors

To kernel, all open files are file descriptors. File descriptor is non-negative integer. When open existing file or create new one, kernel returns file descriptor to process. When we want to read or write a file, we identify file with file descritor returned by open or creat, as argument to read or write. By convention, UNIX shells associate file descriptor 0 as standard input, 1 as standard output, and 2 as standard error of process. Convention used by shells and many applications; not feature of UNIX kernel. Many application would break if associations weren’t followed. Values are standardized by POSIX.1, but magic numbers (0,1 and 2) should by replaced in POSIX-compliant apps to symbolic constants STDIN_FILENO, STDOUT_FILENO, and STDERR_FILENO to improve readability, constants defined in ** header. File descriptors goes from 0 to OPEN_MAX - 1. Early historical implementations of UNIX had opper limit of 19, allowing maximum of 20 open file per process, may systems increases limit to 63.

open and openat Functions

File created or opened using any of these functions:

    int open (const char *path, int oflag, ... /* mode_t mode */ );
    int openat (int fd, const char *path, int oflag, ... /* mode_t mode */ );

    Both return: file descriptor if OK, -1 on error.

We show last argument as …, ISO C way to specify number and types of remaining arguments may vary. For these functions, last argument used only when new file is being created. We show argument as comment in prototype. path parameter is name of file to open or create. Function has multitude of options, specified by oflag argument. Argument is formed by ORing together one or more of following constants from ** header:

One and only one of previous five constants must be specified, next will be optional:

(In releases of System V, O_NDELAY (no delay) flag was introduced. SImilar to O_NONBLOCK option, but ambiguity was introduced in return value from read operation. no-delay causes a read operation return 0 if no data to be read from pipe, FIFO or device, but conflicts with return value of 0, indicating end of file. SVR4-based system still support no-delay, but nw apps should use nonblocking option instead)

Following two flags are also optional. They are part of synchronized input and output options of Single UNIX Specification.

File descriptor returned by open and openat guaranteed to be lowest-numbered unused descriptor. This fact used by some applications to open new file on standard input, standard output, or standard error. Example, application might close standard output (normally, file descriptor 1) then open another knowing that it will be opened on file descriptor 1. We’ll see better way to guarantee that file is open on given descriptor, with the function “dup2” (better than “dup”).

fd parameter distinguishes openat function from open function. There are three possibilities:

  1. path parameter specifies an absolute pathname. This case, fd parameter is ignored and openat functions behaves like open function.
  2. path parameters specified relative pathname and fd parameter is file descriptor that specifies starting location in file system where relative pathname is to be evaluated. fd parameter obtained by opening the directory where relative pathname is to be evaluated.
  3. path parameter specified a relative pathname and fd parameter has special value AT_FDCWD. This case, pathname is evaluated starting in current working directory and openat function behaves like open function.

openat function is one of a class of functions added to latest version of POSIX.1 to address two problems. First, gives threads way to use relative pathnames to open files in directories other than current working directory. All threads share same current working directory, so makes it difficult for multiple threads in same process to work in different directories at same time. Second, it provides a way to avoid time-of-check-to-time-of-use (TOCTTOU) errors.

TOCTTOU errors program vulnerable if it makes two file-based function calls where second call depends on result of first call. Because two calls are not atomic, file can change between the two calls, thereby invalidating results of first call, leading to a program error. TOCTTOU errors in file system namespace generally deal with attempts to subvert file system permissions by tricking privileged program into either reducing permissions on privileged file or or modifying a privileged file to open up a security hole. Wei and Pu [2005] talk about TOCTTOU in UNIX file system.

Filename and Pathname Truncation

What happens if NAME_MAX is 14 and we try to create new file in current directoy with filename containing 15 characters? Traditionally, early releases of System V, allowed this to happen, silently truncating filename beyond the 14th character. BSD-derived systems, returned error status, with errno set to ENAMETOOLONG. Silently truncating filename presents a problem that affects more than simply creation of new files. If NAME_MAX is 14 and file exists whose name is exactly 14 characters, any function that accepts pathname argument, such as open or stat, has no way to determine what original name of file was, as original name might have been truncated.

POSIX.1, constant _POSIX_NO_TRUNC determines whether long filenames and long components of pathnames are truncated or an error is returned. Value can vary based on type of file system, we can use fpathconf or pathconf to query a directory to see which behaviour is supported.

If _POSIX_NO_TRUNC is in effect, errno is set to ENAMETOOLONG, and error status is returned if any filename component of pathname exceeds NAME_MAX. (Most modern file systems support maximum of 255 characters, so usually is not a problem).

creat Function

New file can be created calling “creat” function:

    #include <fcntl.h>
    int creat (const char *path, mode_t mode);
        Returns: file descriptor opened for write-only if OK, -1 on error.

This function is equivalent to:

    open(path, O_WRONLY | O_CREAT | O_TRUNC, mode);
    

As historically open only accepted 0, 1 or 2 as second argument, and it wasn’t possible to open a non-existing file, creat was used. With O_CREAT and O_TRUNC provided by open, separate creat function no longer needed.

We’ll see later the mode, when we describe file’s access permission in detail. One deficiency with creat, file is opened only for writing. Before new version of open was provided, if we were creating temp file that we wanted to write and read, we had to call “creat”, “close” and then “open”. Better way is to use “open” function in this way:

    open(path, O_RDWR | O_CREAT | O_TRUNC, mode);

close Function

Open file is closed calling “close” function:

    #include <unistd.h>
    int close (int fd);
        Returns 0 if OK, -1 on error

Closing a file, release any record locks that process may have on file. We’ll discuss that further later. When a process terminates, all of its open files are closed automatically by kernel. Many programs take advantage of this fact and don’t explicitly close open files.

lseek Function

Every open file has associated a “current file offset”, a non-negative integer that measures number of bytes from the beginning of file (We’ll see some exceptions to “non-negative” qualifier). Read and write operations normally start at current file offset and cause offset to be incremented by number of bytes read or written. By default, offset is 0 when file is opened, unless O_APPEND option is specified. Open file’s offset can be set calling “lseek”:

    #include <unistd.h>
    off_t lseek (int fd, off_t offset, int whence);
        Returns: new file offset if OK, -1 on error

Interpretation of offset depends on value of whence argument:

Because successful call to lseek returns new file offset, we can seek zero bytes from current position to determine current offset:

    off_t   currpos;
    currpos = lseek(fd, 0, SEEK_CUR);

Technique can be used to determine if file is capable of seeking. If file descriptor refers to a pipe, FIFO, or socket, lseek sets errno to ESPIPE. (Three constants SEEK_SET, SEEK_CUR and SEEK_END were introduced in System V. Prior to this whence was 0 (absolute), 1 (relative to current offset), or 2 (relative to end of file). Character l in lseek means “long integer”. Before introduction of off_t data type, offset argument and return value were long integers. lseek was added with Version 7 when long integers were added to C (previous to that the functions seek and tell were used)).

Example: test_lseek.c to see whether the standard input is capable of seeking. Here we can call the program giving different ways of input:

    $ ./test_lseek < /etc/passwd # input is a file (seek OK)
    $ cat < /etc/passwd | ./test_lseek # input here is an string (cannot seek)
    $ ./test_lseek < /var/spool/cron/FIFO # input here is a FIFO (cannot seek)

A file’s current offset must be a non-negative integer. It is possible, certain devices could allow negative offsets. But regular files, offset must be non-negative. Because negative offsets are possible, we should compare return of lseek with -1, and not if is lower than 0.

lseek only records current file offset within kernel, it does not cause any I/O to take place. This offset then sued by next read or write operation.

File’s offset can be greater than file’s current size, in which case next write to file will extend the file. This is referred to as creating a hole in a file and is allowed. Any bytes in a file that have not been written are read back as 0.

A hole in a file isn’t required to have storage backing it on disk. Depending on file system implementation, when you write after seeking past end of a file, new disk blocks might be allocated to store the data, but there’s no need to allocate disk blocks for data between old end of file and location where you start writing.

Example: file_hole.c example that creates a file with a hole in it. We can see with “ls -l file.hole” the size of the file, and then with “od -c file.hole” the content of the file. The flag -c of “od” command tells to print the contents as characters. Unwritten bytes in the middle are read back as zero. The seven-digit number at the beginning of each line is the byte offset in octal.

Because offset address that lseek uses is an off_t, implementations are allowed to support whatever size is appropiate on their particular platform. Most platforms today provide two sets of interfaces to manipulate file offsets: one that uses 32-bit file offsets and another that uses 64-bit file offsets.

Single UNIX specification provides a way for applications to determine which environments are supported through sysconf function. Figure 3.3 in page 70, gives a set of Name of option and description of types depending on the system. c99 compiler requires that we use getconf(1) command to map desired data size model to flags necessary to compile and link our programs. Different flags and libraries might be needed, depending on environments supported by each platform. Applications can set _FILE_OFFSET_BITS constant to 64 to enable 64-bit offsets. Doing so changes definition of off_t to be 64-bit signed integer. Setting _FILE_OFFSET_BITS to 32 enables 32-bit file offsets. This technique is not guaranteed to be portable.

Different versions as Figure 3.4 of page 70 shows, can have different values if _FILE_OFFSET_BITS is set or not. Even though you might enable 64-bit file offsets, ability to create a file larger than 2GB (2^31 -1 bytes) depends on underlying file system type.

read Function

Data read from open file with read function

    #include <unistd.h>
    ssize_t read(int fd, void *buf, size_t nbytes);
        Returns: number of bytes read, 0 if end of file, -1 on error

If read is successful, number of bytes read is returned. If end of file is encountered, 0 is returned. Several cases in which number of bytes actually read is less than aument requested:

Read operation starts at file’s current offset. Before successful return, offset is incremented by number of bytes actually read. POSIX.1 changed prototype for function in several ways, classic definition:

    int read (int fd, char *buf, unsigned nbytes);

write Function

Data written to open file with write function:

    #include <unistd.h>
    ssize_t write (int fd, const void *buf, size_t nbytes);
        Returns: number of bytes written if OK, -1 on error

Return value usally equal to nbytes argument; otherwise error has occurred. Common cause for write error is filling up a disk or exceeding file size limit for given process.

For regular file, write operation starts at file’s current offset. If O_APPEND option was specified, file’s offset is set to current end of file before each write operation. After successful write, file’s offset is incremented by number of bytes actually written.

I/O Efficiency

Example: copy_file.c, uses only read and write functions. Following caveats apply to the program:

Some tests were done with BUFFSIZE, program was run with different values of BUFFSIZE to check the User CPU (in seconds), System CPU (in seconds), clock time (in seconds) and number of loops.

File was read using previous program, with output redirected to /dev/null. File system used was a Linux with ext4 file system with 4,096 blocks. This makes that minimum time ocurring at the few timing measurements starting around a BUFFSIZE of 4,096. Increasing buffer size beyond limit has little positive effect.

Most file systems support read-ahead to improve performance. When sequential reads are detected, system tries to read-ahead than application request, assuming application will read it shortly. Effect of read-ahead can be seen in timing tables, where with a small buffer of 32 bytes system start to improve times because of read-ahead. We’ll see the effect of synchronous writes; later we will compare unbuffered I/O times with standard I/O library.

File Sharing

UNIX System supports sharing of open files among different processes. Before describing dup function, we need to describe this sharing. We’ll examine data structures used by kernel for all I/O. The kernel uses three data structures to represent an open file, and relationship among them determine effect one process has on another with regard to file sharing.

  1. Every process has entry in process table. Within each process table entry is a table of open file descriptors, which we can think of as a vector, with one entry per descriptor. Associated with each file descriptor are:
    • file descriptor flags (close-on-exec)
    • pointer to a file table entry
  2. Kernel maintains file table for all open files. Each file table entry contains:
    • file status flags for file, such as read, write, append, sync, and nonblocking.
    • current file offset.
    • pointer to v-node table entry for the file
  3. Each open file (or device) has v-node structure that contains information about type of file and pointers to functions that operate on the file. For most files, v-node also contains i-node for the file. Information is read from disk when file is opened, so all the pertinent information about file is readily available. Example, i-node contains owner of the file, size of the file, pointers to where actual data blocks for file are located on disk. (Linux has no v-node. Instead, generic i-node structure is used. Although implementations differ, v-node is conceptually the same as generic i-node. Both point to an i-node structure specific to file system). We’re ignoring implementation details don’t affect our discussion. Example, table of open file descriptors can be stored in user area (a separate process structure that can be paged out) instead of process table. Also, these tables can be implemented in numerous ways, do not need to be arrays; another alternate implementation is a linked list of structures. Concepts remain the same.

Figure 3.7 of page 75 shows pictorial arrangement of these three tables for single process that has two different files open: one file is open on standard input (file descriptor 0), and other is open on standard output (file descriptor 1).

Arrangement of three tables has existed since early versions of UNIX system. Arrangement is critical to the way files are shared among processes. We’ll return to table when we describwe additional ways that files are shared.

(v-node was invented to provide support for multiple file system types on single computer system. This was done by Peter Weinberger and Bill Joy. Sun called this Virtual File System and called the file system-independent portion of the i-node the v-node. v-node propagated through various vendor implementations as support for Sun’s Network File System (NFS) was added. First release from Berkeley to provide v-node was 4.3BSD Reno release, when NFS was added. In SVR4, v-node replaced file system-independent i-node of SVR3. Solaris is derived from SVR4 and uses v-nodes. Instead of splitting data structures into v-node and i-node, Linux uses file system-independent i-node and a file system-dependent i-node.)

If two independent process have same file open, we could have as Figure 3.8 in page 76, where each process table entry (in user process) points to a different file table entry (in kernel) but finally each v-node pointer points to same v-node table entry and this to same i-node.

In example we see that first process has file open on descriptor 3, and other process in descriptor 4. So as we said each process has its own file table entry, but only single v-node is required. One reason why each process has its own file table entry is because each process has its own current file offset for file. Given these data structures, now need to be more specific about what happens with certain operations that we’ve described.

It’s possible for more than one file descriptor entry, point to same file table entry, as we’ll see with “dup” function. This also happens after a “fork” when parent and child share same file table entry for each open descriptor.

Note difference in scope between file descriptor flags and file status flags. Former apply only to single descriptor in single process, whereas latter apply to all descriptors in any process that point to given file table entry. When we describe fcntl function in Section 3.14, we’ll see how to fetch and modify both, file descriptor flags and file status flags.

Everything described in this section works fine for multiple processes that are reading same file. Each process has its own file table table entry with its own current file offset. Unexpected results can arise, when multiple processes write to same file. We need to understand concept of atomic operations.

Atomic Operations

Appending to a File: consider single process that wants to append to end of a file. Older versions of UNIX System didn’t support O_APPEND option, so program was like this:

    if (lseek (fd, 0L, 2) < 0)  /* position to EOF */
        err_sys("lseek error");
    if (write (fd, buf, 100) != 100)    /* and write */
        err_sys("write error");

Works fine for single process, but there are problems if multiple process use this technique to append to same file (example multiple instances of same program are appending messages to log file).

Assume two independent processes, A and B, are appending to same file. Each has opened the file but without O_APPEND flag. This gives us same picture as Figure 3.8. Each process has its own file table entry, but they share a single v-node table entry. Assume process A does lseek and sets current offset for file for process A to byte offset 1,500 (current end of file). Then kernel switches processes, and B continues running. Process B then does lseek, which sets current does lseek, which sets current offset for file for process B to byte offset 1,500 also (current end of file). Then B calls write, which increments B’s current file offset for file to 1,600. Because file’s size has been extended, kernel also updates current file size in v-node to 1,600. Then kernel switches processes and A resumes. When A calls write, data is written starting at current file offset for A, which is byte offset 1,500. This overwrites data that B wrote to the file (because without O_APPEND current file offset is not updated to i-node current file size).

Problem, is that logical operation of “position to the end of file and write” requires two separate function calls. Solution is to have positioning to current end of file and write be an atomic operation with regard to other processes. Any operation that requries more than one function call cannot be atomic, as kernel might temporarily suspend process between two function calls.

The UNIX System provides atomic way to do this operation if we set the O_APPEND flag when file is opened. This causes kernel to position the file to its current end of file before each write. We no longer have to call lseek before each write.

pread and pwrite Functions

UNIX specification includes two functions, allow applications to seek and perform I/O atomically: pread and pwrite.

    #include <unistd.h>

    ssize_t pread(int fd, void *buf, size_t nbytes, off_t offset);
        Returns: number of bytes read, 0 if end of file, -1 on error

    ssize_t pwrite(int fd, void *buf, size_t nbytes, off_t offset);
        Returns: number of bytes written if OK, -1 on error.

Calling pread = calling lseek + read, with exceptions:

Calling pwrite = calling lseek + write with similar exceptions.

(Really this is because, function instead of using the “current file offset” from “file table entry” they tell to the operating system: “hey OS, could you be so kind to the specified offset on disk and write or read? So it does not imply the current file offset” Thanks to jalopezg from UC3M-ARCOS for explanation of table and functions).

Creating a File

We saw another atomic operation when we described O_CREAT and O_EXCL options for open. When both exist, open will fail if file already exists. We said check for existence of file, and creation of file was performed as atomic operation. Without that atomic operation we had to do:

    if (( fd = open(path, O_WRONLY) ) < 0) {
        if (errno == ENOENT) {
            if (( fd = creat(path, mode)) < 0 )
                err_sys("creat error");
        } else {
            err_sys("open error");
        }
    }

But there’s a problem if a process create the file between open and creat calls. If that happens and the other process writes something, content will be erased with creat function. For that reason combining test for existence and creation into single atomic operation avoids problem.

Atomic operation refers to operation that might be composed of multiple steps. If operation is atomic, all the steps are performed (success) or none are performed (failure). We’ll again see atomic operations with the function link.

dup and dup2 Functions

Existing file descriptor is duplicated with this functions:

    #include <unistd.h>

    int dup (int fd);
    int dup2 (int fd, int fd2);
        Both return: new file descriptor if OK, -1 on error

New file descriptor returned by dup, guaranteed to be lowest-numbered available file descriptor. With dup2, we specify the new descriptor with fd2 argument (example we can specify a socket as first argument, and stdin, stdout and stderr as second argument). If fd2 is open, it’s first closed. If fd == fd2, dup2 returns fd2 without closing it. Otherwise, FD_CLOEXEC file descriptor flag, cleared for fd2, so fd2 is left open if process calls exec.

New file descriptor returned as value of functions shares same file table entry (the one from kernel) as fd argument.

In Figure 3.9 (page 80), we assume that at beginning of program executes:

    newfd = dup(1);

So probably, if no file was opened, next available descriptor is 3 (0, 1 and 2 are opened by the shell). And as both point same file table entry, share file status flags, current file offset and v-node pointer. The descriptor flag close-on-exec is cleared by dup functions always.

fcntl function can be used to duplicate dup and dup2:

    dup(fd)         =   fcntl(fd, F_DUPFD, 0)

    dup2(fd, fd2)   =   close(fd2)
                        fcntl(fd, F_DUPFD, fd2)

This last one is not the same, because dup2, is atomic, but close followed by fcntl isn’t. This can be a proble if there’s a signal between both or there’s a different thread.

Also there are some errno differences between dup2 and fcntl.

sync, fsync, and fdatasync Functions

Implementations of UNIX System have buffer cache or page cache in kernel through which most disk I/O passes. When we write data to a file, data is normally copied by kernel into one of its buffer and queued for writing to disk at some later time. This is “delayed write”. Kernel eventually writes all delayed-write blocks to disk, normally when it needs to reuse buffer for some other disk block. The explained three functions are used to ensure consistency of file system on disk with contents of buffer cache.

    #include <unistd.h>

    int fsync(int fd);
    int fdatasync(int fd);
        Returns: 0 if OK, -1 on error
    void sync(void);

sync function simply queues all modified block buffers for writing and returns; it does not wait for disk writes to take place.

Function sync called periodically ( ~30 seconds ) from system daemon (called update). This guarantees regular flushing of kernel’s block buffers. Command sync also calls sync function.

Function fsync refers only to single file, specified by file descriptor fd, waits for disk writes to complete before returning. Function is used when application, such as database, needs to be sure, modified blocks have been written to disk.

The fdatasync function similar to fsync, but it affects only data portions of a file. With fsync, file’s attributes are also updated synchronously.

fcntl Function

it can change properties of a file that is already open.

    #include <fcntl.h>
    int fcntl(int fd, int cmd, ... /* int arg */ );
        Return: depends on cmd if OK, -1 on error

In examples, third argument is always an integer. When we describe record locking in section 14.3, third argument will become a pointer to a structure. fcntl function used for five different purposes:

Let’s describe the first 8 of 11 cmd values.

Return value from fcntl depends on command. All commands return -1 on error or some other value if OK. Following four: F_DUPFD, F_GETFD, F_GETFL, and F_GETOWN. First returns new file descriptor, next two return corresponding flags, and final returns positive process ID or negative process group ID.

Example: check_file_descriptor_flags.c example program where we use fcntl to check the file flags of the file descriptors. Let’s see the checks:

    $ ./program 0 < /dev/tty # opens /dev/tty as stdin
    read only
    $ ./program 1 > temp.foo # opens temp.foo as stdout
    $ cat temp.foo
    write only
    $ ./program 2 2>>temp.foo # opens temp.foo as stderr appending data
    write only, append
    $ ./program 5 5<>temp.foo
    read write

The clause 5<>temp.foo opens file temp.foo for reading and writing on descriptor 5.

Example: If we modify file descriptor flags or file status flags, we must fetch existing flag value, modify it as desired, and set new flag value. We can’t simply issue F_SETFD or F_SETFL command. Next is a function to set one or more of file status flags for a descriptor.

    #include "apue.h"
    #include <fcntl.h>

    void
    set_fl (int fd, int flags) /* flags are file status flags to turn on */
    {
        int val;

        if ((val = fcntl(fd, F_GETFL, 0)) < 0)
            err_sys("fcntl F_GETFL error");

        val |= flags;   /* turn on flags */

        if (fcntl(fd, F_SETFL, val) < 0)
            err_sys("fcntl F_SETFL error");
    }

But if we change middle statement to:

    #include "apue.h"
    #include <fcntl.h>

    void
    clr_fl (int fd, int flags)
    {
        int val;

        if ((val = fcntl(fd, F_GETFL, 0)) < 0)
            err_sys("fcntl F_GETFL error");

        val &= ~flags;      /* turn flags off */

        if (fcntl(fd, F_SETFL, val) < 0)
            err_sys("fcntl F_SETFL error");
    }

Using the AND operation, and the complement of flags, we can reset the previous values of val.

Adding line:

    set_fl(STDOUT_FILENO, O_SYNC);

to the beginning of program shown in Figure 3.5, we’ll turn on synchrnous-write flag. So this would make each write to wait for the data to be written to disk before returning. As UNIX System, write only queues data for writing; actual disk write can take place sometime later. Database system, likely candidate for using O_SYNC, so that it knows on return from a write that data is actually on disk, in case of abnormal system failure.

O_SYNC flag increase system and clock times when program runs.

ioctl Function

The alfa and the omega of I/O operations. So anything that couldn’t be expressed using one of the other functions in this chapter usually ended up being specified with an ioctl. Terminal I/O was biggest user of this function (We’ll see in chapter18, that many of this operations has been replaced with separated functions).

    #include <unistd.h>     /* System V */
    #include <sys/ioctl.h>      /* BSD and Linux */

    int ioctl(int fd, int request, ... );
        Returns: -1 on error, something else if OK

ioctl was included in Single UNIX Specification only as extension for dealing with STREAMS devices, but it was moved to obsolescent status in SUSv4. UNIX System implementations use ioctl for many miscellaneous device operations.

Showed property corresponds to POSIX.1. FreeBSD 8.0 and Mac OS X 10.6.8, declare second argument as unsigned long. This doesn’t matter, since second argument is always a #defined name from a header.

For ISO C prototype, an ellipsis is used for remaining arguments. Normally, however, there is only one more argument, usually a pointer to a variable or a structure.

In this prototype, we show only headers required for function itself. Normally, additional device-specific headers are required. Example, ioctl commands for terminal I/O, beyond basic operations specified by POSIX.1, all require ** (instead of *<sys/termios.h>* that is deprecated).

Each, device driver can define its own set of ioctl commands. System however, provides generic ioctl commands for different classes of devices. Here example of ioctls, with category, constant names, header and number:

    disk labels         DIOxxx      <sys/disklabel.h>   4
    file I/O            FIOxxx      <sys/filio.h>       14
    mag tape I/O        MTIOxxx     <sys/mtio.h>        11
    socket I/O          SIOxxx      <sys/sockio.h>      73
    terminal I/O        TIOxxx      <sys/ttycom.h>      43

The mag tape operations allow us to write end-of-file marks on a tape, rewind a tape, space forward… None of these operations is easily expressed in terms of other functions (like read, write, lseek, etc), so easiest way to handle these has always been to access their operations using ioctl. We use ioctl in chapter 18 to fetch and set size of terminal’s window, and in chapter 19 when we access advanced features of pseudo terminals.

/dev/fd

Newer systems provide directory named /dev/fd, whose entries are files named 0, 1, 2 and so on. Opening file /dev/fd/n is equivalent to duplicating descriptor n, assuming that descriptor n is open. Example:

    fd = open("/dev/fd/0", mode);

most systems ignore specified mode, whereas others require that it be a subset of mode used when referenced file (stdin in this case) was originally opened. Because previous open is equivalent to:

    fd = dup(0);

descriptors 0 and fd share same file table entry. If descriptor 0 was opened read-only, we can only read on fd. Even if system ignores open mode and the call:

    fd = open("/dev/fd/0", O_RDWR);

succeeds, we still can’t write to fd. (Linux when you open /dev/fd/0 what it does is to map a symbolic link to underlying physical files).

We can also call creat, as well as specify O_CREAT in a call to open. Beware of doing this on Linux, as it uses symbolic links to real files, using creat will result in underlying file being truncated.

Other systems provide pathnames /dev/stdin, /dev/stdout and /dev/stderr. Equivalent to /dev/fd/0, /dev/fd/1 and /dev/fd/2.

Main use of /dev/fd files is from shell. Allows that use pathname arguments to handle standard input and standard output in same manner as other pathnames. Example, cat program looks for an input filename of - and uses it to mean standard input:

    $ filter file2 | cat file1 - file3 | lpr

First cat reads file1, then its standard input (output of filter program on file2), and then file3. If /dev/fd is supported, special handling of - can be replaced by:

    $ filter file2 | cat file1 /dev/fd/0 file3 | lpr