Linux System Programming by Robert Love [Chapters 1-4]

Chapter 1 - Introduction
- processes inherit the UID and GID
- permission octal values: r=4, w=2, x=1. Order is user, group, everyone else
- functions normally just return a -1 to indicate an error
- more details to be found in
extern int errno
in<errno.h>
- to print
void perror(const char *str)
in<stdio.h>
- example:
if (close (fd) == −1)
perror ("close");
Chapter 2 - File I/O
- Opening files:
fd = open(<path>, flags)
- create:
fd = open (file, O_WRONLY | O_CREAT | O_TRUNC, 0664);
- is identical to
fd = creat (filename, 0644);
- fd of -1 indicates error
- reading:
#include <unistd.h>
ssize_t read (int fd, void *buf, size_t len);
- A call to read() can result in many possibilitie, to read all the bytes
ssize_t ret;
while (len != 0 && (ret = read (fd, buf, len)) != 0) {
if (ret == −1) {
if (errno == EINTR)
continue;
perror ("read");
break;
}
len -= ret;
buf += ret;
}
- nonblocking reads can be achieved by passing O_NONBLOCK to
open
- if a read is blocked, it will return -1 and errno will be set to EAGAIN
- write:
#include <unistd.h>
ssize_t write (int fd, const void *buf, size_t count);
- call fsync(int ft) to sync writes to storage
- fdatasync(int fd) does the same thing, without updated metadata, and is faster
- use the O_SYNC flag to always sync
- closing:
#include <unistd.h>
int close(int fd)`
- seeking:
#include <sys/types.h>
#include <unistd.h>
off_t lseek (int fd, off_t pos, int origin);
- you CAN seek past the end of a file, it will be padded with zeros
- position reads and writes avoid the race condtion associated with seeking and then reading
#define _XOPEN_SOURCE 500
#include <unistd.h>
ssize_t pread (int fd, void *buf, size_t count, off_t pos);
ssize_t pwrite (int fd, const void *buf, size_t count, off_t pos);
Multiplexed IO
poll()
is easier to use thanselect()
- both wait on a set of open file descriptors and return when any are available
- poll example
#include <stdio.h>
#include <poll.h>
#include <unistd.h>
int main(int argc, char **argv)
{
struct pollfd fds[2];
fds[0].fd = STDIN_FILENO;
fds[0].events = POLLIN;
fds[1].fd = STDOUT_FILENO;
fds[1].events = POLLOUT;
int err = poll(fds, (nfds_t) 2, 0);
if (err == -1) {
perror("poll");
return -1;
}
if (fds[0].revents & POLLIN) {
printf("STDIN ready\n");
}
if (fds[1].revents & POLLOUT) {
printf("STDOUT ready\n");
}
return 0;
}
- The VFS provides a unified blocks and inodes based interface to all filesystems
- The page cache holds retrieved info including readaheads
Chapter 3 - Buffered I/O
- Usually more efficient to read in multiples of 4096 or 8192 bytes because of block alignment
- User space buffered IO can increase performance even more
- Write to a buffer, which is written in a single operation
- The read requests are served from the buffer
- The end result is fewer system calls for larger amounts of data, all aligned on block boundaries.
- Provided by
stdio
- StardardI/O routines act on file pointers, not fds
FILE
type defined instdio.h
FILE * fopen (const char *path, const char *mode);
FILE * fdopen (int fd, const char *mode);
- modes: r, w, r+ (read+write), w+ (read, write, truncates), a+(rw in append mode)
- Closing the stream will close the file descriptor as well.
int fclose (FILE *stream);
int fcloseall (void); // Linux specific
- reading
// read a char
int fgetc (FILE *stream);
// put it back
int ungetc (int c, FILE *stream);
// read a line
// reads one char less than size and puts a at the end
// will stop and newline and also put a
char * fgets (char *str, int size, FILE *stream);
// Read binary data
// reads `nr` elements, each of size `size`
// returns less than nr if there's an error
// it is impossible to know which of the two conditions occurred without using ferror() and feof()
size_t fread (void *buf, size_t size, size_t nr, FILE *stream);
- writing
// write a char
// return EOF in case of error
int fputc (int c, FILE *stream);
// write a string
int fputs (const char *str, FILE *stream);
// binary
// A return value less than nr denotes error.
size_t fwrite (void *buf, size_t size, size_t nr, FILE *stream);
- It’s important to bear in mind that because of differences in variable sizes, align‐ ment, and so on, binary data written with one application may not be readable by other applications.
- Example program
# include <stdio.h>
int main(int arc, char **argv) {
struct pirate {
char name[100];
unsigned long booty;
unsigned int beard_len;
} p, blackbeard = {"Mayank", 100, 50};
FILE * file = fopen("/tmp/pirate", "w");
if (!file) {
perror("fopen");
return 1;
}
if(!fwrite(&blackbeard, sizeof (struct pirate), 1, file)){
perror("fwrite");
return 1;
}
if(fclose(file)) {
perror("fclose");
return 1;
}
file = fopen("/tmp/pirate", "r");
if (!file) {
perror("fopen");
return 1;
}
if(!fread(&p, sizeof(struct pirate), 1, file)) {
perror("fread"); return 1;
}
if(fclose(file)) {
perror("fclose");
return 1;
}
printf("%s, %lu, %u", p.name, p.booty, p.beard_len);
}
- other
int fseek (FILE *stream, long offset, int whence);
// fsetpos, rewind, ftell, fgetpos for seeking
// fflush flushes the data to the kernel (but does not sync)
fileno(*stream) gets the fd
- errors
ferror(FILE *stream) returns non-zero if error is set
feof() returns nonzero if EOF is set
clearerr() clears the error
Threading
- Standard io functions are thread safe
- For multipl functions, use explicit locks
flockfile(*stream) locks a file (blocking) by incrementing lock count
funlockfile(*stream) decrements lock count
ftrylockfile(*stream) is non-blocking, returning nonzero if cannot lock
Chapter 4: Advanced File I/O
Scatter/Gather I/O
- The idea is to write several buffers to a file or to read several buffers from a file
- The buffers are read and written in serial order, but very efficiently by the kernel
#include <sys/uio.h>
struct iovec {
void *iov_base;
size_t iov_len;
};
ssize_t readv (int fd, const struct iovec *iov, int count);
ssize_t writev (int fd, const struct iovec *iov, int count);
epoll
- easier to use than
poll
andselect
, but linux specific - decouples creating a listener, adding fds and waiting on it
#include <stdio.h>
#include <sys/epoll.h>
#include <unistd.h>
int main() {
int epfd = epoll_create1(0);
struct epoll_event events[2];
events[0].events = EPOLLIN;
events[0].data.fd = STDIN_FILENO;
events[1].events = EPOLLOUT;
events[1].data.fd = STDOUT_FILENO;
if(epoll_ctl(epfd, EPOLL_CTL_ADD, STDIN_FILENO, &events[0]) == -1) {
perror("epoll_ctl");
return 1;
}
if(epoll_ctl(epfd, EPOLL_CTL_ADD, STDOUT_FILENO, &events[1]) == -1) {
perror("epoll_ctl");
return 1;
}
struct epoll_event out_events[2];
int events_ready = epoll_wait(epfd, out_events, 2, 0);
if(events_ready == -1) {
perror("epoll_wait");
return 1;
}
printf("Events Ready: %d\n", events_ready);
for(int i = 0; i < events_ready; i++) {
printf("Event fd: %i: %i\n", out_events[i].events, out_events[i].data.fd);
}
return 0;
}
MMAP
- mmap maps a file into memory
#include <sys/mman.h>
void * mmap (void *addr, size_t len, int prot, int flags, int fd, off_t offset);
prot
is protection mode -PROT_READ
,PROT_WRITE
,PROT_EXEC
- it should match the open mode of the file
flags:
- MAX_FIXED - addr is a requirement, not optional (discouraged)
- MAP_PRIVATE - file is mapped copy-on-write
- MAP_SHARED - shared with other processes that map this file
- one of the previous two must be specified, but not both
- mapping increments the file's reference counter
addr
andoffset
must be aligned on a page boundary- to get page size
long page_size = sysconf(_SC_PAGESIZE); // <unistd.h>
// or
int getpagesize(void);
- to remove mapping:
int munmap(void *arrd, size_t len);
- will remove all mappings in the given range
- example
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/mman.h>
#include <stdio.h>
int main(int argc, char *argv[]) {
if (argc != 2) {
printf("Must suppy filename\n"); return 1;
}
int fd = open(argv[1], O_RDONLY);
if (fd == -1) {
perror("open"); return 1;
}
struct stat sb;
if (fstat(fd, &sb) == -1) {
perror("fstat"); return 1;
}
if (!S_ISREG(sb.st_mode)) {
printf("Not a regular file\n"); return 1;
}
char * p;
if ((p = mmap(0, sb.st_size, PROT_READ, MAP_SHARED, fd, 0)) == MAP_FAILED) {
perror("mmap"); return 1;
}
for(int i = 0; i < sb.st_size; i++) {
putchar(*(p + i));
}
putchar('\n');
if(munmap(p, sb.st_size) == -1) {
perror("munmap"); return 1;
}
return 0;
}
advantages of mmap:
- avoids the extra copy that happens to a user space buffer when using
read
andwrite
- there's no system call overhead to read from memory
- shared mode lets processes share file
- there's no need for lseek
- avoids the extra copy that happens to a user space buffer when using
mremap
to change size of mapping- glib often uses
mremap
to implementrealloc
mprotect
to change protection of a mappingmsync
to sync changes back to disk - should always be called because kernel does not know that memory has been modified, unlikewrite
's dirty buffersmadvise
to give the kernel a hint about usage of map - usually related to amount of readahead (more for sequential, none for random)posix_fadvise
is very similar, but for normal I/O (not mmap)
Asynchronous I/O
The book here is very brief. The following notes are from the aio(7) man page
The POSIX AIO interface consists of the following functions:
#include <aiocb.h>
aio_read(3) Enqueue a read request. This is the asynchronous analog of read(2).
aio_write(3) Enqueue a write request. This is the asynchronous analog of write(2).
aio_fsync(3) Enqueue a sync request for the I/O operations on a file descriptor. This is the asynchronous analog of fsync(2) and fdatasync(2).
aio_error(3) Obtain the error status of an enqueued I/O request.
aio_return(3) Obtain the return status of a completed I/O request.
aio_suspend(3) Suspend the caller until one or more of a specified set of I/O requests completes.
aio_cancel(3) Attempt to cancel outstanding I/O requests on a specified file descriptor.
lio_listio(3) Enqueue multiple I/O requests using a single function call.
struct aiocb {
int aio_fildes; /* File descriptor */
off_t aio_offset; /* File offset */
volatile void *aio_buf; /* Location of buffer */
size_t aio_nbytes; /* Length of transfer */
int aio_reqprio; /* Request priority */
struct sigevent aio_sigevent; /* Notification method */
int aio_lio_opcode; /* Operation to be performed;
};
- You can either poll the status of the request (with
aio_error
) or have the completion fire a signal - You can ask the request to signal via
aiocbp->aio_sigevent.sigev_notify = SIGEV_SIGNAL
- You can also ask it to spawn a thread
- If
aio_error
is not EINPROGRESS, you can useaio_return
to get the return value (of the correspondingread()
operation for example)
I/O Schedulers and I/O performance
Scheduler is set in
/sys/block/[device]/queue/scheduler
The scheduler would try to sequence reads and writes on disk efficiently
It performs merges (combining reads in adjecent blocks) and sorting
Since writes are streamed (async), they can starve the reads (
writes-starving-reads
)The Deadline scheduler
- keeps a sorted list of operations (sorted by physical block)
- keeps a FIFO queue of read operations and a FIFO queue of write operations
- read ops have a deadline of 500ms and write ops 6s
- it goes by sorted list, unless a deadline happens, which get priority
My current machine with an NMVe SSD disc uses mq-deadline (multiqueue)
Optimizing IO
There are many optimizations that can be done on the application side:
- Minimize I/O operations (coalescing)
- Perform block aligned I/O
- Use buffering
- vectored I/O, positional I/O and async I/O
We can also sort I/O requests before sending them to the kernel
- Sectors are mapped to physical blocks so block storage can be addressed by blocks (logical block addressing, LBA) instead of cylinders, heads and sectors
- Filesystems are purely software, and talk in terms of logical blocks
- Logical block size is a multiple of physical block size
Sorting writes in the application:
- One way to do that is pathname -> poor approximation
- Another is inode number -> breaks with more fragmentation
Or you can try to sort by physical block:
- Each file is broken into logical blocks
- You can get the physical block from logical block:
ret = ioctl (fd, FIBMAP, &block);
- The logical blocks are zero-indexed and file-relative
- First determine number of blocks in a file via
fstat
- Then, for each logical block, find the physical block via
ioctl
- Since files tend to be contiguous and we will mostly want to order I/O for a given file, just use the 0th block to sort:
int logical_block = 0; ioctl(fd, FIBMAP, &logical_block);