CS444 class 26

Thursday, December 7

hw5 due Sunday—questions?

Tues: last class: teacher evals, review.

Reading since midterm

Mem Man: Ch 4 to pg. 219, skip 4.4.7, read pp. 222-232, skip 233 (errors), read 234-247, 257-261, UNIX: 710-719, Win 811-823

I/O: Ch 5 to pg. 305, Sec. 5.5, pp 327-332, UNIX 723-732, Win 824-830

Filesys: Ch 6 to pg. 414, 445-447, UNIX: 732-745, Win 830-841.

Goals of i/o software, pg. 283.

device independence, helped by uniform naming, also known as the filesystem “namespace”, a syntax of string names for files and also devices represented by “special files” like /dev/lp.

synchronous vs. asynchronous—we’ve covered this before:

synchronous output: request output, wait (block) until it’s done
asynch. output: request output, do other things, get notified that output is done

but “output done” can have two meanings, at least for user output using an OS:

actually out on device
fully accepted by OS, on way to device—normal meaning, allowing write-behind

Memory-mapped I/O (not to be confused with memory-mapped files!!)

We are familiar with I/O instructions of the x86, not memory-mapped i/o. For example,

inb %dx, %al with 0x3f8 in dx

This instruction reads one byte from i/o port 0x3f8, the COM1 UART’s receiver register, into CPU register AL, the low 8 bits of EAX.

Compare this to the memory-reading instruction

movb 0x3f8, %al

This instruction reads one byte from memory location 0x3f8 into CPU register AL.

The x86 architecture specifies two spaces, a 32-bit memory space and a 16-bit i/o port space, and it is always clear from the instructions which one is being used.

When the reads or writes go over the bus, the memory address or i/o port number ride on the address bus, which of course has to be big enough for the 32-bit memory addresses, and is half idle in the case of an i/o port number. In addition to the address lines, there is a control line called M/IO# which says which kind of address it is.

Thus it is easy for devices to know that an i/o port # is current on the address bus and they should check further to see if it’s asking for them.

On the other hand we’re using 33 lines somewhat inefficiently—we could carve off a little bit of the memory 32-bit space for i/o addresses and drop the M/IO# line. That’s the idea of memory-mapped i/o.

Aside: Memory-mapped i/o was invented by the genius Gordon Bell, who masterminded the PDP11. He also engineered the Unibus backplane, with a standardized bus protocol that allowed new boards to be plugged in. Then UNIX came along with its plug-in device drivers, at no charge, just sign this nondisclosure. The result was dynamite for University researchers and others who wanted to use computers for measurement and control as well as general computing. Our department’s first computer was a PDP11 with 256K of memory and 10M of disk, bought in 1978 for about $30K, soon upgraded with a giant 80M disk. We taught several classes at once on this system, using version 6 and then version 7 of the original UNIX sequence.

In memory-mapped i/o, a certain region of memory addresses are turned into i/o register addresses. Under UNIX or other sealed-up OSs, this region is available only to the kernel, except possibly for the video memory. I drew a picture of kernel address space with the memory-mapped region in the high addresses. Then a mov instruction to 0xffffffe4 for example, would be putting data in a device register.

Tan ex. “test port_4” in assembler, where port_4 is some particular address in the hardware register region. One nice thing about memory-mapped i/o is that you can use C directly for accessing device registers.

char *port_4_addr = PORT_4_ADDR; /*construct a pointer to a particular addr */

while (*port_4_addr)

; /* poll register until it becomes zero */

With memory-mapped i/o, you can write a whole device driver in C, except for that little assembler envelope around the interrupt handler. Of course we put little routines around inb and outb, so we can write mostly in C in spite of the need to use special i/o instructions.

The x86 actually does memory-mapped i/o as well as i/o instructions. The PC system architecture reserves the upper part of the first 1M of memory space for this. The video memory is in this area.

DMA—read coverage in Tan. Used for network interfaces as well as disk-like devices.

pg 284: Programmed i/o—polling device status, no interrupts.

Note polling loop on pg. 285 is using memory-mapped i/o—be sure you know the x86 equivalent. Also move the semicolon for the empty while body to it own line!

pg. 286: Interrupt-driven i/o—covered in detail already. Note again memory-mapped i/o in example. Don’t try to interpret this code exactly, as it is not exactly right.

pg. 287: I/O using DMA: a whole block of data gets moved to or from a device and memory, with just one interrupt. As with simpler devices, the data gets copied from the user to a kernel data area before being sent to the device, and is received first in kernel memory, then copied back to user. The kernel blocks the user process as needed in read or write. The interrupt handler calls wakeup.

pg. 288: Layers pic. Note that “interrupt handlers” are actually parts of device drivers, not a separate kind of software. Each device driver has an upper half and a lower half, where the upper half runs in system calls read or write, and the lower half is the device’s interrupt handler. Between them is a shared data buffer, itself in kernel data.

pg. 288 Steps in interrupt handler

save regs.
set up context—not necessary for trusted interrupt handlers that simply borrow the environment of the current process.
set up stack—also not necessary for trusted int. handlers—borrow stack.
Ack the PIC.
Copy regs to proc. entry—not necessary unless doing a process switch, move to process switch list
Run ISR, the interrupt handler of the device driver.
Check for another process to run, do process switch if indicated (preemption)
part of process switch
part of process switch
part of process switch
Restore regs saved in 1., do iret. (added)

Steps of process switch—should be separated from interrupt handler description. It’s called from there and many other places.

pg. 289: Device Drivers—one for each kind of device.

pg 290. Note that “user process” is a poor label for the top part of this picture. A user process has a user execution environment and a kernel execution environment, with its own kernel stack. So change that label to “user-mode execution.”

At Tan., pg 292, Device-indep. i/o—implementation

At syscall level with UNIX, we see write(fd, buffer, nbytes) where the first argument can correspond to many different devices or files, specified at open time. fd = open(“/dev/lp”....) for example.

Similarly, in the hw, we have write(dev, buffer, nbytes), where dev can correspond to different devices: 0 and 1 for serial devices, and we can add device 2 for LPR. Then we have in ioconf.c:

struct device devtab[] = {

{0, ttyinit, ttyread, ttywrite, ttycontrol, 0x3f8,(int)&ttytab[0]}, /* TTY0 */

{1, ttyinit, ttyread, ttywrite, ttycontrol, 0x2f8,(int)&ttytab[1]},/* TTY1*/

{2, lpinit, lpread, lpwrite, lpcontrol, ... }

};

This is the core data structure to a device-independent i/o system, the device switch table. Classic UNIX has “cdevsw” for character devices and “bdevsw” for block devices that have similarly tabulated function pointers for device drivers. It allows a new device driver to be added without changing the kernel source except for this one file, plus the additional device driver code.

The user program calls read or write with a buffer in user data to supply or receive the data.

The device-independent i/o layer contains the “sysread” and “syswrite” or equivalent functions called from the system call dispatch code. In these functions, the appropriate device driver function is looked up in the device switch table and called, causing execution to arrive in the right device driver.

To be more exact, execution arrives in the upper half of the device driver, in “ttyread” or “ttywrite” for a serial device, or “lpread” or “lpwrite” for the LPR device. There it enqueues or dequeues data between the user and kernel data areas, and calls sleep if appropriate, waiting for the actual i/o to be reported by interrupts. In some cases, the upper half has to “start” the i/o in the device.

The interrupt handler handles the actual i/o with the hardware device, and calls wakeup when needed to get the upper half going again.

The hardware is at the bottom.

I drew a big picture to show all this, and how we could use (kernel) semaphores to do some of the coordination.

In the UNIX case, the device number used to look up the right row in the device switch table is called the major device number. There is also a minor device number that discriminates multiple devices of the same kind. We can see these two numbers by doing “ls –l” on the special file:

ls –l /dev/lp

crw-rw-rw- 1 root 5, 1 <data> /dev/lp

Here the major device number is 5, so cdevsw[5].d_write will be accessed by a write to this device, and will be “lpwrite” or some similar name in the lp device driver.

pg 294 Buffering. There are so many ways described here it’s confusing. Concentrate on case (c), the most common case, where the kernel has a data buffer, and the user has another one.

Disks, pg. 300

Today’s disks have their own microcontroller, so they can do complicated things such as buffering data in their own memory.

First look at pg. 25, so see pic of platters, surfaces, heads, and disk arm.

A track is made up of a certain number of sectors, each 512 bytes.

A cylinder is the set of tracks at one arm position, and represents the set of sectors that can be read or written without seeking from the current arm position, so are “close together” on disk.

A partition is a set of cylinders.

A typical disk rotates at 7200rpm = 120 cyc/sec, or 8.3 ms/rotation. Average rotational delay is half that.

A seek is an arm motion to the right cylinder and takes about

--0 ms if already on right cylinder

--1 ms if at next cylinder

--up to 10 ms if further away.

The data transfer goes from disk surface to disk track buffer, then from there to the bus.

Example of 20G disk, pg. 301: 281 sectors/track, each 512 bytes, so 140K/track. This is transferred to a track buffer in one rotation time of 8.3ms, so a transfer rate of 17M/sec.

However it is listed as having a transfer time of 17usec for one sector, or 512/17x10**-6 = 30M/sec. This is the transfer rate out of the disk to the bus. Since it is being transferred from the track buffer, it can be faster than if it was coming from the disk directly. In fact it is transferred to the bus in many bursts, so the bus can be used for other things too.

Each read or write to a disk involves seek, rotational delay, and transfer time, but these vary widely.

Example of random access: block anywhere on disk

	To 100 K block	To 4 K block
Seek time	10 ms	10 ms
Rot. delay	4 ms	4 ms
Transfer to track buffer	6 ms	.2 ms
Transfer to bus	3 ms	.1 ms
Total	23 ms	14 ms

Example of access to next block on disk

	To 100 K	To 4 K
Seek time	0 ms (usually on same cylinder)	0 ms
Rot. delay	0 ms	0 ms
Transfer to track buffer	0 ms (already there)	0 ms
Transfer to bus	3 ms	.1 ms
Total	3 ms	.1 ms

In the 4K block case, the track buffer will usually already have the data when the second read comes in, so the read is very fast.

The most important thing is to avoid thousands of seeks all over the disk. Don’t worry about a few seeks. A thousand random seeks would take 14 secs.

Recall the picture of one disk controller running 2 disks. The one controller should be able to keep both disks busy at the same time, a feature called overlapped seeks.

The pattern of sectors on tracks isn’t as simple as you might think—see pg. 302. So a simple system of “logical block numbers” across the whole disk makes it easier to use. Note that the numbers go over cylinder 1, then cylinder 2, and so on, so that close-by numbers correspond to close-by disk sectors in terms of seek time.

Latest disks: Serial ATA disks. The “serial” comes from the fact they use a bit stream from the disk to the controller, not parallel wires one for each bit as in previous “IDE” or ATA disks. Allows hot swap, previously only available with SCSI disks, which have been using serial bitstreams for years. Basically you get SCSI-like technology at ATA prices, i.e., much lower. There are other features too, like flush-buffers on command, that database systems should use.

RAID—read to pg. 306. Redundant Array of Inexpensive/Independent Disks

Now available for home systems. Example: Dell desktop, most expensive on website, $1699 with 80GB 7200rpm SATA hard disk, + $289 for either RAID 0 with 500 GB, or RAID 1 with 250GB, since they both use two 250GB drives inside. Also smaller RAID for +$153 or larger (400/800) for +$689.

Basic idea: make a box with disks inside, one controller for the whole thing so it looks like an ordinary disk to the OS.

Different levels of RAID

RAID 0 is “striping”, also available by OS software + ordinary disks. No redundancy. Good performance and easier to use than several separate disks.

Large i/o’s end up being serviced by multiples disks in parallel. Great for handling huge files.

RAID 1 is “mirroring”, simple second-copy redundancy. Writing not faster but some reading is. Easy to understand recovery: copy whole disk.

RAID 2,3, 4—NIU (not in use)

RAID 5—used for slowly changing data, min of 3 disks. Do exclusive OR of first stripes on n-1 disks, write to nth disk, called parity. Similarly handle second stripes, etc Then if one disk crashes, can reconstruct data (or parity) Better space utilization than RAID 1, but harder to recover.

RAID 6—can handle failure of any 2 disks. See http://www.pcguide.com/ref/hdd/perf/raid/levels/singleLevel6-c.html

File Systems: Chap 6, plus Chap 10, 11

Look at pg. 740-741.

Shows disk layout for UNIX systems.

Idea of inode: structure describing one file: info in ls –l, plus pointers to disk contents. See pg. 742. Devices have special info.

Can do “ls –i” to see inode numbers of files. Note that the filenames themselves are not in the inode but rather in the directory file, along with the inode number.

One inode can have multiple directory entries, meaning that the same file contents is known under multiple names. To try this out, “ln foo bar” to link a new name to file foo. Then do “ls –I” to see that they have the same inode numbers.

pg. 743: lots of info in this picture. Note that the open file table is in the kernel global memory, while the file descriptor tables are in the process table entries.

Shows how parent and child share an fd passed via fork and still usable (open) after exec. Clearly need to know fd to use it (and we know the whole user space is wiped out and rebuilt by exec) but stdout = 1 always, and stdin = 0 and stderr = 2, so once one of these is set up by the shell, it can be used by all the processes run from the shell.

Also shows how the contents of the file are held together with index blocks. Looks like the page-table picture, but can go down more levels. Can handle TB (terabyte) files.