CS444 class 21

CS444 Class 21 Swapping, Page Faults

Handout: Intro to hw4

Intro to hw4—go over handout

We’ll discuss this further next class.

Back to Memory Management.

Note that virtual memory can be much larger than physical memory, but when this is true, and processes are actively using the virtual memory, there is a lot of paging going on.

Stressing memory management: Last year a student wrote a “depth bomb” program to severely exercise ulab’s memory management, which we had previously observed to be hardly used. It had had only had 22 revolutions of the clock daemon in almost 3 years of uptime. His program created about 2000 processes, each malloc’ing 4KB of memory and then using it, forcing the OS to provide memory for them. That’s 8GB of memory, but ulab has only about 400MB of physical memory, so it had to do a lot of paging to keep up. The system did survive, and after these runs showed 39 revolutions of the clock daemon. I suspect that many malloc’s failed, because ulab has only about 1GB of swap space, and the brk system call used in malloc allocates swap space for the malloc’d memory. But even if only 800MB of mallocs succeeded, that’s still much more than the 400MB of physical memory, so has the same effect.

I assume he checked that no other users were working on ulab at the time he ran the program, since this would be significant “denial of service” to them.

Swap space or swap area: an area of disk or SSD (solid state disk, aka flash memory) to hold page data. Typically sized at twice physical memory size. This area has no file system: it’s just a bunch of page images. Of course the OS has to track which pages are in use, and by what process. Swapping can be done to a file at some cost in performance (pg. 769).

Uses of swap space:

UNIX/Linux, Windows: Page-outs of paging system. When a dirty page is reclaimed, the old version is written to swap space.

Solaris: swaps whole process images to swap space if idle too long (20 min.) or need for memory is high…

Swapping under extreme load on Solaris UNIX

If the clock algorithm is unable to get the #pages on the free list up to even to a low level, it causes swapping to start on actually-active processes, a desperate action. This should never happen on a healthy system. It is discussed at the bottom of pg. 718 and top of the next page. Linux apparently just uses more and more paging.

In the real OS execution, PDs and PTs (page directories and page tables) come and go with the processes. Each process has a PD, and at least a few PTs, to support its virtual memory, under code, data, stack, and DLLs. In Linux, there's 1 GB of kernel virtual memory that uses the upper one quarter of the PD, plus PTs

The current process on a certain x86 processor has CPU register CR3 pointing to its PD, and thus mapping in all its virtual memory, including the kernel for x86 Linux. There are typically a group of other processes on the system with their PD and PTs in memory, but not currently provided a CPU, so not mapped in.

Tan, Sec. 3.6.1 pp 227-230. MMU actions for processes

Four times the OS has paging-related work:

1. Process creation (fork}: create page directory PD and at least one PT

2. Execution: Exec: map in executable file pages, process switch: make CPU use different PD

3. Page Fault: fix up one page, PTE

4. Process Termination (exit or exception) : deallocate PD, PTs

At process switch, the CR3 register is loaded with the PA of the PD of the newly scheduled process, and that causes a whole new process image to be mapped in. The caches need to be flushed, both the instruction/data and TLBs. Just after a process switch, cache misses are frequent until the caches again have the important data in them. This cache flush action is a big performance effect attached to a process switch.

Note that switching between threads of one process does not involve reloading CR3 or flushing caches, and thus is significantly more lightweight.

MMU and system security

The MMU is very important to the job of keeping a user process "bottled up" in its virtual machine. Each address in a user program is tested out on every instruction, so the process can't see anything it shouldn't, in particular, the kernel code and data..

The MMU causes a page fault or general protection exception for addresses that fail its test, and this causes the kernel to execute and figure out what to do. Each page is marked U or S to allow the kernel to see things the user level execution can't, and generate an exception if a user execution tries to access kernel memory.

Naturally, the instruction to change the CR3 is privileged. The page tables and page directory are hidden away from the user program in the kernel data area.

Page Fault Handling in x86.

The MMU sees P=0 in a needed PTE, and causes the CPU proper to trap via vector number 0xe = 14, so that IDT[14] supplies the trap handler, known as the page fault handler.

The MMU also puts the “linear address” of the fault in CPU register CR2. For SAPC, it’s just LA = PA + c0000000, and VA = PA, so a page fault at VA 0x23456 will show CR2 = 0xc0023456. Understanding linear addresses properly requires studying the x86 segmentation system, covered in Sec. 3.7.3, but we are skipping this rat’s nest.

Back to the page fault handler: look at Tan., pg. 228 for steps.

MMU causes trap, puts faulting LA in CR2
As routine saves regs (this is part of the OS)
PF handler (in C) figures out which page not present
OS checks if this page is a valid page of this process (else usually kill process), get page from free list.
Drop this step (the free list handling gets dirty pages written out earlier)
OS locates needed data, often in executable file or swap space, schedules disk i/o, blocks this process <-- Note blocking in PF
Disk interrupt signals page in, PTE updated, wakeup process.
Faulting instruction needs reexecuting—process is still in kernel after scheduling, back in PF handler, with user PC on bottom of stack where it can be adjusted (backed up) (this is done in the PF handler, not the disk interrupt as Tan seems to say).
The PF handler returns to the as routine
The as routine does the iret, using the backed-up PC, and resumes user execution.
(added) The user code re-executes the instruction that caused the PF, more successfully this time.

Tan., pg. 230. Instruction backup is not such a problem in x86 or Sparc, because there is no auto-incrementation done along with memory access, and there is at most one operand memory address per instruction.

Examples of PFs

First ref to data page—page contents are in executable file. PF handler blocks.
First ref to BSS page (uninitialized data)—no blocking, just assign page from free list.
Ref that extends the user stack—same as 2.
First ref to text page (code)—as in 1, or if this program is in use by another process, arrange sharing of code page already in memory.
Reref after pageout to swap space—block while read in from swap space.
Ref to address outside of program image: fails validity test in step 4 above, causes “segmentation violation” in Solaris, usually kills process.
Ref to malloc’d memory (heap): malloc itself only allocates swap space, not real memory, so the memory is added by PFs, like #2.

Note that a PF is more like a system call than an interrupt. They are both exceptions or “traps.” When the kernel is executing after a trap, it is executing on behalf of the current process, so the process entry and process image are relevant and usable. No problem in blocking. An interrupt is quite different. Interrupt handlers execute as guests of a “random” process. They normally don’t access process data, only kernel global data relevant to their device.

Tan, pg. 758 Linux memory management.

Discussion of process image regions—don’t forget DLLs too now.

pg. 758 pic. showing 2 processes in memory sharing their code pages. But note that only one of these is “current” in use of the CPU (unless there are multiple CPUs.) The other one is in memory but not scheduled at the moment. On the x86, the current one has the CPU register CR3 pointing at its page directory, which makes its whole process image mapped in. Other CPUs have similar master registers for the top-level of their paging support.

Memory-mapped files. Fig. 10-13, pg. 761

A region of a file can be mapped into a process image. Then writes to that part of VA space cause corresponding writes to the file pages. If two processes map the same region of a file, they end up with shared memory with data that persists in the filesystem. However, this is not commonly used in applications, partly because the solution is not portable, and partly because there are subtle issues such as exactly when the file writes occur.

Often, the memory-mapping mechanism is used for shared memory, ignoring the file itself. You can use /dev/zero, the OS-supplied effectively infinite file of zeroes, instead of a real file. See mmap_nofile.c.

Memory-related system calls.

Most paging actions are done without memory-specific system calls, but malloc does need a system call to get memory assigned to the process. The underlying system call is brk. Memory-mapped

Finished with memory management coverage. Skipping Chap 4, on to Chap. 5, already partly covered.

Chap. 5 I/O

Reading: Chap 5 to pg. 332, 339-347, skip DMA, 348-360, Chap 10, Sec. 10.5: 771-773, 775-776

Block vs. char devices (mainly a UNIX idea)

Each device under UNIX has a special file, also known as “device node”. Tan., pg. 734 example is “/dev/lp” for a line printer device. Classically, device nodes were kept in directory /dev. They are not ordinary files, but rather filenames and associated information about a device.

When you display them with “ls –l”, you see a “c” for char device or “b” for block device as the first character of the listing, as you would see a directory marked “d”. For example a line printer would be a char device:

ls –l /dev/lp

crw-rw-rw 1 root ... /dev/lp

On Solaris, the devices have been reorganized into many subdirectories by device type, and with symbolic links to other names, so it’s a bit hard to find the actual device nodes. For example, on ulab, we have /dev/board1, the serial line to mtip system 1. We have to follow two symbolic links before we find the device node:

blade57(6)% ls -l /dev/board1

lrwxrwxrwx 1 root 5 Apr 25 2008 /dev/board1 -> ttyrf

blade57(7)% ls -l /dev/board1

lrwxrwxrwx 1 root 5 Apr 25 2008 /dev/board1 -> ttyrf

blade57(8)% ls -l /dev/ttyrf

lrwxrwxrwx 1 root 30 Sep 19 2002 /dev/ttyrf -> ../devices/pseudo/ptsl@0:ttyrf

blade57(9)% ls -l /dev/../devices/pseudo/ptsl@0:ttyrf

crw-rw-rw- 1 root 26, 47 Dec 3 08:59 /dev/../devices/pseudo/ptsl@0:ttyrf

The c shows that we finally found the device node.