CS444 class 21

CS444 Class 21 Page Replacement Algorithms, Page Faults

Free list: List of pages the kernel can use when it needs memory. Needs continual replenishment by “reclaiming” pages.

Reclaim a page: remove a page of memory from a process VA space (or kernel data area), so its PTE P bit goes from 1 to 0.

Page replacement algorithm: decision algorithm for which pages to reclaim to keep the free list in good shape

Page age: time back to the last reference to a page. Mostly a theoretical concept, since current MMU hardware doesn’t track page age, only provides an R/A (Referenced/Accessed) bit.

Reclaiming a dirty page: possible, but requires writing the page contents to swap space (a “page out” action).

Page Replacement Algorithms

Would like to reclaim “old” pages (large page age), since they seem to be less used than others, but MMU doesn’t track page age, just provides R/A bit.

Idea: Suppose “old” pages are ones with age > 30 min.

Algorithm: Clear the R bit in a group of pages, come back 30 min later, see which pages still have R=0, reclaim them.

This is the basic idea of the Clock Algorithm, though it is entirely flexible on the time period.

Practical algorithm using R/A bit:

Clock algorithm, pg. 205. Here the hand points to the page that hasn’t been examined by this algorithm for the longest time as it cycled through the rest of the pages. Although this is a low-cost algorithm, it isn’t in direct use by Linux because more complicated algorithms (which we won’t cover in detail) work better using the same A/R bit. Solaris uses this approach.

LRU, the touchstone of page replacement algorithms—Sec. 3.4.6

Can’t be done exactly with current MMUs, because they don’t track page age and the OS is not involved in each page reference, so the kernel doesn’t have the data it needs.
However, is too expensive for use in OS’s, even if they had hardware page-age information. Is used in database servers for buffer management. The database server does get involved in buffer memory references (in groups) through “pin buffer” and “unpin buffer” operations.
Skip the hardware implementations in Tanenbaum—they are all unusable in practice, i.e., just read the first two paragraphs of Sec. 3.4.6.
Is characterized as “Excellent” on pg. 215, but in fact is not that great. If you read a huge file once, all those pages look new and wipe out older pages in memory.

If LRU isn’t excellent, what’s better? I proposed and proved things about “LRU-K”, a LRU-family algorithm that tracks the last K references—it’s mentioned on the Wikipedia page on Page Replacement Algorithms!

Linux: PRFA described (confusingly) on pg 769-771 does use the A bit, and is part of Linux 2.6 and Linux 3.6, the current version (I checked the Linux sources—see links to online Linux kernel sources on the cs644 class web page). Don’t worry about it in detail for this class, just remember that Linux has a page replacement algorithm that uses the R/A bit. I was interested because it looks like an approximation to LRU-K, not just LRU.

Android: Android uses a Linux process for each app execution, with virtual memory and demand paging, and a JVM on top of Linux VA space, but limits memory to (say) 48MB for a Java app. It doesn’t save user memory pages back to persistent storage, however, that is, it runs normally without swap space. Also, Android can throw a process out of memory when it determines that the memory is needed for a higher priority processes. So an app has to store its own context somewhere to be ready to restart when the user returns to it. The app receives an onPause() event when it gets paused, at which point it should save its state, according to Gargenta, Learning Android, O’Reilly, p. 29.

Solaris uses a two-handed clock algorithm, but has responded to the problem of sweeps (at least those caused by file access) by using a priority system to keep balance between file access pages and process image pages.

Paging in practice: UNIX/Linux commands vmstat and uptime

You can see how many revolutions of the clock hand have happened on a Solaris system since the last reboot by “vmstat –s”, and how many days uptime by “uptime”. Try it out! More on this next time. On Linux, vmstat -s shows a lot of stats, but not this one.

The 2-handed clock algorithm is executed by the page daemon, process 2 on “classic” BSD UNIX, now a kernel thread in Solaris and Linux. But to keep up traditions, the kernel thread appears as process 2 “pageout” on Solaris. (See it with “ps –ax|head” on ulab.)

Useful small article on vmstat from the Linux Journal.

Swapping under extreme load on Solaris UNIX

If the 2-handed clock algorithm is unable to get the #pages on the free list up to even a certain fraction of lotsfree, it causes swapping to start on actually-active processes, a desperate action. This should never happen on a healthy system. It is discussed at the bottom of pg. 718 and top of the next page. Linux apparently just uses more and more paging.

In the real OS execution, PDs and PTs come and go with the processes. Each process has a PD, and at least a few PTs, to support its virtual memory, under code, data, stack, and DLLs. In Linux, there's 1 GB of kernel virtual memory that uses the upper one quarter of the PD, plus PTs

The current process on a certain x86 processor has CPU register CR3 pointing to its PD, and thus mapping in all its virtual memory, including the kernel for x86 Linux. There are typically a group of other processes on the system with their PD and PTs in memory, but not currently provided a CPU, so not mapped in.

Tan, Sec. 3.6.1 pp 227-230. MMU actions for processes

Four times the OS has paging-related work:

1. Process creation (fork}: create page directory PD and at least one PT

2. Execution: Exec: map in executable file pages, process switch: make CPU use different PD

3. Page Fault: fix up one page, PTE

4. Process Termination (exit or exception) : deallocate PD, PTs

At process switch, the CR3 register is loaded with the PA of the PD of the newly scheduled process, and that causes a whole new process image to be mapped in. The caches need to be flushed, both the instruction/data and TLBs. Just after a process switch, cache misses are frequent until the caches again have the important data in them. This cache flush action is a big performance effect attached to a process switch.

Note that switching between threads of one process does not involve reloading CR3 or flushing caches, and thus is significantly more lightweight.

MMU and system security

The MMU is very important to the job of keeping a user process "bottled up" in its virtual machine. Each address in a user program is tested out on every instruction, so the process can't see anything it shouldn't, in particular, the kernel code and data..

The MMU causes a page fault or general protection exception for addresses that fail its test, and this causes the kernel to execute and figure out what to do. Each page is marked U or S to allow the kernel to see things the user level execution can't, and generate an exception if a user execution tries to access kernel memory.

Naturally, the instruction to change the CR3 is privileged. The page tables and page directory are hidden away from the user program in the kernel data area.

Page Fault Handling in x86.

The MMU sees P=0 in a needed PTE, and causes the CPU proper to trap via vector number 0xe = 14, so that IDT[14] supplies the trap handler, known as the page fault handler.

The MMU also puts the “linear address” of the fault in CPU register CR2. For SAPC, it’s just LA = PA + c0000000, and VA = PA, so a page fault at VA 0x23456 will show CR2 = 0xc0023456. Understanding linear addresses properly requires studying the x86 segmentation system, covered in Sec. 3.7.3, but we are skipping this rat’s nest.

Back to the page fault handler: look at Tan., pg. 228 for steps.

MMU causes trap, puts faulting LA in CR2
As routine saves regs (this is part of the OS)
PF handler (in C) figures out which page not present
OS checks if this page is a valid page of this process (else usually kill process), get page from free list.
Drop this step (the free list handling gets dirty pages written out earlier)
OS locates needed data, often in executable file or swap space, schedules disk i/o, blocks this process <-- Note blocking in PF
Disk interrupt signals page in, PTE updated, wakeup process.
Faulting instruction needs reexecuting—process is still in kernel after scheduling, back in PF handler, with user PC on bottom of stack where it can be adjusted (backed up) (this is done in the PF handler, not the disk interrupt as Tan seems to say).
The PF handler returns to the as routine
The as routine does the iret, using the backed-up PC, and resumes user execution.
(added) The user code re-executes the instruction that caused the PF, more successfully this time.

Tan., pg. 230. Instruction backup is not such a problem in x86 or Sparc, because there is no auto-incrementation done along with memory access, and there is at most one operand memory address per instruction.

Examples of PFs

First ref to data page—page contents are in executable file. PF handler blocks.
First ref to BSS page (uninitialized data)—no blocking, just assign page from free list.
Ref that extends the user stack—same as 2.
First ref to text page (code)—as in 1, or if this program is in use by another process, arrange sharing of code page already in memory.
Reref after pageout to swap space—block while read in from swap space.
Ref to address outside of program image: fails validity test in step 4 above, causes “segmentation violation” in Solaris, usually kills process.
Ref to malloc’d memory (heap): malloc itself only allocates swap space, not real memory, so the memory is added by PFs, like #2.

Note that a PF is more like a system call than an interrupt. They are both exceptions or “traps.” When the kernel is executing after a trap, it is executing on behalf of the current process, so the process entry and process image are relevant and usable. No problem in blocking. An interrupt is quite different. Interrupt handlers execute as guests of a “random” process. They normally don’t access process data, only kernel global data relevant to their device.

Tan, pg. 758 Linux memory management.

Discussion of process image regions—don’t forget DLLs too now.

pg. 758 pic. showing 2 processes in memory sharing their code pages. But note that only one of these is “current” in use of the CPU (unless there are multiple CPUs.) The other one is in memory but not scheduled at the moment. On the x86, the current one has the CPU register CR3 pointing at its page directory, which makes its whole process image mapped in. Other CPUs have similar master registers for the top-level of their paging support.

Memory-mapped files. Fig. 10-13, pg. 761

A region of a file can be mapped into a process image. Then writes to that part of VA space cause corresponding writes to the file pages. If two processes map the same region of a file, they end up with shared memory with data that persists in the filesystem. However, this is not commonly used in applications, partly because the solution is not portable, and partly because there are subtle issues such as exactly when the file writes occur.

Often, the memory-mapping mechanism is used for shared memory, ignoring the file itself. You can use /dev/zero, the OS-supplied effectively infinite file of zeroes, instead of a real file. See mmap_nofile.c.

Memory-related system calls.

Most paging actions are done without memory-specific system calls, but malloc does need a system call to get memory assigned to the process. The underlying system call is brk. Memory-mapped