CS444 class 21

Thurs., Nov. 16 Demand Paging

Continuing with Chap. 4 in Tan., pg. 211

hw4 due Tues, Nov. 28, no programming, no late days

TLB, aka “address cache”, really a cache of PTEs.

Why needed? Recall how the MMU maps a VA, say 12345678, into a PA:

first 10 bits, 0001 0010 00=0x48, determines, via lookup in page dir, the PT to use

middle 10 bits, 11 0100 0101 = 0x345, determines which PTE in that PT to use

So only after two memory references is the MMU ready to do the actual memory access it is trying to do, the access to some PFN recorded in the PTE, at offset 0x678. Recall the “memory barrier” we discussed in cs241. Today’s CPUs are so fast that their speed is limited mostly by the time it takes to read in instructions and data. Obviously three memory references for every VA access would slow things down terribly.

The TLB caches PTEs, to speed up the VA to PA mapping. In CS241, we studied the instruction and data caching (L1 and L2 caches), so we see that the x86 has several kinds of caches. The instruction and data caches work in units of 32 bytes “cache lines”, whereas the TLB works in units of 4-byte PTEs. In fact, the Pentium has 2 TLBs, one for instruction addresses and one for data addresses. The size of the cache grows over various versions of the CPU. For example, the Pentium 4 has 128 slots in its instruction TLB.

Tan. discusses “software TLB management”, pg. 212, but let’s not worry about this. Also skip “inverted PTs”, not needed for 32-bit systems.

TLBs are constructed from “associative memory”, in which parallel requests are made to all the various entries in the cache at once: “Anyone got the PTE for VPN 345?” The one matching entry returns the PTE. If the VPN is not cached, the PT in memory is accessed.

Page Replacement Algorithms: deciding which page to throw out of memory

Page replacement algorithms make decisions in the most important spot where actual decisions can be made, that is, where the OS has some discretion in management of the memory pages. On the other hand, when pages are brought into memory under demand paging, there is very little choice about it—the program needs the page and cannot progress without it.

The OS maintains a free list of pages to use as needed. It gets a page from the free list at page fault time, and puts pages back on the free list by a kind of garbage collection scheme with the fancy name of the page replacement algorithm. Garbage collection is a real technical term used by Java VM people and compiler writers as well as OS folk.

Demand Paging and the circulation of memory pages.

With demand paging, a program is not “loaded into memory”, but rather only mapped into memory. PTEs with P=0 are set up for it, except for say one page of stack and possibly one page of code, which are given actual physical memory.

When another page is referenced by the program, the P=0 in the PTE causes the MMU to initiate a page fault. In the page fault handler of the kernel, the kernel fills in the page, using a page from its “free list.” (This often requires reading a page from an executable file while the process is blocked.) After the page itself is filled in, the kernel fills in the pfn in the PTE and sets P=1. Then it gets the process to re-execute the instruction that caused the page fault. More on this later.

We see that the memory assigned to a process grows incrementally. The pages of a process “age”. Page “age” is the length of time since the last reference to the page.

Old pages of a process are probably not as important to the program as the newer one, and can be removed, i.e., garbage-collected. That’s the underlying idea of “LRU” least-recently-used page replacement algorithm. The removed pages are put on the free list. If they are needed later, they can always be paged back in.

We see that the pages go round and round in the system, spending time on the free list and then in some process image (or possibly in a page table or other kernel object—the kernel uses paging for some of its own data too.)

Tan. discusses the free list on pg. 717.

Page Age in theory and practice: the page reference bit.

The idea of page age is important to the understanding of good memory management, but when we look at a PTE, we don’t see a page age field. All we see is a reference bit, denoted “R” in Tan., “A” in x86. This tells us the page has been referenced since the last time A was set to zero. This tiny clue is used by several important algorithms in actual use by OSs.

(However it is not in universal use. According to Inside Window 2000 (pg. 460), Windows 2000 uses the A bit if it is running on a single processor, but not if it is running on a multiprocessor system. Each of multiple CPUs has its own TLB, and to make setting A=0 effective, that vpn entry must be flushed from all those TLBs, an operation deemed not worth it.)

First A/R-bit algorithm: Clock algorithm, pg. 218. Here the hand points to the oldest page in a sense, but not exactly the LRU sense. It’s just the one that hasn’t been examined by this algorithm for the longest time as it cycled through the rest of the pages. Although this is a low-cost algorithm, it isn’t in use because the two-handed algorithm (discussed below) works better using the same A/R bit.

LRU, the touchstone of page replacement algorithms—

Better than FIFO, other alg’s of pp 216-217
Uses recent history of pages, so is adaptive to new computation environments.
However, is too expensive for use in OS’s. Is used in database servers for buffer management.
Even the approximate treatment of Fig. 4-19 is too expensive for OSs.
Approximation that is cheap enough: two-handed clock algorithm, pg. 718.

Two-handed Clock Algorithm: commonly used in UNIX implementations, Tan., pg. 718.

Unlike the clock algorithm, the hands of the clock in this algorithm sweep around the pages in the order they lie in physical memory, so no “extra” data structure is needed. The first hand to pass a page sets A=0, and the second one arrives a while later and checks if A is still 0, and if so, puts the page on the free list (after writing it out if it is dirty.)

The number of pages on the free list is used as a guide to how far to sweep across memory with this algorithm. The target number of pages is “lotsfree”, so the algorithm runs until this number is collected, and then quits. In a memory-rich system, the algorithm only makes a few “turns” or “revolutions” around memory in a day.

Page replacement algorithms

LRU least recently used—touchstone, but too expensive for OSs, is used by database servers.

Clock algorithm, two-handed clock algorithm—rough approximations to LRU that can be afforded, covered last time.

NFU—read about, pg. 220-221

Idea of Working Set for a Process

Because of the tendency of programs to have “locality of reference”, that is, clustering of references to certain regions, there normally is a set of pages the process uses a lot, plus an additional number it uses less often. That core set is called its working set.

It would be good if the OS could keep all the working sets in memory for all the active processes. In fact this happens under LRU to a pretty good approximation if the physical memory is large enough to hold all the working sets, plus a margin. Even with the rough approximation to LRU of the 2-handed clock algorithm, this is true.

However, LRU can be “fooled” by sequential access to files or memory, that is, sweeps over huge amounts of data. In that case, all these pages look recently accessed, and the older pages of process images are swept out. All it takes is one crazy process (or maybe this sweep is the most important thing the computer ever does.)

To be really fair, processes should be individually monitored and protected from one another’s sweeps. That line of thinking points to establishing a “working set manager” that checks out processes and makes sure they are not hogging pages. The developers of VMS were so sure of this approach they didn’t bother to put a reference bit in the VAX’s PTE, a big problem to the UNIX community who wanted to run UNIX on VAXes. Since the developers of Windows NT were the same group, it’s not too surprising that Windows 2000 also has a working set manager, although it also uses the reference bit in the case of uniprocessors.

Solaris uses the two-handed algorithm, but has responded to the problem of sweeps (at least those caused by file access) by using a priority system to keep balance between file access pages and process image pages.

Paging in practice: UNIX commands vmstat and uptime

You can see how many revolutions of the clock hand have happened on a Solaris system since the last reboot by “vmstat –s”, and how many days uptime by “uptime”. Try it out! More on this next time. On Linux, vmstat -s shows a lot of stats, but not this one.

The 2-handed clock algorithm is executed by the page daemon, process 2 on “classic” BSD UNIX, now a kernel thread in Solaris and Linux. But to keep up traditions, the kernel thread appears as process 2 “pageout” on Solaris. (See it with “ps –ax|head” on ulab.)

Swapping under extreme load on UNIX

If the 2-handed clock algorithm is unable to get the #pages on the free list up to even a certain fraction of lotsfree, it causes swapping to start on actually-active processes, a desperate action. This should never happen on a healthy system. It is discussed at the bottom of pg. 718 and top of the next page.

Note: Pg. 721, picture of 3 page table levels in Linux. But only 2 of these are in actual use for x86—the kernel has to follow the actual hardware setup! The 3 levels are in actual use for the Alpha processor (now somewhat obsolete.)