Paging
The memory is divided up into equal-sized blocks called pages, typically 4K (x86) or 8K (Sparc) in size. The VA space for each process is quietly divided into pages—the page division marks are not apparent to programmers. If the program extends even one byte into a page, the whole page is used (in theory, but if no reference is made to this one byte, it may never be given real memory.) Thus we don’t want pages to be too big, or too small, either, because each page needs to be kept track of, an overhead.
The physical memory is divided up into the same size blocks, called page frames. All the page frames are exactly the same in terms of memory resources. The idea of paging is that any page frame can be used as memory under any (virtual) page. A page table tells which page frame is assigned to each page of (some part of) a VA space.
The x86 paging system
A nice system, well adapted to use for any 32-bit OS you might wish to write, and in use by all the versions of UNIX on x86, plus Win2K, to implement a “flat address space.” The x86 also offers segmentation, but this is not in use by these OS’s, except in a very trivial way, so we will try to ignore it.
Note: good memory management for 64-bit OS’s is more challenging, and this system is not ready for that. We’ll stick to 32-bit systems for memory management examples.
x86 paging uses 4K pages, and a group of 1024 (1K) pages uses one page table (or PT.) 1K*4K = 4M, so each page table specifies VA to PA mappings for 4M of VA space. Since it takes 1K blocks of 4M to make 4G, we see that there are (up to) 1K PTs for a full 32-bit address space.
A page table has one entry for each page, called a page table entry (PTE.)
The x86 PTE looks like this, showing important bits:
Page frame number (pfn): bits 12-31 |
|
|
|
|
|
D |
A |
|
|
U |
W |
V |
V: valid bit (pfn is real), W: this page is writable, U: this page can be accessed in user mode, A: this page has been accessed, D: this page has been written on (dirty)
The biggest thing in it is a page frame number (pfn), which specifies the page frame for this page, if the valid bit is on. If the V bit is off, the pfn is meaningless.
Example: PTE is 0x000504007. Here the pfn is 0x504, and this page is valid, user-accessible, and writable, but not yet accessed or written on.
Here is a simple page table example, an x86 page table somewhere in memory:
0x000002005 (PTE0)
0x000008005 (PTE1)
0x000006005
…
0x000000000 (PTE6)
… (1K 32-bit entries in all for one PT, so fits exactly on one page)
This boils down to a page number mapping:
page 0 -> pfn 2 i.e., VAs 0 – 0xfff map to PAs 0x2000-2fff
page 1 -> pfn 8 i.e., VAs 0x1000 – 1fff map to PAs 0x8000-8fff
page 2 -> pfn 6
...
page 6 -> nothing (V bit = 0) VAs 0x6000-6fff are not (yet) supplied with real memory
…
page 1023 -> … VAs 0x3ff000 – 0x3fffff map to …
For example, since page 1 -> pfn 8, VA 0x1234 maps to PA 0x8234. You see that the last three hex digits represent the 12 bits of the page offset, here 0x234. The 1 to the left of 234 in the VA is the page number and the 8 to the left of the 234 is the pfn.
The MMU of the CPU does this mapping during instruction execution with no help from the OS at that moment. The OS has to set up the PTEs in page tables held in the kernel data area, but once the PTs are set up, the hardware does the VA->PA mapping over and over.
x86: Multiple Page Tables are used to cover a
32-bit address space
So far we’ve been looking at one page table handling 4M of VA space, using 1024 PTE’s, each mapping one 4K page. A real system has (up to) a whole 32-bit address space, 4G of memory space. 4M vs. 4G -- that’s a factor of 1K bigger. So it takes 1K = 1024 PT’s to cover the whole 32-bit address space.
These 1024 PTs are managed via a “page directory” of 1024 PDEs, each 4 bytes, and with the same format as PTEs except that the D bit is not in use. The resulting system of a page directory at one level, page tables at the next level, and finally the pages themselves, is called a two-level paging system. The page directory location (by PA) is held in the CR3 register of the x86 CPU. See Love, pg. 267 for a diagram of a 3-level paging system.
A Process VA space is
defined by its PTs
A small process image would typically have 3 PTs in use, one for the code+data in the first 4M of VA space, one at the top end of user space for the stack, and one covering the C library DLL area. That’s 3 PTs + PD, or 4 pages of memory overhead, or 16K. Not bad at all. Threads sharing an address space share this set of PTs plus PD.
(The PD is at VA 51000 in the SAPC, in case you want to look at it. Pretty boring though, since it points to only one page table, at VA 52000.)
Address space switch
(part of process switch, switch_mm, pg. 57)
Each process address space has its own PD (page directory). Just reload the CR3 CPU register and bingo, a whole new address space is mapped in. Note that several processes have their image in memory, ready to run, when timesharing is going on, but only one is current for each CPU.
Page faults
If the V bit is off when the MMU tries to translate the VA, it causes a page fault exception in the CPU, and puts the offending address in CPU register CR2. The page fault handler of the kernel figures out whether the page is appropriate to the process, and if so, how to fill it in. Often the data has to be read off disk, so the process is blocked for a while. When the page is set up properly, and the PTE filled in, the saved-EIP of the exception is backed up a little to cause re-execution of the same instruction that caused the page fault, after the iret is executed at the end of the page fault handler. The re-execution should go through smoothly now that the page is valid.
What about the rule of no blocking in interrupt handlers? An exception handler is executing in the process context of the process that had the problem, so it’s not the same as an interrupt handler. It can cause its process to block.
Demand Paging
Linux, like Solaris and Windows, uses demand paging. A user image for a process starts out with hardly any pages and has to “demand” them by generating page faults.
Copy On Write (Love pg
31)
This paging trick allows for an efficient fork implementation, i.e., the kernel doesn’t really copy the whole user image, only maps them together with W=0 in both the PTEs. Then when the protection fault occurs on a write, the kernel figures out why and copies that one page, and redoes the faulting PTE (the other one may be fixed up later.)
Protection of the
Kernel Memory
Each page has a U bit to mark if it’s usable by user-level code. Thus the user code and kernel code can sit in the same hardware address space (4GB of VA space in a 32-bit system), yet whenever the CPU is executing at user level, the kernel part, typically the highest 1GB for 32-bit Linux, upper 2GB for 32-bit Windows, is completely unavailable.