CS444 class 19

CS444 Class 18

Handout: debuglog for hw3

The only tricky coding in sched.c is schedule(), because it calls asmswtch, a completely weird function. The code is set up like this:

Decision code, end up setting new curproc, but also knowing oldproc, pointers to proctab entries. But the current code in schedule_one only chooses between some argued process i and 0. You need to fix it.
asmswtch(oldproc, curproc)
return

You don’t have to do it this way, but be aware that asmswtch does return after the given process is rescheduled, and then it should return from schedule and go on with its business (unless the process is exiting). Don’t let it fall into inappropriate code because you think it never returns!

The calls into the scheduler are straightforward to code except for the tsleep calls from ttywrite, etc.. There we had a busy loop of attempted enqueues into the output queue, and we need to replace it by blocking the process. Ttywrite is already coded, but you need to do ttyread and the new sleep system call.

Might try:

if (enqueue(...) == FULLQ)
tsleep(OUTPUT);
/* can we enqueue now? Not for sure! */

Suppose two processes, A and B, were unblocked by wakeup. If A gets to run first, it will use up the space in the output queue, but B will still be marked RUN, and will be chosen to run a little while later. It will return from tsleep, but there will be no space in the output queue for it to use. It needs to call tsleep again. So we really need something like:

while (enqueue(...) == FULLQ)
tsleep(OUTPUT);

But what about mutex? We know from hw2 that enqueue needs mutex (cli()) to protect it against concurrent modifications to the queue data structure by itself and the dequeue in the interrupt handler, which can run in the middle of an enqueue unless we turn off interrupts.

Is that enough? Suppose we write guarded_enqueue(), which is just enqueue’s body surrounded by cli and sti (or set_eflags back to original value.), and similarly guarded_sleep(). Then we would have:

while (guarded_enqueue(...) == FULLQ)
guarded_sleep(OUTPUT);

But there’s a sneaky problem here called the lost wakeup problem. It was discussed in Tan, pg. 127 in the context of user-level sleep and wakeup calls. Luckily we are working inside the kernel here, so we are allowed to turn off interrupts as needed.

The problem is this. Suppose that just after guarded_enqueue returns with FULLQ, an interrupt happens that causes a dequeue and a call to wakeup. But this process is not yet blocked, so it is not unblocked. Instead, the wakeup does nothing at all to proctab[] (that’s OK in itself, maybe there’s no more output.) Then this code resumes and calls guarded_sleep, and this process may sleep forever (if it’s the last non-zombie process, for example.)

The solution is to put the cli()—sti() mutex region across more code:

cli();
while (enqueue(...) == FULLQ)
tsleep(OUTPUT);
sti();

It’s a little scary to call tsleep with interrupts off, but in fact they won’t be off long, because the tsleep immediately calls schedule, which finds another process to run, and asmswtch swaps the EFLAGS with the IF in it, and the other process resumes execution, typically with IF=0 inside asmswtch, but it soon restores IF to 1 as it returns to user code. So the mutex (IF=0) is only holding for the execution in this process. The important thing for us here is that an interrupt can no longer occur in that bad spot described above. So even though there is a “mutex hole” at the call to tsleep, we have what we need here.

New system call for hw3: sleep(int ms)

A new system call, so give it a number. Also a new waitcode. You can use tsleep to block a process, but it isn’t reasonable to use the broadcast twakeup here, so just set the process status when the time is up, or implement wakeup_one(int pid). Note that you don’t want to call schedule here, because we are using a non-preemptive scheduler, which means that processes run until they block or exit. You need the pid argument in wakeup_one() because this will be called from the tick handler, which has figured out the pid that needs waking up when its sleep is over.

Memory Management

Read Chap. 3 to pg. 194, esp. 189-194, then 3.3.3, TLBs, 3.3.4 multilevel page tables,

3.4 on page replacement algorithms, skipping 3.4.8,

3.6 through pg. 232, Linux pp. 758-762

Primitive Memory Management, and how to do better

Examples: SAPC, MSDOS

Both the user code and OS code sit in the same memory space, and each can see the other. The data is also in this space and mutually accessible, and must be writable (the code might be in ROM, and thus protected from write.) The user code can accidentally or on purpose corrupt the OS data. Thus the system is seriously insecure.

What do we need in order to get the needed user-OS separation? We need help from the hardware and software that uses it properly.

More specifically, we need help from memory management hardware (the MMU) to restrict memory access based on user vs kernel execution. That means we need to have the CPU keep track of whether it’s executing user or kernel code, and make sure the transition between the two is very carefully handled.

User vs. Kernel Mode Execution

All CPUs used for serious OS’s have at least two modes of execution, for user and kernel. In fact, the x86 has 4 “rings” for this, but only two of them are in use by UNIX or Win2K. Thus we will stick to the simple picture of two levels, user and kernel. In the x86, the current execution mode (ring number) is stored in the CS (code segment descriptor) register of the CPU.

User-Kernel Transitions

We have already studied the mechanism of system calls, and saw how careful they are to transfer only a minimal amount of information from the user execution environment to the kernel, just a syscall number and the arguments. The whole syscall execution looks like the execution of a single instruction in the user code, the syscall trap instruction.

Interrupts during user execution also make the user-kernel mode transition. We have seen that they utilize the same basic CPU trap cycle as the system calls, with the additional trick of turning off interrupts, and some way of getting a device identifier. They can’t even be detected by user code unless it does high-precision timekeeping.

The MMU (part of the CPU) Picture, pg. 189

The MMU needs to know which addresses are user, which are kernel, as well as the current CPU mode, user or kernel. Then when it is handling a certain address, it can detect a user-mode access to kernel address, and make it fail. We will see how this works.

The MMU is another helper to the CPU proper, but it is not like the other helper, the PIC, which we called the “secretary to the CPU,” because it remembered things and organized work for the CPU. The MMU just takes orders, more like a telephone for the boss. Type in a number and a phone places a particular call. Similarly, the CPU asks the MMU for the data at a certain address.

A MMU is needed to support the concept of an address space, discussed in Tanenbaum starting on pg. 179. It is possible to support address spaces without using paging, but we will not study that case. It is covered on pp. 180 (last paragraph)-181.

Def, pg. 180: an address space is the set of addresses that a process can use to address memory. Each process has its own address space. This holds for UNIX/Linux and Windows.

For example, on 32-bit Linux, processes have 3GB-sized address spaces going from A = 0 to 3G-1 = 0xbfffffff. Two such processes can each use address 0x100234 and read different values, because these are two different address spaces.

In fact not all addresses are readable: some will cause an exception on access because no memory has been allocated there yet.

Swapping: Copying program-image data to/from memory into special areas on disk is called swapping. In current systems, this happens normally only to idle processes (multiple minutes idle.) If a system is running two many processes to fit the non-idle ones all in memory (at least their most important pages), it is in trouble (overloaded). The process swap rate can be used as a danger signal. Swapping whole programs back and forth to disk is discussed in Section 3.2.2, but we don’t need a lot of detail on this.

Paging (starting pg. 189): This is the main method of memory management for systems operating within their normal range (not overloaded.) If a process can be provided with memory under its most commonly-used pages, and extra memory as needed for less-commonly used pages, it can run well and share the memory efficiently with other processes using paging.

Under paging, the memory is divided up into equal-sized blocks called pages, typically 4K (x86) or 8K (Sparc) in size. The VA (virtual address) space for each process is the memory it sees and works with, i.e., its address space. It is quietly divided into pages—the page division marks are not apparent to programmers. If the program extends even one byte into a page, the whole page is used (in theory, but if no reference is made to this one byte, it may never be given real memory.) Thus we don’t want pages to be too big, or too small, either, because each page needs to be kept track of, an overhead.

The physical memory is divided up into the same size blocks, called page frames. All the page frames are exactly the same in terms of memory resources. The idea of paging is that any page frame can be used as memory under any virtual page. A page table (PT for short) tells which page frame is assigned to each page of (some part of) a VA space.

Parts of the CPU, pg. 189 pic

Here we see that the MMU is part of the CPU, specifically the part closer to the bus, which makes sense since the bus carries the addresses out to memory to specify which RAM chip is accessed (the physical location.) The upper part of the CPU contains the ALU, the arithmetic-logical unit, i.e. the brains of the CPU. This part works in VAs. You could add the registers to the upper box: these all contain VAs if they contain addresses.

The PC (x86 EIP) is a VA, and specifies which next instruction to pull in from memory. A read request by VA is generated in the upper CPU and passed to the MMU, which maps it to a PA (with the help of a page table), and the PA specifies the exact physical location.

Next: look at examples on pg. 190 and 191.

For next time:

Look at Tan’s page table example on p. 190, using 4K pages, so could be an x86 example. Boils down to page number mapping:

page 0 -> pfn 2 VA 0K-4K (really 4K-1), PAs 8K-12K

page 1 -> pfn 1

page 2 -> pfn 6

...

page 6 -> nothing VA 24K-28K

Examples on pg. 191

MOV REG, 8192 (movl 8192, %eax, say)

Here page 0 is 0K-4K in the VA space (4K-1 to be exact), page 1 is 4K-8K, etc., as on pg. 190.

So 8192 = 8K is at the start of page 2. It’s mapped to page 6 in physical memory, page frame 6, with PA= 6*4K = 24K. So the PA access here is to PA 24K. The physical memory is accessed and the value there copied into the REG.

MOV REG, 32780 is accessing VA 32K, which is not mapped here (see “X” in its spot). When the MMU tries to map this address, it fails and causes an exception called a “page fault”. Assuming this process should be able to use this memory, the OS page fault handler gets a page of memory and adds it to the process image.

What if the address is not at the start of a page? If it’s at 8K+20, the PA is 24K+20, and so on. Look at this in hex: 4K = 0x1000 (how handy!) 20 = 0x14, so 8K+20 = 0x2014, and the cooresponding PA is 24K+20 = 0x6014.

VA 0x2014 (virtual page 2) à PA 0x6014 (page frame 6)

In binary: See lowest 12 bits the same, see page numbers in higher bits:

VA 0x2014 = 0010 0000 0001 0100 à

PA 0x6014 = 0110 0000 0001 0100

We see that the last 12 bits are the same in both addresses. This can be called the in-page offset. The page numbers are the parts that are mapped. The page table holds the vpn-pfn mapping, plus other status bits per page.

Page 192 Page Table (first half)

Vpn pfn present bit

7 0 0

6 0 0

5 3 1

4 4 1

3 0 1

2 6 1

1 1 1

0 2 1

The vpns are just the array indexes, so the contents of the array, the actual page table, has pfn’s and present bits in their entries. As discussed on the next page, the present bit is the most important bit, but there are others.

Multi-level page tables—needed for efficient 4-GB address spaces. Otherwise, if PTE = 4 bytes, for 4KB, monolithic PT would be 4GB/1K= 4MB, expensive in memory.

The x86 32-bit paging system, a two-level paging system

A nice system, well adapted to use for any 32-bit OS you might wish to write, and in use by all the versions of UNIX on x86, plus Windows, to implement a “flat address space.” The x86 also offers segmentation, but this is not in use by these OS’s, except in a very trivial way, so we will try to ignore it.

Note: good memory management for 64-bit OS’s is more challenging, and this system is not ready for that. We’ll stick to 32-bit systems.

Uses 4K pages, and a group of 1024 (1K) pages uses one page table (or PT.) 1K*4K = 4M, so each page table specifies VA to PA mappings for 4M of VA space. The SAPC has 4MB of usable memory because it has only one PT set up, plus a "page directory" to point to it. (In fact our SAPC systems have more than 4MB of memory, but it's not usable because of lack of PTs.)

Since it takes 1K blocks of 4M to make 4G, we see that there are (up to) 1K PTs for a full 32-bit address space. Like pic on pg. 199.

Note: good memory management for 64-bit platforms is more challenging, and this system is not ready for that. We’ll stick to 32-bit systems.

If there were only one PT, it would have to be huge, a waste of memory for tiny programs. By using a two-level PT system, like Figure 4-12, pg. 208, the setup can be customized to the actual process configuration of memory.

x86 paging uses 4K pages, and a group of 1024 (1K) pages uses one page table (or PT.) 1K*4K = 4M, so each page table specifies VA to PA mappings for 4M of VA space. Since it takes 1K blocks of 4M to make 4G, we see that there are (up to) 1K PTs for a full 32-bit address space.

The 1K PTs are located with the help of the page directory, the top-level page table. Each lower-level PT has an entry in the page directory with the pfn of the PT to locate it. These page directory entries are 4 bytes long, so the whole page directory fits in 4KB, or one page.

Similarly, each PT has one entry for each page, called a page table entry (PTE), also 4 bytes long. The 1K entries of a PT also fit exactly in one 4KB page.

This system is much better than a monolithic single PT, because the page directory entries can be marked as "not present", meaning there is no PT needed for that area of virtual memory.