CS644 Thurs, Apr. 22 Linux Core Concepts, cont.

Looking at Love, Chap. 4, from pg. 48

Schedule(): This is almost exactly like resched(), except that it takes care of its own mutex needs, unlike resched(), which requires the caller to have already obtained kernel mutex (IF=0).

Like resched(), schedule() can be called with the current process state a blocked state or a runnable state and choose the best thread/process to run next.

Skip section “Calculating Priority and Timeslice”. This may be out of date.

Sleeping and Waking Up. Note that “sleeping” and “blocking” are used interchangeably in Linux/UNIX language. Here we encounter “wait queues”, like Xinu semaphore queues but more general. There is one wait queue for each source of wait, e.g., one for each semaphore, one for each terminal line for input, another for its output.

Blocking/Going to sleep

Recall the simple Xinu pattern:

1. make sure IF=0

2. set current pstate to wait state

3. if sem wait, add current process to semaphore queue

4. resched()

But Linux schedule() doesn’t need mutex set already, and Linux can’t use IF=0 anyway for kernel mutex (since it runs on multiprocessors), so a sequence has been developed that allows the calling code to hold no mutex at the call to schedule(): see pg. 53.

Remember our discussion of user-level mutex around a blocking system call, where we released the mutex before the call. Similarly, the idea here is to hold no kernel mutex across schedule(), to avoid the same kind of problem we saw there.

It is important to realize that there is kernel mutex held inside the various calls we see from this code:

· inside add_wait_queue() and remove_wait_queue() to keep the wait queue healthy

· inside “condition”, which stands for code that checks if, for example, data has showed up for an input device, and then if so, changes the status to be returned by the next call “condition” as needed, so only one caller gets the good news.

· inside set_current_state(), although this may not be needed if it’s a single assignment.

· schedule() itself

The lack of mutex across these various calls means that a process can be seen by other activities to be on a wait queue while runnable (by its state), or put another way, this state is accepted to be a “consistent kernel state” in Linux. By loosening the model of kernel state more concurrency can be attained, at the cost of greater complexity. Similarly, a process can be observed as blocked by its process state but still running.

To understand why there needs to be a loop in schedule(), we need to look at how processes are unblocked by wake_up().

wake_up(): broadcast wakeup to all tasks waiting on a certain wait queue. If two processes are on a wait queue, both are unblocked by wake_up(). Then both execute in the schedule() loop and check the condition. Only one will see the good news from condition and return to its caller. The other is still stuck in the loop in schedule().

No mutex held at call to schedule(): it’s a rule.

Now we’ve seen how schedule() can do its work without pre-existing mutex, with careful programming. In fact, the rule is that no kernel mutex (spinlocks) are allowed to be held at the call to schedule(). A debug version of schedule() checks for this. More on this in a moment.

Figure 4.3: need fix in errata. Also, we see there are more minor states in the transition.

Preemption and Context Switching (pg. 57)

Just as in Xinu, where resched() calls ctxsw() to do the actual CPU context switch, Linux has schedule() call context_switch() to do the actual CPU switch and the memory management switch as well. The saved processor registers are in thread_struct, part of the task_struct.

Where is schedule/resched called from?

· System call code that needs to block: this process can’t run now

· System call kill(currpid)/exit: this process can no longer run ever

· System call code that has unblocked something (like signal in Xinu, sem_post in Linux)

· Interrupt handlers, esp. the timer interrupt: main case of preemption (user preemption or kernel preemption)

· When a syscall is about to return to the user: another case of preemption (user preemption) (not in Xinu)

· When delayed by kernel spinlocks, when the last unlock happens (so in unlock code) (kernel preemption) (not in Xinu)

Since schedule can’t be called under kernel mutex in Linux, there are times when the kernel just remembers it wants to call it with “need_resched” variable for each process. Then later, when mutex has been released at the end of a syscall or interrupt handler, by the spinlock unlock code, schedule() is called.

Kernel preemption

Allowed as much as possible. But not if locks (spinlocks) are held, so count them with preempt_count for each thread. If preempt_count > 0, no call to schedule() is allowed, so no preemption. The variable need_resched is used to remember the need to reschedule, and when the preempt_count reaches 0 by an unlock operation, the call to schedule() is made.

pg. 59: “Kernel preemption can also occur explicitly” is confusing. It’s best not to use the word “preemption” for explicit calls to schedule() in system call code. These are considered “voluntary” because they are coded into the kernel actions. Preemption occurs when a process switch happens that is not shown explicitly in kernel code for an individual syscall, because it’s either in an interrupt or in the general end-of-syscall code. See bulleted list above.

Spinlocks (for Linux and also database servers)

Love, pp. 137-140 (stop just after Table 9.3)

Spinlocks are used for short-term mutex for multiprocessor kernels or multithreaded servers running at user level such as database servers. Spinlocks are based on special instructions that are not privileged, so user-level programs such as database servers can used them.

By short-term we mean usually well under a millisecond, so no disk access can be allowed during the mutex holding. It is plenty of time to access shared data in memory, the main reason to use kernel mutex.

As discussed on pg. 138 of Love, the “What Do I Lock?” note, it is important to identify the shared data that a spinlock is guarding. We saw a first example of the runqueues. Another example is in the implementation of kernel semaphores, the mechanism for long-term mutex in the kernel.

Recall that in Xinu, there is only one kernel mutex, guarding the entire set of kernel global variables. In Linux, there are many spinlocks, each guarding a portion of the set of kernel global variables.

Terminology: a stretch of code that needs mutex because it is accessing certain kernel global variables is called a critical region. Typically there are several critical regions for a certain mutex. The mutex itself is associated with the variables it is guarding, not with code directly.

See Love, pg. 138 for a code sample (text uses Linux 2.6.10):

spinlock_t mr_lock = SPIN_LOCK_UNLOCKED; // the byte that holds the spinlock status

spin_lock(&mr_lock);

... (critical region: at most one thread is executing in here, or any other critical region for this mutex)

spin_unlock(&mr_lock);

Note: the initialization of the spinlock by the first line above is now (Linux 2.6.33) deprecated. You should now use DEFINE_SPINLOCK(x). If you want to see all the kernel spinlocks, search the source for DEFINE_SPINLOCK using the link at the end of the class web page. In the newer way, the first line would be DEFINE_SPINLOCK(mr_lock), or more commonly “static DEFINE_SPINLOCK(mr_lock);” outside any function, that is, defining mr_lock as an external static variable.

Spinlocks are almost as easy to use as disable(ps) and restore(ps), though we do need to define the byte of memory, usually in an external variable that is part of a struct or group of variables that also contains the data that needs to be guarded.

For Linux, we are considering systems with multiple CPUs that all share a single system bus. The memory is also on this one bus, and shared by all the CPUs. This is the current normal “SMP” (symmetric multiprocessing) architecture for an x86 (32 or 64 bit) server system. Of course Linux also runs on uniprocessors, but that’s an easier case, so we won’t worry about it here.

Spinlocks are based on the special CPU instructions that lock the bus. For x86, they are called “interlocked” instructions, for example, the “interlocked decrement”, decb A, locks the bus, reads byte at the indicated address A, decrements it, and writes it back before unlocking the bus. None of the other processors can read or write the byte (or do anything with the bus) while this is going on. This can’t be done commonly without causing performance problems!

A spinlock is based on a byte at some kernel address A, on any platform. For x86, the byte is 1 if the spinlock is available (unlocked) and 0 or negative if the spinlock is locked.

The following holds for Linux 2.6.10, the book’s version, on 32-bit x86. By Linux 2.6.33, the exact implementation has changed, but it still depends (as it must) on interlocked instructions. Also, in 2.6.33, a specialized “mutex” type is available that is a spinlock, but may not be used in interrupt handlers. This way it can be faster.

The thread that wants a spinlock based at address A (in the kernel memory), interlock-decrements A, and if this makes the byte equal to 0, it has the spinlock and it continues. If the decrement shows that the byte value has gone negative, it has to try again, but it doesn’t try until reading A shows it is free (reading A is naturally atomic). So it spins reading A while some other thread is using the lock. Finally the other thread releases the lock by writing 1 into A (this is also atomic). The spinning thread will see this, and if there are no other spinners, will be able to get the lock. With multiple spinners, there may be multiple interlocked decb’s, but only one will win by seeing the byte go from 1 to 0.

We see that CPU is burned while the critical code is being executed by multiple threads, and the occasional bus locking is hard on performance too, but this is needed for proper mutex.

Clearly the system wants the mutex holder to run as fast as possible through the critical region, so it can release the lock. To this end, preemption is disabled for spinlock holders. But the “needresched” flag remembers the idea, and preemption happens when the last spinlock is unlocked for the thread. With preemption disabled, no calls to schedule() happen unless they are explicitly coded in syscall code. It is up to the programmer of that code to avoid calling schedule() when a spinlock is held.

Note that the processes spinning, waiting to get a spinlock, are preemptible unless the caller of spin_lock was already running without preemption.

Spinlocks can be used in interrupt handlers, because spinning is not “sleeping”, i.e., blocked. It doesn’t require a process switch. Instead, the process is running but not making progress.

However, spinlocks that have at least one of their critical sections in an interrupt handler need to turn off interrupts on the local CPU (in all critical sections) to avoid a deadlock: see Love pg. 138.

An example of this is the kernel semaphore service: up can be called from interrupt handlers (but not down, since it blocks), so both its up and down need to use the IF=0 spinlock calls.

Next time: Kernel Semaphores (Love pp 143-146), Interrupts in Linux (Love, Chap. 6)