CS644 Tues, Apr. 27: Linux Kernel Synchronization, Interrupt Handling

handout: Interrupt Handling Summary

Project available, due May 4, 11. More on this next time.

Longer Term Kernel Mutex: Kernel Semaphores, and new “mutex” functionality

Spinlocks are great for guarding global memory structures, but no good for guarding resources across disk accesses or other hardware-related or other longer delays.

Kernel semaphores are used for longer-term mutex: see Love, pp 143-146.

For example, console_sem guards printk, the console printing service.

The code can block while holding a kernel semaphore, and also can be preempted.

Kernel semaphores are built on spinlocks, just as Xinu semaphores are built on Xinu kernel mutex. We could take the Xinu semaphore code and turn it into an implementation:

disable(ps) --> spin_lock_irqsave(&sem_mutex)

restore(ps) --> spin_unlock_irqrestore(&sem_mutex)

But this is not enough, because of the calls to resched that turn into calls to schedule().

resched() --> { spin_unlock_irq(&sem_mutex); schedule();spin_lock_irq(&sem_mutex);}

Recall that schedule() handles its own mutex needs. Since kernel mutex in Xinu in fact does not hold *across* the call to resched, we are obtaining the same critical section guarding effect as in the Xinu code.

In Linux 2.6.10, the book’s version, kernel semaphores were used all over the place in the kernel for mutex, esp. in the file system code.

In Linux 2.6.33, kernel semaphores have been partially replaced by a new “mutex” type, that tries to lock by spinlock methods, but sleeps while spinning, and allows sleeping while held. It may not be used in interrupt handlers (whereas kernel semaphores can do “up” in an interrupt handler). It is significantly faster than semaphores. To set one up, use DEFINE_MUTEX(var) instead of DECLARE_MUTEX(var) for a semaphore (poor notation!).

If you are interested, see http://alinux.tv/Kernel-2.6.29/mutex-design.txt (or use the current linux kernel source link on the class web page and search for mutex-design)

Here is a quote from there, where x86 means 32-bit x86:

On x86, the locking fastpath is 2 instructions:

c0377ccb <mutex_lock>:

c0377ccb: f0 ff 08 lock decl (%eax)

c0377cce: 78 0e js c0377cde <.text.lock.mutex>

c0377cd0: c3 ret

the unlocking fastpath is equally tight:

c0377cd1 <mutex_unlock>:

c0377cd1: f0 ff 00 lock incl (%eax)

c0377cd4: 7e 0f jle c0377ce5 <.text.lock.mutex+0x7>

c0377cd6: c3 ret

By fastpath, the author means the “uncontended” case, where no other thread requests the lock for the duration. This shows that the lock is in a 32-bit word instead of a byte, and the unlock action is more complicated than the spinlock’s write of 1 to the lock address. That’s because the caller of this code needs to know for sure whether or not there are other lock-waiters, so it can wake one up. A negative number in the lock word represents the number of waiters.

Such a hybrid lock has been used for years in Solaris. It can be implemented at user-level too, by using the same fastpath, plus a semaphore syscall for waiting/ waking up.

In Linux 2.6.*, there is also the BKL for old code. See Love, pg. 149.

Linux Interrupt Handling Love, Chap. 6

Linux has many interrupt handlers, most of which are parts of “device drivers” for the hundreds of devices supported by Linux. The tick handler is one which is not part of a device driver.

A device driver is all the software associated with a certain device. It has code that executes with open, close, read and write to the device, and also an interrupt handler (that can handle both input and output), typically. This can all be in one .c file such as e1000.c for the Ethernet driver for my laptop. Some device drivers are “dynamically loaded” into the running kernel.

The core of the kernel can’t know all the devices, so it sets up a registration system for interrupt handlers. See Love, pg. 77-80. The registration function, request_irq(), hands over a pointer to the interrupt handler function for a particular device. That way, at interrupt time, the core kernel can call into the right device driver to handle the interrupt.

The registration function also accepts a pointer to data (usually a struct) for this device. This pointer is passed as an argument at interrupt time, so that the driver can find the particular data for this device. A driver can handle multiple devices of the same type, for example, two Ethernet interfaces or two CDROMs.

Note the discussion of shared IRQ lines. This happens when multiple devices are connected to a single pin of the PIC. Then each must have a status register so that the interrupt handler code can figure out if a device is actually ready and do the appropriate action.

The flag SA_INTERRUPT has been renamed IRQF_DISABLED, not a huge improvement, but perhaps less misleading. It means that the irq handler being registered will be run with IF=0 on the local processor. Only a few drivers involved in time use this flag, so most i/o device interrupts run with IF=1 on the local processor, and this allows recursive interrupts.

The interrupt handler itself is a function with prototype shown on pg. 80 (also seen as the type of the most important argument of request_irq() on pg. 78.) The last arg has been dropped by 2.6.33, so an interrupt handler now has only 2 args. Recall that interrupts do not themselves carry data to the CPU. The only clue at interrupt time is the IRQ number that determines which IDT slot is used. That is one argument. The other is the registered device data pointer.

Look at Love, pg. 83 for an example interrupt handler, in drivers/char/rtc.c in the Linux source tree. On x86, the CMOS_READ macro expands to

outb_p((addr), RTC_PORT(0)); inb_p(RTC_PORT(1));

which expands further to the RTC i/o port numbers 0x70 and 0x71:

outb_p((addr), 0x70); inb_p(0x71);

So we see this is accessing the hardware directly. It is using spinlocks for mutex, guarding the rtc_irq_data kernel global variable, which can also be accessed from syscall code in rtc_read(). A second spinlock guards the “callback” setting. I think this is for asychronous i/o.

Interrupt Context: pg. 84-85

When an interrupt handler is executing, the kernel is in interrupt context on that CPU, instead of the basic process context. Some process was executing at the time of the first of the current set of (possibly recursive) interrupts. Its task struct is still “current”, because an interrupt doesn’t have its own task struct. Only the tick handler does anything with this task struct, to count down the timeslice; the i/o device driver interrupt handlers ignore it. The need_resched flag and preempt_count (more below) is in the process descriptor too, and is accessed in preemption decision.

Here is some discussion of interrupt stack, which we already covered.

Implementation of Interrupt Handling.

Look at Figure 6.1

“processor interrupts the kernel” --alternatively, could interrupt the user execution

Shows the basic kernel functions:

starts in entry.S, the assembly language syscall/interrupt handler envelope, where it collects some info and does “call do_IRQ”.

do_IRQ(): in core kernel, checks registered interrupt handlers, also sends EOI to PIC so device driver interrupt handlers don’t have to worry about this. Different processors have different PICs. Also masks off this IRQ# in the PIC, so device drivers don’t have to worry about handling recursive interrupts for their own device. After call to handle_IRQ_event() returns, restores PIC mask state.

Fig. 6.1 makes it look like execution doesn’t return to do_IRQ() before getting to ret_from_intr(), back in entry.S where the interrupt was first handled. But it does, so reroute the arrows coming in to ret_from_intr() to go back to do_IRQ, and then put an arrow from do_IRQ to ret_from_intr, or even more accurately, show it returning to its caller in entry.S, and in there, jumping to ret_from_intr.

Handout x86 Linux interrupt handling shows this another way, with additional info on IF and preempt_count, the counter that counts three things: the number of spinlocks held by the current thread, the number of softirqs on this CPU (special continuations of interrupt handling work, can be 0 or 1), and the number of hardirqs (the real interrupt handlers in execution, so the recursion level of this interrupt).

The preempt_count is crucial to both scheduling softirqs and determining whether it is OK to preempt this process when it is about to return from the interrupt. See the details on the handout.

Interrupt Control, pp. 88-92

You can skip these details. Most kernel code just uses spinlocks (plain or irqsave versions) or kernel mutexes or sometimes kernel semaphores, which in turn use these detailed calls as needed.

Chap. 7 Bottom Halves: just the idea

Here is where softirqs are explained. They provide a way of offloading interrupt handler work so the interrupt handler can return quickly. So their execution context is intermediate between interrupt context and process context, though closer to interrupt context. Like interrupt handlers, they may not sleep. Each CPU can be doing at most one softirq at each moment in time. Writing new softirq tasks is tricky, so the softirq implementation is wrapped into a tasklet, which is still pretty tricky. And there are also work queues, which run in kernel threads.

If you need to write your own interrupt handler, you are not required to use a bottom half. It’s really for device drivers that are going to be part of Linux and are expected to have the best possible performance. Today’s processors are so fast that it may easily be fine to put all the interrupt handling in the basic interrupt handler.

Chap. 8 Kernel Synchronization: read this and these notes...

Idea of race condition: synchronization bug where two threads execute in what should be a critical section, thus racing to access some variables and possibly breaking them.

pg. 124: “almost all processors implement an atomic test-and-set instruction” Note that the interlocked decrement can be considered a “test and set” instruction, because it sees if the new count is non-negative and makes it more negative.

pg. 125: “synchronizing with user space” Reading or writing to user memory can cause “page faults”, which require a sleep to handle. We haven’t covered this yet, so don’t worry about it.

pg. 125: “It is a bug if code in the kernel sleeps in the middle of a critical section”. This is assuming a critical section guarded by spinlocks. It’s OK to sleep with kernel semaphores or mutexes.

Deadlocks: the bane of multiple locks. Was discussed in Chap. 4 pp. 45-46 on the subject of runqueues, where the runqueue address order determines the lock order to avoid deadlocks.

The 2.6.33 kernel contains a lock dependency service that tracks locks (spinlocks and mutexes) looking for deadlock possibilities. Very nice idea. Database servers also track lock dependencies and abort one transaction if they find a deadlock. But Linux tries to maintain a deadlock-free system by reporting possible deadlocks as bugs.

Next: back to Xinu, to look at device drivers. Read Comer, Chap 11, 12.