CS644 Tues, Mar. 23

Questions on hw3?

Midterm next Tues., review next class

 

Read Love, Chap. 5, Syscalls

 

Handout: User-kernel Transitions

 

Look at timelines from class 9 on handout. Now want to dig down on the transitions.

 

UNIX/Linux syscall is a special trap instruction causing user->kernel CPU mode transition.

See Love, pg. 67 for general picture:

·         user app code calls C library “wrapper” for system call read

·         wrapper code does actual syscall instruction for read

·         ----user to kernel transition-----

·         kernel picks up in system_call() trap handler, calls sys_read()  (or sys_write for write, etc.)

·         ...  (can block in here)

·         kernel ends work with iret,

·         ---kernel to user transition---

·         back to user space, next instruction after syscall instruction

 

 

Each instruction cycle can be classified as running “as user” or “as kernel”, except for the system call instruction itself, which involves a transition from user to kernel, and iret, used for kernel to user transitions at the end of interrupt handling (and end of UNIX/Linux syscall handling).

 

In UNIX/Linux, user mode vs. kernel mode is exactly determined by the CPU user vs. kernel mode execution (held in the CS CPU register in x86).

 

Xinu syscalls

 

Xinu syscall is just a normal function call (x86 call instruction), so its execution does not change the CPU user/kernel mode. In fact, Xinu runs in CPU kernel mode even in user code. But we still claim that we can classify each instruction cycle as user or kernel execution.

 

Xinu syscall execution

So simpler sequence:

·         user app calls read()

·         ----user to kernel transition-----

·         read() executes, returns with ordinary ret instruction (can block in here)

·         ---kernel to user transition---

·         back to app code at next instruction after call

 

Interrupts

Each interrupt causes an interrupt cycle causing user->kernel CPU mode transition (if the CPU was running in user mode.) Interrupt handlers are kernel code in all our OS’s. Interrupts can come between any two user or kernel instructions as long as IF=1.

 

Kinds of code:

·         user code

·         kernel code: system call code, interrupt handlers, initialization code

·         utility code like strcpy: counts as user code if called from user code, or kernel code if called from kernel

 

The third point really only applies to Xinu, because in UNIX/Linux, the kernel is separately built with its own copy of the library.

 

Stacks

The execution stack provides the working memory of any program, saying how it got where it is, and how to return back. We need to understand the management of stacks to understand execution environments.

 

from the handout:

 

Xinu use of stacks:

·         Kernel stack grows on top of user stack starting from a syscall in user code.

·         Interrupt stack grows on top of user or kernel stack.

·         This all works because a process is only making progress in one activity at a time: a system call in execution means the user code is not making progress, an interrupt means the underlying user and/or kernel code is not making progress. When the interrupt is done, the underlying activity resumes.

 

UNIX/Linux use of stacks:

·         Each thread has a kernel stack separate from its user stack

·         Interrupt stack may be separate or grow on top of kernel stack

 

This arrangement is a little neater, and means the kernel never writes on the user stack area unless explicitly requested to by system call arguments.

 

Xinu Examples using hw2 solution (thanks to Truong Tran!)

 

Case of plain user stack: worker calls lckwait, a user function:

Breakpoint 1, lckwait (lock={mutex = 86, users = 0x109010}) at lock.c:54

54        return wait(lock.mutex);

(gdb) where

#0  lckwait (lock={mutex = 86, users = 0x109010}) at lock.c:54

#1  0x100ace in worker (c=66 'B', lockp=0x3fd7dc) at test1.c:67

(gdb) c

Continuing.

 

Case of kernel stack growing on top of user stack, from syscall wait:

Breakpoint 2, enqueue (item=47, tail=223) at ../sys/queue.c:18

18              tptr = &q[tail];

(gdb) where

#0  enqueue (item=47, tail=223) at ../sys/queue.c:18

#1  0x102879 in wait (sem=86) at ../sys/wait.c:28        <--syscall “wait”

#2  0x1008f8 in lckwait (lock={mutex = 86, users = 0x109010}) at lock.c:54

#3  0x100ace in worker (c=66 'B', lockp=0x3fd7dc) at test1.c:67  <---worker process

(gdb) b putc

Breakpoint 3 at 0x102aa2: file ../sys/putc.c, line 17.

(gdb) c

Continuing.

 

Case of kernel stack growing on top of user stack, from syscall putc

Breakpoint 3, putc (descrp=0, ch=108 'l') at ../sys/putc.c:17

17              if (isbaddev    (descrp) )

(gdb) where

#0  putc (descrp=0, ch=108 'l') at ../sys/putc.c:17        <--syscall “putc”

#1  0x100be9 in _doprnt (

    fmt=0x100923 "ockkill: see lock.users = %d, nusers = %d\n", args=0x3fd7c0,

    func=0x102a94 <putc>, farg=0) at doprnt.c:44

#2  0x100b9f in printf (

    fmt=0x100922 "lockkill: see lock.users = %d, nusers = %d\n", args=0)

    at printf.c:15

#3  0x100974 in lckkill (lock={mutex = 86, users = 0x109010}, nusers=2) at lock.c:79

#4  0x100a87 in main (argc=0, argv=0x102d70) at test1.c:55   <---main process start

#5  0x101948 in startmain () at ../sys/main.c:18   <--kernel sets up user main

 

(gdb) b clkint

Breakpoint 6 at 0x100752

(gdb) c

Continuing.

 

Case of interrupt stack on top of user stack: clock interrupt during printf execution

Breakpoint 6, 0x100752 in clkint () at ../sys/initialize.c:201

201     }

(gdb) where

#0  0x100752 in clkint () at ../sys/initialize.c:201

#1  0x100be9 in _doprnt (fmt=0x100925 "kkill: see lock.users = %d, nusers = %d\n",

    args=0x3fd7c0, func=0x102a94 <putc>, farg=0) at doprnt.c:44

#2  0x100b9f in printf (

    fmt=0x100922 "lockkill: see lock.users = %d, nusers = %d\n", args=0)

    at printf.c:15

#3  0x100974 in lckkill (lock={mutex = 86, users = 0x109010}, nusers=2) at lock.c:79

#4  0x100a87 in main (argc=0, argv=0x102d70) at test1.c:55

#5  0x101948 in startmain () at ../sys/main.c:18

 

 

We can determine from the stack backtrace whether the system is running as user or kernel in Xinu: if the stack bottom is main or a user process function used in create (and recorded in proctab[i].paddr), we have a user process, and reading up the stack, if we find a syscall, that means execution is now in the kernel. If none, the execution is as user.

 

In theory, we could do this at any point in time, working from the current ESP value, into the current execution stack.

 

In UNIX/Linux on x86, we can just look at the current CS register.

 

More details on system calls

 

Each system call has a number. Although write has been 4 for all my previous experience, it now is 1 for x86_64, as defined in /usr/include/asm/unistd_64.h.  That explains the code we saw in hw1’s solution:

 

0x00007fe7ab8ffd89 <write+9>:   mov    $0x1,%eax

0x00007fe7ab8ffd8e <write+14>:  syscall

 

This shows how the one syscall instruction can handle all the different system calls. The arguments are accessible from the registers, put there in preparation for the call to write. This is different from what we see with i386-gcc, which passes function arguments on the stack. In current processors and C compilers for them, accesses to memory (such as pushes on the stack) are minimized for performance.

 

Controlled Transitions and system security

 

OK, so we see that we can pinpoint the moments that the user surrenders control to the OS, and vice versa. This is crucial to the whole design of a safe OS. Just as national security depends on handling things very carefully at ports of entry—it is made very clear whether you’re on the inside or the outside of the checkpoint, and only special handling gets you across the divide.

 

In Linux/UNIX, the user code is trapped quite well by running in user mode of the CPU (no privileged instructions like cli) and with IF=1, so interrupts can get the OS working again for sure.  All the user code is allowed to do is use the CPU and memory it is given (all private memory unless shared memory syscalls have been used), and make system calls to ask for services.

 

User-kernel Separation

The fact that the system calls are trap-type instructions is crucial. This mechanism was invented in the late 60s, before UNIX was born in 1970, and UNIX has always had this feature. So have other important OS’s:

 

UNIX/Linux

Windows NT/2000/XP/Vista/7

DOS

 

UNIX* and Windows* (as listed above) also have the user run in user mode on the CPU and the kernel run in kernel mode, also important to keeping the user from taking over the machine. DOS does not do this, so it is not at all secure.

 

Of course if the user code is allowed to use file system calls to wipe out system binaries, or install new ones, the system is still not really secure. This was the case with older Windows OS’s (NT/2000/XP) : the user was given administrative rights by default, so “user mode” execution was way too powerful. This was never true with UNIX/Linux. Finally, with Windows Vista, users by default run at “standard user rights”, not administrative rights. See http://207.46.16.252/en-us/magazine/2007.06.uac.aspx, which ends:

 With Windows Vista, Windows users can for the first time perform most daily tasks and run most software using standard user rights, and many corporations can now deploy standard user accounts.

 

So you see that system security involves more than user-kernel transition handling, but this is a prerequisite to the rest of the story.

 

Admin/root capability

 

In UNIX/Linux, the administrators have “root” capability to do things like change system files and reboot the system. As discussed in Love, pg. 70, current Linux has subdivided the blanket root capability into separate capabilities  Examples: CAP_SYS_ADMIN, CAP_SYS_BOOT, CAP_SYS_NICE, ...They are per-thread attributes. They are supposed to be picked up from an executable file, but this doesn’t seem to be supported yet in any Linux filesystem. So this talk of capabilities boils down to nothing yet.

 

In real life, the distinction in UNIX/Linux is between privileged threads (with user id UID = 0 for root) vs. non-privileged threads (other UIDs). The privileged threads are not checked for detailed capabilities.  To do privileged system calls, you login as root or “setuid to root” (similar in effect to login, but can be done in a going process by a the setuid syscall) and thus run with UID=0, in which case you can do any system call that checks for privilege.

 

Note that root privilege is entirely separate from privilege for instructions. Privileged instructions are defined by the CPU architecture, whereas root or admin privilege is defined by the OS kernel.

 

Privileged instructions: cli is our standard example, but inb and outb are also privileged by default on x86 Linux. The x86 CPU can be programmed per thread and per port to make inb and outb privileged or not. You can enable inb and outb for user level (making them non-privileged) via the ioperm syscall, which of course is a privileged syscall, so you need to be running as root at that point, but you can revert to your usual user id in the same process and still use the ports. If you’re interested, see http://www.faqs.org/docs/Linux-mini/IO-Port-Programming.html#ss2.1

 

 

The kernel in memory

 

Where is the kernel?  In Linux on x86_64, it’s near the top of the 64-bit address space. Recall from OS-lect4 that only 47 bits are available for usable addressing at the low end and 47 bits at the high end of the 64-bit address space.  The kernel lives in the upper part. We can see the addresses by variations on the following command:

 

sf06.cs.umb.edu$ grep sys_w /boot/System.map-2.6.27-9-server

ffffffff802527e0 T sys_wait4

ffffffff802528d0 T sys_waitpid

ffffffff802528f0 T sys_waitid

ffffffff802874e0 T compat_sys_waitid

ffffffff802875e0 T compat_sys_wait4

ffffffff802e9d20 T sys_write

ffffffff802ea440 T sys_writev

ffffffff803257e0 T compat_sys_writev

ffffffff803417e0 t proc_sys_write

 

 

These addresses are all about 0x80000000 from the top, or 0x800MB = 2GB from the top.

 

We see that both the user and kernel can be represented as separate parts of the huge 64-bit address space, with the user image (of the current user process running on this CPU) in the low part and the kernel in the high part. This makes it easy for the kernel to access user space of the current process as it is doing a syscall for it.

 

Memory Protection

 

What prevents our user level code from calling sys_write directly? Hardware memory protection. Will cover later. It is based on the user vs kernel mode of execution as kept in the CS register.

 

Summary on System Security

With syscalls, user vs kernel CPU modes, and memory protection based on user/kernel mode, we see that the user execution is effectively bottled up in the virtual machine given to it, and only can use syscalls to cause trouble.  In UNIX/Linux syscalls, the notion of root vs. non-root privilege is there to keep ordinary users from doing dangerous things. Ordinary users, including software developers, run as non-root users (almost) all the time.

 

Windows NT/2000/XP/Vista/7 has syscalls, user vs kernel CPU modes, and memory protection based on user/kernel mode. However, until Vista, users normally ran with Administrative privileges, causing many problems with viruses damaging system files, etc. Even with Vista, many users find they needed Admin. Hopefully Windows 7 has improved this situation.