CS641 Class 5

Last time: Using the stack for calling functions and holding local variables and arguments and preserved registers.  Note there is a document on this topic now.  MIPS (gcc) Calling Conventions, with more examples. Read this for hw2.

Logical Instructions and Masks: Sec 2.6

C and Java operators <<, >>, &, |, ~ (>>> for Java only)

<<: left shift, easy to use, can do shortcut multiplies.

Ex:   x = 0x1<<4;  makes x = 0x10, binary 10000.  Zeroes come in from the right.

y = x<<1;   doubles x for y, y = x <<2 ;   multiplies x by 4, etc.

>>:  right shift, need to be more careful, understand “sign extension”.

y = x>>2;    divides x by 4 if x is positive, or negative, because it shifts in 1s instead of 0s if x < 0 (“sign extension”, since the sign bit (most significant bit) is 1 for negative numbers, 0 for non-negatives.

If you want to shift in 0s, Java (but not C) has a special unsigned right shift >>>.  In C, have to clear out unwanted 1s in some cases.

Example: create a number with bits 3-9 on (7 bits), rest off.

Partial solution: Start with -1, all 1s on, left shift by 3, get bits 3-31 on. But right shift of this will bring in more 1s from the left, so can’t finish the job this way, at least in C.

Note: -1 has all bits = 1, so that (-1) + 1 = 0.

Java: start with -1, >>> by 32-7 = 25, then left shift by 3.((-1)>>>25)<<3

C:   0x7f<<3   does it, since 0x7f has 7 bits on.  But not generalizable.

Note that 0x7f = (1<<7) – 1,  so ((1<<7)-1) <<3 is generalizable, works in C or Java.

This number with bits 3-9 on, can be called a mask for bits 3-9.

int mask3_9 = ((1<<7)-1) <<3;

It can be used to pick out the value in these bits from an int value

°         Key idea: ANDing a bit with 0 produces a 0 while ANDing a bit with 1 produces the original bit.

val = (x&mask3_9) >> 3;  /* pick up the number in bits 3-9 */

Here & is the bitwise and of C/Java, and MIPS too.

 

Example:

                x              1011 0110 1010 0100 0011 1101 1001 1010

                mask     0000 0000 0000 0000 0000 0111 1111 1000  = 0x7f8

The result of bitwise anding these:

                                0000 0000 0000 0000 0000 0101 1001 1000

 

Masking 1 bit: good for working through the bits of a number

int mask_i = (1<<i);

Example: count the on-bits in x;

int i, count = 0;

for (i=0; i<32;i++)

   if (x&(1<<i))          ( or x&(1<<i) != 0 in Java)

      count++;

 

MIPS Assembler Logical Ops

View contents of register as 32 raw bits rather than as a single 32-bit number

MIPS Logical Operators are all bitwise, meaning that bit 0 of the output is produced by the respective bit 0’s of the inputs, bit 1 by the bit 1’s, etc.

If x of our last example is in $t0, and the mask is in $t1, we can do the bitwise and into $t2:

and          $t2,$t0,$t1

or we could use immediate for the mask:

andi         $t2,$t0,0x7f8

or we could build the mask up using our expression.

To get the value in the 7 bits, we right shift this result (MIPS shift right works like >>>, bringing in 0s)

srli      $t2, 3

Note that the #2 method on pg. 105 doesn’t work directly in C. In Java, need to use >>>, then <<. In MIPS, use srl, then sll.  The AND of #1 leaves the value in the same bits as the mask, need right shift to normalize.

 

Next: Sec. 2.13, the C Sort example.

swap.c and sort.c are in $cs641/programs

swap.c:

void swap(int v[], int k)

{

    int temp;

    temp = v[k];

    v[k] = v[k+1];

    v[k+1]=temp;

}

Basic idea: array v is an arg, which means its starting address (the value of the v name) is passed from the caller. 

In C, an array doesn’t “know” its length, unlike in Java, where an array is an object. The C compiler only knows an array’s length if it has access to the array’s definition, such as “int a[100];.  The swap function here does not have such access, so it only knows the starting address of v, passed as an arg.

This function is swapping neighbors in the array, the elements at k and k+1.

Right away we know this sort is not that efficient. An efficient (in-place) sort moves elements much further on each step.

Recall that C compiles each function completely separately, so it’s perfectly OK to put swap.c in a separate file from sort.c, its caller.  It’s no handicap to C.

Unoptimized Compile

Frame: save $fp + pad, $gp, temp as local+pad: 24 bytes

set up frame, save $fp, save arg0, arg1 in 24($fp) and 28($fp)

First action of body: temp = v[k]

load k from stack, mult by 4 (sll by 2), add to v also loaded from stack, creating ptr to v[k]

load  v-at-k using ptr

store v-at-k to local var temp on stack at 8($fp)

        lw      $2,28($fp)

        sll     $3,$2,2

        lw      $2,24($fp)

        addu    $2,$3,$2

        lw      $2,0($2)

        sw      $2,8($fp)

Similarly, next assignment is handled similarly: load everything needed from stack, end up storing one result back to stack.  Even in the middle statement, nothing is reused from doing the rhs when doing the lhs.

When we use optimization (mips-gcc-O2),  the assembly code ends up looking like the code in the book, pg. 151:

No stack frame, use args a0 = v, a1 = k from their initial registers.

compute 4*k by sll by 2 (k in $a1)

add 4*k to v in $a0, getting ptr to v[k] in $t1

load old-v-at-k into $t0 using ptr in $t1

load old-v-at-k+1 into $t2 by using offset of 4 from ptr in $t1

store old-v-at-k+1 from $t2 using ptr in $t1 (putting it in v[k])

store old-v-at-k  using offset of 4 from ptr in $t1  (putting it in v[k+1])

return to caller using $ra

 

Note: if hand-written, should have register usage comments:

# $t0: old v[k] value

# $t1: &v[k]

# $t2: old v[k+1] value

 

Comparing these:              Wow!                                                 

 

# instructions    

#memory ref’s

Unoptimized

31

18

Optimized

7

4

 

Actually, there are more memory references due to instruction fetches, but we won’t count them yet.

The sort function: can be in sort.c, separate from swap.c:

void sort(int v[], int n)

{

    int i, j;

    for (i=0; i < n; i++)

        for (j =i-1; j >= 0&& v[j] > v[j+1]; j--)

            swap(v,j);

}

 

Unoptimized frame: $ra, $fp, $gp, 2 local vars, space for 4 args: 40 bytes

$fp set up

arg 0 (v) saved to 40($fp), arg 1 (n) saved to 44($fp)

24($fp) used for local var i

28($fp) used for local var j

Call made by loading a0 and a1 with needed args, jal to swap.

Usual inefficient code to do the needed actions.

Optimized Code: Different from book code

Book code:

frame of 20 bytes: save $ra, $s0, $s1, $s2, $s3, no args, so 5*4 = 20, which should be padded to 24 to meet MIPS rules of double-word alignment of sp. (Noted on pg. B-27!)

Once the s-reg’s are saved, they can be used with the assurance that they will retain their value across the calls to swap.

Clearly a register usage comment would be very welcome here.

mips-gcc-O2 code:

frame of 48 bytes: save $ra, $s0-$s4, $gp, 4 args.  So one more s-register than the book’s version.

Register usage:

s0: &v[j]   (also used for &v[j+1) via offsetting by 4)

s1: j

s2: i

s3: v

s4: n

 

So we see that the two args and two local vars each get a register, plus one more “made up” by the compiler for its convenience in accessing the array.

Here $gp is actually saved in code generated by” .cprestore 16”, and later used in calling swap.

To judge efficiency, we count the instructions and memory ref’s in the inner loop.

Inner loop counts

# instructions (counting 1 for call)

#memory ref’s (not counting call)

mips-gcc-O2

10

2

Book code

12

2

 

We see that in both cases, the memory ref’s are minimal, just the needed array elements in the compare.  Our C compiler apparently saves a couple of instructions in the inner loop by testing the inner termination condition in two places, once before the inner loop and once inside.