CS641 Class
13
Midterm:
Wed., Mar 30 (open books, notes)
Handout:
Mandelbrot program with added synchonized method
Cache
Coherence: Sec. 5.8, review from last time
Look at Fig. 5.35, and see the base situation that is leading up to problems: after A stores X, B’s cache has a stale value.
Will a subsequent read by B get the wrong value?
Well, if stale values are allowed to hang around for significant time and are then used, it’s a real problem. It turns out that we tolerate a little bit of staleness for performance sake. After all, the threads on different processors are not synchronized unless we ask for special synchronization.
Look at the rules on pg. 535. I’ll write them in database notation: w1(x) means write by processor 1 on data item x. r2(x) is a read of x by processor 2.
Cache Coherence Rules
1. w1(x) r1(x): the read must return the value previously written by the same processor
2. w2(x) r1(x) with no w(x)’s between: returns the written value to the read if the w and r are close enough in time
3. w1(x)w2(x) can be done out of order, but whatever order of writes seen by one processor must be seen by all—this is the “serialization order” chosen by the system.
What this is aiming at is a good story about what happened to an x value in observable memory. No processor can see a different succession of values in x from another processor. Each processor has to act like a uniprocessor on reads and writes it does for its own programs.
How can a system keep even this much order? Snooping protocols on buses are the most common way.
What is a bus?
A bus has a set of wires connected to multiple devices for their communication (pg. 582). A typical bus has address lines, data lines, a R/W line, and for a multiprocessor, lines for the processor id doing the current bus cycle (read or write). It has a clock signal that runs the show, so the cyc/sec of this clock signal measures its speed, for example 800 MHz or 1333 Mhz. 1000Mhz means one cycle every ns.
In the SMP diagram, pg 639, the Interconnection Network can be a memory bus. Alternatively, it can be some other fast network, but we won’t consider that case here. The “front side bus” shown in Fig. 6.9, pg. 585, and Fig. A.2.2, pg. A-8, is a memory bus.
In the case of the memory bus, the CPU’s caches and the main memory are connected, plus bridges to other buses.
<picture of parallel wires, connected devices, including multiple CPU/cache devices, and of course memory modules>
All the caches are connected to this common bus, so they can interpret all the reads and writes of the other caches. Also, they can use the bus to talk to other caches.
Snooping Protocol on
a Bus
Write Invalidate Protocol. Consider the situation of Fig. 5.35, pg. 535, where caches A and B have copies of X, and then cache A does a write of X. A can notify all other caches of this (using the bus), and B can remove the stale entry. The result is the sequence shown in Fig. 5.36, pg. 537, for a write-back cache.
When B reads X after the invalidation, there is a cache miss. The resulting request for the memory value by B is in fact cancelled by A, which answers the request using its copy. So the caches cooperate to avoid accessing memory, the bottleneck. In this particular protocol, the memory copy is updated at this point to avoid ambiguity in where the definitive copy of X is. (A more complicated protocol can track “ownership”, i.e, who has the definitive copy.)
False Sharing Idea,
where a write-back cache starts acting like a write-through cache
Suppose memory items X and Y sit in the same cache line, and processor 1 updates X and ignores Y, while processor 2 updates Y and ignores X. Since they belong to the same cache line, the W1(X) puts X in cache 1, then the W2(Y) requires a read of the line, answered by cache1, but also instituting a write of the memory. Then the write of the line invalidates cache 1’s copy, so the next W1(X) ends up writing the line to memory, and so on. Lots of memory action.
If X and Y are on separate cache lines, everything happens in cache (assuming write back), and only occasionally is memory updated.
Memory Consistency
Memory coherence allows a lot of sloppiness in when a write takes effect for other processors. A coherent memory could allow W1(X)W2(Y) to write out Y before X, so another processor could see the new Y value but the old X value. Memory consistency can rule this out.
Memory Consistency Rules (pg. 538)
1. A write does not complete (allowing another write by the same processor) until all processors have “seen” the effect of that write. That is, after this, a read by any processor will get the new value.
2. A processor does not reorder writes with respect to any other memory operations.
These ensure that W1(X)W1(Y) means no processor can R2(Y)R2(X) and see the old value of X and the new value of Y.
Last time we discussed the problem of lost update in the simple case of counting in parallel. Two threads, running on two processors, can both read a count of 42, separately add 1, and both write 43, instead of the expected 44. This can even happen for two threads on a uniprocessor.
We saw that with the help of a mutex, we can fix this problem. When the mutex is “set”, only one thread can run, so the mutex forces the threads to take turns in the increment code. Each thread has to set the mutex, run the increment code, and release the mutex. Sometimes when a thread is setting the mutex, it has to wait a while in the set-mutex code, while the other thread finishes up.
The mystery remains:
how can we implement a mutex?
In practice, we use synchronization supplied by Java or the OS, but this just transfers the question to another environment. At some level in the software, we reach the hardware—how does it work there?
Each processor architecture has special instructions for this. In x86, there are “interlocked” instructions that can lock the bus, stalling all the processors attached to it, while one processor does its action, such as decrementing a memory location. See $cs641/../cs644/class22.html for more info if interested.
Some processors offer an “atomic exchange” instruction. MIPS has something like this.
Using an atomic
exchange instruction
For a mutex, set up a memory variable, the “lock variable”, say mylock.
mylock = 0 means the lock is free (the mutex is free)
mylock = 1 means the lock is locked (the mutex is taken)
To set the lock (get the mutex):
get-mutex: put 1 in a register
do atomic-exchange, exchanging values in the register and mylock (this needs bus-lock or something powerful)
test the register: if 0, this thread has the lock, can run critical code, so return from get-mutex
if 1, the lock is taken, try again, by branching to get-mutex
To release the lock: write 0 to mylock
MIPS uses two instructions for atomic exchange, load-linked and store-conditional, and requires a loop (code, pg. 138) to have the effect of atomic exchange, so the code for get-mutex ends up being a double loop.
These looping constructs give another name to such a lock: a “spin lock”. CPU is wasted. So we only want to use this mutex for very short periods of time, that is, shorter than thread-switch time. Access to shared memory variables is the main use case.
For longer needed mutex times, such as covering a disk access, we want the thread to block in the OS so it can run other threads. The OS uses this kind of mutex to implement “semaphores” that block callers as appropriate. Most apps just use the OS offerings, but database servers may directly use spinlocks.
Clearly the two MIPS instructions don’t lock the bus between them. Something like this must be happening, in the case of snooping on a bus:
load-linked on X by Pi: all caches create an entry for X showing load-linked access by Pi
load-linked on X by Pj: seen by Pi and Pj, both caches now know X has 2 requests by Pi and Pj
store-conditional on X by Pi: succeeds, invalidates all cache entries on X
store-conditional on X by Pj: fails since no request on X still there.
Synchronization
in Java programs: see handout code
Ref:
http://download.oracle.com/javase/tutorial/essential/concurrency/syncmeth.html
See the handout. Java provides a mutex for each object (at least in effect). It’s important to use the right object’s mutex for the job. Here, the threads individually have objects and do the increments, so we put the synchronized method in the top-level object, so the mutex in use is global to the program.
The code for the increment is slowed down by a little delay, to make it more likely that synch problems will actually occur. And they do, as shown by the run results on a dual-processor machine.
Digital Logic
° Next several weeks: we’ll study how a modern processor is built starting with basic logic elements as building blocks.
° Why study logic design?
• Understand what processors can do fast and what they can’t do fast (avoid slow things if you want your code to run fast!)
Logic Gates
° Basic building blocks are logic gates.
• Can build gates with transistors and resistors
• Can represent and reason about gates with truth tables and Boolean algebra
• Assume know truth tables and Boolean algebra from a math course.
• Appendix C in the textbook CD-ROM has a review if needed