Chap. 7 Multicores, etc.
Flynn Taxonomy: SIMD, MIMD, also MISD, SISD, but rarer in use.
Terminology
core: one processor, usually in a multi-core chip
microprocessor: one system
serial vs. parallel: how many PCs are in use
parallel processing program: one that can utilize multiple processors (cores)
multiprocessing: utilizing multiple processors
multithreading: ability to run multiple threads (doesn’t require multiprocessing, just a decent OS)
Effective parallel programming is hard
Those little parts of the program that set things up and wrap up results loom large. If there’s only one disk, it doesn’t speed up for multiple processors.
Example on pg. 635: Given 100 processors, want speed up by x90, how much sequential code is allowed?
Answer: only 0.1% of the total computation.
Pretty depressing
Second example: sum of 10 numbers (done serially) and sum of two 10x10 matrices
Let t = time of one addition
Uniprocessor: Time = 10t + 100t = 110t.
10-way Multiprocessor: each does 10 adds for matrix, time = 100t/10 + 10t = 20t, 5.5 times faster (vs. 10x more processors)
100-way multiprocessor: each does 1 add for matrix, time = 100t/100 + 10t = 11t, 10 times faster (vs 100x more processors)
Scale up problem (matrix part):
Uniprocessor: Time = 10t + 10000 t = 10010 t.
10-way Multiprocessor: each does 100 adds for matrix, time = 10000t/10 + 10t = 1010t, 9.9 times faster (vs. 10x more processors)
100-way multiprocessor: each does 10 add for matrix, time = 10000t/10 + 10t = 110t, 91 times faster (vs 100x more processors)
So scaling up problem makes the parallelism count for more and more, if the main computation is perfectly parallelizable.
Nearly Perfectly
parallelizable tasks
--Mandelbrot set computations
--Ray tracing for animation
--others?
Sec. 7.3
Shared Memory Multiprocessors (SMP)
Each processor has its own cache (though they can communicate for special actions), and a shared bus or other connection to their shared memory.
The memory is allocated to processes, and normally processes can’t see each other’s memory. But a process can run threads, which do share memory, so one thread can write x and another read it.
To keep threads from destroying shared data, we need synchronization capabilities. These are provided by the OS or directly by hardware instructions that can be used to implement “spinlocks”. This is possible on MIPS—there is some discussion (confusing) in 2.11.
SMP machines: the common 8-way server with perhaps 32GB of memory.
SMP machines can have huge memory: at UMB Venture Development Cernter, Symmetric Computing’s machine can take up to 1.54TB of memory, with 96 cores on each of 3 units connected with 40 Gbps Infiniband Mellanox host bus, Weiwei worked there last summer.
Skip now to Sec. 7.4, come back for multi-thread programming
Sec. 7.4 Clusters,
etc.
See Fig. 7.4, pg. 641, for systems each with own memory
Current such systems communicate over 1Gb Ethernet switches: ex, Vertica, , etc.
Also, clusters can be made out of cloud computing nodes from Amazon EC2, etc., at very reasonable price, like $4/hr. for 8 nodes.
Video on a signup and install (single node) Cloud computing intrro: slides from academic conference
At DBDay miniconference in January, heard of several ongoing academic research projects using this for implementation
Example: web app using a cluster:
20 web server nodes
25 app server nodes (running Java web app)
1 DB server node: here’s where the sharing of data takes place
Sum example, pg. 643 is crazy, need much bigger job to go to a cluster.
Ray tracing: Paper on ray tracing in Cars movie used SIMD instructions on PowerPC CPU (Apple G5)
Specifically the AltiVec and SSD instruction sets. See Wikipedia article on AltiVec.
Idea of SIMD instruction: set up vector of data, do one operation on each element all at once.
From
Altivec’s Wikipedia page:
Both AltiVec and SSE feature 128-bit vector registers that can represent sixteen 8-bit signed or unsigned chars, eight 16-bit signed or unsigned shorts, four 32-bit ints or four 32-bit floating point variables. Both provide cache-control instructions intended to minimize cache pollution when working on streams of data.
Cache pollution describes situations where an executing computer
program loads data into CPU cache unnecessarily, thus causing other needed data
to be evicted from the cache into lower levels of the memory
hierarchy, potentially all the way down to main memory,
thus causing a performance hit
SIMD instructions for MIMD CPUs are discussed briefly on pg. 649.
Sec.7.6 GPUs: use SIMD instructions heavily.
Hardware setup: pg. A-8: GPU about as close to the CPU as memory is:
· GPU connects with “x16” PCI-Express, with 16 GB/s bandwidth
· memory (DDR2 = modern DRAM) connect with 667 MT/s = 667*(128/8) MB/s = 10.7 GB/s bandwidth.
Although GPU has “own” memory, pg. A-9 says it’s accessible from CPU, and GPU can read main memory too.
GPU has” hardware threading”, to help overcome memory latency.
Example: 768MB of RAM, 24x32 = 768 threads, so each thread handles 1MB of memory in simple cases.
Main job of GPU: determine screen memory from commands sent by programs (OpenGL API, for example).
Need new screen image about every 10 ms, so thread needs to process about 1MB of memory every 10 ms, overall rate of 1MB/.01 sec = 100MB/sec
By memory bandwidth: 1MB/(86.4 GB/sec) = 1/85.4 ms = .012 ms, tiny compared to 10 ms. Doesn’t sound hard, but can’t wait very often. 10ms/50 ns = 200 mem waits/refresh, or 1MB/200 = at most 1 mem wait for each 5KB read.
(The total memory 768MB takes 9.1ms, so the bus to GPU memory is heavily loaded, and with not much locality, so caching is not helpful. The memory transfers are going on separately from the SIMD instruction executions, except for the need to wait for memory.)
Can we harness this power for apps? Yes, C of GPUs—pg. 656.
New category for such GPUs, SIMT: Single instruction multiple thread (more info in Appendix A)
Hardware figures out which threads can do a certain instruction, and makes them do it all together, while the other threads wait their turn. Very flexible setup.
Hardware threads: each thread has its own PC, just like a OS thread. But there’s no “thread switching” in the same sense as in an OS, where one particular thread is running on a core at each point in time. The hardware tracks the PCs and manages multiple floating point and integer processing units in a fine-grained manner, making progress in little runs with subsets of threads doing the same operations, until some need to wait for memory.
Back to 7.3, SMP
Here we have OS threads and parallel programs, a somewhat more ordinary environment we should all understand.
Java supports threads.
Look at Mandelbrot example next time.
: