Skip to main content
留学咨询

辅导案例-CMPEN 431

By May 15, 2020No Comments

CMPEN 431 Chapter 6B.1 Sampson, PSU, 2019 CMPEN 431 Computer Architecture Fall 2019 Chapter 6B: Introduction to Message Passing Multiprocessors [Adapted from Computer Organization and Design, 5th Edition, Patterson & Hennessy, © 2014, MK and Mary Jane Irwin] CMPEN 431 Chapter 6B.2 Sampson, PSU, 2019 Review: Shared Memory Multiprocessors (SMP) ❑ Q1 – Single address space shared by all cores ❑ Q2 – Cores coordinate/communicate through shared variables in memory (via loads and stores) Use of shared data must be coordinated via synchronization primitives (locks) that allow access to data to only one core at a time ❑ SMPs come in two styles Uniform memory access (UMA) multiprocessors Nonuniform memory access (NUMA) multiprocessors Core Core Core Cache Cache Cache Interconnection Network Memory I/O CMPEN 431 Chapter 6B.3 Sampson, PSU, 2019 Message Passing Multiprocessors (MPP) ❑ Each core has its own private address space ❑ Q1 – Cores share data by explicitly sending and receiving information (message passing) ❑ Q2 – Coordination is built into message passing primitives (message send and message receive) Cores Cores Cores Cache Cache Cache Interconnection Network Memory Memory Memory CMPEN 431 Chapter 6B.4 Sampson, PSU, 2019 Communication in Network Connected Multi’s ❑ Implicit communication via loads and stores (SMP) hardware architects have to provide coherent caches and process (thread) synchronization primitives (like ll and sc) lower communication overhead harder to overlap computation with communication more efficient to use an address to get remote data when needed rather than to send for it in case it might be needed ❑ Explicit communication via sends and receives (MPP) simplest solution for hardware architects higher communication overhead easier to overlap computation with communication easier for the programmer to optimize communication CMPEN 431 Chapter 6B.5 Sampson, PSU, 2019 SMP/MPP Example – Intel Xeon Phi Coprocessor ❑ Intel’s Many Integrated Core Architecture (MIC, Mike) Up to 8 coprocessors (72 cores each) per host server, SMP within a coprocessor, MPP between coprocessors ❑ Three generations: Knight’s Ferry (3100), Knight’s Corner (5100), Knight’s Landing (7100) Knight’s Landing in 14nm FinFETs, 2nd + quarter 2015 72 Atom (Silvermont) cores, static 2-way superscalar, 4 threads per core (FGMT) so 288 threads, 1.238GHz (1.33GHz Turbo mode) Each core has two 512-bit vector units and supports AVX-512F SIMD instructions On-chip interconnect, mesh NoC ? Up to 384GB of DDR4DRAM and 16GB of stacked 3D MCDRAM 3+ TeraFLOPS per coprocessor TDP of 300W (estimated 15W per core, so how ??) ❑ Programming tools: OpenMP, OpenCL, Cilk, and specialized versions of Intel’s Fortran, C++ and scientific libraries CMPEN 431 Chapter 6B.6 Sampson, PSU, 2019 Xeon Phi Knight’s Landing http://www.anandtech.com/show/8217/intels-knights-landing-coprocessor-detailed CMPEN 431 Chapter 6B.7 Sampson, PSU, 2019 Summing 100,000 Numbers on 100 Core MPP sum = 0; for (i = 0; i<1000; i = i + 1) sum = sum + Al[i]; /* sum local array subset ❑ Start by distributing 1000 elements of vector A to each of the local memories and summing each subset in parallel ❑ The cores then coordinate in adding together the sub sums (Cn is the number of cores, send(x,y) sends value y to core x, and receive() receives a value) half = 100; limit = 100; repeat half = (half+1)/2; /*dividing line if (Cn>= half && Cnif (Cnlimit = half; until (half == 1); /*final sum in C0’s sum CMPEN 431 Chapter 6B.9 Sampson, PSU, 2019 An Example with 10 Cores C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C0 C1 C2 C3 C4 half = 10 half = 5 half = 3 half = 2 sum sum sum sum sum sum sum sum sum sum send receive C0 C1 C2 limit = 10 limit = 5 limit = 3 limit = 2 half = 1 C0 C1 C0 send receive send receive send receive ❑ Key is how long it takes to send packets back and forth across the network Software protocol stack (CmpEn 362) Hardware interconnection network and traffic load CMPEN 431 Chapter 6B.10 Sampson, PSU, 2019 Pros and Cons of Message Passing ❑ Message sending and receiving is much slower than addition, for example ❑ But message passing multiprocessors are much easier for hardware architects to design Don’t have to worry about cache coherency for example ❑ The advantage for programmers is that communication is explicit, so there are fewer “performance surprises” than with the implicit communication in cache-coherent SMPs Message passing standard MPI-2.2 (www.mpi-forum.org) ❑ However, its harder to port a sequential program to a message passing multiprocessor since every communication must be identified in advance With cache-coherent shared memory the hardware figures out what data needs to be communicated CMPEN 431 Chapter 6B.11 Sampson, PSU, 2019 Aside: Quick Summary of MPI ❑ The MPI Standard describes point-to-point message-passing collective communications group and communicator concepts process topologies environmental management process creation and management one-sided communications extended collective operations external interfaces I/O functions a profiling interface ❑ Language bindings for C, C++ and Fortran are defined http://www.mpi- forum.org/docs/docs.html CMPEN 431 Chapter 6B.12 Sampson, PSU, 2019 Concurrency and Parallelism ❑ Programs are designed to be sequential or concurrent Sequential – only one activity, behaving in the “usual” way Concurrent – multiple, simultaneous activities, designed as independent operations or as cooperating threads or processes – The various parts of a concurrent program need not execute simultaneously, or in a particular sequence, but they do need to coordinate their activities by exchanging information in some way ❑ A key challenge is to build parallel (concurrent) programs that have high performance on multiprocessors as the number of cores increase – programs that scale Problems that arise – Scheduling threads on cores close to the memory space where their data primarily resides – Load balancing threads on cores and dealing with thermal hot-spots – Time for synchronization of threads – Overhead for communication of threads CMPEN 431 Chapter 6B.15 Sampson, PSU, 2019 Encountering Amdahl’s Law ❑ Speedup due to enhancement E is Speedup w/ E = ———————- Exec time w/o E Exec time w/ E ❑ Suppose that enhancement E accelerates a fraction F (F <1) of the task by a factor S (S>1) and the remainder of the task is unaffected ExTime w/ E = ExTime w/o E  ((1-F) + F/S) Speedup w/ E = 1 / ((1-F) + F/S) CMPEN 431 Chapter 6B.17 Sampson, PSU, 2019 Example 1: Amdahl’s Law ❑ Consider an enhancement which runs 20 times faster but which is only usable 25% of the time. Speedup w/ E = 1/(.75 + .25/20) = 1.31 ❑ What if its usable only 15% of the time? Speedup w/ E = 1/(.85 + .15/20) = 1.17 ❑ Amdahl’s Law tells us that to achieve linear speedup with 100 cores (so 100 times faster), none of the original computation can be scalar! ❑ To get a speedup of 90 from 100 cores, the percentage of the original program that could be scalar would have to be 0.1% or less Speedup w/ E = 1/(.001 + .999/100) = 90.99 Speedup w/ E = 1 / ((1-F) + F/S) CMPEN 431 Chapter 6B.19 Sampson, PSU, 2019 Example 2: Amdahl’s Law ❑ Consider summing 10 scalar variables and two 10 by 10 matrices (matrix sum) on 10 cores Speedup w/ E = 1/(.091 + .909/10) = 1/0.1819 = 5.5 ❑ What if there are 100 cores? Speedup w/ E = 1/(.091 + .909/100) = 1/0.10009 = 10.0 ❑ What if the matrices are 100 by 100 (or 10,010 adds in total) on 10 cores? Speedup w/ E = 1/(.001 + .999/10) = 1/0.1009 = 9.9 ❑ What if there are 100 cores? Speedup w/ E = 1/(.001 + .999/100) = 1/0.01099 = 91 Speedup w/ E = 1 / ((1-F) + F/S) CMPEN 431 Chapter 6B.20 Sampson, PSU, 2019 Multiprocessor Scaling ❑ To get good speedup on a multiprocessor while keeping the problem size fixed is harder than getting good speedup by increasing the size of the problem Strong scaling – when good speedup is achieved on a multiproce
ssor without increasing the size of the problem Weak scaling – when good speedup is achieved on a multiprocessor by increasing the size of the problem proportionally to the increase in the number of cores and the total size of memory ❑ But Amdahl was an optimist – you probably will need extra time to patch together parts of the computation that were done in parallel CMPEN 431 Chapter 6B.21 Sampson, PSU, 2019 Multiprocessor Benchmarks Scaling? Reprogram? Description LINPACK Weak Yes Dense matrix linear algebra http://www.top500.org/project/linpack/ SPECrate Weak No Parallel SPEC programs for job- level parallelism SPLASH 2 Strong No Independent job parallelism (both kernels and applications, from high-performance computing) NAS Parallel Weak Yes (c or Fortran) Five kernels, mostly from computational fluid dynamics PARSEC Weak No Multithreaded programs that use Pthreads and OpenMP. Nine applications and 3 kernels – 8 with data parallelism, 3 with pipelined parallelism Berkeley Patterns Strong or Weak Yes 13 design patterns implemented by frameworks or kernels CMPEN 431 Chapter 6B.22 Sampson, PSU, 2019 DGEMM (Double precision GEneral MM) Example ❑ DGEMM: A BLAS (Basic Linear Algebra Subprograms) routine; part of LINPACK used for performance measurements C = C + A * B void dgemm (int n, double* A, double* B, double* C) { for (int i = 0; i < n; ++i) for (int j = 0; j < n; ++j) { double cij = C[i+j*n]; /* cij=C[i][j] */ for (int k = 0; k < n; ++k) cij += A[i+k*n] * B[k+j*n]; /* cij += A[i][j] * B[i][j] */ C[i+j*n] = cij; /*C[i][j] = cij */ } } CMPEN 431 Chapter 6B.23 Sampson, PSU, 2019 Multithreaded, Blocked OpenMP DGEMM ❑ #pragma OpenMP code makes the outmost for loop operate in parallel . . . void dgemm (int n, double* A, double* B, double* C) { #pragma omp parallel for for ( int sj = 0; sj < n; sj += BLOCKSIZE ) for ( int si = 0; si < n; si += BLOCKSIZE ) for ( int sk = 0; sk < n; sk += BLOCKSIZE ) do_block(n, si, sj, sk, A, B, C); CMPEN 431 Chapter 6B.24 Sampson, PSU, 2019 DGEMM Scaling: Thread Count, Matrix Size CMPEN 431 Chapter 6B.25 Sampson, PSU, 2019 Multiprocessor Basics # of Cores Communication model Message passing 8 to 2048 + SMP NUMA 8 to 256 + UMA 2 to 32 Physical connection Network 8 to 256 + Bus 2 to 8 ❑ Q1 – How do they share data? A single physical address space shared by all cores or message passing ❑ Q2 – How do they coordinate? • Through atomic operations on shared variables in memory (via loads and stores) or via message passing ❑ Q3 – How scalable is the architecture? How many cores? CMPEN 431 Chapter 6B.29 Sampson, PSU, 2019 Multiple types of parallelism: DGEMM Using AVX ❑ Taking advantage of subword parallelism (AVX), instruction level parallelism (compiler loop unrolling), and caches (matrix blocking) CMPEN 431 Chapter 6B.34 Sampson, PSU, 2019 Speedup Versus Parallelism ❑ For x86 processors ❑ Assumptions MIMD: 2 cores per chip added every two years SIMD: SIMD widths (registers and ALUs) will double every four years CMPEN 431 Chapter 6B.36 Sampson, PSU, 2019 GPU Architectures ❑GPUs are highly multithreaded (data-parallel) Use thread switching to hide memory latency - Less reliance on multi-level caches Graphics memory is wide and high-bandwidth ❑Trend is toward general purpose GPUs Heterogeneous CPU/GPU systems CPU for sequential code, GPU for parallel code ❑Programming languages/APIs DirectX, OpenGL C for Graphics (Cg), High Level Shader Language (HLSL) NVIDIA’s Compute Unified Device Architecture (CUDA) CMPEN 431 Chapter 6B.37 Sampson, PSU, 2019 NVIDIA Tesla 14 x Streaming Multiprocessor 8 x Streaming Processor CMPEN 431 Chapter 6B.38 Sampson, PSU, 2019 Apples A6 Processor (2 CPUs, 3 GPUs) CMPEN 431 Chapter 6B.39 Sampson, PSU, 2019 Intel’s Broadwell-L ❑ Tock node at 14nm FinFET 82mm2 die 4.5W TDP ❑ Haswell’s “tick” partner IPC 5% faster Larger ROB Larger BP tables Larger L2 TLB (1K to 1.5K entries) New L2 TLB for large pages (16 entries for 1GB pages) From MPR, Sep 2014 CMPEN 431 Chapter 6B.40 Sampson, PSU, 2019 Graphics Performance From MPR, Sep 2014 CMPEN 431 Chapter 6B.41 Sampson, PSU, 2019 Intel’s Gen8 GPU Architecture From MPR, Sep 2014 CMPEN 431 Chapter 6B.46 Sampson, PSU, 2019 Loosely Coupled Clusters ❑ Multiple off-the-shelf computers (each with its own private address space and OS) connected via a local area network (LAN) functioning as a “single” multiprocessor Search engines, Web servers, email servers, databases, … The current trend is toward grid or cloud computing, using wide- or global-area networks, perhaps borrowing from other participants, or selling a service as a subset of an existing corporate network ❑ N OS copies limits the memory space for applications ❑ Improved system availability and expandability easy to replace a machine without bringing down the whole system allows rapid, incremental expandability ❑ Economy-of-scale advantages with respect to costs CMPEN 431 Chapter 6B.48 Sampson, PSU, 2019 Top500 Architecture Styles ❑ Uniprocessor – one core only On the Top500 list, a vector processor ❑ SIMD – one control core, many compute units ❑ SMP – Shared Memory multiProcessors ❑ MPP – Message Passing multiProcessors ❑ Constellation – a collection of network connected SMPs ❑ Clusters – collection of independent, network connected PCs (commodity processors and memory), either Separate (not coherent) address spaces, communicate via message passing, commodity network Single (coherent) address space, communicate via direct inter- node memory access, custom inter-node network CMPEN 431 Chapter 6B.49 Sampson, PSU, 2019 Top500 Architecture Styles, 1993-2009 ❑ Uniprocessors and SIMDs disappeared while Clusters and Constellations grew from <3% to 80%. Now its 99% Clusters and MPPs. Nov 2009 datahttp://www.top500.org/ 0 100 200 300 400 500 1 9 9 3 1 9 9 4 1 9 9 5 1 9 9 6 1 9 9 7 1 9 9 8 1 9 9 9 2 0 0 0 2 0 0 1 2 0 0 2 2 0 0 3 2 0 0 4 2 0 0 5 2 0 0 6 2 0 0 7 2 0 0 8 2 0 0 9 Cluster Constellation SIMD MPP SMP Uniprocessor CMPEN 431 Chapter 6B.50 Sampson, PSU, 2019 From the Top500 Lists and Earlier … ❑ First Megaflop (106) system – CDC 7600, 1971 in today’s vocabulary, a superscalar processor ❑ First Gigaflop (109) system – Cray-2, 1986 a vector processor, 4 CPUs, 2 GB memory ❑ First Teraflop (1012) system – Intel ASCI Red, built for Sandia National Laboratories, Albuquerque, 1996 7,264 Pentium Pro processors (200 MHz), later increased to 9,632 Pentium II OverDrive processors (333 MHz) ❑ First Petaflop (1015) system – IBM Roadrunner, built for Los Alamos National Laboratory, 2008 12,960 IBM Cell + 6,480 AMD Opteron processors, all multicore ❑ First Exaflop (1018) system – 2020? Oak Ridge National Laboratory performed a 1.8×1018 operation calculation per second (which is not the same as 1.8×1018 flops) on the Summit OLCF-4 Supercomputer ❑ Top this year – https://www.top500.org/lists/2019/06/ http://en.wikipedia .org/wiki/Cray-2 http://en.wikipedia.o rg/wiki/CDC_7600 http://en.wikipedia.org/wiki/IBM_Roadrunner

admin

Author admin

More posts by admin