**************************************** Module 2: Parallel Hardware and Software **************************************** Serial System ============= The traditional von Neumann architecture ---------------------------------------- * CPU - Control unit - Arithmetic logic unit (ALU) * Memory * Input/Output (I/O) Operating system ---------------- * Process - Multitasking * Thread Cache ----- * Definition: A cache is a smaller, faster type of volatile computer memory that provides high-speed data access to the processor and stores frequently used computer programs, applications, and data. * Locality - Temporal locality + A memory location that is referenced once is likely to be referenced again soon. - Spatial locality + A memory location near the location that is referenced once is likely to be referenced again soon. * Leveled - L1, L2, L3, ... from faster/smaller to slower/larger * Cache line * Cache address mapping * Cache hit/miss * Cache mapping - Direct mapping (temporal locality) + Select one cache line to store in the cache + A line of data can be stored in any cache line - Full associative mapping (spatial locality) + Select a block around the line of data and store the whole block in the cache + The whole cache is occupied by the block + A line of data can only be stored in one cache line - Set-associative mapping (both) + Split the cache into several sets + Select a block around the line of data and store the whole block in the cache + A set is occupied by the block + A line of data can only be stored in one cache line per set. Thus, there are the number of sets possible way to store the line of data. * Cache eviction policy Virtual memory -------------- * Use secondary storage as an extension of main memory * Page - Fixed length block * Swapping - unused page moved to secondary storage * Virtual address * Page table - mapping from virtual address to physical address - Translation lookaside buffer (TLB) - cache for page table * Page fault - page not in memory Instruction level parallelism (ILP) ----------------------------------- Internal feature of a CPU to allow parallelism. Hardware dependant. * pipeline - unique subunit * multiple issue - replicated subunits Thread level parallelism (TLP) ------------------------------ Parallel Hardware ================= + Type of parallelism * data parallelism * task parallelism Flynn's taxonomy ---------------- + Single instruction single data stream (SISD) * classic von Neumann + Single instruction multiple data (SIMD) * Vector processor - Dedicated vector processor - General CPUs with vector instructions + Intel SIMD intrinsics: MMX, SSE, AVX * GPUs single core (Graphical Processing Unit) + Multiple instruction multiple data stream (MIMD) * Types - shared-memory - distributed memory Shared-memory SIMD ------------------ + Memory access * UMA (Uniform Memory Access) * NUMA (Non-Uniform Memory Access) + interconnections * bus * switch - crossbar Distributed-memory SIMD ----------------------- + interconnections * direct (nodes are connected by a single switch) - fully connected - hypercube * indirect (nodes are connected by multiple switches) - crossbar - omega network * Time to send a message - latency - preparation time - bandwidth - transfer time - :math:`time = latency + size/bandwidth` * bisect width/bandwidth Cache coherence in shared-memory system --------------------------------------- + Solutions * snooping * directory + false sharing - different part of a cache line is shared Parallel Software ================= + Focus on Single program multiple data (SPMD) programs * A single program is executed by multiple processes/threads/nodes * Each process/thread/node executes the same program * Each process/thread/node has its own data Shared-memory system -------------------- + Threads * Dynamic threads - fork-join model * Static threads + Communicating by I/O to the shared memory * nondeterminism - racing condition - critical section - mutual exclusion lock (MUTEX) + Input/Output * Always have one thread for stdin * Usually have one thread for stdout, stderr + Thread-safety + Libraries * POSIX thread (fork-join model) * OpenMP Distributed-memory system ------------------------- + Communicating by message passing + The Message-Passing Interface (MPI) standard * MPICH * OpenMPI Performance Model ================= Variables --------- + Serial run-time :math:`T_{serial}` + Parallel run-time :math:`T_{parallel}` + Number of processes/threads/nodes :math:`p` + Problem size :math:`N` Metrics ------- + Speedup :math:`S = T_{serial}/T_{parallel}` + Efficiency :math:`E = \frac{S}{p} = \frac{T_{serial}}{p \cdot T_{parallel}}` Relationships ------------- + Ideal * :math:`S = T_{serial}/T_{parallel} = p` + With overhead * :math:`T_{parallel} = T_{serial}/p + T_{overhead}` * :math:`S = T_{serial}/T_{parallel} < p` + Amdahl's law * Let :math:`f` be the fraction of the program that can be parallelized * Then the speedup :math:`S = \frac{1}{(1-f) + \frac{f}{p}}` * Diminishing return .. image:: /_static/amdahls_law.png :alt: Amdahl's law :width: 400px .. container:: footnote credit: https://en.wikipedia.org/wiki/Amdahl%27s_law + Scalability * strong scaling - Constant :math:`E` with increasing :math:`p` for a fixed :math:`N` - plot: + :math:`E` vs :math:`P` + fixed :math:`N` + Expect nearly flat line * weak - Constant :math:`E` with increasing :math:`p` for a fixed :math:`N/p` ratio - plot: + :math:`E` vs :math:`P` + fixed :math:`N/p` + Expect nearly flat line + Timing Parallel Program Design ======================= + Foster's methodology #. Partition #. Communication #. Agglomeration #. Mapping + Histogram example * code example available Addition Content ================ .. toctree:: :maxdepth: 2 :caption: Contents: intel-intrinsics profiling