Parallel and High Performance Computing

Detalles Bibliográficos
Autor principal: Robey, Robert (-)
Otros Autores: Zamora, Yuliana
Formato: Libro electrónico
Idioma:Inglés
Publicado: New York : Manning Publications Co. LLC 2021.
Materias:
Ver en Biblioteca Universitat Ramon Llull:https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009633549506719
Tabla de Contenidos:
  • Intro
  • Parallel and High Performance Computing
  • Copyright
  • Dedication
  • contents
  • front matter
  • foreword
  • Yulie Zamora, University of Chicago, Illinois
  • How we came to write this book
  • acknowledgments
  • about this book
  • Who should read this book
  • Part 1 Introduction to parallel computing
  • 1 Why parallel computing?
  • 1.1 Why should you learn about parallel computing?
  • 1.1.1 What are the potential benefits of parallel computing?
  • 1.1.2 Parallel computing cautions
  • 1.2 The fundamental laws of parallel computing
  • 1.2.1 The limit to parallel computing: Amdahl's Law
  • 1.2.2 Breaking through the parallel limit: Gustafson-Barsis's Law
  • 1.3 How does parallel computing work?
  • 1.3.1 Walking through a sample application
  • 1.3.2 A hardware model for today's heterogeneous parallel systems
  • 1.3.3 The application/software model for today's heterogeneous parallel systems
  • 1.4 Categorizing parallel approaches
  • 1.5 Parallel strategies
  • 1.6 Parallel speedup versus comparative speedups: Two different measures
  • 1.7 What will you learn in this book?
  • 1.7.1 Additional reading
  • 1.7.2 Exercises
  • Summary
  • 2 Planning for parallelization
  • 2.1 Approaching a new project: The preparation
  • 2.1.1 Version control: Creating a safety vault for your parallel code
  • 2.1.2 Test suites: The first step to creating a robust, reliable application
  • 2.1.3 Finding and fixing memory issues
  • 2.1.4 Improving code portability
  • 2.2 Profiling: Probing the gap between system capabilities and application performance
  • 2.3 Planning: A foundation for success
  • 2.3.1 Exploring with benchmarks and mini-apps
  • 2.3.2 Design of the core data structures and code modularity
  • 2.3.3 Algorithms: Redesign for parallel
  • 2.4 Implementation: Where it all happens
  • 2.5 Commit: Wrapping it up with quality
  • 2.6 Further explorations.
  • 2.6.1 Additional reading
  • 2.6.2 Exercises
  • Summary
  • 3 Performance limits and profiling
  • 3.1 Know your application's potential performance limits
  • 3.2 Determine your hardware capabilities: Benchmarking
  • 3.2.1 Tools for gathering system characteristics
  • 3.2.2 Calculating theoretical maximum flops
  • 3.2.3 The memory hierarchy and theoretical memory bandwidth
  • 3.2.4 Empirical measurement of bandwidth and flops
  • 3.2.5 Calculating the machine balance between flops and bandwidth
  • 3.3 Characterizing your application: Profiling
  • 3.3.1 Profiling tools
  • 3.3.2 Empirical measurement of processor clock frequency and energy consumption
  • 3.3.3 Tracking memory during run time
  • 3.4 Further explorations
  • 3.4.1 Additional reading
  • 3.4.2 Exercises
  • Summary
  • 4 Data design and performance models
  • 4.1 Performance data structures: Data-oriented design
  • 4.1.1 Multidimensional arrays
  • 4.1.2 Array of Structures (AoS) versus Structures of Arrays (SoA)
  • 4.1.3 Array of Structures of Arrays (AoSoA)
  • 4.2 Three Cs of cache misses: Compulsory, capacity, conflict
  • 4.3 Simple performance models: A case study
  • 4.3.1 Full matrix data representations
  • 4.3.2 Compressed sparse storage representations
  • 4.4 Advanced performance models
  • 4.5 Network messages
  • 4.6 Further explorations
  • 4.6.1 Additional reading
  • 4.6.2 Exercises
  • Summary
  • 5 Parallel algorithms and patterns
  • 5.1 Algorithm analysis for parallel computing applications
  • 5.2 Performance models versus algorithmic complexity
  • 5.3 Parallel algorithms: What are they?
  • 5.4 What is a hash function?
  • 5.5 Spatial hashing: A highly-parallel algorithm
  • 5.5.1 Using perfect hashing for spatial mesh operations
  • 5.5.2 Using compact hashing for spatial mesh operations
  • 5.6 Prefix sum (scan) pattern and its importance in parallel computing.
  • 5.6.1 Step-efficient parallel scan operation
  • 5.6.2 Work-efficient parallel scan operation
  • 5.6.3 Parallel scan operations for large arrays
  • 5.7 Parallel global sum: Addressing the problem of associativity
  • 5.8 Future of parallel algorithm research
  • 5.9 Further explorations
  • 5.9.1 Additional reading
  • 5.9.2 Exercises
  • Summary
  • Part 2 CPU: The parallel workhorse
  • 6 Vectorization: FLOPs for free
  • 6.1 Vectorization and single instruction, multiple data (SIMD) overview
  • 6.2 Hardware trends for vectorization
  • 6.3 Vectorization methods
  • 6.3.1 Optimized libraries provide performance for little effort
  • 6.3.2 Auto-vectorization: The easy way to vectorization speedup (most of the time1)
  • 6.3.3 Teaching the compiler through hints: Pragmas and directives
  • 6.3.4 Crappy loops, we got them: Use vector intrinsics
  • 6.3.5 Not for the faint of heart: Using assembler code for vectorization
  • 6.4 Programming style for better vectorization
  • 6.5 Compiler flags relevant for vectorization for various compilers
  • 6.6 OpenMP SIMD directives for better portability
  • 6.7 Further explorations
  • 6.7.1 Additional reading
  • 6.7.2 Exercises
  • Summary
  • 7 OpenMP that performs
  • 7.1 OpenMP introduction
  • 7.1.1 OpenMP concepts
  • 7.1.2 A simple OpenMP program
  • 7.2 Typical OpenMP use cases: Loop-level, high-level, and MPI plus OpenMP
  • 7.2.1 Loop-level OpenMP for quick parallelization
  • 7.2.2 High-level OpenMP for better parallel performance
  • 7.2.3 MPI plus OpenMP for extreme scalability
  • 7.3 Examples of standard loop-level OpenMP
  • 7.3.1 Loop level OpenMP: Vector addition example
  • 7.3.2 Stream triad example
  • 7.3.3 Loop level OpenMP: Stencil example
  • 7.3.4 Performance of loop-level examples
  • 7.3.5 Reduction example of a global sum using OpenMP threading
  • 7.3.6 Potential loop-level OpenMP issues.
  • 7.4 Variable scope importance for correctness in OpenMP
  • 7.5 Function-level OpenMP: Making a whole function thread parallel
  • 7.6 Improving parallel scalability with high-level OpenMP
  • 7.6.1 How to implement high-level OpenMP
  • 7.6.2 Example of implementing high-level OpenMP
  • 7.7 Hybrid threading and vectorization with OpenMP
  • 7.8 Advanced examples using OpenMP
  • 7.8.1 Stencil example with a separate pass for the x and y directions
  • 7.8.2 Kahan summation implementation with OpenMP threading
  • 7.8.3 Threaded implementation of the prefix scan algorithm
  • 7.9 Threading tools essential for robust implementations
  • 7.9.1 Using Allinea/ARM MAP to get a quick high-level profile of your application
  • 7.9.2 Finding your thread race conditions with Intel® Inspector
  • 7.10 Example of a task-based support algorithm
  • 7.11 Further explorations
  • 7.11.1 Additional reading
  • 7.11.2 Exercises
  • Summary
  • 8 MPI: The parallel backbone
  • 8.1 The basics for an MPI program
  • 8.1.1 Basic MPI function calls for every MPI program
  • 8.1.2 Compiler wrappers for simpler MPI programs
  • 8.1.3 Using parallel startup commands
  • 8.1.4 Minimum working example of an MPI program
  • 8.2 The send and receive commands for process-to-process communication
  • 8.3 Collective communication: A powerful component of MPI
  • 8.3.1 Using a barrier to synchronize timers
  • 8.3.2 Using the broadcast to handle small file input
  • 8.3.3 Using a reduction to get a single value from across all processes
  • 8.3.4 Using gather to put order in debug printouts
  • 8.3.5 Using scatter and gather to send data out to processes for work
  • 8.4 Data parallel examples
  • 8.4.1 Stream triad to measure bandwidth on the node
  • 8.4.2 Ghost cell exchanges in a two-dimensional (2D) mesh
  • 8.4.3 Ghost cell exchanges in a three-dimensional (3D) stencil calculation.
  • 8.5 Advanced MPI functionality to simplify code and enable optimizations
  • 8.5.1 Using custom MPI data types for performance and code simplification
  • 8.5.2 Cartesian topology support in MPI
  • 8.5.3 Performance tests of ghost cell exchange variants
  • 8.6 Hybrid MPI plus OpenMP for extreme scalability
  • 8.6.1 The benefits of hybrid MPI plus OpenMP
  • 8.6.2 MPI plus OpenMP example
  • 8.7 Further explorations
  • 8.7.1 Additional reading
  • 8.7.2 Exercises
  • Summary
  • Part 3 GPUs: Built to accelerate
  • 9 GPU architectures and concepts
  • 9.1 The CPU-GPU system as an accelerated computational platform
  • 9.1.1 Integrated GPUs: An underused option on commodity-based systems
  • 9.1.2 Dedicated GPUs: The workhorse option
  • 9.2 The GPU and the thread engine
  • 9.2.1 The compute unit is the streaming multiprocessor (or subslice)
  • 9.2.2 Processing elements are the individual processors
  • 9.2.3 Multiple data operations by each processing element
  • 9.2.4 Calculating the peak theoretical flops for some leading GPUs
  • 9.3 Characteristics of GPU memory spaces
  • 9.3.1 Calculating theoretical peak memory bandwidth
  • 9.3.2 Measuring the GPU stream benchmark
  • 9.3.3 Roofline performance model for GPUs
  • 9.3.4 Using the mixbench performance tool to choose the best GPU for a workload
  • 9.4 The PCI bus: CPU to GPU data transfer overhead
  • 9.4.1 Theoretical bandwidth of the PCI bus
  • 9.4.2 A benchmark application for PCI bandwidth
  • 9.5 Multi-GPU platforms and MPI
  • 9.5.1 Optimizing the data movement between GPUs across the network
  • 9.5.2 A higher performance alternative to the PCI bus
  • 9.6 Potential benefits of GPU-accelerated platforms
  • 9.6.1 Reducing time-to-solution
  • 9.6.2 Reducing energy use with GPUs
  • 9.6.3 Reduction in cloud computing costs with GPUs
  • 9.7 When to use GPUs
  • 9.8 Further explorations
  • 9.8.1 Additional reading.
  • 9.8.2 Exercises.