Multicore and gpu programming an integrated approach

Multicore and GPU Programming offers broad coverage of the key parallel computing skillsets: multicore CPU programming and manycore "massively parallel" computing. Using threads, OpenMP, MPI, and CUDA, it teaches the design and development of software capable of taking advantage of today’s...

Descripción completa

Detalles Bibliográficos
Otros Autores:	Barlas, Gerassimos, author (author)
Formato:	Libro electrónico
Idioma:	Inglés
Publicado:	Amsterdam : Morgan Kaufmann [2015]
Edición:	First edition
Materias:	Multiprocessors.
Ver en Biblioteca Universitat Ramon Llull:	https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009629823706719

Tabla de Contenidos:

Front Cover
Multicore and GPU Programming: An Integrated Approach
Copyright
Dedication
Contents
List of Tables
Preface
What Is in This Book
Using This Book as a Textbook
Software and Hardware Requirements
Sample Code
Chapter 1: Introduction
1.1 The era of multicore machines
1.2 A taxonomy of parallel machines
1.3 A glimpse of contemporary computing machines
1.3.1 The cell BE processor
1.3.2 Nvidia's Kepler
1.3.3 AMD's APUs
1.3.4 Multicore to many-core: tilera's TILE-Gx8072 and intel's xeon phi
1.4 Performance metrics
1.5 Predicting and measuring parallel program performance
1.5.1 Amdahl's law
1.5.2 Gustafson-barsis's rebuttal
Exercises
Chapter 2: Multicore and parallel program design
2.1 Introduction
2.2 The PCAM methodology
2.3 Decomposition patterns
2.3.1 Task parallelism
2.3.2 Divide-and-conquer decomposition
2.3.3 Geometric decomposition
2.3.4 Recursive data decomposition
2.3.5 Pipeline decomposition
2.3.6 Event-based coordination decomposition
2.4 Program structure patterns
2.4.1 Single-program, multiple-data
2.4.2 Multiple-program, multiple-data
2.4.3 Master-worker
2.4.4 Map-reduce
2.4.5 Fork/join
2.4.6 Loop parallelism
2.5 Matching decomposition patterns with program structure patterns
Exercises
Chapter 3: Shared-memory programming: threads
3.1 Introduction
3.2 Threads
3.2.1 What is a thread?
3.2.2 What are threads good for?
3.2.3 Thread creation and initialization
3.2.3.1 Implicit thread creation
3.2.4 Sharing data between threads
3.3 Design concerns
3.4 Semaphores
3.5 Applying semaphores in classical problems
3.5.1 Producers-consumers
3.5.2 Dealing with termination
3.5.2.1 Termination using a shared data item
3.5.2.2 Termination using messages.
3.5.3 The barbershop problem: introducing fairness
3.5.4 Readers-writers
3.5.4.1 A solution favoring the readers
3.5.4.2 Giving priority to the writers
3.5.4.3 A fair solution
3.6 Monitors
3.6.1 Design approach 1: critical section inside the monitor
3.6.2 Design approach 2: monitor controls entry to critical section
3.7 Applying monitors in classical problems
3.7.1 Producers-consumers revisited
3.7.1.1 Producers-consumers: buffer manipulation within the monitor
3.7.1.2 Producers-consumers: buffer insertion/extraction exterior to the monitor
3.7.2 Readers-writers
3.7.2.1 A solution favoring the readers
3.7.2.2 Giving priority to the writers
3.7.2.3 A fair solution
3.8 Dynamic vs. static thread management
3.8.1 Qt's thread pool
3.8.2 Creating and managing a pool of threads
3.9 Debugging multithreaded applications
3.10 Higher-level constructs: multithreaded programming without threads
3.10.1 Concurrent map
3.10.2 Map-reduce
3.10.3 Concurrent filter
3.10.4 Filter-reduce
3.10.5 A case study: multithreaded sorting
3.10.6 A case study: multithreaded image matching
Exercises
Chapter 4: Shared-memory programming: OpenMP
4.1 Introduction
4.2 Your First OpenMP Program
4.3 Variable Scope
4.3.1 OpenMP Integration V.0: Manual Partitioning
4.3.2 OpenMP Integration V.1: Manual Partitioning Without a Race Condition
4.3.3 OpenMP Integration V.2: Implicit Partitioning with Locking
4.3.4 OpenMP Integration V.3: Implicit Partitioning with Reduction
4.3.5 Final Words on Variable Scope
4.4 Loop-Level Parallelism
4.4.1 Data Dependencies
4.4.1.1 Flow Dependencies
4.4.1.2 Antidependencies
4.4.1.3 Output Dependencies
4.4.2 Nested Loops
4.4.3 Scheduling
4.5 Task Parallelism
4.5.1 The sections Directive
4.5.1.1 Producers-Consumers in OpenMP.
4.5.2 The task Directive
4.6 Synchronization Constructs
4.7 Correctness and Optimization Issues
4.7.1 Thread Safety
4.7.2 False Sharing
4.8 A Case Study: Sorting in OpenMP
4.8.1 Bottom-Up Mergesort in OpenMP
4.8.2 Top-Down Mergesort in OpenMP
4.8.3 Performance Comparison
Exercises
Chapter 5: Distributed memory programming
5.1 Communicating Processes
5.2 MPI
5.3 Core concepts
5.4 Your first MPI program
5.5 Program architecture
5.5.1 SPMD
5.5.2 MPMD
5.6 Point-to-Point communication
5.7 Alternative Point-to-Point communication modes
5.7.1 Buffered Communications
5.8 Non blocking communications
5.9 Point-to-Point Communications: Summary
5.10 Error reporting and handling
5.11 Collective communications
5.11.1 Scattering
5.11.2 Gathering
5.11.3 Reduction
5.11.4 All-to-All Gathering
5.11.5 All-to-All Scattering
5.11.6 All-to-All Reduction
5.11.7 Global Synchronization
5.12 Communicating objects
5.12.1 Derived Datatypes
5.12.2 Packing/Unpacking
5.13 Node management: communicators and groups
5.13.1 Creating Groups
5.13.2 Creating Intra-Communicators
5.14 One-sided communications
5.14.1 RMA Communication Functions
5.14.2 RMA Synchronization Functions
5.15 I/O considerations
5.16 Combining MPI processes with threads
5.17 Timing and Performance Measurements
5.18 Debugging and profiling MPI programs
5.19 The Boost.MPI library
5.19.1 Blocking and non blocking Communications
5.19.2 Data Serialization
5.19.3 Collective Operations
5.20 A case study: diffusion-limited aggregation
5.21 A case study: brute-force encryption cracking
5.21.1 Version #1 : "plain-vanilla'' MPI
5.21.2 Version #2 : combining MPI and OpenMP
5.22 A Case Study: MPI Implementation of the Master-Worker Pattern.
5.22.1 A Simple Master-Worker Setup
5.22.2 A Multithreaded Master-Worker Setup
Exercises
Chapter 6: GPU programming
6.1 GPU Programming
6.2 CUDA's programming model: threads, blocks, and grids
6.3 CUDA's execution model: streaming multiprocessors and warps
6.4 CUDA compilation process
6.5 Putting together a CUDA project
6.6 Memory hierarchy
6.6.1 Local Memory/Registers
6.6.2 Shared Memory
6.6.3 Constant Memory
6.6.4 Texture and Surface Memory
6.7 Optimization techniques
6.7.1 Block and Grid Design
6.7.2 Kernel Structure
6.7.3 Shared Memory Access
6.7.4 Global Memory Access
6.7.5 Page-Locked and Zero-Copy Memory
6.7.6 Unified Memory
6.7.7 Asynchronous Execution and Streams
6.7.7.1 Stream Synchronization: Events and Callbacks
6.8 Dynamic parallelism
6.9 Debugging CUDA programs
6.10 Profiling CUDA programs
6.11 CUDA and MPI
6.12 Case studies
6.12.1 Fractal Set Calculation
6.12.1.1 Version #1: One thread per pixel
6.12.1.2 Version #2: Pinned host and pitched device memory
6.12.1.3 Version #3: Multiple pixels per thread
6.12.1.4 Evaluation
6.12.2 Block Cipher Encryption
6.12.2.1 Version #1: The case of a standalone GPU machine
6.12.2.2 Version #2: Overlapping GPU communication and computation
6.12.2.3 Version #3: Using a cluster of GPU machines
6.12.2.4 Evaluation
Exercises
Chapter 7: The Thrust template library
7.1 Introduction
7.2 First steps in Thrust
7.3 Working with Thrust datatypes
7.4 Thrust algorithms
7.4.1 Transformations
7.4.2 Sorting and searching
7.4.3 Reductions
7.4.4 Scans/prefix sums
7.4.5 Data management and manipulation
7.5 Fancy iterators
7.6 Switching device back ends
7.7 Case studies
7.7.1 Monte carlo integration
7.7.2 DNA Sequence alignment
Exercises.
Chapter 8: Load balancing
8.1 Introduction
8.2 Dynamic load balancing: the Linda legacy
8.3 Static Load Balancing: The Divisible LoadTheory Approach
8.3.1 Modeling Costs
8.3.2 Communication Configuration
8.3.3 Analysis
8.3.3.1 N-Port, Block-Type, Single-Installment Solution
8.3.3.2 One-Port, Block-Type, Single-Installment Solution
8.3.4 Summary - Short Literature Review
8.4 DLTlib: A library for partitioning workloads
8.5 Case studies
8.5.1 Hybrid Computation of a Mandelbrot Set "Movie'':A Case Study in Dynamic Load Balancing
8.5.2 Distributed Block Cipher Encryption: A Case Study in Static Load Balancing
Appendix A: Compiling Qt programs
A.1 Using an IDE
A.2 The qmake Utility
Appendix B: Running MPI programs
B.1 Preparatory Steps
B.2 Computing Nodes discovery for MPI Program Deployment
B.2.1 Host Discovery with the nmap Utility
B.2.2 Automatic Generation of a Hostfile
Appendix C: Time measurement
C.1 Introduction
C.2 POSIX High-Resolution Timing
C.3 Timing in Qt
C.4 Timing in OpenMP
C.5 Timing in MPI
C.6 Timing in CUDA
Appendix D: Boost.MPI
D.1 Mapping from MPI C to Boost.MPI
Appendix E: Setting up CUDA
E.1 Installation
E.2 Issues with GCC
E.3 Running CUDA without an Nvidia GPU
E.4 Running CUDA on Optimus-Equipped Laptops
E.5 Combining CUDA with Third-Party Libraries
Appendix F: DLTlib
F.1 DLTlib Functions
F.1.1 Class Network: Generic Methods
F.1.2 Class Network: Query Processing
F.1.3 Class Network: Image Processing
F.1.4 Class Network: Image Registration
F.2 DLTlib Files
Glossary
Bibliography
Index.

Multicore and gpu programming an integrated approach

Ejemplares similares