CUDA Fortran for Scientists and Engineers Best Practices for Efficient CUDA Fortran Programming

CUDA Fortran for Scientists and Engineers: Best Practices for Efficient CUDA Fortran Programming shows how high-performance application developers can leverage the power of GPUs using Fortran, the familiar language of scientific computing and supercomputer performance benchmarking. The authors presu...

Descripción completa

Detalles Bibliográficos
Otros Autores:	Ruetsch, Gregory, author (author), Fatica, Massimiliano, author
Formato:	Libro electrónico
Idioma:	Inglés
Publicado:	Cambridge, MA : Morgan Kaufmann [2024]
Edición:	Second edition
Materias:	FORTRAN (Computer program language)
Ver en Biblioteca Universitat Ramon Llull:	https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009840465706719

Tabla de Contenidos:

Front Cover
CUDA Fortran for Scientists and Engineers
Copyright
Contents
Preface to the Second Edition
Preface to the First Edition
Acknowledgments
1 CUDA Fortran programming
1 Introduction
1.1 A brief history of GPU computing
1.2 Parallel computation
1.3 Basic concepts
1.3.1 A first CUDA Fortran program
1.3.1.1 CUDA Fortran compilation
1.3.2 Extending to larger arrays
1.3.3 Multidimensional arrays
1.3.4 Interfaces for device code
1.3.5 Managed data
1.3.6 Kernel loop directives and CUF kernels
1.4 Determining CUDA hardware features and limits
1.4.1 Choosing a device to run on
1.4.2 Floating point precision
1.4.2.1 Accommodating variable precision
1.5 Error handling
1.6 Compiling CUDA Fortran code
1.7 CUDA Driver, Toolkit, and compatibility
2 Correctness, accuracy, and debugging
2.1 Assessing correctness of results
2.1.1 Non-associativity of floating point arithmetic
2.1.2 Fused-multiply add
2.1.3 Flags affecting floating-point accuracy
2.2 Debugging
2.2.1 Printing from device code
2.2.2 Debugging with cuda-gdb
2.2.2.1 System requirements
2.2.2.2 Compilation
2.2.2.3 Setting breakpoints
2.2.2.4 Focus - software and hardware coordinates
2.2.2.5 CUDA activity status
2.2.2.6 Single-stepping in device code
2.2.2.7 Examining program state
2.2.3 compute-sanitizer
3 Performance measurement and metrics
3.1 Measuring execution time
3.1.1 Host-device synchronization and CPU timers
3.1.2 Timing via CUDA events
3.1.3 Nsight Systems command-line interface nsys
3.1.3.1 Nsight Systems graphical user interface nsys-ui
3.1.4 Customizing profiling with nvtx
3.1.4.1 Basic NVTX tooling interfaces
3.1.4.2 Advanced NVTX tooling interfaces
3.1.4.3 Automated NVTX instrumentation.
3.2 Instruction, bandwidth, and latency bound kernels
3.3 Memory bandwidth
3.3.1 Theoretical peak bandwidth
3.3.2 Effective bandwidth
3.3.3 Actual data throughput vs. effective bandwidth
4 Synchronization
4.1 Synchronization of kernel execution and data transfers
4.1.1 Pageable versus pinned host memory
4.1.2 Streams
4.1.2.1 Creating streams
4.1.3 Asynchronous transfers via cudaMemcpyAsync()
4.1.4 Synchronization barriers
4.1.4.1 cudaDeviceSynchronize()
4.1.4.2 cudaStreamSynchronize()
4.1.4.3 cudaEventSynchronize()
4.1.4.4 Querying streams and events
4.1.5 Advanced stream topics
4.1.5.1 The default stream
4.1.5.2 Non-blocking streams
4.1.5.3 Stream priorities
4.2 Synchronization of kernel threads on the device
4.2.1 Shared memory
4.2.2 Synchronizing threads within a block
4.2.3 Warps, warp synchronization, and warp-level primitives
4.2.3.1 SHFL functions
4.2.4 Atomics
4.2.5 Memory fences
4.2.6 Cooperative groups
4.2.6.1 Grid synchronization
4.2.6.2 Thread block clusters
5 Optimization
5.1 Transfers between host and device
5.1.1 Pinned memory
5.2 Device memory
5.2.1 ECC (Error Correcting Code)
5.2.2 Global memory
5.2.2.1 Declaring global array arguments in kernels
5.2.2.2 Coalesced global memory access
5.2.3 Local memory
5.2.4 Constant memory
5.2.5 L1 and L2 caches
5.2.6 Shared memory
5.2.6.1 Configuring shared memory
5.2.6.2 Global memory coalescing through shared memory
5.2.6.3 Shared memory bank conflicts
5.2.7 Registers
5.2.7.1 Exchanging register data between threads in a warp
5.3 Execution configuration
5.3.1 Thread-level parallelism
5.3.2 Instruction-level parallelism
5.3.2.1 Asynchronous data transfers between global and shared memory
5.3.2.2 Instruction-level parallelism in CUF kernels.
5.4 Instruction optimization
5.4.1 Device intrinsics
5.4.1.1 Directed rounding
5.4.1.2 C intrinsics
5.4.1.3 Fast math intrinsics
5.4.1.4 Compiler options
5.4.2 Divergent warps
6 Porting tips and techniques
6.1 CUF kernels
6.2 Conditional inclusion of code
6.3 Renaming variables
6.3.1 Renaming via use statements
6.3.2 Renaming via the associate construct
6.4 Minimizing memory footprint for work arrays
6.5 Array compaction
7 Interfacing with CUDA C code and CUDA libraries
7.1 Calling user-written CUDA C code
7.1.1 The ignore"80"137tkr directive
7.2 cuBLAS
7.2.1 Legacy cuBLAS API
7.2.2 New cuBLAS API
7.2.3 Batched cuBLAS routines
7.2.4 GEMM with tensor cores
7.3 cuSPARSE
7.4 cuSOLVER
7.5 cuTENSOR
7.5.1 Low-level cuTENSOR interfaces
7.6 Thrust
8 Multi-GPU programming
8.1 CUDA multi-GPU features
8.1.1 Peer-to-peer communication
8.1.1.1 Requirements for peer-to-peer communication
8.1.2 Peer-to-peer direct transfers
8.1.3 Peer-to-peer transpose
8.2 Multi-GPU programming with MPI
8.2.1 Assigning devices to MPI ranks
8.2.2 MPI transpose
8.2.3 GPU-aware MPI transpose
2 Case studies
9 Monte Carlo method
9.1 CURAND
9.2 Computing π with CUF kernels
9.2.1 IEEE-754 precision
9.3 Computing π with reduction kernels
9.3.1 Reductions with SHFL instructions
9.3.2 Reductions with atomic locks
9.3.3 Reductions using the grid"80"137group cooperative group
9.4 Accuracy of summation
9.5 Option pricing
10 Finite difference method
10.1 Nine-point 1D finite difference stencil
10.1.1 Data reuse and shared memory
10.1.2 The x-derivative kernel
10.1.2.1 Performance of the x-derivative kernel
10.1.3 Derivatives in y and z
10.1.4 Nonuniform grids
10.2 2D Laplace equation.
11 Applications of the fast Fourier transform
11.1 CUFFT
11.2 Spectral derivatives
11.3 Convolution
11.4 Poisson solver
11.4.1 Vortex dynamics
12 Ray tracing
12.1 Generating an image file
PPM file format
Implementation
12.2 Vectors in CUDA Fortran
Implementation of the RGB module
12.3 Rays, a simple camera, and background
First ray-tracing code
12.4 Adding a sphere
Theory
Implementation
12.5 Surface normals and multiple objects
Surface normals
Multiple objects
12.6 Antialiasing
Implementation
CUDA implementation
12.7 Material types
12.7.1 Diffuse materials
Implementation
Random number generation
Modifications to the color() routine
12.7.2 Metal
Theory
Implementation
Material derived types
Initialization of spheres and material types
The scatter() function and Cray pointers
12.7.3 Dielectrics
Theory
Implementation
12.8 Positionable camera
12.9 Defocus blur
Implementation
12.10 Where next?
12.11 Triangles
Theory
Implementation
12.12 Lights
Implementation
12.13 Textures
Theory
Implementation
3 Appendices
A System and environment management
A.1 Environment variables
A.1.1 General
A.1.2 Just-in-time compilation
A.2 nvidia-smi - System Management Interface
A.2.1 Enabling and disabling ECC
A.2.2 Compute mode
A.2.3 Persistence mode
A.2.4 Topology
References
Index
Back Cover.

CUDA Fortran for Scientists and Engineers Best Practices for Efficient CUDA Fortran Programming

Ejemplares similares