Advances in GPU research and practice
Advances in GPU Research and Practice focuses on research and practices in GPU based systems. The topics treated cover a range of issues, ranging from hardware and architectural issues, to high level issues, such as application systems, parallel programming, middleware, and power and energy issues....
Otros Autores: | , |
---|---|
Formato: | Libro electrónico |
Idioma: | Inglés |
Publicado: |
Amsterdam :
Elsevier
[2017]
|
Edición: | First edition |
Colección: | Emerging trends in computer science & applied computing.
|
Materias: | |
Ver en Biblioteca Universitat Ramon Llull: | https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009630332406719 |
Tabla de Contenidos:
- Front Cover
- Advances in GPU Research and Practice
- Copyright
- Dedication
- Contents
- List of Contributors
- Preface
- Acknowledgments
- Part 1: Programming and tools
- Chapter 1: Formal analysis techniques for reliable GPU programming: current solutions and call to action
- 1 GPUs in Support of Parallel Computing
- Bugs in parallel and GPU code
- 2 A quick introduction to GPUs
- Organization of threads
- Memory spaces
- Barrier synchronization
- Warps and lock-step execution
- Dot product example
- 3 Correctness issues in GPU programming
- Data races
- Lack of forward progress guarantees
- Floating-point accuracy
- 4 The need for effective tools
- 4.1 A Taxonomy of Current Tools
- 4.2 Canonical Schedules and the Two-Thread Reduction
- Race freedom implies determinism
- Detecting races: ``all for one and one for all''
- Restricting to a canonical schedule
- Reduction to a pair of threads
- 4.3 Symbolic Bug-Finding Case Study: GKLEE
- 4.4 Verification Case Study: GPUVerify
- 5 Call to Action
- GPUs will become more pervasive
- Current tools show promise
- Solving basic correctness issues
- Equivalence checking
- Clarity from vendors and standards bodies
- User validation of tools
- Acknowledgments
- References
- Chapter 2: SnuCL: A unified OpenCL framework for heterogeneous clusters
- 1 Introduction
- 2 OpenCL
- 2.1 Platform Model
- 2.2 Execution Model
- 2.3 Memory Model
- 2.4 Synchronization
- 2.5 Memory Consistency
- 2.6 OpenCL ICD
- 3 Overview of SnuCL framework
- 3.1 Limitations of OpenCL
- 3.2 SnuCL CPU
- 3.3 SnuCL Single
- 3.4 SnuCL Cluster
- 3.4.1 Processing synchronization commands
- 4 Memory management in SnuCL Cluster
- 4.1 Space Allocation to Memory Objects
- 4.2 Minimizing Copying Overhead
- 4.3 Processing Memory Commands
- 4.4 Consistency Management.
- 4.5 Detecting Memory Objects Written by a Kernel
- 5 SnuCL extensions to OpenCL
- 6 Performance evaluation
- 6.1 Evaluation Methodology
- 6.2 Performance
- 6.2.1 Scalability on the medium-scale GPU cluster
- 6.2.2 Scalability on the large-scale CPU cluster
- 7 Conclusions
- Acknowledgments
- References
- Chapter 3: Thread communication and synchronization on massively parallel GPUs
- 1 Introduction
- 2 Coarse-Grained Communication and Synchronization
- 2.1 Global Barrier at the Kernel Level
- 2.2 Local Barrier at the Work-Group Level
- 2.3 Implicit Barrier at the Wavefront Level
- 3 Built-In Atomic Functions on Regular Variables
- 4 Fine-Grained Communication and Synchronization
- 4.1 Memory Consistency Model
- 4.1.1 Sequential consistency
- 4.1.2 Relaxed consistency
- 4.2 The OpenCL 2.0 Memory Model
- 4.2.1 Relationships between two memory operations
- 4.2.2 Special atomic operations and stand-alone memory fence
- 4.2.3 Release and acquire semantics
- 4.2.4 Memory order parameters
- 4.2.5 Memory scope parameters
- 5 Conclusion and Future Research Direction
- References
- Chapter 4: Software-level task scheduling on GPUs
- 1 Introduction, Problem Statement, and Context
- 2 Nondeterministic behaviors caused by the hardware
- 2.1 P1: Greedy
- 2.2 P2: Not Round-Robin
- 2.3 P3: Nondeterministic Across Runs
- 2.4 P4: Oblivious to Nonuniform Data Sharing
- 2.5 P5: Serializing Multikernel Co-Runs
- 3 SM-centric transformation
- 3.1 Core Ideas
- 3.1.1 SM-centric task selection
- Correctness issues
- 3.1.2 Filling-retreating scheme
- 3.2 Implementation
- 3.2.1 Details
- 4 Scheduling-enabled optimizations
- 4.1 Parallelism Control
- 4.2 Affinity-Based Scheduling
- 4.2.1 Evaluation
- 4.3 SM Partitioning for Multi-Kernel Co-Runs
- 4.3.1 Evaluation
- 5 Other scheduling work on GPUs
- 5.1 Software Approaches.
- 5.2 Hardware Approaches
- 6 Conclusion and future work
- Acknowledgments
- References
- Chapter 5: Data placement on GPUs
- 1 Introduction
- 2 Overview
- 3 Memory specification through MSL
- 4 Compiler support
- 4.1 Deriving Data Access Patterns: A Hybrid Approach
- 4.1.1 Reuse distance model
- 4.1.2 Staging code to be agnostic to placement
- 5 Runtime support
- 5.1 Lightweight Performance Model
- 5.2 Finding the Best Placements
- 5.2.1 Dealing with phases
- 6 Results
- 6.1 Results With Irregular Benchmarks
- 6.2 Results With Regular Benchmarks
- 7 Related Work
- 8 Summary
- References
- Part 2: Algorithms and applications
- Chapter 6: Biological sequence analysis on GPU
- 1 Introduction
- 2 Pairwise Sequence Comparison and Sequence-Profile Comparison
- 2.1 Pairwise Sequence Comparison
- 2.1.1 Smith-Waterman algorithm
- Phase 1: Create the similarity matrix
- Phase 2: Obtain the best local alignment
- Smith-Waterman variations
- 2.1.2 Basic Local Alignment Tool
- Phase 1: Seeding
- Phase 2: Extension
- Phase 3: Evaluation
- BLAST
- 2.2 Sequence-Profile Comparison
- 2.2.1 Hidden Markov models
- 2.2.2 The Viterbi algorithm
- 2.2.3 The MSV algorithm
- 3 Design Aspects of GPU Solutions for Biological Sequence Analysis
- 3.1 Task-Parallelism vs Data-Parallelism
- 3.2 Sorting Sequence Database to Achieve Load Balance
- 3.3 Use of GPU Memory Hierarchy
- 3.4 GPU Solution Used as a Filter
- 4 GPU Solutions for Pairwise Sequence Comparison
- 4.1 GPU Solutions Using Exact Algorithms
- 4.1.1 Manavski and Valle [27]
- 4.1.2 Ligowski and Rudnicki [38]
- 4.1.3 Blazwicz et al. [29]
- 4.1.4 Li et al. [52]
- 4.1.5 Ino et al. [28,30]
- 4.1.6 Liu et al. [45-47]
- 4.1.7 Sandes and de Melo [39-41] and Sandes et al. [42]
- 4.1.8 Comparative overview
- 4.2 GPU Solutions Using BLAST
- 4.2.1 Vouzis and Sahinidis [31].
- 4.2.2 Liu et al. [48]
- 4.2.3 Zhang et al. [43]
- 4.2.4 Comparative overview
- 5 GPU Solutions for Sequence-Profile Comparison
- 5.1 GPU Solutions Using the Viterbi Algorithm
- 5.1.1 Horn et al. [32]
- 5.1.2 Du et al. [44]
- 5.1.3 Walters et al. [33]
- 5.1.4 Yao et al. [34]
- 5.1.5 Ganesan et al. [49]
- 5.1.6 Ferraz and Moreano [36]
- 5.2 GPU Solutions Using the MSV Algorithm
- 5.2.1 Li et al. [35]
- 5.2.2 Cheng and Butler [37]
- 5.2.3 Araújo Neto and Moreano [50]
- 5.3 Comparative Overview
- 6 Conclusion and Perspectives
- References
- Chapter 7: Graph algorithms on GPUs
- 1 Graph representation for GPUs
- 1.1 Adjacency Matrices
- 1.2 Adjacency Lists
- 1.3 Edge Lists
- 2 Graph traversal algorithms: the breadth first search (BFS)
- 2.1 The Frontier-Based Parallel Implementation of BFS
- 2.2 BFS-4K
- 3 The single-source shortest path (SSSP) problem
- 3.1 The SSSP Implementations for GPUs
- 3.2 H-BF: An Efficient Implementation of the Bellman-Ford Algorithm
- 4 The APSP problem
- 4.1 The APSP Implementations for GPUs
- 5 Load Balancing and Memory Accesses: Issuesand Management Techniques
- 5.1 Static Mapping Techniques
- 5.1.1 Work-items to threads
- 5.1.2 Virtual warps
- 5.2 Semidynamic Mapping Techniques
- 5.2.1 Dynamic virtual warps + dynamic parallelism
- 5.2.2 CTA + warp + scan
- 5.3 Dynamic Mapping Techniques
- 5.3.1 Direct search
- 5.3.2 Local warp search
- 5.3.3 Block search
- 5.3.4 Two-phase search
- 5.4 The Multiphase Search Technique
- 5.5 Coalesced Expansion
- 5.6 Iterated Searches
- References
- Chapter 8: GPU alignment of two and three sequences
- 1 Introduction
- 1.1 Pairwise alignment
- 1.2 Alignment of Three Sequences
- 2 GPU architecture
- 3 Pairwise alignment
- 3.1 Smith-Waterman Algorithm
- 3.2 Computing the Score of the Best Local Alignment.
- 3.3 Computing the Best Local Alignment
- 3.3.1 StripedAlignment
- 3.3.2 ChunkedAlignment1
- 3.3.3 ChunkedAlignment2
- 3.3.4 Memory requirements
- 3.4 Experimental Results
- StripedScore
- StripedAlignment
- ChunkedAlignment1
- 4 Alignment of three sequences
- 4.1 Three-Sequence Alignment Algorithm
- 4.2 Computing the Score of the Best Alignment
- 4.2.1 Layered algorithm
- GPU computational strategy
- Analysis
- 4.2.2 Sloped algorithm
- GPU computational strategy
- Analysis
- 4.3 Computing the Best Alignment
- 4.3.1 Layered-BT1
- 4.3.2 Layered-BT2
- 4.3.3 Layered-BT3
- 4.4 Experimental Results
- 4.4.1 Computing the score of the best alignment
- 4.4.2 Computing the alignment
- 5 Conclusion
- References
- Chapter 9: Augmented block cimmino distributed algorithm for solving tridiagonal systems on GPU
- 1 Introduction
- 2 ABCD Solver for tridiagonal systems
- 3 GPU implementation and optimization
- 3.1 QR Method and Givens Rotation
- 3.2 Sparse Storage Format
- 3.3 Coalesced Memory Access
- 3.4 Boundary Padding
- 3.4.1 Padding of the augmented matrix
- 3.4.2 Padding for Givens rotation
- 4 Performance evaluation
- 4.1 Comparison With CPU Implementation
- 4.2 Speedup by Memory Coalescing
- 4.3 Boundary Padding
- 5 Conclusion and future work
- References
- Chapter 10: GPU computing applied to linear and mixed-integer programming
- 1 Introduction
- 2 Operations Research in Practice
- 3 Exact Optimization Algorithms
- 3.1 The Simplex Method
- 3.2 Dynamic Programming
- 3.2.1 Knapsack problems
- 3.2.2 Multiple-choice knapsack problem
- 3.3 Branch-and-Bound
- 3.3.1 Knapsack problem
- 3.3.2 Flow-shop scheduling problem
- 3.3.3 Traveling salesman problem
- 4 Metaheuristics
- 4.1 Genetic Algorithms
- 4.1.1 The traveling salesman problem
- 4.1.2 Scheduling problems
- 4.1.3 Knapsack problems.
- 4.2 Ant Colony Optimization.