Advances in GPU research and practice

Advances in GPU Research and Practice focuses on research and practices in GPU based systems. The topics treated cover a range of issues, ranging from hardware and architectural issues, to high level issues, such as application systems, parallel programming, middleware, and power and energy issues....

Descripción completa

Detalles Bibliográficos
Otros Autores:	Sarbazi-Azad, Hamid, author (author), Sarbazi-Azad, Hamid, editor (editor)
Formato:	Libro electrónico
Idioma:	Inglés
Publicado:	Amsterdam : Elsevier [2017]
Edición:	First edition
Colección:	Emerging trends in computer science & applied computing.
Materias:	Graphics processing units > Programming. Imaging systems. Computer graphics. Image processing > Digital techniques.
Ver en Biblioteca Universitat Ramon Llull:	https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009630332406719

Tabla de Contenidos:

Front Cover
Advances in GPU Research and Practice
Copyright
Dedication
Contents
List of Contributors
Preface
Acknowledgments
Part 1: Programming and tools
Chapter 1: Formal analysis techniques for reliable GPU programming: current solutions and call to action
1 GPUs in Support of Parallel Computing
Bugs in parallel and GPU code
2 A quick introduction to GPUs
Organization of threads
Memory spaces
Barrier synchronization
Warps and lock-step execution
Dot product example
3 Correctness issues in GPU programming
Data races
Lack of forward progress guarantees
Floating-point accuracy
4 The need for effective tools
4.1 A Taxonomy of Current Tools
4.2 Canonical Schedules and the Two-Thread Reduction
Race freedom implies determinism
Detecting races: ``all for one and one for all''
Restricting to a canonical schedule
Reduction to a pair of threads
4.3 Symbolic Bug-Finding Case Study: GKLEE
4.4 Verification Case Study: GPUVerify
5 Call to Action
GPUs will become more pervasive
Current tools show promise
Solving basic correctness issues
Equivalence checking
Clarity from vendors and standards bodies
User validation of tools
Acknowledgments
References
Chapter 2: SnuCL: A unified OpenCL framework for heterogeneous clusters
1 Introduction
2 OpenCL
2.1 Platform Model
2.2 Execution Model
2.3 Memory Model
2.4 Synchronization
2.5 Memory Consistency
2.6 OpenCL ICD
3 Overview of SnuCL framework
3.1 Limitations of OpenCL
3.2 SnuCL CPU
3.3 SnuCL Single
3.4 SnuCL Cluster
3.4.1 Processing synchronization commands
4 Memory management in SnuCL Cluster
4.1 Space Allocation to Memory Objects
4.2 Minimizing Copying Overhead
4.3 Processing Memory Commands
4.4 Consistency Management.
4.5 Detecting Memory Objects Written by a Kernel
5 SnuCL extensions to OpenCL
6 Performance evaluation
6.1 Evaluation Methodology
6.2 Performance
6.2.1 Scalability on the medium-scale GPU cluster
6.2.2 Scalability on the large-scale CPU cluster
7 Conclusions
Acknowledgments
References
Chapter 3: Thread communication and synchronization on massively parallel GPUs
1 Introduction
2 Coarse-Grained Communication and Synchronization
2.1 Global Barrier at the Kernel Level
2.2 Local Barrier at the Work-Group Level
2.3 Implicit Barrier at the Wavefront Level
3 Built-In Atomic Functions on Regular Variables
4 Fine-Grained Communication and Synchronization
4.1 Memory Consistency Model
4.1.1 Sequential consistency
4.1.2 Relaxed consistency
4.2 The OpenCL 2.0 Memory Model
4.2.1 Relationships between two memory operations
4.2.2 Special atomic operations and stand-alone memory fence
4.2.3 Release and acquire semantics
4.2.4 Memory order parameters
4.2.5 Memory scope parameters
5 Conclusion and Future Research Direction
References
Chapter 4: Software-level task scheduling on GPUs
1 Introduction, Problem Statement, and Context
2 Nondeterministic behaviors caused by the hardware
2.1 P1: Greedy
2.2 P2: Not Round-Robin
2.3 P3: Nondeterministic Across Runs
2.4 P4: Oblivious to Nonuniform Data Sharing
2.5 P5: Serializing Multikernel Co-Runs
3 SM-centric transformation
3.1 Core Ideas
3.1.1 SM-centric task selection
Correctness issues
3.1.2 Filling-retreating scheme
3.2 Implementation
3.2.1 Details
4 Scheduling-enabled optimizations
4.1 Parallelism Control
4.2 Affinity-Based Scheduling
4.2.1 Evaluation
4.3 SM Partitioning for Multi-Kernel Co-Runs
4.3.1 Evaluation
5 Other scheduling work on GPUs
5.1 Software Approaches.
5.2 Hardware Approaches
6 Conclusion and future work
Acknowledgments
References
Chapter 5: Data placement on GPUs
1 Introduction
2 Overview
3 Memory specification through MSL
4 Compiler support
4.1 Deriving Data Access Patterns: A Hybrid Approach
4.1.1 Reuse distance model
4.1.2 Staging code to be agnostic to placement
5 Runtime support
5.1 Lightweight Performance Model
5.2 Finding the Best Placements
5.2.1 Dealing with phases
6 Results
6.1 Results With Irregular Benchmarks
6.2 Results With Regular Benchmarks
7 Related Work
8 Summary
References
Part 2: Algorithms and applications
Chapter 6: Biological sequence analysis on GPU
1 Introduction
2 Pairwise Sequence Comparison and Sequence-Profile Comparison
2.1 Pairwise Sequence Comparison
2.1.1 Smith-Waterman algorithm
Phase 1: Create the similarity matrix
Phase 2: Obtain the best local alignment
Smith-Waterman variations
2.1.2 Basic Local Alignment Tool
Phase 1: Seeding
Phase 2: Extension
Phase 3: Evaluation
BLAST
2.2 Sequence-Profile Comparison
2.2.1 Hidden Markov models
2.2.2 The Viterbi algorithm
2.2.3 The MSV algorithm
3 Design Aspects of GPU Solutions for Biological Sequence Analysis
3.1 Task-Parallelism vs Data-Parallelism
3.2 Sorting Sequence Database to Achieve Load Balance
3.3 Use of GPU Memory Hierarchy
3.4 GPU Solution Used as a Filter
4 GPU Solutions for Pairwise Sequence Comparison
4.1 GPU Solutions Using Exact Algorithms
4.1.1 Manavski and Valle [27]
4.1.2 Ligowski and Rudnicki [38]
4.1.3 Blazwicz et al. [29]
4.1.4 Li et al. [52]
4.1.5 Ino et al. [28,30]
4.1.6 Liu et al. [45-47]
4.1.7 Sandes and de Melo [39-41] and Sandes et al. [42]
4.1.8 Comparative overview
4.2 GPU Solutions Using BLAST
4.2.1 Vouzis and Sahinidis [31].
4.2.2 Liu et al. [48]
4.2.3 Zhang et al. [43]
4.2.4 Comparative overview
5 GPU Solutions for Sequence-Profile Comparison
5.1 GPU Solutions Using the Viterbi Algorithm
5.1.1 Horn et al. [32]
5.1.2 Du et al. [44]
5.1.3 Walters et al. [33]
5.1.4 Yao et al. [34]
5.1.5 Ganesan et al. [49]
5.1.6 Ferraz and Moreano [36]
5.2 GPU Solutions Using the MSV Algorithm
5.2.1 Li et al. [35]
5.2.2 Cheng and Butler [37]
5.2.3 Araújo Neto and Moreano [50]
5.3 Comparative Overview
6 Conclusion and Perspectives
References
Chapter 7: Graph algorithms on GPUs
1 Graph representation for GPUs
1.1 Adjacency Matrices
1.2 Adjacency Lists
1.3 Edge Lists
2 Graph traversal algorithms: the breadth first search (BFS)
2.1 The Frontier-Based Parallel Implementation of BFS
2.2 BFS-4K
3 The single-source shortest path (SSSP) problem
3.1 The SSSP Implementations for GPUs
3.2 H-BF: An Efficient Implementation of the Bellman-Ford Algorithm
4 The APSP problem
4.1 The APSP Implementations for GPUs
5 Load Balancing and Memory Accesses: Issuesand Management Techniques
5.1 Static Mapping Techniques
5.1.1 Work-items to threads
5.1.2 Virtual warps
5.2 Semidynamic Mapping Techniques
5.2.1 Dynamic virtual warps + dynamic parallelism
5.2.2 CTA + warp + scan
5.3 Dynamic Mapping Techniques
5.3.1 Direct search
5.3.2 Local warp search
5.3.3 Block search
5.3.4 Two-phase search
5.4 The Multiphase Search Technique
5.5 Coalesced Expansion
5.6 Iterated Searches
References
Chapter 8: GPU alignment of two and three sequences
1 Introduction
1.1 Pairwise alignment
1.2 Alignment of Three Sequences
2 GPU architecture
3 Pairwise alignment
3.1 Smith-Waterman Algorithm
3.2 Computing the Score of the Best Local Alignment.
3.3 Computing the Best Local Alignment
3.3.1 StripedAlignment
3.3.2 ChunkedAlignment1
3.3.3 ChunkedAlignment2
3.3.4 Memory requirements
3.4 Experimental Results
StripedScore
StripedAlignment
ChunkedAlignment1
4 Alignment of three sequences
4.1 Three-Sequence Alignment Algorithm
4.2 Computing the Score of the Best Alignment
4.2.1 Layered algorithm
GPU computational strategy
Analysis
4.2.2 Sloped algorithm
GPU computational strategy
Analysis
4.3 Computing the Best Alignment
4.3.1 Layered-BT1
4.3.2 Layered-BT2
4.3.3 Layered-BT3
4.4 Experimental Results
4.4.1 Computing the score of the best alignment
4.4.2 Computing the alignment
5 Conclusion
References
Chapter 9: Augmented block cimmino distributed algorithm for solving tridiagonal systems on GPU
1 Introduction
2 ABCD Solver for tridiagonal systems
3 GPU implementation and optimization
3.1 QR Method and Givens Rotation
3.2 Sparse Storage Format
3.3 Coalesced Memory Access
3.4 Boundary Padding
3.4.1 Padding of the augmented matrix
3.4.2 Padding for Givens rotation
4 Performance evaluation
4.1 Comparison With CPU Implementation
4.2 Speedup by Memory Coalescing
4.3 Boundary Padding
5 Conclusion and future work
References
Chapter 10: GPU computing applied to linear and mixed-integer programming
1 Introduction
2 Operations Research in Practice
3 Exact Optimization Algorithms
3.1 The Simplex Method
3.2 Dynamic Programming
3.2.1 Knapsack problems
3.2.2 Multiple-choice knapsack problem
3.3 Branch-and-Bound
3.3.1 Knapsack problem
3.3.2 Flow-shop scheduling problem
3.3.3 Traveling salesman problem
4 Metaheuristics
4.1 Genetic Algorithms
4.1.1 The traveling salesman problem
4.1.2 Scheduling problems
4.1.3 Knapsack problems.
4.2 Ant Colony Optimization.

Advances in GPU research and practice

Ejemplares similares