Distributed machine learning with Python accelerating model training and serving with distributed systems

Chapter 2: Parameter Server and All-Reduce -- Technical requirements -- Parameter server architecture -- Communication bottleneck in the parameter server architecture -- Sharding the model among parameter servers -- Implementing the parameter server -- Defining model layers -- Defining the parameter...

Descripción completa

Detalles Bibliográficos
Otros Autores:	Wang, Guanhua, author (author)
Formato:	Libro electrónico
Idioma:	Inglés
Publicado:	Birmingham ; Mumbai : Packt Publishing 2022.
Materias:	Machine learning. Python (Computer program language)
Ver en Biblioteca Universitat Ramon Llull:	https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009660437406719

Tabla de Contenidos:

Intro
Title page
Copyright and Credits
Dedication
Contributors
Table of Contents
Preface
Section 1 - Data Parallelism
Chapter 1: Splitting Input Data
Single-node training is too slow
The mismatch between data loading bandwidth and model training bandwidth
Single-node training time on popular datasets
Accelerating the training process with data parallelism
Data parallelism - the high-level bits
Stochastic gradient descent
Model synchronization
Hyperparameter tuning
Global batch size
Learning rate adjustment
Model synchronization schemes
Summary
Chapter 2: Parameter Server and All-Reduce
Technical requirements
Parameter server architecture
Communication bottleneck in the parameter server architecture
Sharding the model among parameter servers
Implementing the parameter server
Defining model layers
Defining the parameter server
Defining the worker
Passing data between the parameter server and worker
Issues with the parameter server
The parameter server architecture introduces a high coding complexity for practitioners
All-Reduce architecture
Reduce
All-Reduce
Ring All-Reduce
Collective communication
Broadcast
Gather
All-Gather
Summary
Chapter 3: Building a Data Parallel Training and Serving Pipeline
Technical requirements
The data parallel training pipeline in a nutshell
Input pre-processing
Input data partition
Data loading
Training
Model synchronization
Model update
Single-machine multi-GPUs and multi-machine multi-GPUs
Single-machine multi-GPU
Multi-machine multi-GPU
Checkpointing and fault tolerance
Model checkpointing
Load model checkpoints
Model evaluation and hyperparameter tuning
Model serving in data parallelism
Summary
Chapter 4: Bottlenecks and Solutions.
Communication bottlenecks in data parallel training
Analyzing the communication workloads
Parameter server architecture
The All-Reduce architecture
The inefficiency of state-of-the-art communication schemes
Leveraging idle links and host resources
Tree All-Reduce
Hybrid data transfer over PCIe and NVLink
On-device memory bottlenecks
Recomputation and quantization
Recomputation
Quantization
Summary
Section 2 - Model Parallelism
Chapter 5: Splitting the Model
Technical requirements
Single-node training error - out of memory
Fine-tuning BERT on a single GPU
Trying to pack a giant model inside one state-of-the-art GPU
ELMo, BERT, and GPT
Basic concepts
RNN
ELMo
BERT
GPT
Pre-training and fine-tuning
State-of-the-art hardware
P100, V100, and DGX-1
NVLink
A100 and DGX-2
NVSwitch
Summary
Chapter 6: Pipeline Input and Layer Split
Vanilla model parallelism is inefficient
Forward propagation
Backward propagation
GPU idle time between forward and backward propagation
Pipeline input
Pros and cons of pipeline parallelism
Advantages of pipeline parallelism
Disadvantages of pipeline parallelism
Layer split
Notes on intra-layer model parallelism
Summary
Chapter 7: Implementing Model Parallel Training and Serving Workflows
Technical requirements
Wrapping up the whole model parallelism pipeline
A model parallel training overview
Implementing a model parallel training pipeline
Specifying communication protocol among GPUs
Model parallel serving
Fine-tuning transformers
Hyperparameter tuning in model parallelism
Balancing the workload among GPUs
Enabling/disabling pipeline parallelism
NLP model serving
Summary
Chapter 8: Achieving Higher Throughput and Lower Latency
Technical requirements.
Freezing layers
Freezing layers during forward propagation
Reducing computation cost during forward propagation
Freezing layers during backward propagation
Exploring memory and storage resources
Understanding model decomposition and distillation
Model decomposition
Model distillation
Reducing bits in hardware
Summary
Section 3 - Advanced Parallelism Paradigms
Chapter 9: Hybrid of Data and Model Parallelism
Technical requirements
Case study of Megatron-LM
Layer split for model parallelism
Row-wise trial-and-error approach
Column-wise trial-and-error approach
Cross-machine for data parallelism
Implementation of Megatron-LM
Case study of Mesh-TensorFlow
Implementation of Mesh-TensorFlow
Pros and cons of Megatron-LM and Mesh-TensorFlow
Summary
Chapter 10: Federated Learning and Edge Devices
Technical requirements
Sharing knowledge without sharing data
Recapping the traditional data parallel model training paradigm
No input sharing among workers
Communicating gradients for collaborative learning
Case study: TensorFlow Federated
Running edge devices with TinyML
Case study: TensorFlow Lite
Summary
Chapter 11: Elastic Model Training and Serving
Technical requirements
Introducing adaptive model training
Traditional data parallel training
Adaptive model training in data parallelism
Adaptive model training (AllReduce-based)
Adaptive model training (parameter server-based)
Traditional model-parallel model training paradigm
Adaptive model training in model parallelism
Implementing adaptive model training in the cloud
Elasticity in model inference
Serverless
Summary
Chapter 12: Advanced Techniques for Further Speed-Ups
Technical requirements
Debugging and performance analytics.
General concepts in the profiling results
Communication results analysis
Computation results analysis
Job migration and multiplexing
Job migration
Job multiplexing
Model training in a heterogeneous environment
Summary
Index
About Packt
Other Books You May Enjoy.

Distributed machine learning with Python accelerating model training and serving with distributed systems

Ejemplares similares