LLM Engineer's Handbook Master the Art of Engineering Large Language Models from Concept to Production

The field of Artificial Intelligence has undergone rapid advancements, and Large Language Models (LLMs) are at the forefront of this revolution. This LLM book provides practical insights into designing, training, and deploying LLMs in real-world scenarios by leveraging MLOps best practices. This com...

Descripción completa

Detalles Bibliográficos
Otros Autores: Iusztin, Paul, author (author), Labonne, Maxime, author
Formato: Libro electrónico
Idioma:Inglés
Publicado: Birmingham, England : Packt Publishing [2024]
Edición:First edition
Colección:Expert insight.
Materias:
Ver en Biblioteca Universitat Ramon Llull:https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009849106806719
Tabla de Contenidos:
  • Cover
  • Copyright
  • Contributors
  • Table of Contents
  • Preface
  • Chapter 1: Understanding the LLM Twin Concept and Its Architecture
  • Understanding the LLM twin concept
  • What is an LLM twin?
  • Why building an LLM twin matters
  • Why not use ChatGPT (or another similar chatbot)?
  • Planning the MVP of the LLM twin product
  • What is an MVP?
  • Defining the LLM twin MVP
  • Building ML systems with feature/training/inference pipelines
  • The problem with building ML systems
  • The issue with previous solutions
  • The solution - ML pipelines for ML systems
  • The feature pipeline
  • The training pipeline
  • The inference pipeline
  • Benefits of the FTI architecture
  • Designing the system architecture of the LLM twin
  • Listing the technical details of the LLM twin architecture
  • How to design the LLM twin architecture using the FTI pipeline design
  • Data collection pipeline
  • Feature pipeline
  • Training pipeline
  • Inference pipeline
  • Final thoughts on the FTI design and the LLM twin architecture
  • Summary
  • References
  • Chapter 2: Tooling and Installation
  • Python ecosystem and project installation
  • Poetry: dependency and virtual environment management
  • Poe the Poet: task execution tool
  • MLOps and LLMOps tooling
  • Hugging Face: model registry
  • ZenML: orchestrator, artifacts, and metadata
  • Orchestrator
  • Artifacts and metadata
  • How to run and configure a ZenML pipeline
  • Comet ML: experiment tracker
  • Opik: prompt monitoring
  • Databases for storing unstructured and vector data
  • MongoDB: NoSQL database
  • Qdrant: vector database
  • Preparing for AWS
  • Setting up an AWS account, an access key, and the CLI
  • SageMaker: training and inference compute
  • Why AWS SageMaker?
  • Summary
  • References
  • Chapter 3: Data Engineering
  • Designing the LLM Twin's data collection pipeline.
  • Implementing the LLM Twin's data collection pipeline
  • ZenML pipeline and steps
  • The dispatcher: How do you instantiate the right crawler?
  • The crawlers
  • Base classes
  • GitHubCrawler class
  • CustomArticleCrawler class
  • MediumCrawler class
  • The NoSQL data warehouse documents
  • The ORM and ODM software patterns
  • Implementing the ODM class
  • Data categories and user document classes
  • Gathering raw data into the data warehouse
  • Troubleshooting
  • Selenium issues
  • Import our backed-up data
  • Summary
  • References
  • Chapter 4: RAG Feature Pipeline
  • Understanding RAG
  • Why use RAG?
  • Hallucinations
  • Old information
  • The vanilla RAG framework
  • Ingestion pipeline
  • Retrieval pipeline
  • Generation pipeline
  • What are embeddings?
  • Why embeddings are so powerful
  • How are embeddings created?
  • Applications of embeddings
  • More on vector DBs
  • How does a vector DB work?
  • Algorithms for creating the vector index
  • DB operations
  • An overview of advanced RAG
  • Pre-retrieval
  • Retrieval
  • Post-retrieval
  • Exploring the LLM Twin's RAG feature pipeline architecture
  • The problem we are solving
  • The feature store
  • Where does the raw data come from?
  • Designing the architecture of the RAG feature pipeline
  • Batch pipelines
  • Batch versus streaming pipelines
  • Core steps
  • Change data capture: syncing the data warehouse and feature store
  • Why is the data stored in two snapshots?
  • Orchestration
  • Implementing the LLM Twin's RAG feature pipeline
  • Settings
  • ZenML pipeline and steps
  • Querying the data warehouse
  • Cleaning the documents
  • Chunk and embed the cleaned documents
  • Loading the documents to the vector DB
  • Pydantic domain entities
  • OVM
  • The dispatcher layer
  • The handlers
  • The cleaning handlers
  • The chunking handlers
  • The embedding handlers
  • Summary
  • References.
  • Chapter 5: Supervised Fine-Tuning
  • Creating an instruction dataset
  • General framework
  • Data quantity
  • Data curation
  • Rule-based filtering
  • Data deduplication
  • Data decontamination
  • Data quality evaluation
  • Data exploration
  • Data generation
  • Data augmentation
  • Creating our own instruction dataset
  • Exploring SFT and its techniques
  • When to fine-tune
  • Instruction dataset formats
  • Chat templates
  • Parameter-efficient fine-tuning techniques
  • Full fine-tuning
  • LoRA
  • QLoRA
  • Training parameters
  • Learning rate and scheduler
  • Batch size
  • Maximum length and packing
  • Number of epochs
  • Optimizers
  • Weight decay
  • Gradient checkpointing
  • Fine-tuning in practice
  • Summary
  • References
  • Chapter 6: Fine-Tuning with Preference Alignment
  • Understanding preference datasets
  • Preference data
  • Data quantity
  • Data generation and evaluation
  • Generating preferences
  • Tips for data generation
  • Evaluating preferences
  • Creating our own preference dataset
  • Preference alignment
  • Reinforcement Learning from Human Feedback
  • Direct Preference Optimization
  • Implementing DPO
  • Summary
  • References
  • Chapter 7: Evaluating LLMs
  • Model evaluation
  • Comparing ML and LLM evaluation
  • General-purpose LLM evaluations
  • Domain-specific LLM evaluations
  • Task-specific LLM evaluations
  • RAG evaluation
  • Ragas
  • ARES
  • Evaluating TwinLlama-3.1-8B
  • Generating answers
  • Evaluating answers
  • Analyzing results
  • Summary
  • References
  • Chapter 8: Inference Optimization
  • Model optimization strategies
  • KV cache
  • Continuous batching
  • Speculative decoding
  • Optimized attention mechanisms
  • Model parallelism
  • Data parallelism
  • Pipeline parallelism
  • Tensor parallelism
  • Combining approaches
  • Model quantization
  • Introduction to quantization
  • Quantization with GGUF and llama.cpp.
  • Quantization with GPTQ and EXL2
  • Other quantization techniques
  • Summary
  • References
  • Chapter 9: RAG Inference Pipeline
  • Understanding the LLM twin's RAG inference pipeline
  • Exploring the LLM twin's advanced RAG techniques
  • Advanced RAG pre-retrieval optimizations: query expansion and self-querying
  • Query expansion
  • Self-querying
  • Advanced RAG retrieval optimization: filtered vector search
  • Advanced RAG post-retrieval optimization: reranking
  • Implementing the LLM twin's RAG inference pipeline
  • Implementing the retrieval module
  • Bringing everything together into the RAG inference pipeline
  • Summary
  • References
  • Chapter 10: Inference Pipeline Deployment
  • Criteria for choosing deployment types
  • Throughput and latency
  • Data
  • Understanding inference deployment types
  • Online real-time inference
  • Asynchronous inference
  • Offline batch transform
  • Monolithic versus microservices architecture in model serving
  • Monolithic architecture
  • Microservices architecture
  • Choosing between monolithic and microservices architectures
  • Exploring the LLM Twin's inference pipeline deployment strategy
  • The training versus the inference pipeline
  • Deploying the LLM Twin service
  • Implementing the LLM microservice using AWS SageMaker
  • What are Hugging Face's DLCs?
  • Configuring SageMaker roles
  • Deploying the LLM Twin model to AWS SageMaker
  • Calling the AWS SageMaker Inference endpoint
  • Building the business microservice using FastAPI
  • Autoscaling capabilities to handle spikes in usage
  • Registering a scalable target
  • Creating a scalable policy
  • Minimum and maximum scaling limits
  • Cooldown period
  • Summary
  • References
  • Chapter 11: MLOps and LLMOps
  • The path to LLMOps: Understanding its roots in DevOps and MLOps
  • DevOps
  • The DevOps lifecycle
  • The core DevOps concepts
  • MLOps.
  • MLOps core components
  • MLOps principles
  • ML vs. MLOps engineering
  • LLMOps
  • Human feedback
  • Guardrails
  • Prompt monitoring
  • Deploying the LLM Twin's pipelines to the cloud
  • Understanding the infrastructure
  • Setting up MongoDB
  • Setting up Qdrant
  • Setting up the ZenML cloud
  • Containerize the code using Docker
  • Run the pipelines on AWS
  • Troubleshooting the ResourceLimitExceeded error after running a ZenML pipeline on SageMaker
  • Adding LLMOps to the LLM Twin
  • LLM Twin's CI/CD pipeline flow
  • More on formatting errors
  • More on linting errors
  • Quick overview of GitHub Actions
  • The CI pipeline
  • GitHub Actions CI YAML file
  • The CD pipeline
  • Test out the CI/CD pipeline
  • The CT pipeline
  • Initial triggers
  • Trigger downstream pipelines
  • Prompt monitoring
  • Alerting
  • Summary
  • References
  • Appendix: MLOps Principles
  • 1. Automation or operationalization
  • 2. Versioning
  • 3. Experiment tracking
  • 4. Testing
  • Test types
  • What do we test?
  • Test examples
  • 5. Monitoring
  • Logs
  • Metrics
  • System metrics
  • Model metrics
  • Drifts
  • Monitoring vs. observability
  • Alerts
  • 6. Reproducibility
  • Packt Page
  • Other Books You May Enjoy
  • Index.