LARGE LANGUAGE MODEL-BASED SOLUTIONS how to deliver value with cost-effective generative AI applications

Learn to build cost-effective apps using Large Language Models In Large Language Model-Based Solutions: How to Deliver Value with Cost-Effective Generative AI Applications, Principal Data Scientist at Amazon Web Services, Shreyas Subramanian, delivers a practical guide for developers and data scient...

Descripción completa

Detalles Bibliográficos
Otros Autores: Subramanian, Shreyas, author (author)
Formato: Libro electrónico
Idioma:Inglés
Publicado: Hoboken, New Jersey : JOHN WILEY 2024.
Newark : 2024.
Edición:1st ed
Colección:Tech Today Series
Materias:
Ver en Biblioteca Universitat Ramon Llull:https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009811313906719
Tabla de Contenidos:
  • Cover
  • Contents At A Glance
  • Title Page
  • Copyright Page
  • Dedication Page
  • About the Author
  • About the Technical Editor
  • Contents
  • Introduction
  • GenAI Applications and Large Language Models
  • Importance of Cost Optimization
  • Challenges and Opportunities
  • Micro Case Studies
  • OpenAI: Leading the Way
  • Hugging Face: Open-Source Community Building
  • Bloomberg GPT: LLMs in Large Commercial Institutions
  • Who Is This Book For?
  • Summary
  • Chapter 1 Introduction
  • Overview of GenAI Applications and Large Language Models
  • The Rise of Large Language Models
  • Neural Networks, Transformers, and Beyond
  • GenAI vs. LLMs: What's the Difference?
  • The Three-Layer GenAI Application Stack
  • The Infrastructure Layer
  • The Model Layer
  • The Application Layer
  • Paths to Productionizing GenAI Applications
  • Sample LLM-Powered Chat Application
  • The Importance of Cost Optimization
  • Cost Assessment of the Model Inference Component
  • Cost Assessment of the Vector Database Component
  • Benchmarking Setup and Results
  • Other Factors to Consider
  • Cost Assessment of the Large Language Model Component
  • Summary
  • Chapter 2 Tuning Techniques for Cost Optimization
  • Fine-Tuning and Customizability
  • Basic Scaling Laws You Should Know
  • Parameter-Efficient Fine-Tuning Methods
  • Adapters Under the Hood
  • Prompt Tuning
  • Prefix Tuning
  • P-tuning
  • IA3
  • Low-Rank Adaptation
  • Cost and Performance Implications of PEFT Methods
  • Summary
  • Chapter 3 Inference Techniques for Cost Optimization
  • Introduction to Inference Techniques
  • Prompt Engineering
  • Impact of Prompt Engineering on Cost
  • Estimating Costs for Other Models
  • Clear and Direct Prompts
  • Adding Qualifying Words for Brief Responses
  • Breaking Down the Request
  • Example of Using Claude for PII Removal
  • Conclusion
  • Providing Context.
  • Examples of Providing Context
  • RAG and Long Context Models
  • Recent Work Comparing RAG with Long Content Models
  • Conclusion
  • Context and Model Limitations
  • Indicating a Desired Format
  • Example of Formatted Extraction with Claude
  • Trade-Off Between Verbosity and Clarity
  • Caching with Vector Stores
  • What Is a Vector Store?
  • How to Implement Caching Using Vector Stores
  • Conclusion
  • Chains for Long Documents
  • What Is Chaining?
  • Implementing Chains
  • Example Use Case
  • Common Components
  • Tools That Implement Chains
  • Comparing Results
  • Conclusion
  • Summarization
  • Summarization in the Context of Cost and Performance
  • Efficiency in Data Processing
  • Cost-Effective Storage
  • Enhanced Downstream Applications
  • Improved Cache Utilization
  • Summarization as a Preprocessing Step
  • Enhanced User Experience
  • Conclusion
  • Batch Prompting for Efficient Inference
  • Batch Inference
  • Experimental Results
  • Using the accelerate Library
  • Using the DeepSpeed Library
  • Batch Prompting
  • Example of Using Batch Prompting
  • Model Optimization Methods
  • Quantization
  • Code Example
  • Recent Advancements: GPTQ
  • Parameter-Efficient Fine-Tuning Methods
  • Recap of PEFT Methods
  • Code Example
  • Cost and Performance Implications
  • Summary
  • References
  • Chapter 4 Model Selection and Alternatives
  • Introduction to Model Selection
  • Motivating Example: The Tale of Two Models
  • The Role of Compact and Nimble Models
  • Examples of Successful Smaller Models
  • Quantization for Powerful but Smaller Models
  • Text Generation with Mistral 7B
  • Zephyr 7B and Aligned Smaller Models
  • CogVLM for Language-Vision Multimodality
  • Prometheus for Fine-Grained Text Evaluation
  • Orca 2 and Teaching Smaller Models to Reason
  • Breaking Traditional Scaling Laws with Gemini and Phi
  • Phi 1, 1.5, and 2 B Models
  • Gemini Models.
  • Domain-Specific Models
  • Step 1 - Training Your Own Tokenizer
  • Step 2 - Training Your Own Domain-Specific Model
  • More References for Fine-Tuning
  • Evaluating Domain-Specific Models vs. Generic Models
  • The Power of Prompting with General-Purpose Models
  • Summary
  • Chapter 5 Infrastructure and Deployment Tuning Strategies
  • Introduction to Tuning Strategies
  • Hardware Utilization and Batch Tuning
  • Memory Occupancy
  • Strategies to Fit Larger Models in Memory
  • KV Caching
  • PagedAttention
  • How Does PagedAttention Work?
  • Comparisons, Limitations, and Cost Considerations
  • AlphaServe
  • How Does AlphaServe Work?
  • Impact of Batching
  • Cost and Performance Considerations
  • S3: Scheduling Sequences with Speculation
  • How Does S3 Work?
  • Performance and Cost
  • Streaming LLMs with Attention Sinks
  • Fixed to Sliding Window Attention
  • Extending the Context Length
  • Working with Infinite Length Context
  • How Does StreamingLLM Work?
  • Performance and Results
  • Cost Considerations
  • Batch Size Tuning
  • Frameworks for Deployment Configuration Testing
  • Cloud-NativeInference Frameworks
  • Deep Dive into Serving Stack Choices
  • Batching Options
  • Options in DJL Serving
  • High-Level Guidance for Selecting Serving Parameters
  • Automatically Finding Good Inference Configurations
  • Creating a Generic Template
  • Defining a HPO Space
  • Searching the Space for Optimal Configurations
  • Results of Inference HPO
  • Inference Acceleration Tools
  • TensorRT and GPU Acceleration Tools
  • CPU Acceleration Tools
  • Monitoring and Observability
  • LLMOps and Monitoring
  • Why Is Monitoring Important for LLMs?
  • Monitoring and Updating Guardrails
  • Summary
  • Conclusion
  • Index
  • EULA.