Data Lake for enterprises leveraging Lambda architecture for building Enterprise Data Lake
A practical guide to implementing your enterprise data lake using Lambda Architecture as the base About This Book Build a full-fledged data lake for your organization with popular big data technologies using the Lambda architecture as the base Delve into the big data technologies required to meet mo...
Otros Autores: | , , |
---|---|
Formato: | Libro electrónico |
Idioma: | Inglés |
Publicado: |
Birmingham, England :
Packt
2017.
|
Edición: | 1st edition |
Materias: | |
Ver en Biblioteca Universitat Ramon Llull: | https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009630227806719 |
Tabla de Contenidos:
- Cover
- Copyright
- Credits
- Foreword
- About the Authors
- About the Reviewers
- www.PacktPub.com
- Customer Feedback
- Table of Contents
- Preface
- Part 1 - Overview
- Part 2 - Technical Building blocks of Data Lake
- Part 3 - Bringing It All Together
- Chapter 1: Introduction to Data
- Exploring data
- What is Enterprise Data?
- Enterprise Data Management
- Big data concepts
- Big data and 4Vs
- Relevance of data
- Quality of data
- Where does this data live in an enterprise?
- Intranet (within enterprise)
- Internet (external to enterprise)
- Business applications hosted in cloud
- Third-party cloud solutions
- Social data (structured and unstructured)
- Data stores or persistent stores (RDBMS or NoSQL)
- Traditional data warehouse
- File stores
- Enterprise's current state
- Enterprise digital transformation
- Enterprises embarking on this journey
- Some examples
- Data lake use case enlightenment
- Summary
- Chapter 2: Comprehensive Concepts of a Data Lake
- What is a Data Lake?
- Relevance to enterprises
- How does a Data Lake help enterprises?
- Data Lake benefits
- How Data Lake works?
- Differences between Data Lake and Data Warehouse
- Approaches to building a Data Lake
- Lambda Architecture-driven Data Lake
- Data ingestion layer - ingest for processing and storage
- Batch layer - batch processing of ingested data
- Speed layer - near real time data processing
- Data storage layer - store all data
- Serving layer - data delivery and exports
- Data acquisition layer - get data from source systems
- Messaging Layer - guaranteed data delivery
- Exploring the Data Ingestion Layer
- Exploring the Lambda layer
- Batch layer
- Speed layer
- Serving layer
- Data push
- Data pull
- Data storage layer
- Batch process layer
- Speed layer
- Serving layer
- Relational data stores.
- Distributed data stores
- Summary
- Chapter 3: Lambda Architecture as a Pattern for Data Lake
- What is Lambda Architecture?
- History of Lambda Architecture
- Principles of Lambda Architecture
- Fault-tolerant principle
- Immutable Data principle
- Re-computation principle
- Components of a Lambda Architecture
- Batch layer
- Speed layer
- CAP Theorem
- Eventual consistency
- Serving layer
- Complete working of a Lambda Architecture
- Advantages of Lambda Architecture
- Disadvantages of Lambda Architectures
- Technology overview for Lambda Architecture
- Applied lambda
- Enterprise-level log analysis
- Capturing and analyzing sensor data
- Real-time mailing platform statistics
- Real-time sports analysis
- Recommendation engines
- Analyzing security threats
- Multi-channel consumer behaviour
- Working examples of Lambda Architecture
- Kappa architecture
- Summary
- Chapter 4: Applied Lambda for Data Lake
- Knowing Hadoop distributions
- Selection factors for a big data stack for enterprises
- Technical capabilities
- Ease of deployment and maintenance
- Integration readiness
- Batch layer for data processing
- The NameNode server
- The secondary NameNode Server
- Yet Another Resource Negotiator (YARN)
- Data storage nodes (DataNode)
- Speed layer
- Flume for data acquisition
- Source for event sourcing
- Interceptors for event interception
- Channels for event flow
- Sink as an event destination
- Spark Streaming
- DStreams
- Data Frames
- Checkpointing
- Apache Flink
- Serving layer
- Data repository layer
- Relational databases
- Big data tables/views
- Data services with data indexes
- NoSQL databases
- Data access layer
- Data exports
- Data publishing
- Summary
- Chapter 5: Data Acquisition of Batch Data using Apache Sqoop
- Context in data lake - data acquisition.
- Data acquisition layer
- Data acquisition of batch data - technology mapping
- Why Apache Sqoop
- History of Sqoop
- Advantages of Sqoop
- Disadvantages of Sqoop
- Workings of Sqoop
- Sqoop 2 architecture
- Sqoop 1 versus Sqoop 2
- Ease of use
- Ease of extension
- Security
- When to use Sqoop 1 and Sqoop 2
- Functioning of Sqoop
- Data import using Sqoop
- Data export using Sqoop
- Sqoop connectors
- Types of Sqoop connectors
- Sqoop support for HDFS
- Sqoop working example
- Installation and Configuration
- Step 1 - Installing and verifying Java
- Step 2 - Installing and verifying Hadoop
- Step 3 - Installing and verifying Hue
- Step 4 - Installing and verifying Sqoop
- Step 5 - Installing and verifying PostgreSQL (RDBMS)
- Step 6 - Installing and verifying HBase (NoSQL)
- Configure data source (ingestion)
- Sqoop configuration (database drivers)
- Configuring HDFS as destination
- Sqoop Import
- Import complete database
- Import selected tables
- Import selected columns from a table
- Import into HBase
- Sqoop Export
- Sqoop Job
- Job command
- Create job
- List Job
- Run Job
- Create Job
- Sqoop 2
- Sqoop in purview of SCV use case
- When to use Sqoop
- When not to use Sqoop
- Real-time Sqooping: a possibility?
- Other options
- Native big data connectors
- Talend
- Pentaho's Kettle (PDI - Pentaho Data Integration)
- Summary
- Chapter 6: Data Acquisition of Stream Data using Apache Flume
- Context in Data Lake: data acquisition
- What is Stream Data?
- Batch and stream data
- Data acquisition of stream data - technology mapping
- What is Flume?
- Sqoop and Flume
- Why Flume?
- History of Flume
- Advantages of Flume
- Disadvantages of Flume
- Flume architecture principles
- The Flume Architecture
- Distributed pipeline - Flume architecture
- Fan Out - Flume architecture.
- Fan In - Flume architecture
- Three tier design - Flume architecture
- Advanced Flume architecture
- Flume reliability level
- Flume event - Stream Data
- Flume agent
- Flume agent configurations
- Flume source
- Custom Source
- Flume Channel
- Custom channel
- Flume sink
- Custom sink
- Flume configuration
- Flume transaction management
- Other flume components
- Channel processor
- Interceptor
- Channel Selector
- Sink Groups
- Sink Processor
- Event Serializers
- Context Routing
- Flume working example
- Installation and Configuration
- Step 1: Installing and verifying Flume
- Step 2: Configuring Flume
- Step 3: Start Flume
- Flume in purview of SCV use case
- Kafka Installation
- Example 1 - RDBMS to Kafka
- Example 2: Spool messages to Kafka
- Example 3: Interceptors
- Example 4 - Memory channel, file channel, and Kafka channel
- When to use Flume
- When not to use Flume
- Other options
- Apache Flink
- Apache NiFi
- Summary
- Chapter 7: Messaging Layer using Apache Kafka
- Context in Data Lake - messaging layer
- Messaging layer
- Messaging layer - technology mapping
- What is Apache Kafka?
- Why Apache Kafka
- History of Kafka
- Advantages of Kafka
- Disadvantages of Kafka
- Kafka architecture
- Core architecture principles of Kafka
- Data stream life cycle
- Working of Kafka
- Kafka message
- Kafka producer
- Persistence of data in Kafka using topics
- Partitions - Kafka topic division
- Kafka message broker
- Kafka consumer
- Consumer groups
- Other Kafka components
- Zookeeper
- MirrorMaker
- Kafka programming interface
- Kafka core API's
- Kafka REST interface
- Producer and consumer reliability
- Kafka security
- Kafka as message-oriented middleware
- Scale-out architecture with Kafka
- Kafka connect
- Kafka working example
- Installation.
- Producer - putting messages into Kafka
- Kafka Connect
- Consumer - getting messages from Kafka
- Setting up multi-broker cluster
- Kafka in the purview of an SCV use case
- When to use Kafka
- When not to use Kafka
- Other options
- RabbitMQ
- ZeroMQ
- Apache ActiveMQ
- Summary
- Chapter 8: Data Processing using Apache Flink
- Context in a Data Lake - Data Ingestion Layer
- Data Ingestion Layer
- Data Ingestion Layer - technology mapping
- What is Apache Flink?
- Why Apache Flink?
- History of Flink
- Advantages of Flink
- Disadvantages of Flink
- Working of Flink
- Flink architecture
- Client
- Job Manager
- Task Manager
- Flink execution model
- Core architecture principles of Flink
- Flink Component Stack
- Checkpointing in Flink
- Savepoints in Flink
- Streaming window options in Flink
- Time window
- Count window
- Tumbling window configuration
- Sliding window configuration
- Memory management
- Flink API's
- DataStream API
- Flink DataStream API example
- Streaming connectors
- DataSet API
- Flink DataSet API example
- Table API
- Flink domain specific libraries
- Gelly - Flink Graph API
- FlinkML
- FlinkCEP
- Flink working example
- Installation
- Example - data processing with Flink
- Data generation
- Step 1 - Preparing streams
- Step 2 - Consuming Streams via Flink
- Step 3 - Streaming data into HDFS
- Flink in purview of SCV use cases
- User Log Data Generation
- Flume Setup
- Flink Processors
- When to use Flink
- When not to use Flink
- Other options
- Apache Spark
- Apache Storm
- Apache Tez
- Summary
- Chapter 9: Data Store Using Apache Hadoop
- Context for Data Lake - Data Storage and lambda Batch layer
- Data Storage and the Lambda Batch Layer
- Data Storage and Lambda Batch Layer - technology mapping
- What is Apache Hadoop?
- Why Hadoop?
- History of Hadoop.
- Advantages of Hadoop.