Data Lake for enterprises leveraging Lambda architecture for building Enterprise Data Lake

A practical guide to implementing your enterprise data lake using Lambda Architecture as the base About This Book Build a full-fledged data lake for your organization with popular big data technologies using the Lambda architecture as the base Delve into the big data technologies required to meet mo...

Descripción completa

Detalles Bibliográficos
Otros Autores: John, Tomcy, author (author), Misra, Pankaj, author (writer of foreword), Benjamin, Thomas, writer of foreword
Formato: Libro electrónico
Idioma:Inglés
Publicado: Birmingham, England : Packt 2017.
Edición:1st edition
Materias:
Ver en Biblioteca Universitat Ramon Llull:https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009630227806719
Tabla de Contenidos:
  • Cover
  • Copyright
  • Credits
  • Foreword
  • About the Authors
  • About the Reviewers
  • www.PacktPub.com
  • Customer Feedback
  • Table of Contents
  • Preface
  • Part 1 - Overview
  • Part 2 - Technical Building blocks of Data Lake
  • Part 3 - Bringing It All Together
  • Chapter 1: Introduction to Data
  • Exploring data
  • What is Enterprise Data?
  • Enterprise Data Management
  • Big data concepts
  • Big data and 4Vs
  • Relevance of data
  • Quality of data
  • Where does this data live in an enterprise?
  • Intranet (within enterprise)
  • Internet (external to enterprise)
  • Business applications hosted in cloud
  • Third-party cloud solutions
  • Social data (structured and unstructured)
  • Data stores or persistent stores (RDBMS or NoSQL)
  • Traditional data warehouse
  • File stores
  • Enterprise's current state
  • Enterprise digital transformation
  • Enterprises embarking on this journey
  • Some examples
  • Data lake use case enlightenment
  • Summary
  • Chapter 2: Comprehensive Concepts of a Data Lake
  • What is a Data Lake?
  • Relevance to enterprises
  • How does a Data Lake help enterprises?
  • Data Lake benefits
  • How Data Lake works?
  • Differences between Data Lake and Data Warehouse
  • Approaches to building a Data Lake
  • Lambda Architecture-driven Data Lake
  • Data ingestion layer - ingest for processing and storage
  • Batch layer - batch processing of ingested data
  • Speed layer - near real time data processing
  • Data storage layer - store all data
  • Serving layer - data delivery and exports
  • Data acquisition layer - get data from source systems
  • Messaging Layer - guaranteed data delivery
  • Exploring the Data Ingestion Layer
  • Exploring the Lambda layer
  • Batch layer
  • Speed layer
  • Serving layer
  • Data push
  • Data pull
  • Data storage layer
  • Batch process layer
  • Speed layer
  • Serving layer
  • Relational data stores.
  • Distributed data stores
  • Summary
  • Chapter 3: Lambda Architecture as a Pattern for Data Lake
  • What is Lambda Architecture?
  • History of Lambda Architecture
  • Principles of Lambda Architecture
  • Fault-tolerant principle
  • Immutable Data principle
  • Re-computation principle
  • Components of a Lambda Architecture
  • Batch layer
  • Speed layer
  • CAP Theorem
  • Eventual consistency
  • Serving layer
  • Complete working of a Lambda Architecture
  • Advantages of Lambda Architecture
  • Disadvantages of Lambda Architectures
  • Technology overview for Lambda Architecture
  • Applied lambda
  • Enterprise-level log analysis
  • Capturing and analyzing sensor data
  • Real-time mailing platform statistics
  • Real-time sports analysis
  • Recommendation engines
  • Analyzing security threats
  • Multi-channel consumer behaviour
  • Working examples of Lambda Architecture
  • Kappa architecture
  • Summary
  • Chapter 4: Applied Lambda for Data Lake
  • Knowing Hadoop distributions
  • Selection factors for a big data stack for enterprises
  • Technical capabilities
  • Ease of  deployment and maintenance
  • Integration readiness
  • Batch layer for data processing
  • The NameNode server
  • The secondary NameNode Server
  • Yet Another Resource Negotiator (YARN)
  • Data storage nodes (DataNode)
  • Speed layer
  • Flume for data acquisition
  • Source for event sourcing
  • Interceptors for event interception
  • Channels for event flow
  • Sink as an event destination
  • Spark Streaming
  • DStreams
  • Data Frames
  • Checkpointing
  • Apache Flink
  • Serving layer
  • Data repository layer
  • Relational databases
  • Big data tables/views
  • Data services with data indexes
  • NoSQL databases
  • Data access layer
  • Data exports
  • Data publishing
  • Summary
  • Chapter 5: Data Acquisition of Batch Data using Apache Sqoop
  • Context in data lake - data acquisition.
  • Data acquisition layer
  • Data acquisition of batch data - technology mapping
  • Why Apache Sqoop
  • History of Sqoop
  • Advantages of Sqoop
  • Disadvantages of Sqoop
  • Workings of Sqoop
  • Sqoop 2 architecture
  • Sqoop 1 versus Sqoop 2
  • Ease of use
  • Ease of extension
  • Security
  • When to use Sqoop 1 and Sqoop 2
  • Functioning of Sqoop
  • Data import using Sqoop
  • Data export using Sqoop
  • Sqoop connectors
  • Types of Sqoop connectors
  • Sqoop support for HDFS
  • Sqoop working example
  • Installation and Configuration
  • Step 1 - Installing and verifying Java
  • Step 2 - Installing and verifying Hadoop
  • Step 3 - Installing and verifying Hue
  • Step 4 - Installing and verifying Sqoop
  • Step 5 - Installing and verifying PostgreSQL (RDBMS)
  • Step 6 - Installing and verifying HBase (NoSQL)
  • Configure data source (ingestion)
  • Sqoop configuration (database drivers)
  • Configuring HDFS as destination
  • Sqoop Import
  • Import complete database
  • Import selected tables
  • Import selected columns from a table
  • Import into HBase
  • Sqoop Export
  • Sqoop Job
  • Job command
  • Create job
  • List Job
  • Run Job
  • Create Job
  • Sqoop 2
  • Sqoop in purview of SCV use case
  • When to use Sqoop
  • When not to use Sqoop
  • Real-time Sqooping: a possibility?
  • Other options
  • Native big data connectors
  • Talend
  • Pentaho's Kettle (PDI - Pentaho Data Integration)
  • Summary
  • Chapter 6: Data Acquisition of Stream Data using Apache Flume
  • Context in Data Lake: data acquisition
  • What is Stream Data?
  • Batch and stream data
  • Data acquisition of stream data - technology mapping
  • What is Flume?
  • Sqoop and Flume
  • Why Flume?
  • History of Flume
  • Advantages of Flume
  • Disadvantages of Flume
  • Flume architecture principles
  • The Flume Architecture
  • Distributed pipeline - Flume architecture
  • Fan Out - Flume architecture.
  • Fan In - Flume architecture
  • Three tier design - Flume architecture
  • Advanced Flume architecture
  • Flume reliability level
  • Flume event - Stream Data
  • Flume agent
  • Flume agent configurations
  • Flume source
  • Custom Source
  • Flume Channel
  • Custom channel
  • Flume sink
  • Custom sink
  • Flume configuration
  • Flume transaction management
  • Other flume components
  • Channel processor
  • Interceptor
  • Channel Selector
  • Sink Groups
  • Sink Processor
  • Event Serializers
  • Context Routing
  • Flume working example
  • Installation and Configuration
  • Step 1: Installing and verifying Flume
  • Step 2: Configuring Flume
  • Step 3: Start Flume
  • Flume in purview of SCV use case
  • Kafka Installation
  • Example 1 - RDBMS to Kafka
  • Example 2: Spool messages to Kafka
  • Example 3: Interceptors
  • Example 4 - Memory channel, file channel, and Kafka channel
  • When to use Flume
  • When not to use Flume
  • Other options
  • Apache Flink
  • Apache NiFi
  • Summary
  • Chapter 7: Messaging Layer using Apache Kafka
  • Context in Data Lake - messaging layer
  • Messaging layer
  • Messaging layer - technology mapping
  • What is Apache Kafka?
  • Why Apache Kafka
  • History of Kafka
  • Advantages of Kafka
  • Disadvantages of Kafka
  • Kafka architecture
  • Core architecture principles of Kafka
  • Data stream life cycle
  • Working of Kafka
  • Kafka message
  • Kafka producer
  • Persistence of data in Kafka using topics
  • Partitions - Kafka topic division
  • Kafka message broker
  • Kafka consumer
  • Consumer groups
  • Other Kafka components
  • Zookeeper
  • MirrorMaker
  • Kafka programming interface
  • Kafka core API's
  • Kafka REST interface
  • Producer and consumer reliability
  • Kafka security
  • Kafka as message-oriented middleware
  • Scale-out architecture with Kafka
  • Kafka connect
  • Kafka working example
  • Installation.
  • Producer - putting messages into Kafka
  • Kafka Connect
  • Consumer - getting messages from Kafka
  • Setting up multi-broker cluster
  • Kafka in the purview of an SCV use case
  • When to use Kafka
  • When not to use Kafka
  • Other options
  • RabbitMQ
  • ZeroMQ
  • Apache ActiveMQ
  • Summary
  • Chapter 8: Data Processing using Apache Flink
  • Context in a Data Lake - Data Ingestion Layer
  • Data Ingestion Layer
  • Data Ingestion Layer - technology mapping
  • What is Apache Flink?
  • Why Apache Flink?
  • History of Flink
  • Advantages of Flink
  • Disadvantages of Flink
  • Working of Flink
  • Flink architecture
  • Client
  • Job Manager
  • Task Manager
  • Flink execution model
  • Core architecture principles of Flink
  • Flink Component Stack
  • Checkpointing in Flink
  • Savepoints in Flink
  • Streaming window options in Flink
  • Time window
  • Count window
  • Tumbling window configuration
  • Sliding window configuration
  • Memory management
  • Flink API's
  • DataStream API
  • Flink DataStream API example
  • Streaming connectors
  • DataSet API
  • Flink DataSet API example
  • Table API
  • Flink domain specific libraries
  • Gelly - Flink Graph API
  • FlinkML
  • FlinkCEP
  • Flink working example
  • Installation
  • Example - data processing with Flink
  • Data generation
  • Step 1 - Preparing streams
  • Step 2 - Consuming Streams via Flink
  • Step 3 - Streaming data into HDFS
  • Flink in purview of SCV use cases
  • User Log Data Generation
  • Flume Setup
  • Flink Processors
  • When to use Flink
  • When not to use Flink
  • Other options
  • Apache Spark
  • Apache Storm
  • Apache Tez
  • Summary
  • Chapter 9: Data Store Using Apache Hadoop
  • Context for Data Lake - Data Storage and lambda Batch layer
  • Data Storage and the Lambda Batch Layer
  • Data Storage and Lambda Batch Layer - technology mapping
  • What is Apache Hadoop?
  • Why Hadoop?
  • History of Hadoop.
  • Advantages of Hadoop.