Big Data Analytics with Hadoop 3 build highly effective analytics solutions to gain valuable insight into your big data
Explore big data concepts, platforms, analytics, and their applications using the power of Hadoop 3 About This Book Learn Hadoop 3 to build effective big data analytics solutions on-premise and on cloud Integrate Hadoop with other big data tools such as R, Python, Apache Spark, and Apache Flink Expl...
Otros Autores: | |
---|---|
Formato: | Libro electrónico |
Idioma: | Inglés |
Publicado: |
Birmingham ; Mumbai :
Packt
[2018]
|
Edición: | 1st edition |
Materias: | |
Ver en Biblioteca Universitat Ramon Llull: | https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009631733506719 |
Tabla de Contenidos:
- Cover
- Title Page
- Copyright and Credits
- Packt Upsell
- Contributors
- Table of Contents
- Preface
- Chapter 1: Introduction to Hadoop
- Hadoop Distributed File System
- High availability
- Intra-DataNode balancer
- Erasure coding
- Port numbers
- MapReduce framework
- Task-level native optimization
- YARN
- Opportunistic containers
- Types of container execution
- YARN timeline service v.2
- Enhancing scalability and reliability
- Usability improvements
- Architecture
- Other changes
- Minimum required Java version
- Shell script rewrite
- Shaded-client JARs
- Installing Hadoop 3
- Prerequisites
- Downloading
- Installation
- Setup password-less ssh
- Setting up the NameNode
- Starting HDFS
- Setting up the YARN service
- Erasure Coding
- Intra-DataNode balancer
- Installing YARN timeline service v.2
- Setting up the HBase cluster
- Simple deployment for HBase
- Enabling the co-processor
- Enabling timeline service v.2
- Running timeline service v.2
- Enabling MapReduce to write to timeline service v.2
- Summary
- Chapter 2: Overview of Big Data Analytics
- Introduction to data analytics
- Inside the data analytics process
- Introduction to big data
- Variety of data
- Velocity of data
- Volume of data
- Veracity of data
- Variability of data
- Visualization
- Value
- Distributed computing using Apache Hadoop
- The MapReduce framework
- Hive
- Downloading and extracting the Hive binaries
- Installing Derby
- Using Hive
- Creating a database
- Creating a table
- SELECT statement syntax
- WHERE clauses
- INSERT statement syntax
- Primitive types
- Complex types
- Built-in operators and functions
- Built-in operators
- Built-in functions
- Language capabilities
- A cheat sheet on retrieving information
- Apache Spark
- Visualization using Tableau
- Summary.
- Chapter 3: Big Data Processing with MapReduce
- The MapReduce framework
- Dataset
- Record reader
- Map
- Combiner
- Partitioner
- Shuffle and sort
- Reduce
- Output format
- MapReduce job types
- Single mapper job
- Single mapper reducer job
- Multiple mappers reducer job
- SingleMapperCombinerReducer job
- Scenario
- MapReduce patterns
- Aggregation patterns
- Average temperature by city
- Record count
- Min/max/count
- Average/median/standard deviation
- Filtering patterns
- Join patterns
- Inner join
- Left anti join
- Left outer join
- Right outer join
- Full outer join
- Left semi join
- Cross join
- Summary
- Chapter 4: Scientific Computing and Big Data Analysis with Python and Hadoop
- Installation
- Installing standard Python
- Installing Anaconda
- Using Conda
- Data analysis
- Summary
- Chapter 5: Statistical Big Data Computing with R and Hadoop
- Introduction
- Install R on workstations and connect to the data in Hadoop
- Install R on a shared server and connect to Hadoop
- Utilize Revolution R Open
- Execute R inside of MapReduce using RMR2
- Summary and outlook for pure open source options
- Methods of integrating R and Hadoop
- RHADOOP - install R on workstations and connect to data in Hadoop
- RHIPE - execute R inside Hadoop MapReduce
- R and Hadoop Streaming
- RHIVE - install R on workstations and connect to data in Hadoop
- ORCH - Oracle connector for Hadoop
- Data analytics
- Summary
- Chapter 6: Batch Analytics with Apache Spark
- SparkSQL and DataFrames
- DataFrame APIs and the SQL API
- Pivots
- Filters
- User-defined functions
- Schema - structure of data
- Implicit schema
- Explicit schema
- Encoders
- Loading datasets
- Saving datasets
- Aggregations
- Aggregate functions
- count
- first
- last
- approx_count_distinct
- min
- max
- avg
- sum.
- kurtosis
- skewness
- Variance
- Standard deviation
- Covariance
- groupBy
- Rollup
- Cube
- Window functions
- ntiles
- Joins
- Inner workings of join
- Shuffle join
- Broadcast join
- Join types
- Inner join
- Left outer join
- Right outer join
- Outer join
- Left anti join
- Left semi join
- Cross join
- Performance implications of join
- Summary
- Chapter 7: Real-Time Analytics with Apache Spark
- Streaming
- At-least-once processing
- At-most-once processing
- Exactly-once processing
- Spark Streaming
- StreamingContext
- Creating StreamingContext
- Starting StreamingContext
- Stopping StreamingContext
- Input streams
- receiverStream
- socketTextStream
- rawSocketStream
- fileStream
- textFileStream
- binaryRecordsStream
- queueStream
- textFileStream example
- twitterStream example
- Discretized Streams
- Transformations
- Windows operations
- Stateful/stateless transformations
- Stateless transformations
- Stateful transformations
- Checkpointing
- Metadata checkpointing
- Data checkpointing
- Driver failure recovery
- Interoperability with streaming platforms (Apache Kafka)
- Receiver-based
- Direct Stream
- Structured Streaming
- Getting deeper into Structured Streaming
- Handling event time and late date
- Fault-tolerance semantics
- Summary
- Chapter 8: Batch Analytics with Apache Flink
- Introduction to Apache Flink
- Continuous processing for unbounded datasets
- Flink, the streaming model, and bounded datasets
- Installing Flink
- Downloading Flink
- Installing Flink
- Starting a local Flink cluster
- Using the Flink cluster UI
- Batch analytics
- Reading file
- File-based
- Collection-based
- Generic
- Transformations
- GroupBy
- Aggregation
- Joins
- Inner join
- Left outer join
- Right outer join
- Full outer join
- Writing to a file
- Summary.
- Chapter 9: Stream Processing with Apache Flink
- Introduction to streaming execution model
- Data processing using the DataStream API
- Execution environment
- Data sources
- Socket-based
- File-based
- Transformations
- map
- flatMap
- filter
- keyBy
- reduce
- fold
- Aggregations
- window
- Global windows
- Tumbling windows
- Sliding windows
- Session windows
- windowAll
- union
- Window join
- split
- Select
- Project
- Physical partitioning
- Custom partitioning
- Random partitioning
- Rebalancing partitioning
- Rescaling
- Broadcasting
- Event time and watermarks
- Connectors
- Kafka connector
- Twitter connector
- RabbitMQ connector
- Elasticsearch connector
- Cassandra connector
- Summary
- Chapter 10: Visualizing Big Data
- Introduction
- Tableau
- Chart types
- Line charts
- Pie chart
- Bar chart
- Heat map
- Using Python to visualize data
- Using R to visualize data
- Big data visualization tools
- Summary
- Chapter 11: Introduction to Cloud Computing
- Concepts and terminology
- Cloud
- IT resource
- On-premise
- Cloud consumers and Cloud providers
- Scaling
- Types of scaling
- Horizontal scaling
- Vertical scaling
- Cloud service
- Cloud service consumer
- Goals and benefits
- Increased scalability
- Increased availability and reliability
- Risks and challenges
- Increased security vulnerabilities
- Reduced operational governance control
- Limited portability between Cloud providers
- Roles and boundaries
- Cloud provider
- Cloud consumer
- Cloud service owner
- Cloud resource administrator
- Additional roles
- Organizational boundary
- Trust boundary
- Cloud characteristics
- On-demand usage
- Ubiquitous access
- Multi-tenancy (and resource pooling)
- Elasticity
- Measured usage
- Resiliency
- Cloud delivery models
- Infrastructure as a Service.
- Platform as a Service
- Software as a Service
- Combining Cloud delivery models
- IaaS + PaaS
- IaaS + PaaS + SaaS
- Cloud deployment models
- Public Clouds
- Community Clouds
- Private Clouds
- Hybrid Clouds
- Summary
- Chapter 12: Using Amazon Web Services
- Amazon Elastic Compute Cloud
- Elastic web-scale computing
- Complete control of operations
- Flexible Cloud hosting services
- Integration
- High reliability
- Security
- Inexpensive
- Easy to start
- Instances and Amazon Machine Images
- Launching multiple instances of an AMI
- Instances
- AMIs
- Regions and availability zones
- Region and availability zone concepts
- Regions
- Availability zones
- Available regions
- Regions and endpoints
- Instance types
- Tag basics
- Amazon EC2 key pairs
- Amazon EC2 security groups for Linux instances
- Elastic IP addresses
- Amazon EC2 and Amazon Virtual Private Cloud
- Amazon Elastic Block Store
- Amazon EC2 instance store
- What is AWS Lambda?
- When should I use AWS Lambda?
- Introduction to Amazon S3
- Getting started with Amazon S3
- Comprehensive security and compliance capabilities
- Query in place
- Flexible management
- Most supported platform with the largest ecosystem
- Easy and flexible data transfer
- Backup and recovery
- Data archiving
- Data lakes and big data analytics
- Hybrid Cloud storage
- Cloud-native application data
- Disaster recovery
- Amazon DynamoDB
- Amazon Kinesis Data Streams
- What can I do with Kinesis Data Streams?
- Accelerated log and data feed intake and processing
- Real-time metrics and reporting
- Real-time data analytics
- Complex stream processing
- Benefits of using Kinesis Data Streams
- AWS Glue
- When should I use AWS Glue?
- Amazon EMR
- Practical AWS EMR cluster
- Summary
- Index.