Mastering Spark for data science master the techniques and sophisticated analytics used to construct Spark-based solutions that scale to deliver production-grade data science products

Master the techniques and sophisticated analytics used to construct Spark-based solutions that scale to deliver production-grade data science products About This Book Develop and apply advanced analytical techniques with Spark Learn how to tell a compelling story with data science using Spark's...

Descripción completa

Detalles Bibliográficos
Otros Autores:	Morgan, Andrew, author (author)
Formato:	Libro electrónico
Idioma:	Inglés
Publicado:	Birmingham, England : Packt Publishing 2017.
Edición:	1st edition
Materias:	Spark (Electronic resource : Apache Software Foundation) Data mining.
Ver en Biblioteca Universitat Ramon Llull:	https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009630128206719

Tabla de Contenidos:

Cover
Copyright
Credits
Foreword
About the Authors
About the Reviewer
www.PacktPub.com
Customer Feedback
Table of Contents
Preface
Chapter 1: The Big Data Science Ecosystem
Introducing the Big Data ecosystem
Data management
Data management responsibilities
The right tool for the job
Overall architecture
Data Ingestion
Data Lake
Reliable storage
Scalable data processing capability
Data science platform
Data Access
Data technologies
The role of Apache Spark
Companion tools
Apache HDFS
Advantages
Disadvantages
Installation
Amazon S3
Advantages
Disadvantages
Installation
Apache Kafka
Advantages
Disadvantages
Installation
Apache Parquet
Advantages
Disadvantages
Installation
Apache Avro
Advantages
Disadvantages
Installation
Apache NiFi
Advantages
Disadvantages
Installation
Apache YARN
Advantages
Disadvantages
Installation
Apache Lucene
Advantages
Disadvantages
Installation
Kibana
Advantages
Disadvantages
Installation
Elasticsearch
Advantages
Disadvantages
Installation
Accumulo
Advantages
Disadvantages
Installation
Summary
Chapter 2: Data Acquisition
Data pipelines
Universal ingestion framework
Introducing the GDELT news stream
Discovering GDELT in real-time
Our first GDELT feed
Improving with publish and subscribe
Content registry
Choices and more choices
Going with the flow
Metadata model
Kibana dashboard
Quality assurance
[Example 1 - Basic quality checking, no contending users]
Example 1 - Basic quality checking, no contending users
Example 2 - Advanced quality checking, no contending users
Example 3 - Basic quality checking, 50% utility due to contending users
Summary.
Chapter 3: Input Formats and Schema
A structured life is a good life
GDELT dimensional modeling
GDELT model
First look at the data
Core global knowledge graph model
Hidden complexity
Denormalized models
Challenges with flattened data
Issue 1 - Loss of contextual information
Issue 2: Re-establishing dimensions
Issue 3: Including reference data
Loading your data
Schema agility
Reality check
GKG ELT
Position matters
Avro
Spark-Avro method
Pedagogical method
When to perform Avro transformation
Parquet
Summary
Chapter 4: Exploratory Data Analysis
The problem, principles and planning
Understanding the EDA problem
Design principles
General plan of exploration
Preparation
Introducing mask based data profiling
Introducing character class masks
Building a mask based profiler
Setting up Apache Zeppelin
Constructing a reusable notebook
Exploring GDELT
GDELT GKG datasets
The files
Special collections
Reference data
Exploring the GKG v2.1
The Translingual files
A configurable GCAM time series EDA
Plot.ly charting on Apache Zeppelin
Exploring translation sourced GCAM sentiment with plot.ly
Concluding remarks
A configurable GCAM Spatio-Temporal EDA
Introducing GeoGCAM
Does our spatial pivot work?
Summary
Chapter 5: Spark for Geographic Analysis
GDELT and oil
GDELT events
GDELT GKG
Formulating a plan of action
GeoMesa
Installing
GDELT Ingest
GeoMesa Ingest
MapReduce to Spark
Geohash
GeoServer
Map layers
CQL
Gauging oil prices
Using the GeoMesa query API
Data preparation
Machine learning
Naive Bayes
Results
Analysis
Summary
Chapter 6: Scraping Link-Based External Data
Building a web scale news scanner
Accessing the web content
The Goose library.
Integration with Spark
Scala compatibility
Serialization issues
Creating a scalable, production-ready library
Build once, read many
Exception handling
Performance tuning
Named entity recognition
Scala libraries
NLP walkthrough
Extracting entities
Abstracting methods
Building a scalable code
Build once, read many
Scalability is also a state of mind
Performance tuning
GIS lookup
GeoNames dataset
Building an efficient join
Offline strategy - Bloom filtering
Online strategy - Hash partitioning
Content deduplication
Context learning
Location scoring
Names de-duplication
Functional programming with Scalaz
Our de-duplication strategy
Using the mappend operator
Simple clean
DoubleMetaphone
News index dashboard
Summary
Chapter 7: Building Communities
Building a graph of persons
Contact chaining
Extracting data from Elasticsearch
Using the Accumulo database
Setup Accumulo
Cell security
Iterators
Elasticsearch to Accumulo
A graph data model in Accumulo
Hadoop input and output formats
Reading from Accumulo
AccumuloGraphxInputFormat and EdgeWritable
Building a graph
Community detection algorithm
Louvain algorithm
Weighted Community Clustering (WCC)
Description
Preprocessing stage
Initial communities
Message passing
Community back propagation
WCC iteration
Gathering community statistics
WCC Computation
WCC iteration
GDELT dataset
The Bowie effect
Smaller communities
Using Accumulo cell level security
Summary
Chapter 8: Building a Recommendation System
Different approaches
Collaborative filtering
Content-based filtering
Custom approach
Uninformed data
Processing bytes
Creating a scalable code
From time to frequency domain
Fast Fourier transform.
Sampling by time window
Extracting audio signatures
Building a song analyzer
Selling data science is all about selling cupcakes
Using Cassandra
Using the Play framework
Building a recommender
The PageRank algorithm
Building a Graph of Frequency Co-occurrence
Running PageRank
Building personalized playlists
Expanding our cupcake factory
Building a playlist service
Leveraging the Spark job server
User interface
Summary
Chapter 9: News Dictionary and Real-Time Tagging System
The mechanical Turk
Human intelligence tasks
Bootstrapping a classification model
Learning from Stack Exchange
Building text features
Training a Naive Bayes model
Laziness, impatience, and hubris
Designing a Spark Streaming application
A tale of two architectures
The CAP theorem
The Greeks are here to help
Importance of the Lambda architecture
Importance of the Kappa architecture
Consuming data streams
Creating a GDELT data stream
Creating a Kafka topic
Publishing content to a Kafka topic
Consuming Kafka from Spark Streaming
Creating a Twitter data stream
Processing Twitter data
Extracting URLs and hashtags
Keeping popular hashtags
Expanding shortened URLs
Fetching HTML content
Using Elasticsearch as a caching layer
Classifying data
Training a Naive Bayes model
Thread safety
Predict the GDELT data
Our Twitter mechanical Turk
Summary
Chapter 10: Story De-duplication and Mutation
Detecting near duplicates
First steps with hashing
Standing on the shoulders of the Internet giants
Simhashing
The hamming weight
Detecting near duplicates in GDELT
Indexing the GDELT database
Persisting our RDDs
Building a REST API
Area of improvement
Building stories
Building term frequency vectors.
The curse of dimensionality, the data science plague
Optimizing KMeans
Story mutation
The Equilibrium state
Tracking stories over time
Building a streaming application
Streaming KMeans
Visualization
Building story connections
Summary
Chapter 11: Anomaly Detection on Sentiment Analysis
Following the US elections on Twitter
Acquiring data in stream
Acquiring data in batch
The search API
Rate limit
Analysing sentiment
Massaging Twitter data
Using the Stanford NLP
Building the Pipeline
Using Timely as a time series database
Storing data
Using Grafana to visualize sentiment
Number of processed tweets
Give me my Twitter account back
Identifying the swing states
Twitter and the Godwin point
Learning context
Visualizing our model
Word2Graph and Godwin point
Building a Word2Graph
Random walks
A Small Step into sarcasm detection
Building features
#LoveTrumpsHates
Scoring Emojis
Training a KMeans model
Detecting anomalies
Summary
Chapter 12: TrendCalculus
Studying trends
The TrendCalculus algorithm
Trend windows
Simple trend
User Defined Aggregate Functions
Simple trend calculation
Reversal rule
Introducing the FHLS bar structure
Visualize the data
FHLS with reversals
Edge cases
Zero values
Completing the gaps
Stackable processing
Practical applications
Algorithm characteristics
Advantages
Disadvantages
Possible use cases
[Chart annotation]
Chart annotation
Co-trending
Data reduction
Indexing
Fractal dimension
Streaming proxy for piecewise linear regression
Summary
Chapter 13: Secure Data
Data security
The problem
The basics
Authentication and authorization
Access control lists (ACL)
Role-based access control (RBAC)
Access
Encryption.
Data at rest.

Mastering Spark for data science master the techniques and sophisticated analytics used to construct Spark-based solutions that scale to deliver production-grade data science products

Ejemplares similares