Apache Spark machine learning blueprints develop a range of cutting-edge machine learning projects with Apache Spark using this actionable guide

Develop a range of cutting-edge machine learning projects with Apache Spark using this actionable guide About This Book Customize Apache Spark and R to fit your analytical needs in customer research, fraud detection, risk analytics, and recommendation engine development Develop a set of practical Ma...

Full description

Bibliographic Details
Other Authors:	Liu, Alex, author (author)
Format:	eBook
Language:	Inglés
Published:	Birmingham : Packt Publishing 2016.
Edition:	1st edition
Series:	Community experience distilled.
Subjects:	Spark (Electronic resource : Apache Software Foundation) Machine learning. Big data. Information retrieval.
See on Biblioteca Universitat Ramon Llull:	https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009630193006719

Table of Contents:

Cover
Copyright
Credits
About the Author
About the Reviewer
www.PacktPub.com
Table of Contents
Preface
Chapter 1: Spark for Machine Learning
Spark overview and Spark advantages
Spark overview
Spark advantages
Spark computing for machine learning
Machine learning algorithms
MLlib
Other ML libraries
Spark RDD and dataframes
Spark RDD
Spark dataframes
Dataframes API for R
ML frameworks, RM4Es and Spark computing
ML frameworks
RM4Es
The Spark computing framework
ML workflows and Spark pipelines
ML as a step-by-step workflow
ML workflow examples
Spark notebooks
Notebook approach for ML
Step 1: Getting the software ready
Step 2: Installing the Knitr package
Step 3: Creating a simple report
Spark notebooks
Summary
Chapter 2: Data Preparation for Spark ML
Accessing and loading datasets
Accessing publicly available datasets
Loading datasets into Spark
Exploring and visualizing datasets
Data cleaning
Dealing with data incompleteness
Data cleaning in Spark
Data cleaning made easy
Identity matching
Identity issues
Identity matching on Spark
Entity resolution
Short string comparison
Long string comparison
Record deduplication
Identity matching made better
Crowdsourced deduplication
Configuring the crowd
Using the crowd
Dataset reorganizing
Dataset reorganizing tasks
Dataset reorganizing with Spark SQL
Dataset reorganizing with R on Spark
Dataset joining
Dataset joining and its tool - the Spark SQL
Dataset joining in Spark
Dataset joining with the R data table package
Feature extraction
Feature development challenges
Feature development with Spark MLlib
Feature development with R
Repeatability and automation
Dataset preprocessing workflows.
Spark pipelines for dataset preprocessing
Dataset preprocessing automation
Summary
Chapter 3: A Holistic View on Spark
Spark for a holistic view
The use case
Fast and easy computing
Methods for a holistic view
Regression modeling
The SEM approach
Decision trees
Feature preparation
PCA
Grouping by category to use subject knowledge
Feature selection
Model estimation
MLlib implementation
The R notebooks' implementation
Model evaluation
Quick evaluations
RMSE
ROC curves
Results explanation
Impact assessments
Deployment
Dashboard
Rules
Summary
Chapter 4: Fraud Detection on Spark
Spark for fraud detection
The use case
Distributed computing
Methods for fraud detection
Random forest
Decision trees
Feature preparation
Feature extraction from LogFile
Data merging
Model estimation
MLlib implementation
R notebooks implementation
Model evaluation
A quick evaluation
Confusion matrix and false positive ratios
Results explanation
Big influencers and their impacts
Deploying fraud detection
Rules
Scoring
Summary
Chapter 5: Risk Scoring on Spark
Spark for risk scoring
The use case
Apache Spark notebooks
Methods of risk scoring
Logistic regression
Preparing coding in R
Random forest and decision trees
Preparing coding
Data and feature preparation
OpenRefine
Model estimation
The DataScientistWorkbench for R notebooks
R notebooks implementation
Model evaluation
Confusion matrix
ROC
Kolmogorov-Smirnov
Results explanation
Big influencers and their impacts
Deployment
Scoring
Summary
Chapter 6: Churn Prediction on Spark
Spark for churn prediction
The use case
Spark computing
Methods for churn prediction
Regression models.
Decision trees and Random forest
Feature preparation
Feature extraction
Feature selection
Model estimation
Spark implementation with MLlib
Model evaluation
Results explanation
Calculating the impact of interventions
Deployment
Scoring
Intervention recommendations
Summary
Chapter 7: Recommendations on Spark
Apache Spark for a recommendation engine
The use case
SPSS on Spark
Methods for recommendation
Collaborative filtering
Preparing coding
Data treatment with SPSS
Missing data nodes on SPSS modeler
Model estimation
SPSS on Spark - the SPSS Analytics server
Model evaluation
Recommendation deployment
Summary
Chapter 8: Learning Analytics on Spark
Spark for attrition prediction
The use case
Spark computing
Methods of attrition prediction
Regression models
About regression
Preparing for coding
Decision trees
Preparing for coding
Feature preparation
Feature development
Feature selection
Principal components analysis
ML feature selection
Model estimation
Spark implementation with the Zeppelin notebook
Model evaluation
A quick evaluation
The confusion matrix and error ratios
Results explanation
Calculating the impact of interventions
Calculating the impact of main causes
Deployment
Rules
Scoring
Summary
Chapter 9: City Analytics on Spark
Spark for service forecasting
The use case
Spark computing
Methods of service forecasting
Regression models
About regression
Preparing for coding
Time series modeling
About time series
Preparing for coding
Data and feature preparation
Data merging
Feature selection
Model estimation
Spark implementation with the Zeppelin notebook
Spark implementation with the R notebook
Model evaluation.
RMSE calculation with MLlib
RMSE calculation with R
Explanations of the results
Biggest influencers
Visualizing trends
The rules of sending out alerts
Scores to rank city zones
Summary
Chapter 10: Learning Telco Data on Spark
Spark for using Telco Data
The use case
Spark computing
Methods for learning from Telco Data
Descriptive statistics and visualization
Linear and logistic regression models
Decision tree and random forest
Data and feature development
Data reorganizing
Feature development and selection
Model estimation
SPSS on Spark - SPSS Analytics Server
Model evaluation
RMSE calculations with MLlib
RMSE calculations with R
Confusion matrix and error ratios with MLlib and R
Results explanation
Descriptive statistics and visualizations
Biggest influencers
Special insights
Visualizing trends
Model deployment
Rules to send out alerts
Scores subscribers for churn and for Call Center calls
Scores subscribers for purchase propensity
Summary
Chapter 11: Modeling Open Data on Spark
Spark for learning from open data
The use case
Spark computing
Methods for scoring and ranking
Cluster analysis
Principal component analysis
Regression models
Score resembling
Data and feature preparation
Data cleaning
Data merging
Feature development
Feature selection
Model estimation
SPSS on Spark - SPSS Analytics Server
Model evaluation
RMSE calculations with MLlib
RMSE calculations with R
Results explanation
Comparing ranks
Biggest influencers
Deployment
Rules for sending out alerts
Scores for ranking school districts
Summary
Index.

Apache Spark machine learning blueprints develop a range of cutting-edge machine learning projects with Apache Spark using this actionable guide

Similar Items