Apache Spark machine learning blueprints develop a range of cutting-edge machine learning projects with Apache Spark using this actionable guide

Develop a range of cutting-edge machine learning projects with Apache Spark using this actionable guide About This Book Customize Apache Spark and R to fit your analytical needs in customer research, fraud detection, risk analytics, and recommendation engine development Develop a set of practical Ma...

Descripción completa

Detalles Bibliográficos
Otros Autores: Liu, Alex, author (author)
Formato: Libro electrónico
Idioma:Inglés
Publicado: Birmingham : Packt Publishing 2016.
Edición:1st edition
Colección:Community experience distilled.
Materias:
Ver en Biblioteca Universitat Ramon Llull:https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009630193006719
Tabla de Contenidos:
  • Cover
  • Copyright
  • Credits
  • About the Author
  • About the Reviewer
  • www.PacktPub.com
  • Table of Contents
  • Preface
  • Chapter 1: Spark for Machine Learning
  • Spark overview and Spark advantages
  • Spark overview
  • Spark advantages
  • Spark computing for machine learning
  • Machine learning algorithms
  • MLlib
  • Other ML libraries
  • Spark RDD and dataframes
  • Spark RDD
  • Spark dataframes
  • Dataframes API for R
  • ML frameworks, RM4Es and Spark computing
  • ML frameworks
  • RM4Es
  • The Spark computing framework
  • ML workflows and Spark pipelines
  • ML as a step-by-step workflow
  • ML workflow examples
  • Spark notebooks
  • Notebook approach for ML
  • Step 1: Getting the software ready
  • Step 2: Installing the Knitr package
  • Step 3: Creating a simple report
  • Spark notebooks
  • Summary
  • Chapter 2: Data Preparation for Spark ML
  • Accessing and loading datasets
  • Accessing publicly available datasets
  • Loading datasets into Spark
  • Exploring and visualizing datasets
  • Data cleaning
  • Dealing with data incompleteness
  • Data cleaning in Spark
  • Data cleaning made easy
  • Identity matching
  • Identity issues
  • Identity matching on Spark
  • Entity resolution
  • Short string comparison
  • Long string comparison
  • Record deduplication
  • Identity matching made better
  • Crowdsourced deduplication
  • Configuring the crowd
  • Using the crowd
  • Dataset reorganizing
  • Dataset reorganizing tasks
  • Dataset reorganizing with Spark SQL
  • Dataset reorganizing with R on Spark
  • Dataset joining
  • Dataset joining and its tool - the Spark SQL
  • Dataset joining in Spark
  • Dataset joining with the R data table package
  • Feature extraction
  • Feature development challenges
  • Feature development with Spark MLlib
  • Feature development with R
  • Repeatability and automation
  • Dataset preprocessing workflows.
  • Spark pipelines for dataset preprocessing
  • Dataset preprocessing automation
  • Summary
  • Chapter 3: A Holistic View on Spark
  • Spark for a holistic view
  • The use case
  • Fast and easy computing
  • Methods for a holistic view
  • Regression modeling
  • The SEM approach
  • Decision trees
  • Feature preparation
  • PCA
  • Grouping by category to use subject knowledge
  • Feature selection
  • Model estimation
  • MLlib implementation
  • The R notebooks' implementation
  • Model evaluation
  • Quick evaluations
  • RMSE
  • ROC curves
  • Results explanation
  • Impact assessments
  • Deployment
  • Dashboard
  • Rules
  • Summary
  • Chapter 4: Fraud Detection on Spark
  • Spark for fraud detection
  • The use case
  • Distributed computing
  • Methods for fraud detection
  • Random forest
  • Decision trees
  • Feature preparation
  • Feature extraction from LogFile
  • Data merging
  • Model estimation
  • MLlib implementation
  • R notebooks implementation
  • Model evaluation
  • A quick evaluation
  • Confusion matrix and false positive ratios
  • Results explanation
  • Big influencers and their impacts
  • Deploying fraud detection
  • Rules
  • Scoring
  • Summary
  • Chapter 5: Risk Scoring on Spark
  • Spark for risk scoring
  • The use case
  • Apache Spark notebooks
  • Methods of risk scoring
  • Logistic regression
  • Preparing coding in R
  • Random forest and decision trees
  • Preparing coding
  • Data and feature preparation
  • OpenRefine
  • Model estimation
  • The DataScientistWorkbench for R notebooks
  • R notebooks implementation
  • Model evaluation
  • Confusion matrix
  • ROC
  • Kolmogorov-Smirnov
  • Results explanation
  • Big influencers and their impacts
  • Deployment
  • Scoring
  • Summary
  • Chapter 6: Churn Prediction on Spark
  • Spark for churn prediction
  • The use case
  • Spark computing
  • Methods for churn prediction
  • Regression models.
  • Decision trees and Random forest
  • Feature preparation
  • Feature extraction
  • Feature selection
  • Model estimation
  • Spark implementation with MLlib
  • Model evaluation
  • Results explanation
  • Calculating the impact of interventions
  • Deployment
  • Scoring
  • Intervention recommendations
  • Summary
  • Chapter 7: Recommendations on Spark
  • Apache Spark for a recommendation engine
  • The use case
  • SPSS on Spark
  • Methods for recommendation
  • Collaborative filtering
  • Preparing coding
  • Data treatment with SPSS
  • Missing data nodes on SPSS modeler
  • Model estimation
  • SPSS on Spark - the SPSS Analytics server
  • Model evaluation
  • Recommendation deployment
  • Summary
  • Chapter 8: Learning Analytics on Spark
  • Spark for attrition prediction
  • The use case
  • Spark computing
  • Methods of attrition prediction
  • Regression models
  • About regression
  • Preparing for coding
  • Decision trees
  • Preparing for coding
  • Feature preparation
  • Feature development
  • Feature selection
  • Principal components analysis
  • ML feature selection
  • Model estimation
  • Spark implementation with the Zeppelin notebook
  • Model evaluation
  • A quick evaluation
  • The confusion matrix and error ratios
  • Results explanation
  • Calculating the impact of interventions
  • Calculating the impact of main causes
  • Deployment
  • Rules
  • Scoring
  • Summary
  • Chapter 9: City Analytics on Spark
  • Spark for service forecasting
  • The use case
  • Spark computing
  • Methods of service forecasting
  • Regression models
  • About regression
  • Preparing for coding
  • Time series modeling
  • About time series
  • Preparing for coding
  • Data and feature preparation
  • Data merging
  • Feature selection
  • Model estimation
  • Spark implementation with the Zeppelin notebook
  • Spark implementation with the R notebook
  • Model evaluation.
  • RMSE calculation with MLlib
  • RMSE calculation with R
  • Explanations of the results
  • Biggest influencers
  • Visualizing trends
  • The rules of sending out alerts
  • Scores to rank city zones
  • Summary
  • Chapter 10: Learning Telco Data on Spark
  • Spark for using Telco Data
  • The use case
  • Spark computing
  • Methods for learning from Telco Data
  • Descriptive statistics and visualization
  • Linear and logistic regression models
  • Decision tree and random forest
  • Data and feature development
  • Data reorganizing
  • Feature development and selection
  • Model estimation
  • SPSS on Spark - SPSS Analytics Server
  • Model evaluation
  • RMSE calculations with MLlib
  • RMSE calculations with R
  • Confusion matrix and error ratios with MLlib and R
  • Results explanation
  • Descriptive statistics and visualizations
  • Biggest influencers
  • Special insights
  • Visualizing trends
  • Model deployment
  • Rules to send out alerts
  • Scores subscribers for churn and for Call Center calls
  • Scores subscribers for purchase propensity
  • Summary
  • Chapter 11: Modeling Open Data on Spark
  • Spark for learning from open data
  • The use case
  • Spark computing
  • Methods for scoring and ranking
  • Cluster analysis
  • Principal component analysis
  • Regression models
  • Score resembling
  • Data and feature preparation
  • Data cleaning
  • Data merging
  • Feature development
  • Feature selection
  • Model estimation
  • SPSS on Spark - SPSS Analytics Server
  • Model evaluation
  • RMSE calculations with MLlib
  • RMSE calculations with R
  • Results explanation
  • Comparing ranks
  • Biggest influencers
  • Deployment
  • Rules for sending out alerts
  • Scores for ranking school districts
  • Summary
  • Index.