R data mining implement data mining techniques through practical use cases and real-world datasets

Mine valuable insights from your data using popular tools and techniques in R About This Book Understand the basics of data mining and why R is a perfect tool for it. Manipulate your data using popular R packages such as ggplot2, dplyr, and so on to gather valuable business insights from it. Apply e...

Descripción completa

Detalles Bibliográficos
Otros Autores: Cirillo, Andrea, author (author)
Formato: Libro electrónico
Idioma:Inglés
Publicado: Birmingham, England ; Mumbai, [India] : Packt Publishing 2017.
Edición:1st edition
Materias:
Ver en Biblioteca Universitat Ramon Llull:https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009630080406719
Tabla de Contenidos:
  • Cover
  • Copyright
  • Credits
  • About the Author
  • About the Reviewers
  • www.PacktPub.com
  • Customer Feedback
  • Table of Contents
  • Preface
  • Chapter 1: Why to Choose R for Your Data Mining and Where to Start
  • What is R?
  • A bit of history
  • R's points of strength
  • Open source inside
  • Plugin ready
  • Data visualization friendly
  • Installing R and writing R code
  • Downloading R
  • R installation for Windows and macOS
  • R installation for Linux OS
  • Main components of a base R installation
  • Possible alternatives to write and run R code
  • RStudio (all OSs)
  • The Jupyter Notebook (all OSs)
  • Visual Studio (Windows users only)
  • R foundational notions
  • A preliminary R session
  • Executing R interactively through the R console
  • Creating an R script
  • Executing an R script
  • Vectors
  • Lists
  • Creating lists
  • Subsetting lists
  • Data frames
  • Functions
  • R's weaknesses and how to overcome them
  • Learning R effectively and minimizing the effort
  • The tidyverse
  • Leveraging the R community to learn R
  • Where to find the R community
  • Engaging with the community to learn R
  • Handling large datasets with R
  • Further references
  • Summary
  • Chapter 2: A First Primer on Data Mining Analysing Your Bank Account Data
  • Acquiring and preparing your banking data
  • Data model
  • Summarizing your data with pivot-like tables
  • A gentle introduction to the pipe operator
  • An even more gentle introduction to the dplyr package
  • Installing the necessary packages and loading your data into R
  • Installing and loading the necessary packages
  • Importing your data into R
  • Defining the monthly and daily sum of expenses
  • Visualizing your data with ggplot2
  • Basic data visualization principles
  • Less but better
  • Not every chart is good for your message
  • Scatter plot
  • Line chart
  • Bar plot
  • Other advanced charts.
  • Colors have to be chosen carefully
  • A bit of theory - chromatic circle, hue, and luminosity
  • Visualizing your data with ggplot
  • One more gentle introduction - the grammar of graphics
  • A layered grammar of graphics - ggplot2
  • Visualizing your banking movements with ggplot2
  • Visualizing the number of movements per day of the week
  • Further references
  • Summary
  • Chapter 3: The Data Mining Process - CRISP-DM Methodology
  • The Crisp-DM methodology data mining cycle
  • Business understanding
  • Data understanding
  • Data collection
  • How to perform data collection with R
  • Data import from TXT and CSV files
  • Data import from different types of format already structured as tables
  • Data import from unstructured sources
  • Data description
  • How to perform data description with R
  • Data exploration
  • What to use in R to perform this task
  • The summary() function
  • Box plot
  • Histograms
  • Data preparation
  • Modelling
  • Defining a data modeling strategy
  • How similar problems were solved in the past
  • Emerging techniques
  • Classification of modeling problems
  • How to perform data modeling with R
  • Evaluation
  • Clustering evaluation
  • Classification evaluation
  • Regression evaluation
  • How to judge the adequacy of a model's performance
  • What to use in R to perform this task
  • Deployment
  • Deployment plan development
  • Maintenance plan development
  • Summary
  • Chapter 4: Keeping the House Clean - The Data Mining Architecture
  • A general overview
  • Data sources
  • Types of data sources
  • Unstructured data sources
  • Structured data sources
  • Key issues of data sources
  • Databases and data warehouses
  • The third wheel - the data mart
  • One-level database
  • Two-level database
  • Three-level database
  • Technologies
  • SQL
  • MongoDB
  • Hadoop
  • The data mining engine
  • The interpreter.
  • The interface between the engine and the data warehouse
  • The data mining algorithms
  • User interface
  • Clarity
  • Clarity and mystery
  • Clarity and simplicity
  • Efficiency
  • Consistency
  • Syntax highlight
  • Auto-completion
  • How to build a data mining architecture in R
  • Data sources
  • The data warehouse
  • The data mining engine
  • The interface between the engine and the data warehouse
  • The data mining algorithms
  • The user interface
  • Further references
  • Summary
  • Chapter 5: How to Address a Data Mining Problem - Data Cleaning and Validation
  • On a quiet day
  • Data cleaning
  • Tidy data
  • Analysing the structure of our data
  • The str function
  • The describe function
  • head, tail, and View functions
  • Evaluating your data tidiness
  • Every row is a record
  • Every column shows an attribute
  • Every table represents an observational unit
  • Tidying our data
  • The tidyr package
  • Long versus wide data
  • The spread function
  • The gather function
  • The separate function
  • Applying tidyr to our dataset
  • Validating our data
  • Fitness for use
  • Conformance to standards
  • Data quality controls
  • Consistency checks
  • Data type checks
  • Logical checks
  • Domain checks
  • Uniqueness checks
  • Performing data validation on our data
  • Data type checks with str()
  • Domain checks
  • The final touch - data merging
  • left_join function
  • moving beyond left_join
  • Further references
  • Summary
  • Chapter 6: Looking into Your Data Eyes - Exploratory Data Analysis
  • Introducing summary EDA
  • Describing the population distribution
  • Quartiles and Median
  • Mean
  • The mean and phenomenon going on within sub populations
  • The mean being biased by outlier values
  • Computing the mean of our population
  • Variance
  • Standard deviation
  • Skewness
  • Measuring the relationship between variables
  • Correlation.
  • The Pearson correlation coefficient
  • Distance correlation
  • Weaknesses of summary EDA - the Anscombe quartet
  • Graphical EDA
  • Visualizing a variable distribution
  • Histogram
  • Reporting date histogram
  • Geographical area histogram
  • Cash flow histogram
  • Boxplot
  • Checking for outliers
  • Visualizing relationships between variables
  • Scatterplots
  • Adding title, subtitle, and caption to the plot
  • Setting axis and legend
  • Adding explicative text to the plot
  • Final touches on colors
  • Further references
  • Summary
  • Chapter 7: Our First Guess - a Linear Regression
  • Defining a data modelling strategy
  • Data modelling notions
  • Supervised learning
  • Unsupervised learning
  • The modeling strategy
  • Applying linear regression to our data
  • The intuition behind linear regression
  • The math behind the linear regression
  • Ordinary least squares technique
  • Model requirements - what to look for before applying the model
  • Residuals' uncorrelation
  • Residuals' homoscedasticity
  • How to apply linear regression in R
  • Fitting the linear regression model
  • Validating model assumption
  • Visualizing fitted values
  • Preparing the data for visualization
  • Developing the data visualization
  • Further references
  • Summary
  • Chapter 8: A Gentle Introduction to Model Performance Evaluation
  • Defining model performance
  • Fitting versus interpretability
  • Making predictions with models
  • Measuring performance in regression models
  • Mean squared error
  • R-squared
  • R-squared meaning and interpretation
  • R-squared computation in R
  • Adjusted R-squared
  • R-squared misconceptions
  • The R-squared doesn't measure the goodness of fit
  • A low R-squared doesn't mean your model is not statistically significant
  • Measuring the performance in classification problems
  • The confusion matrix
  • Confusion matrix in R
  • Accuracy.
  • How to compute accuracy in R
  • Sensitivity
  • How to compute sensitivity in R
  • Specificity
  • How to compute specificity in R
  • How to choose the right performance statistics
  • A final general warning - training versus test datasets
  • Further references
  • Summary
  • Chapter 9: Don't Give up - Power up Your Regression Including Multiple Variables
  • Moving from simple to multiple linear regression
  • Notation
  • Assumptions
  • Variables' collinearity
  • Tolerance
  • Variance inflation factors
  • Addressing collinearity
  • Dimensionality reduction
  • Stepwise regression
  • Backward stepwise regression
  • From the full model to the n-1 model
  • Forward stepwise regression
  • Double direction stepwise regression
  • Principal component regression
  • Fitting a multiple linear model with R
  • Model fitting
  • Variable assumptions validation
  • Residual assumptions validation
  • Dimensionality reduction
  • Principal component regression
  • Stepwise regression
  • Linear model cheat sheet
  • Further references
  • Summary
  • Chapter 10: A Different Outlook to Problems with Classification Models
  • What is classification and why do we need it?
  • Linear regression limitations for categorical variables
  • Common classification algorithms and models
  • Logistic regression
  • The intuition behind logistic regression
  • The logistic function estimates a response variable enclosed within an upper and lower bound
  • The logistic function estimates the probability of an observation pertaining to one of the two available categories
  • The math behind logistic regression
  • Maximum likelihood estimator
  • Model assumptions
  • Absence of multicollinearity between variables
  • Linear relationship between explanatory variables and log odds
  • Large enough sample size
  • How to apply logistic regression in R
  • Fitting the model
  • Reading the glm() estimation output.
  • The level of statistical significance of the association between the explanatory variable and the response variable.