Mastering data mining with Python find patterns hidden in your data

Learn how to create more powerful data mining applications with this comprehensive Python guide to advance data analytics techniques About This Book Dive deeper into data mining with Python ? don't be complacent, sharpen your skills! From the most common elements of data mining to cutting-edge...

Descripción completa

Detalles Bibliográficos
Otros Autores: Squire, Megan, author (author)
Formato: Libro electrónico
Idioma:Inglés
Publicado: Birmingham, England ; Mumbai, India : Packt Publishing 2016.
Edición:1st edition
Colección:Community experience distilled.
Materias:
Ver en Biblioteca Universitat Ramon Llull:https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009630314406719
Tabla de Contenidos:
  • Cover
  • Copyright
  • Credits
  • About the Author
  • About the Reviewers
  • www.PacktPub.com
  • Table of Contents
  • Preface
  • Expanding Your Data Mining Toolbox
  • What is data mining?
  • How do we do data mining?
  • The Fayyad et al. KDD process
  • The Han et al. KDD process
  • The CRISP-DM process
  • The Six Steps process
  • Which data mining methodology is the best?
  • What are the techniques used in data mining?
  • What techniques are we going to use in THIS book?
  • How do we set up our data mining work environment?
  • Summary
  • Association Rule Mining
  • What are frequent itemsets?
  • The diapers and beer urban legend
  • Frequent itemset mining basics
  • Towards association rules
  • Support
  • Confidence
  • Association rules
  • An example with data
  • Added value - fixing a flaw in the plan
  • Methods for finding frequent itemsets
  • A project - discovering association rules in software project tags
  • Summary
  • Entity Matching
  • What is entity matching?
  • Merging data
  • Merging datasets vertically
  • Merging datasets horizontally
  • Techniques for matching
  • Attribute-based similarity matching
  • Be careful of pairwise comparisons
  • Leverage rare values
  • Methods for matching attributes
  • Range-based or distance from target
  • String edit distance
  • Hamming distance
  • Levenshtein distance
  • Soundex
  • Leveraging disjoint sets
  • Context-based similarity matching
  • Machine learning-based entity matching
  • Evaluation of entity matching techniques
  • Efficiency - how long does it take to do the matching?
  • Effectiveness - how accurate are the matches that we generate?
  • Usefulness - how practical is the matching procedure to use?
  • Entity matching project
  • Difficulties with matching software projects
  • Two examples
  • Matching on project names
  • Matching on people names
  • Matching on URLs.
  • Matching on topics and description keywords
  • The dataset
  • The code
  • The results
  • How many entity matches did we find?
  • How good are the pairs we found?
  • Summary
  • Network Analysis
  • What is a network?
  • Measuring a network
  • Degree of a network
  • Diameter of a network
  • Walks, paths, and trails in a network
  • Components of a network
  • Centrality of a network
  • Closeness centrality
  • Degree centrality
  • Betweenness centrality
  • Other measures of centrality
  • Representing graph data
  • Adjacency matrix
  • Edge lists and adjacency lists
  • Differences between graph data structures
  • Importing data into a graph structure
  • Adjacency list format
  • Edge list format
  • GEXF and GraphML
  • GDF
  • Python pickle
  • JSON
  • JSON node and link series
  • JSON trees
  • Pajek format
  • A real project
  • Exploring the data
  • Generating the network files
  • Understanding our data as a network
  • Generating simple network metrics
  • Playing with the parameters of a network
  • Analyzing subgraphs
  • Analyzing cliques and centrality in the subgraphs
  • Looking for change over time
  • Summary
  • Sentiment Analysis
  • What is sentiment analysis?
  • The basics of sentiment analysis
  • The structure of an opinion
  • Document-level and sentence-level analysis
  • Important features of opinions
  • Sentiment analysis algorithms
  • General-purpose data collections
  • Hu and Liu's sentiment analysis lexicon
  • SentiWordNet
  • Vader sentiment
  • Sentiment mining application
  • Motivating the project
  • Data preparation
  • Data analysis of chat messages
  • Data analysis of e-mail messages
  • Summary
  • Named Entity Recognition in Text
  • Why look for named entities?
  • Techniques for named entity recognition
  • Tagging parts of speech
  • Classes of named entities
  • Building and evaluating NER systems
  • NER and partial matches.
  • Handling partial matches
  • Named entity recognition project
  • A simple NER tool
  • Apache Board meeting minutes
  • Django IRC chat
  • GnuIRC summaries
  • LKML e-mails
  • Summary
  • Automatic Text Summarization
  • What is automatic text summarization?
  • Tools for text summarization
  • Naive text summarization using NLTK
  • Text summarization using Gensim
  • Text summarization using Sumy
  • Sumy's Luhn summarizer
  • Sumy's TextRank summarizer
  • Sumy's LSA summarizer
  • Sumy's Edmundson summarizer
  • Summary
  • Topic Modeling
  • What is topic modeling?
  • Latent Dirichlet Allocation
  • Gensim for topic modeling
  • Understanding Gensim LDA topics
  • Understanding Gensim LDA passes
  • Applying a Gensim LDA model to new documents
  • Serializing Gensim LDA objects
  • Serializing a dictionary
  • Serializing a corpus
  • Serializing a model
  • Gensim LDA for a larger project
  • Summary
  • Mining for Data Anomalies
  • What are data anomalies?
  • Missing data
  • Locating missing data
  • Zero values
  • Fixing missing data
  • Ignore the problem rows
  • Fix the problem manually
  • Use a fabricated value
  • Use a central measure
  • Use Last Observation Carried Forward
  • Use a similar value
  • Use the most likely value
  • Data errors
  • Truncated fields
  • Data type and character set errors
  • Logic or semantic errors
  • Outliers
  • Visual mining for outliers
  • Statistical detection of outliers
  • Summary
  • Index.