Working with text tools, techniques and approaches for text mining
What is text mining, and how can it be used? What relevance do these methods have to everyday work in information science and the digital humanities? How does one develop competences in text mining? Working with Text provides a series of cross-disciplinary perspectives on text mining and its applica...
Otros Autores: | , , , |
---|---|
Formato: | Libro electrónico |
Idioma: | Inglés |
Publicado: |
Amsterdam, [Netherlands] :
Chandos Publishing
2016.
|
Edición: | First edition |
Colección: | Chandos information professional series.
|
Materias: | |
Ver en Biblioteca Universitat Ramon Llull: | https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009629874206719 |
Tabla de Contenidos:
- Front Cover
- Working with Text: Tools, Techniques and Approaches for Text Mining
- Copyright
- Contents
- Contributors
- Preface
- Acknowledgements
- Chapter 1: Working with Text
- 1.1. Introduction: Portraits of the Past
- 1.2. The Reading Robot
- 1.3. From Data to Text Mining
- 1.4. Definitions of Text Mining
- 1.5. Exploring the Disciplinary Neighbourhood
- 1.6. Prerequisites for Text Mining
- 1.7. Learning Minecraft: What Makes a Text Miner?
- 1.8. Contemporary Attitudes to Text Mining
- 1.9. Conclusions
- References
- Chapter 2: A Day at Work (with Text): A Brief Introduction
- 2.1. Introduction
- 2.2. Encouraging an Interest in Text Mining
- 2.3. Legal and Ethical Aspects of Text Mining
- 2.3.1. Activities
- 2.3.2. Selecting or Compiling a Data Set or Corpus
- 2.3.3. Building a Data Set
- 2.4. Manual Annotation: Preparing for Evaluation
- 2.4.1. Avoidance of Overfitting
- 2.4.2. Characterising the Problem
- 2.4.3. Managing User Expectations
- 2.5. Common Text Mining Tasks
- 2.6. Basic Corpus Analysis
- 2.6.1. Evaluating Frequency of Word and Term Use
- 2.6.2. Identifying Characteristic Words or Terms
- 2.7. Preprocessing a Text
- 2.8. Extracting Features from a Text
- 2.9. Information Extraction
- 2.9.1. Terminology Extraction
- 2.9.2. Named Entity Recognition (NER)
- 2.9.3. Entity Disambiguation
- 2.9.4. Relationship Extraction and Coreference Resolution
- 2.9.5. Fact Extraction
- 2.9.6. Temporal Information Extraction
- 2.9.7. Automated Geotagging or Geoindexing of Text
- 2.10. Applications of Indexing and Metadata Extraction
- 2.11. Extraction of Subjective Views
- 2.11.1. Opinion Mining
- 2.11.2. Sentiment Analysis
- 2.12. Build, Customise or Apply? Choosing an Appropriate Implementation
- 2.13. Evaluation
- 2.13.1. Evaluating Accuracy: Precision, Recall and F-measure.
- 2.13.2. Process Cost in Resources, Time and Processing Resources
- 2.13.3. Fit with User Requirements and Expectations
- 2.13.4. Contextualisation of Results
- 2.14. The Role of Visualisation in Text Mining
- 2.15. Visualisation Tools and Frameworks
- 2.16. Conclusions
- References
- Chapter 3: If You Find Yourself in a Hole, Stop Digging: Legal and Ethical Issues of Text/Data Mining in Research
- 3.1. Introduction
- 3.1.1. The Relationship Between Law and Ethics
- 3.1.2. Law, Technology and Change
- 3.2. Key Legal Issues in Data Mining
- 3.2.1. Data as Property
- 3.2.1.1. Data Mining and Utilitarian Copyright Perspectives
- 3.2.2. Data as Personally Identifying Information (PII)
- 3.2.2.1. Text/Data Mining and Accountability
- 3.2.3. Primum non Nocere: Some Thoughts on False Positives, Patternicity and Liability
- 3.2.3.1. The Road to Hell May Be Paved With Risky (but Often Necessary) Assumptions
- 3.2.3.2. You Call it a ``Failure Mode´´, I Call it ``Possible Grounds for Legal Action´´
- 3.3. Ethics
- 3.3.1. A Research Ethics Framework for Data/Text Mining
- 3.4. Conclusions: Working on the Borders of Law and Ethics
- References
- Chapter 4: Responsible Content Mining
- 4.1. Introduction to Content Mining
- 4.2. Obtaining Permission to Content Mine
- 4.2.1. Copyright Law as Applied to Content Mining
- 4.2.2. Database Rights as Applied to Content Mining
- 4.2.3. Contractual Restrictions on Content Mining
- 4.2.4. Practical Advice for Obtaining Permissions
- 4.3. Responsible Crawling
- 4.3.1. Understanding the Impact of Crawling
- 4.3.2. Respecting Crawler Limits
- 4.3.3. Choosing and Configuring Crawler Software
- 4.4. Publication of Results
- 4.4.1. Access and Licensing
- 4.4.2. Downstream Use
- 4.5. Citation and Acknowledgement
- 4.6. Proposed Best Practise Guidelines for Content Mining
- References.
- Chapter 5: Text Mining for Semantic Search in Europe PubMed Central Labs
- 5.1. Introduction
- 5.2. Previous Work
- 5.2.1. Biomedical Text Mining
- 5.2.2. Applications of Text Mining in Search Engines
- Linked Data
- Semantic Metadata Search
- Faceting and Clustering
- Fact and Passage Retrieval
- Normalisation and Query Expansion
- Question Suggestion
- 5.3. Design and Implementation
- 5.3.1. Motivation for EvidenceFinders Design
- 5.3.2. Index Design and Implementation
- 5.3.3. Parsing Full Papers to Support Fact Retrieval
- Collection-Scale Processing
- Updating Workflow
- 5.3.4. Query Processing
- Mapping from Queries to Questions
- Generating Grammatically Correct Questions
- Generating Questions that Make Scientific Sense
- Generating Questions that are not too Similar
- 5.3.5. Integration with Full-Text Query Service
- 5.4. Performance and Critique
- 5.4.1. Response Times and Throughput
- 5.4.2. Degree of Success in Question Generation
- 5.4.3. User Feedback and Analysis of Question Usefulness
- Early Focus Groups
- In-Depth Evaluation
- 5.5. Conclusions
- 5.6. Availability
- References
- Appendix: Resources Used for Indexing
- Chapter 6: Extracting Information from Social Media with GATE
- 6.1. Introduction
- 6.2. Social Media Streams: Characteristics, Challenges and Opportunities
- 6.3. The GATE Family of Text Mining Tools: An Overview
- 6.3.1. GATE Developer
- 6.3.2. GATE Embedded
- 6.3.3. GATE Cloud
- 6.4. Information Extraction: An Overview
- 6.5. IE from Social Media with GATE
- 6.5.1. Language Identification
- 6.5.2. Tokenisation
- 6.5.3. Normalisation
- 6.5.4. Part of Speech Tagging
- 6.5.5. Stemming and Morphological Analysis
- 6.5.6. Named Entity Recognition
- 6.6. Conclusion and Future Work
- Acknowledgements
- References.
- Chapter 7: Newton: Building an Authority-Driven Company Tagging and Resolution System
- 7.1. Introduction
- 7.2. Related Work
- 7.3. System Overview
- 7.3.1. UIMA, ClearTK, uimaFIT
- 7.3.2. The Text Mining Framework
- 7.3.3. Newton System Design
- 7.4. Learning Company Name Links
- 7.4.1. Company Name Lookup
- 7.4.2. Company Classification
- 7.4.3. Training the Classifier
- 7.4.4. Evaluation
- 7.4.4.1. Data
- 7.4.4.2. Experiments
- 7.4.4.3. Results
- 7.5. System Development
- 7.5.1. Feature Consistency Testing
- 7.5.2. Performance Tracker
- 7.5.2.1. Performance Recording
- 7.5.3. Performance Reporting
- 7.6. Conclusions
- Acknowledgements
- References
- Chapter 8: Automatic Language Identification
- 8.1. Introduction
- 8.1.1. Language Identification: A Classification Task
- 8.2. Historical Overview
- 8.2.1. Distinguishing between Similar Languages
- 8.3. Computational Techniques
- 8.3.1. Short Words
- 8.3.2. N-Gram Language Models
- 8.3.2.1. Unigrams
- 8.3.2.2. Bigrams and Higher-Order N-Grams
- 8.3.3. Classification Methods
- 8.3.3.1. Out-of-Place Metric
- 8.3.3.2. Information Gain: Estimating the Best Features
- 8.4. Applications and Related Tasks
- 8.4.1. Internet Data and Short Texts
- 8.4.2. Discriminating and Language Varieties
- 8.4.3. Native Language Identification
- 8.5. Conclusion
- Acknowledgements
- References
- Chapter 9: User-Driven Text Mining of Historical Text
- 9.1. Related Work on Text Mining Historical Documents
- 9.2. The Trading Consequences System
- 9.3. Data Collections
- 9.4. Challenges of Processing Digitised Historical Text
- 9.4.1. Optical Character Recognition Errors
- 9.4.2. Text Mining Tables
- 9.5. Text Mining Component
- 9.6. User-Driven Text Mining
- 9.7. Conclusion
- Acknowledgements
- References.
- Chapter 10: Automatic Text Indexing with SKOS Vocabularies in HIVE
- 10.1. Introduction
- 10.2. Automatic Indexing with Machine Learning
- 10.3. Algorithms for Text Data Mining: KEA, KEA++ and MAUI
- 10.4. Algorithm Training and Workflow
- 10.5. The HIVE System
- 10.6. Text Mining for Documents Indexing Using SKOS Vocabularies in HIVE
- 10.7. Conclusions
- Acknowledgements
- References
- Chapter 11: The PIMMS Project and Natural Language Processing for Climate Science
- 11.1. Introduction
- 11.2. Methodology
- 11.2.1. Controlled Vocabularies and Common Information Model
- 11.2.2. The ACPGeo Project (Extending ChemicalTagger)
- 11.2.3. Phrase Parsing
- 11.3. Results
- 11.3.1. Overview of ACPGeo Phrases
- 11.3.1.1. Palaeotime Phrases
- 11.3.1.2. Palaeotime Phrase Omissions and False Positives
- 11.3.2. Extraction of Tagged Metadata in a Format Suitable for Comparison with the CIM
- 11.4. Overall Conclusions and Suggestions for Further Work
- 11.4.1. ACPGeo Phrases
- 11.4.2. XML Conversion to Compare with CV and CIM from PIMMS
- 11.4.3. CIM XML: Applicability
- 11.4.4. Overall Conclusion
- Acknowledgements
- References
- Chapter 12: Building Better Mousetraps: A Linguist in NLP
- Chapter 13: Raúl Garreta, Co-founder of Tryolabs.com, Tells Emma Tonkin About the Journey from Software Engineering Gradu...
- Appendix A: Resources for Text Missing
- A.1. Introduction
- A.2. Text Mining Software and Libraries
- A.3. Text Mining Frameworks and Packages
- A.3.1. UIMA
- A.3.2. GATE: Text Mining in Java or on the Desktop
- A.3.3. OpenNLP
- A.3.4. NLTK: Text Mining in Python
- A.3.5. Stanford Parser and Part of Speech Tagger
- A.3.6. Weka
- A.3.7. UIUC NLP Tools
- A.3.8. LingPipe
- A.3.9. The TM Package: Text Mining in R
- A.3.10. SpaCY
- A.3.11. Mallet
- A.3.12. MontyLingua
- A.3.13. Textmining 1.0 (Python).
- A.3.14. Sempre.