Data architecture a primer for the data scientist
Over the past 5 years, the concept of big data has matured, data science has grown exponentially, and data architecture has become a standard part of organizational decision-making. Throughout all this change, the basic principles that shape the architecture of data have remained the same. There rem...
Other Authors: | , , |
---|---|
Format: | eBook |
Language: | Inglés |
Published: |
London, England :
Academic Press
[2019]
|
Edition: | Second edition |
Subjects: | |
See on Biblioteca Universitat Ramon Llull: | https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009630748606719 |
Table of Contents:
- Front Cover
- Data Architecture: A Primer for the Data Scientist
- Copyright
- Dedication
- Contents
- Chapter 1.1: An Introduction to Data Architecture
- Subdividing Data
- Repetitive/Nonrepetitive Unstructured Data
- The Great Divide of Data
- Textual/Nontextual Data
- The Different Forms of Data
- Business Value
- Chapter 1.2: The Data Infrastructure
- Two Types of Repetitive Data
- Repetitive Structured Data
- Repetitive Big Data
- The Two Infrastructures
- What's Being Optimized?
- Comparing the Two Infrastructures
- Chapter 1.3: The ``Great Divide´´
- Classifying Corporate Data
- The ``Great Divide´´
- Repetitive Unstructured Data
- Nonrepetitive Unstructured Data
- Different Worlds
- Chapter 1.4: Demographics of Corporate Data
- Chapter 1.5: Corporate Data Analysis
- Chapter 1.6: The Life Cycle of Data: Understanding Data Over Time
- Chapter 1.7: A Brief History of Data
- Paper Tape and Punch Cards
- Magnetic Tapes
- Disk Storage
- Data Base Management System (DBMS)
- Coupled Processors
- Online Transaction Processing
- Data Warehouse
- Parallel Data Management
- Data Vault
- Big Data
- The Great Divide
- Chapter 2.1: The End-State Architecture-The ``World Map´´
- Architectural Components
- Different Kinds of Data in the End State Architecture
- Shaping the Data Through Models
- Where Is the Data Warehouse?
- Where Different Types of Questions Are Answered Across the End State Architecture
- Data in the Data Lake
- Metadata in the End State Architecture
- Networked Metadata
- An Evolutionary Experience
- The Data Lake Architecture
- Chapter 3.1: Transformations in the End-State Architecture
- Redundant Data
- Transformations
- Customizing Data
- Transforming Text
- Transforming Application Data
- Transforming Data Into a Customized State
- Transforming Data Into Bulk Storage.
- Transforming Data Generated Automatically
- Transforming Bulk Data
- Transformation and Redundancy
- Chapter 4.1: A Brief History of Big Data
- An Analogy-Taking the High Ground
- Taking the High Ground
- Standardization With the 360
- Online Transaction Processing
- Enter Teradata and MPP Processing
- Then Came Hadoop and Big Data
- IBM and Hadoop
- Holding the High Ground
- Chapter 4.2: What Is Big Data?
- Another Definition
- Large Volumes
- Inexpensive Storage
- The Roman Census Approach
- Unstructured Data
- Data in Big Data
- Context in Repetitive Data
- Nonrepetitive Data
- Context in Nonrepetitive Data
- Chapter 4.3: Parallel Processing
- Chapter 4.4: Unstructured Data
- Textual Information-Everywhere
- Decisions Based on Structured Data
- The Business Value Proposition
- Repetitive and Nonrepetitive Unstructured Information
- Ease of Analysis
- Contextualization
- Some Approaches to Contextualization
- Map Reduce
- Manual Analysis
- Chapter 4.5: Contextualizing Repetitive Unstructured Data
- Parsing Repetitive Unstructured Data
- Recasting the Output Data
- Chapter 4.6: Textual Disambiguation
- From Narrative Into an Analytical Data Base
- Input Into Textual Disambiguation
- Mapping
- Input/Output
- Document Fracturing/Named Value Processing
- Preprocessing a Document
- E-mails-A Special Case
- Spreadsheets
- Report Decompilation
- Chapter 4.7: Taxonomies
- Data Models/Taxonomies
- Applicability of Taxonomies
- What Is a Taxonomy?
- Taxonomies in Multiple Languages
- Commercial or Private Taxonomies?
- Dynamics of Taxonomies and Textual Disambiguation
- Taxonomies and Textual Disambiguation-Separate Technologies
- Different Types of Taxonomies
- Taxonomies-Maintenance Over Time
- Chapter 5.1: The Siloed Application Environment
- The Challenge of Siloed Applications.
- Building Siloed Applications
- What Does a Siloed Application Look Like?
- Current Valued Data
- Minimal Historical Data
- High Availability
- Overlap Between Siloed Applications
- Frozen Business Requirements
- Dismantling Siloed Applications
- Chapter 6.1: Introduction to Data Vault 2.0
- Data Vault Origins and Background
- The ``Old´´ Data Vault 1.0
- The New and Updated Data Vault 2.0
- What Is Data Vault 2.0 Modeling?
- A Business View
- A Technical View
- How Is Data Vault 2.0 Methodology Defined?
- A Business View
- A Technical View
- Why Do We Need a Data Vault 2.0 Architecture?
- Where Does Data Vault 2.0 Implementation Fit?
- What Are the Business Benefits of Data Vault 2.0?
- What Is Data Vault 1.0?
- Chapter 6.2: Introduction to Data Vault Modeling
- What Is a Data Vault Model Concept?
- Data Vault Model Defined
- Components of a Data Vault Model
- What Makes Business Keys So Interesting?
- What Does This Have to Do With Data Vault and Data Warehousing?
- How Does This Translate to Data Vault Modeling?
- Why Restructure the Data From the Staging Area?
- What Are the Basic Rules of the Data Vault Model?
- Why Do We Need Many to Many Link Structures?
- Primary Key Options for Data Vault 2.0
- Sequence Numbers
- Hash Keys
- Business Keys
- Source System Sequence Business Keys
- Multipart Source Business Keys
- Chapter 6.3: Introduction to Data Vault Architecture
- What Is a Data Vault 2.0 Architecture?
- How Does NoSQL Fit in to the Architecture?
- What Are the Objectives of the Data Vault 2.0 Architecture?
- What Is the Objective of the Data Vault 2.0 Model?
- What Are Hard and Soft Business Rules?
- How Does Managed Self Service BI Fit in the Architecture?
- Chapter 6.4: Introduction to Data Vault Methodology
- Data Vault 2.0 Methodology Overview
- How Does CMMI Contribute to the Methodology?.
- If CMMI Is So Great, Why Should We Care About Agility Then?
- Why Include PMP, SDLC If CMMI and Agile Should Be All That's Needed?
- So Then, What Does Six Sigma Contribute to the Data Vault 2 Methodology?
- Where Does TQM (Total Quality Management) Fit in to All of This?
- Chapter 6.5: Introduction to Data Vault Implementation
- Implementation Overview
- What's So Important About Patterns?
- Why Does Reengineering Happen Because of Big Data?
- Why Do We Need to Virtualize Our Data Marts?
- What Is Managed Self-Service BI?
- Chapter 7.1: The Operational Environment: A Short History
- Commercial Uses of the Computer
- The First Applications
- Ed Yourdon and the Structured Revolution
- The SDLC
- Disk Technology
- Enter the DBMS
- Response Time and Availability
- Corporate Computing Today
- Chapter 7.2: The Standard Work Unit
- Elements of Response Time
- An Hourglass Analogy
- The Racetrack Analogy
- Your Vehicle Runs as Fast as the Vehicle in Front of It
- The Standard Work Unit
- The SLA
- Chapter 7.3: Data Modeling for the Structured Environment
- The Purpose of the Roadmap
- Granular Data Only
- The ERD
- The Dis
- Physical Data Base Design
- Relating the Different Levels of the Data Model
- An Example of the Linkage
- Generic Data Models
- Operational Data Models/Data Warehouse Data Models
- Chapter 8.1: A Brief History of Data Architecture
- Chapter 8.2: Big Data/Existing System Interface
- The Big Data/Existing Systems Interface
- The Repetitive Raw Big Data/Existing Systems Interface
- Exception Based Data
- The Nonrepetitive Raw Big Data/Existing Systems Interface
- Into the Existing Systems Environment
- The ``Context Enriched´´ Big Data Environment
- Analyzing Structured Data/Unstructured Data Together
- Chapter 8.3: The Data Warehouse/Operational Environment Interface.
- The Operational/Data Warehouse Interface
- The Classical ETL Interface
- The ODS and the ETL Interface
- The Staging Area
- Changed Data Capture
- Inline Transformation
- ELT Processing
- Chapter 8.4: Data Architecture: A High-Level Perspective
- A High Level Perspective
- Redundancy
- The System of Record
- Different Types of Questions
- Different Communities
- Chapter 9.1: Repetitive Analytics: Some Basics
- Different Kinds of Analysis
- Looking for Patterns
- Heuristic Processing
- Freezing Data
- The Sandbox
- The ``Normal´´ Profile
- Distillation, Filtering
- Subsetting Data
- Bias of the Sample
- Filtering Data
- Repetitive Data and Context
- Linking Repetitive Records
- Log Tape Records
- Analyzing Points of Data
- Outliers
- Data Over Time
- Chapter 9.2: Analyzing Repetitive Data
- Log Data
- Active/Passive Indexing of Data
- Summary/Detailed Data
- Metadata in Big Data
- Linking Data
- Chapter 9.3: Repetitive Analysis
- Internal, External Data
- Universal Identifiers
- Security
- Filtering, Distillation
- Archiving Results
- Metrics
- Chapter 10.1: Nonrepetitive Data
- Inline Contextualization
- Taxonomy/Ontology Processing
- Custom Variables
- Homographic Resolution
- Acronym Resolution
- Negation Analysis
- Numeric Tagging
- Date Tagging
- Date Standardization
- List Processing
- Associative Word Processing
- Stop Word Processing
- Word Stemming
- Document Metadata
- Document Classification
- Proximity Analysis
- Functional Sequencing Within Textual ETL
- Internal Referential Integrity
- Preprocessing, Postprocessing
- Chapter 10.2: Mapping
- Chapter 10.3: Analytics From Nonrepetitive Data
- Call Center Information
- Medical Records
- Chapter 11.1: Operational Analytics: Response Time
- Transaction Response Time
- Chapter 12.1: Operational Analytics.
- Different Perspectives of Data.