Data architecture a primer for the data scientist

Over the past 5 years, the concept of big data has matured, data science has grown exponentially, and data architecture has become a standard part of organizational decision-making. Throughout all this change, the basic principles that shape the architecture of data have remained the same. There rem...

Full description

Bibliographic Details
Other Authors: Inmon, William H., author (author), Linstedt, Daniel, author, Levins, Mary, author
Format: eBook
Language:Inglés
Published: London, England : Academic Press [2019]
Edition:Second edition
Subjects:
See on Biblioteca Universitat Ramon Llull:https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009630748606719
Table of Contents:
  • Front Cover
  • Data Architecture: A Primer for the Data Scientist
  • Copyright
  • Dedication
  • Contents
  • Chapter 1.1: An Introduction to Data Architecture
  • Subdividing Data
  • Repetitive/Nonrepetitive Unstructured Data
  • The Great Divide of Data
  • Textual/Nontextual Data
  • The Different Forms of Data
  • Business Value
  • Chapter 1.2: The Data Infrastructure
  • Two Types of Repetitive Data
  • Repetitive Structured Data
  • Repetitive Big Data
  • The Two Infrastructures
  • What's Being Optimized?
  • Comparing the Two Infrastructures
  • Chapter 1.3: The ``Great Divide´´
  • Classifying Corporate Data
  • The ``Great Divide´´
  • Repetitive Unstructured Data
  • Nonrepetitive Unstructured Data
  • Different Worlds
  • Chapter 1.4: Demographics of Corporate Data
  • Chapter 1.5: Corporate Data Analysis
  • Chapter 1.6: The Life Cycle of Data: Understanding Data Over Time
  • Chapter 1.7: A Brief History of Data
  • Paper Tape and Punch Cards
  • Magnetic Tapes
  • Disk Storage
  • Data Base Management System (DBMS)
  • Coupled Processors
  • Online Transaction Processing
  • Data Warehouse
  • Parallel Data Management
  • Data Vault
  • Big Data
  • The Great Divide
  • Chapter 2.1: The End-State Architecture-The ``World Map´´
  • Architectural Components
  • Different Kinds of Data in the End State Architecture
  • Shaping the Data Through Models
  • Where Is the Data Warehouse?
  • Where Different Types of Questions Are Answered Across the End State Architecture
  • Data in the Data Lake
  • Metadata in the End State Architecture
  • Networked Metadata
  • An Evolutionary Experience
  • The Data Lake Architecture
  • Chapter 3.1: Transformations in the End-State Architecture
  • Redundant Data
  • Transformations
  • Customizing Data
  • Transforming Text
  • Transforming Application Data
  • Transforming Data Into a Customized State
  • Transforming Data Into Bulk Storage.
  • Transforming Data Generated Automatically
  • Transforming Bulk Data
  • Transformation and Redundancy
  • Chapter 4.1: A Brief History of Big Data
  • An Analogy-Taking the High Ground
  • Taking the High Ground
  • Standardization With the 360
  • Online Transaction Processing
  • Enter Teradata and MPP Processing
  • Then Came Hadoop and Big Data
  • IBM and Hadoop
  • Holding the High Ground
  • Chapter 4.2: What Is Big Data?
  • Another Definition
  • Large Volumes
  • Inexpensive Storage
  • The Roman Census Approach
  • Unstructured Data
  • Data in Big Data
  • Context in Repetitive Data
  • Nonrepetitive Data
  • Context in Nonrepetitive Data
  • Chapter 4.3: Parallel Processing
  • Chapter 4.4: Unstructured Data
  • Textual Information-Everywhere
  • Decisions Based on Structured Data
  • The Business Value Proposition
  • Repetitive and Nonrepetitive Unstructured Information
  • Ease of Analysis
  • Contextualization
  • Some Approaches to Contextualization
  • Map Reduce
  • Manual Analysis
  • Chapter 4.5: Contextualizing Repetitive Unstructured Data
  • Parsing Repetitive Unstructured Data
  • Recasting the Output Data
  • Chapter 4.6: Textual Disambiguation
  • From Narrative Into an Analytical Data Base
  • Input Into Textual Disambiguation
  • Mapping
  • Input/Output
  • Document Fracturing/Named Value Processing
  • Preprocessing a Document
  • E-mails-A Special Case
  • Spreadsheets
  • Report Decompilation
  • Chapter 4.7: Taxonomies
  • Data Models/Taxonomies
  • Applicability of Taxonomies
  • What Is a Taxonomy?
  • Taxonomies in Multiple Languages
  • Commercial or Private Taxonomies?
  • Dynamics of Taxonomies and Textual Disambiguation
  • Taxonomies and Textual Disambiguation-Separate Technologies
  • Different Types of Taxonomies
  • Taxonomies-Maintenance Over Time
  • Chapter 5.1: The Siloed Application Environment
  • The Challenge of Siloed Applications.
  • Building Siloed Applications
  • What Does a Siloed Application Look Like?
  • Current Valued Data
  • Minimal Historical Data
  • High Availability
  • Overlap Between Siloed Applications
  • Frozen Business Requirements
  • Dismantling Siloed Applications
  • Chapter 6.1: Introduction to Data Vault 2.0
  • Data Vault Origins and Background
  • The ``Old´´ Data Vault 1.0
  • The New and Updated Data Vault 2.0
  • What Is Data Vault 2.0 Modeling?
  • A Business View
  • A Technical View
  • How Is Data Vault 2.0 Methodology Defined?
  • A Business View
  • A Technical View
  • Why Do We Need a Data Vault 2.0 Architecture?
  • Where Does Data Vault 2.0 Implementation Fit?
  • What Are the Business Benefits of Data Vault 2.0?
  • What Is Data Vault 1.0?
  • Chapter 6.2: Introduction to Data Vault Modeling
  • What Is a Data Vault Model Concept?
  • Data Vault Model Defined
  • Components of a Data Vault Model
  • What Makes Business Keys So Interesting?
  • What Does This Have to Do With Data Vault and Data Warehousing?
  • How Does This Translate to Data Vault Modeling?
  • Why Restructure the Data From the Staging Area?
  • What Are the Basic Rules of the Data Vault Model?
  • Why Do We Need Many to Many Link Structures?
  • Primary Key Options for Data Vault 2.0
  • Sequence Numbers
  • Hash Keys
  • Business Keys
  • Source System Sequence Business Keys
  • Multipart Source Business Keys
  • Chapter 6.3: Introduction to Data Vault Architecture
  • What Is a Data Vault 2.0 Architecture?
  • How Does NoSQL Fit in to the Architecture?
  • What Are the Objectives of the Data Vault 2.0 Architecture?
  • What Is the Objective of the Data Vault 2.0 Model?
  • What Are Hard and Soft Business Rules?
  • How Does Managed Self Service BI Fit in the Architecture?
  • Chapter 6.4: Introduction to Data Vault Methodology
  • Data Vault 2.0 Methodology Overview
  • How Does CMMI Contribute to the Methodology?.
  • If CMMI Is So Great, Why Should We Care About Agility Then?
  • Why Include PMP, SDLC If CMMI and Agile Should Be All That's Needed?
  • So Then, What Does Six Sigma Contribute to the Data Vault 2 Methodology?
  • Where Does TQM (Total Quality Management) Fit in to All of This?
  • Chapter 6.5: Introduction to Data Vault Implementation
  • Implementation Overview
  • What's So Important About Patterns?
  • Why Does Reengineering Happen Because of Big Data?
  • Why Do We Need to Virtualize Our Data Marts?
  • What Is Managed Self-Service BI?
  • Chapter 7.1: The Operational Environment: A Short History
  • Commercial Uses of the Computer
  • The First Applications
  • Ed Yourdon and the Structured Revolution
  • The SDLC
  • Disk Technology
  • Enter the DBMS
  • Response Time and Availability
  • Corporate Computing Today
  • Chapter 7.2: The Standard Work Unit
  • Elements of Response Time
  • An Hourglass Analogy
  • The Racetrack Analogy
  • Your Vehicle Runs as Fast as the Vehicle in Front of It
  • The Standard Work Unit
  • The SLA
  • Chapter 7.3: Data Modeling for the Structured Environment
  • The Purpose of the Roadmap
  • Granular Data Only
  • The ERD
  • The Dis
  • Physical Data Base Design
  • Relating the Different Levels of the Data Model
  • An Example of the Linkage
  • Generic Data Models
  • Operational Data Models/Data Warehouse Data Models
  • Chapter 8.1: A Brief History of Data Architecture
  • Chapter 8.2: Big Data/Existing System Interface
  • The Big Data/Existing Systems Interface
  • The Repetitive Raw Big Data/Existing Systems Interface
  • Exception Based Data
  • The Nonrepetitive Raw Big Data/Existing Systems Interface
  • Into the Existing Systems Environment
  • The ``Context Enriched´´ Big Data Environment
  • Analyzing Structured Data/Unstructured Data Together
  • Chapter 8.3: The Data Warehouse/Operational Environment Interface.
  • The Operational/Data Warehouse Interface
  • The Classical ETL Interface
  • The ODS and the ETL Interface
  • The Staging Area
  • Changed Data Capture
  • Inline Transformation
  • ELT Processing
  • Chapter 8.4: Data Architecture: A High-Level Perspective
  • A High Level Perspective
  • Redundancy
  • The System of Record
  • Different Types of Questions
  • Different Communities
  • Chapter 9.1: Repetitive Analytics: Some Basics
  • Different Kinds of Analysis
  • Looking for Patterns
  • Heuristic Processing
  • Freezing Data
  • The Sandbox
  • The ``Normal´´ Profile
  • Distillation, Filtering
  • Subsetting Data
  • Bias of the Sample
  • Filtering Data
  • Repetitive Data and Context
  • Linking Repetitive Records
  • Log Tape Records
  • Analyzing Points of Data
  • Outliers
  • Data Over Time
  • Chapter 9.2: Analyzing Repetitive Data
  • Log Data
  • Active/Passive Indexing of Data
  • Summary/Detailed Data
  • Metadata in Big Data
  • Linking Data
  • Chapter 9.3: Repetitive Analysis
  • Internal, External Data
  • Universal Identifiers
  • Security
  • Filtering, Distillation
  • Archiving Results
  • Metrics
  • Chapter 10.1: Nonrepetitive Data
  • Inline Contextualization
  • Taxonomy/Ontology Processing
  • Custom Variables
  • Homographic Resolution
  • Acronym Resolution
  • Negation Analysis
  • Numeric Tagging
  • Date Tagging
  • Date Standardization
  • List Processing
  • Associative Word Processing
  • Stop Word Processing
  • Word Stemming
  • Document Metadata
  • Document Classification
  • Proximity Analysis
  • Functional Sequencing Within Textual ETL
  • Internal Referential Integrity
  • Preprocessing, Postprocessing
  • Chapter 10.2: Mapping
  • Chapter 10.3: Analytics From Nonrepetitive Data
  • Call Center Information
  • Medical Records
  • Chapter 11.1: Operational Analytics: Response Time
  • Transaction Response Time
  • Chapter 12.1: Operational Analytics.
  • Different Perspectives of Data.