Pentaho data integration beginner's guide
Extract, Transform, and Load (ETL) is the essence of data integration and this book shows you how to achieve it quickly and efficiently using Pentaho Data. A hands-on guide that you’ll find an indispensable time-saver. Manipulate your data by exploring, transforming, validating, and integrating it L...
Main Author: | |
---|---|
Format: | eBook |
Language: | Inglés |
Published: |
Birmingham :
Packt Publishing
2013.
|
Edition: | Second edition |
Subjects: | |
See on Biblioteca Universitat Ramon Llull: | https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009628113306719 |
Table of Contents:
- Cover
- Copyright
- Credits
- About the Author
- About the Reviewers
- www.PacktPub.com
- Table of Contents
- Preface
- Chapter 1: Getting Started with Pentaho Data Integration
- Pentaho Data Integration and Pentaho BI Suite
- Exploring the Pentaho Demo
- Pentaho Data Integration
- Using PDI in real world scenarios
- Loading data warehouses or datamarts
- Integrating data
- Data cleansing
- Migrating information
- Exporting data
- Integrating PDI along with other Pentaho tools
- Installing PDI
- Time for action - installing PDI
- Launching the PDI graphical designer - Spoon
- Time for action - starting and customizing Spoon
- Spoon
- Setting preferences in the Options window
- Storing transformations and jobs in a repository
- Creating your first transformation
- Time for action - creating a hello world transformation
- Directing Kettle engine with transformations
- Exploring the Spoon interface
- Designing a transformation
- Running and previewing the transformation
- Installing MySQL
- Time for action - installing MySQL on Windows
- Time for action - installing MySQL on Ubuntu
- Summary
- Chapter 2: Getting Started with Transformations
- Designing and previewing transformations
- Time for action - creating a simple transformation and getting familiar with the design process
- Getting familiar with editing features
- Using the mouse-over assistance toolbar
- Working with grids
- Understanding the Kettle rowset
- Looking at the results in the Execution Results pane
- The Logging tab
- The Step Metrics tab
- Running transformations in an interactive fashion
- Time for action - generating a range of dates and inspecting the data as it is being created
- Adding or modifying fields by using different PDI steps
- The Select values step
- Getting fields
- Date fields
- Handling errors.
- Time for action - avoiding errors while converting the estimated time from string to integer
- The error handling functionality
- Time for action - configuring the error handling to see the description of the errors
- Personalizing the error handling
- Summary
- Chapter 3: Manipulating Real-world Data
- Reading data from files
- Time for action - reading results of football matches from files
- Input files
- Input steps
- Reading several files at once
- Time for action - reading all your files at a time using a single text file input step
- Time for action - reading all your files at a time using a single text file input step and regular expressions
- Regular expressions
- Troubleshooting reading files
- Sending data to files
- Time for action - sending the results of matches to a plain file
- Output files
- Output steps
- Getting system information
- Time for action - reading and writing matches files with flexibility
- The Get System Info step
- Running transformations from a terminal window
- Time for action - running the matches transformation from a terminal window
- XML files
- Time for action - getting data from an XML file with information about countries
- What is XML?
- PDI transformation files
- Getting data from XML files
- XPath
- Configuring the Get data from XML step
- Kettle variables
- How and when you can use variables
- Summary
- Chapter 4: Filtering, Searching, and Performing Other Useful Operations with Data
- Sorting data
- Time for action - sorting information about matches with the Sort rows step
- Calculations on groups of rows
- Time for action - calculating football match statistics by grouping data
- Group by Step
- Numeric fields
- Filtering
- Time for action - counting frequent words by filtering
- Time for action - refining the counting task by filtering even more.
- Filtering rows using the Filter rows step
- Looking up data
- Time for action - finding out which language people speak
- The Stream lookup step
- Data cleaning
- Time for action - fixing words before counting them
- Cleansing data with PDI
- Summary
- Chapter 5: Controlling the Flow of Data
- Splitting streams
- Time for action - Browsing new features of PDI by copying a dataset
- Copying rows
- Distributing rows
- Time for action - Assigning tasks by distributing
- Splitting the stream based on conditions
- Time for action - Assigning tasks by filtering priorities with the Filter rows step
- PDI steps for splitting the stream based on conditions
- Time for action - Assigning tasks by filtering priorities with the Switch/Case step
- Merging streams
- Time for action - Gathering progress and merging it all together
- PDI options for merging streams
- Time for action - Giving priority to Bouchard by using the Append Stream
- Treating invalid data by splitting and merging streams
- Time for action - Treating errors in the estimated time to avoid discarding rows
- Treating rows with invalid data
- Summary
- Chapter 6: Transforming Your Data by Coding
- Doing simple tasks with the JavaScript step
- Time for action - counting frequent words by coding in JavaScript
- Using the JavaScript language in PDI
- Inserting JavaScript code using the Modified JavaScript Value Step
- Adding fields
- Modifying fields
- Using transformation predefined constants
- Testing the script using the Test script button
- Reading and parsing unstructured files with JavaScript
- Time for action - changing a list of house descriptions with JavaScript
- Looping over the dataset rows
- Doing simple tasks with the Java Class step
- Time for action - counting frequent words by coding in Java
- Using the Java language in PDI.
- Inserting Java code using the User Defined Java Class step
- Adding fields
- Modifying fields
- Sending rows to the next step
- Data types equivalence
- Testing the Java Class using the Test class button
- Transforming the dataset with Java
- Time for action - splitting the field to rows using Java
- Avoiding coding by using purpose built steps
- Summary
- Chapter 7: Transforming the Rowset
- Converting rows to columns
- Time for action - enhancing the films file by converting rows to columns
- Converting row data to column data by using the Row Denormaliser step
- Aggregating data with a Row Denormaliser step
- Time for action - aggregating football matches data with the Row Denormaliser step
- Using Row Denormaliser for aggregating data
- Normalizing data
- Time for action - enhancing the matches file by normalizing the dataset
- Modifying the dataset with a Row Normaliser step
- Summarizing the PDI steps that operate on sets of rows
- Generating a custom time dimension dataset by using Kettle variables
- Time for action - creating the time dimension dataset
- Getting variables
- Time for action - parameterizing the start and end date of the time dimension dataset
- Using the Get Variables step
- Summary
- Chapter 8: Working with Databases
- Introducing the Steel Wheels sample database
- Connecting to the Steel Wheels database
- Time for action - creating a connection to the Steel Wheels database
- Connecting with Relational Database Management Systems
- Exploring the Steel Wheels database
- Time for action - exploring the sample database
- A brief word about SQL
- Exploring any configured database with the database explorer
- Querying a database
- Time for action - getting data about shipped orders
- Getting data from the database with the Table input step.
- Using the SELECT statement for generating a new dataset
- Making flexible queries using parameters
- Time for action - getting orders in a range of dates using parameters
- Adding parameters to your queries
- Making flexible queries by using Kettle variables
- Time for action - getting orders in a range of dates by using Kettle variables
- Using Kettle variables in your queries
- Sending data to a database
- Time for action - loading a table with a list of manufacturers
- Inserting new data into a database table with the Table output step
- Inserting or updating data by using other PDI steps
- Time for action - inserting new products or updating existent ones
- Time for action - testing the update of existent products
- Inserting or updating with the Insert/Update Step
- Eliminating data from a database
- Time for action - deleting data about discontinued items
- Deleting records of a database table with the Delete step
- Summary
- Chapter 9: Performing Advanced Operations with Databases
- Preparing the environment
- Time for action - populating the Jigsaw database
- Exploring the Jigsaw database model
- Looking up data in a database
- Doing simple lookups
- Time for action - using a Database lookup step to create a list of products to buy
- Looking up values in a database with the Database lookup step
- Performing complex lookups
- Time for action - using a Database join step to create a list of suggested products to buy
- Joining data from the database to the stream data by using a Database join step
- Introducing dimensional modeling
- Loading dimensions with data
- Time for action - loading a region dimension with a Combination lookup/update step
- Time for action - testing the transformation that loads the region dimension
- Describing data with dimensions.
- Loading Type I SCD with a Combination lookup/update step.