ETL Validations in Pentaho
We present a framework for validation. In our case very focused to validate that ETL load has worked well.
The purpose and use (desired) for this STValidations (free download for previous link), is to automate the validation routine and sometimes we fail to do just that. For routine.
Its use is quite simple and are in a version 0.1, so it can evolve much and that we have few use cases how to make sure their robustness and to cover all needs. Hope this code from Stratebi helps you.
ETL is a process that reads from a table a list of queries to be executed and the expected result. After that, run the query and compares it with the expected result if the result is correct is a log of executions correct and if the result is not expected is a log of botched executions. At the end sends an email with the two records to the user to decide so you can review the implementation.
Here you can see the aspect of work in general:
The purpose and use (desired) for this STValidations (free download for previous link), is to automate the validation routine and sometimes we fail to do just that. For routine.
Its use is quite simple and are in a version 0.1, so it can evolve much and that we have few use cases how to make sure their robustness and to cover all needs. Hope this code from Stratebi helps you.
ETL is a process that reads from a table a list of queries to be executed and the expected result. After that, run the query and compares it with the expected result if the result is correct is a log of executions correct and if the result is not expected is a log of botched executions. At the end sends an email with the two records to the user to decide so you can review the implementation.
Here you can see the aspect of work in general:
And the detail of the validation:
Simple ... powerful and versatile. From simple queries like "select count (*) from table" to validate that you have records to complex queries for comparison of values in different tables. In fact, if what I see is data in a table so I think that most of the validations can be performed using this method.
You can follow instructions in this video tutorial:
You can check more videos in our Youtube Channel
The process, in step settings, read the configuration data from configuration.properties, check that there is a validation table and if not creates and inserts a validation sample. Then, the process reads all existing queries in the table, runs and compare the result with the expected result.
The lookup table looks like and filled manually (we are at version 0.1, remember .....):
You can follow instructions in this video tutorial:
You can check more videos in our Youtube Channel
The process, in step settings, read the configuration data from configuration.properties, check that there is a validation table and if not creates and inserts a validation sample. Then, the process reads all existing queries in the table, runs and compare the result with the expected result.
The lookup table looks like and filled manually (we are at version 0.1, remember .....):
We did that because we had some problems of inconsistency in some projects that made clear that the loads must be validated and is a tedious and repetitive work.
- We must always validate charges. We have detected errors due to inconsistent or unexpected data input formats. And that can not be controlled unless the charges are validated.
- Loads must be validated especially after making a change in the ETL and it is pretty boring, perform the same query validation.
- We must make a series of routine consultations to ensure that the data is equal to the source.
- If you can automate ... I'd rather be doing other things.
Strengths and weaknesses:
- It is a version 0.1 we have tested in a couple of clients and so far so good. But we know that, for example, still can not compare data from different sources faith.
- Queries can become heavy and this is a potentially VERY expensive in terms of resource consumption. That's why ETL is a process that can be run independently attached to the ETL process or in another moment of time with little load. We do it well. The process does not run right after the ETL but at 6:00 a.m. how a separate process.
- Do not rely on the BI server. Simple and useful.
Well, I think it can be very useful in projects and we hope you find it useful too!!
By the way ... this is the error log to send:













