Expert Consultancy from Yellow Pelican

Talend Database Joins

A site about Talend

Talend Database Joins

If you're reading data from a database, then it's fairly likely that you'll want to Join the data in one or more of your Tables.

How you go about this in both your Database Design and your Talend Job design will have a major impact on the overall performance of your Job. Understanding how Talend interacts with your database is a key requirement to getting the maximum throughput in your Talend Jobs.

This tutorial is based on MySQL, but referring to other database vendors if useful. You may find it helpful to download the latest version of MySQL so that you can try some of the techniques that are described here.

tMysqlInput Joins

The tMysqlInput component allows you to read data from your MySQL database. Using this component, there are 3 basic ways that you can join data: -

  • Use a tMap component
  • SQL Join
  • Use a tJoin component (I do not use the tJoin component)

There are other techniques that may be used, for example, using Database Views; however, these are outside the scope of this tutorial.

A Simple Join in Talend

In our Talend Database Tutorial, we created two tables, Address and Person, read the entire content of each table, using a Full Table Scan, and then joined the data in Talend using a tMap component.

This approach works great. We want to process every row in our tables, our database is on a fast network connection, we've plenty of memory and CPU cycles available on our Talend Server, and it get's through the data very quickly and has the minimum of impact on our source database (which is often a key goal in our overall ETL strategy).

Query Performance

From a performance point of view, one of the first decisions to make, when reading your database table, is whether you're going to perform a Full Table Scan, or use an Index to select a sub-set of your data. It may be that you can constrain data in all of the tables that you are joining; however, often this is not the case.

Test Case #1

In this test case, we have run the example Job from our Talend Database Tutorial. By default, this example creates just a few hundred rows in our database. Let's increase this to 500,000 Address rows and 2,000,000 Person rows. Each Person will be linked to an Address. This provides a sufficient quantity of data to perform some analysis.

If we now want to extract every Person row together with it's associated Address, then we will be reading every row from the Person table and (almost certainly) every from the Address table. This means that it is sensible that we will access our data using a Full Table Scan, and we can either join our data in the database using an SQL Join (Test Case #2) or use a tMap component to join the data on the Talend Server (as we have done in Talend Database Tutorial).

This Job will process a large amount of data and both of our data sources will need to be held in memory. If we were to simply increase the rows generated by the example from Talend Database Tutorial, then it is certain that we will see the error java.lang.OutOfMemoryError: Java heap space. There are two simple solutions to this, increase the memory available to Java, or stream our data.

Increase Available Memory

We can increase the amount of memory available to our Java process by selecting Advanced Settings of the Run tab. Increase the value of -Xmx1024MB to -Xmx3072MB. This should provide ample memory to run this Job, with the quantity of data that we currently have.

For further information on these parameters, read Java -X Command-line Options.

Image 1

Stream Data

Note. This option is an option that is specific to MySQL. Other options may be available for other database vendors.

The component tMysqlInput allows us to stream data. This means that we can start processing our data as we read it, rather than loading it all in to memory first. This can be a much more efficient way of processing our data and use much less memory.

In our example, we can stream Person; but we cannot stream Address. The reason that we are not able to stream Address is that it is the look-up table in our Join, and all of the data needs to be available when we start processing Person. Streaming can be a much more efficient way of processing our data, where it's an appropriate option. There are some limitations, for example, we would not be able to start writing data to the same shared connection. In this instance, a second shared connection would need to be established.

To stream data, select the tMysqlInput component's Advanced settings tab and check Enable stream, as shown below.

Image 2

Test Result #1

The following screenshot shows the execution of our Job, with data being streamed from the Person table, as opposed to increasing the available memory. In my local testing, this proved to be the optimum way of processing this data. There are many factors that must be taken in to account when tuning your Jobs, so it should not be assumed that this is always the best approach.

Image 3

Test Case #2

For this test case, we'll modify our example so that we push our Join to the database, using an SQL Join rather than joining using the tMap component. To do this, I've deleted the tMap component and one of the tMysqlInput components from the Job. I've modified both the Query and the Schema of the remaining tMysqlInput component, as follows: -

"SELECT   `Person`.`Id`,
  `Person`.`Gender`,
  `Person`.`Title`,
  `Person`.`Salutation`,
  `Person`.`FirstName`,
  `Person`.`LastName`,
  `Person`.`DateOfBirth`,
  `Person`.`UUID`,
  `Person`.`Password`,
  `Person`.`PasswordSalt`,
  `Person`.`HashedPassword`,
  `Person`.`EmailAddress`,
  `Address`.`Street`,
  `Address`.`Town`,
  `Address`.`County`,
  `Address`.`Postcode`
FROM `Person`, `Address`
where `Address`.`Id` = `Person`.`AddressId`"

Query is simply an amalgamation of the two queries from the previous test case, as it the Schema. The two Address keys Person.AddressId and Address.Id have been dropped from both the Query and Schema, whereas previously, they were dropped from the output of the tMap component. Although we could leave these columns in our output, for these examples, we're Normalizing our data, so they are not required.

Test Result #2

The following screenshot shows the execution of our Job, for our second test case. Data is streamed from the Person table, as it was for the first test case. In this instance, throughput can be seen to be marginally reduced; however, performance can be affected by many factors and this should not be taken as evidence that one of these approaches may always superior to the other.

Image 4

Test Cases #1 and #2 Review

We've now completed two test cases, with each Job ultimately returning the same result. In our local test, a tMap Join appears to have performed better than an SQL Join; however, we need to understand what other factors may affect the performance in differing situations. What else should we consider?

Metadata and Hidden Logic

We'll discuss this subject further in the subsequent test cases, however, we do need to consider Metadata and hidden logic. Arguably, one of the key reasons that ETL Tools are chosen over other languages such as 3GL is the desire for Metadata that is not buried in complex code. As well as any consideration for Metadata itself, you will also want to consider the downside of burying any Business Logic in your SQL, as this makes Debugging, Maintenance, and Support more complex.

Test Case #3

Now let's consider that we do not want to process every row in our database tables. In this test case, we only want to read from the Person table, for people with a Title of "Dr". Data will be restricted using a tFilter component, as shown below. This test is based on the previous test case #1, that is, we perform our join in Talend, rather than using an SQL Join.

Image 5


Filtering data using a tFilter component means that we will still read all of the data from the Person table but then discard any rows that do not match our constraint. As can be seen from the screenshot below, we are only interested in approximately 6% of the Person rows. You'll also see that we still need to load all of the Address rows, to perform our look-up.

Image 6

Test Case #4

As we're only interested in a small sub-set of our Person data, it seems sensible that we do not pass the entire content of the table to Talend. We'll now remove the tFilter component and push our constraint down to the database.

The constraint WHERE `Person`.`Title` = 'Dr' has now been added to the Query statement of the Person tMysqlInput component. As can be seen from the following screenshot, we can see an immediate performance improvement over test case #3.

Image 7

Test Case #5

Let's see if we can improve things further, by adding an Index to the database column Person.Title. Even though we improved performance significantly in the previous test, we are still performing a Full Table Scan of Person, before dropping data at the database level, rather than within Talend as we did in the previous test. In the case of MySQL, MySQL Workbench (or similar product) can be used to add our new Index using the following statement create index person_idx01 on Person (Title);.

Again, as can be seen from the following screenshot, we have improved the throughput of our Job. Generally speaking, the lower the percentage of rows in the table that are of interest, the more useful an Index will be. This should be tempered with the additional cost of writing data to your database table, both in terms of speed of write and the extra storage requirement for the Index. You may also want to consider how often data is constrained against the column in question, to better understand the overall benefit for the extra throughput that may be gained.

Image 8

Test Case #6

The final test is to expand on the previous test, but now pushing the join down to the database, as we did in test case #2. The single change that we need to make to test case #2, is to add an additional predicate to the SQL Where Clauses. Add and `Person`.`Title` = 'Dr' to the SQL Where Clauses of the Query statement for the tMysqlInput component.

Again, we can see an improvement in throughput as we have not had to transfer the entire content of the Address table to Talend.

Image 9

Review

We've complete six test cases. The first two tests shown that, for joining two reasonably large datasets, joining within Talend gave an improvement in performance. For the remaining four tests, where we were interest in a relatively small percentage or rows, we saw incremental improvements as we pushed the work down to the database.

These test should not be taken as a firm rule. The intention is to show some of the different techniques that may be used. In a real-world application, there are many factors that need to be taken in to account and there may be other performance goals, not just the throughput of your Talend Job. Often, you will find yourself extracting data from systems that have their own performance constraints, so you also need to consider the social aspect of your Job and that it may need to have a light footprint on these systems.

As previously discussed in this article, you also need to consider Metadata and whether or not you are hiding logic in your SQL. As a general rule, I never perform any data mapping within SQL statements and limit their customisation to SQL Joins and SQL Where Clauses.

Comparison to other ETL Tools

If you've previously used other ETL Tools, then this may be a key area where you will want to consider any functional differences. Business Objects Data Services (BODS), for example, not only supports an SQL component, but it also supports a Table component. BODS allows you to use a Join component to join the data from two Tables (as well as constraining your data) and then BODS will push the entire statement down to the database; preserving readability and Metadata. Talend does not perform re-writing of your multiple SQL statements so that they can be pushed down to the database.




Expert Consultancy from Yellow Pelican
comments powered by Disqus

© www.TalendByExample.com