Expert Consultancy from Yellow Pelican

Talend Database Tutorial

A site about Talend

Talend Database Tutorial

This tutorial provides comprehensive information on the functionality that Talend offers, when working with databases. You may find it helpful to read our article Working with Databases before taking this tutorial; however, it is not a prerequisite.

This tutorial is based on MySQL; but referring to other database vendors if useful. You may find it helpful to download the latest version of MySQL so that you can try some of the techniques that are described here.

Download Example Job

You may Download Example Job and import it in to a Talend Project of your choice. This tutorial supports Talend 5.2.2 and above.

Download

Covered in this Tutorial

The example Job in this tutorial will perform the following tasks: -

  • Establish a connection to a MySQL database
  • Generate pseudo-random Address data and write this to the database table Address
  • Generate pseudo-random Person data, link this to a random address, and then write this to the database table Person
  • Read both the Person and Address tables and Join these on Address.Id
  • Write the combined output to a file

This tutorial uses several reusable Jobs that are described on TalendByExample. These Jobs are design to simplify development, allowing you to quickly build robust and maintainable Jobs.

LibContextReader

LibContextReader reads Context from a file, simplifying the parameterisation of your Jobs.

To run the example Job TBEMySQLTutorial, you will need to create the following Context files. These files should be placed in a directory located from your Home directory, i.e. $HOME/talend/context.

Default.General.cfg

This file holds general Context variables. Change the value of outputDir so that it identifies a directory where you would like the example Job to write any files that it creates.

# www.TalendByExample.com
#
# This is a sample Talend Context File, for reading by LibContextReader
outputDir=/tmp

Default.MySQL.cfg

This file holds MySQL Context variables that will allow the example Job to connect, create tables and then read/write data. Amend these parameters to suit your environment. I would recommend that you test the MySQL connection using the tools provided by MySQL prior to running the Job. This is the easiest way to eliminate any connectivity issues.

# www.TalendByExample.com
#
# This is a sample Talend Context File, for connecting to a MySQL Database
mYSQLHost=localhost
mYSQLPort=3306
mYSQLDatabase=TalendByExample
mYSQLAdditionalJDBCParameters="noDatetimeStringSync=true"
mYSQLUsername=root
mYSQLPassword=root

LibMySQLSharedConnection

LibMySQLSharedConnection establishes a shared connection with a MySQL database. Connection parameters are read from a Context file, simplifying the parameterisation of your Jobs.

LibTBEAddressGenerator, LibTBEPersonGenerator

These two reusable Jobs generate their respective pseudo-random data. These will provide input for the example Job. For more information on these reusable Jobs, read Talend Data Generation.

Example Job Overview

If you've already created the Context files Default.General.cfg and Default.MySQL.cfg, you should be able to run the example Job TBEMySQLTutorial. The following screenshot shows the general design of the Job.

Image 1

Context

This example Job supports the Context of Default, Test and Production. It is a personal choice as to whether or not you only use the Default Context, or have additional ones for your different environments. You may still find it helpful to add the additional Context to your Jobs, even if you do not think you'll use them, as this gives flexibility later on.

You'll see from the screenshot below that a Context Group named General has been added. This, currently, contains just a single variable outputDir. I have added this to a group as it seems likely that we will want to use this in many of our task-specific Jobs. The variables numberOfAddress and numberOfPerson are very local to this Job, so I've added these as built-in.

Image 4

Loading Context

The first task, is to load some general context. This tells our Job something about the environment in which it is executing. Context is loaded using the reusable Job LibContextReader. This Job will load Context from the default file $HOME/talend/context/Default.General.cfg.

Image 2

Establish Shared Database Connection

The reusable Job LibMySQLSharedConnection is used to establish a connection with our Database. Although this Job is database vendor-specific, it can be easily adapted to work with the database vendor of your choice. Context for this connection is provided from the default Context file, $HOME/talend/context/Default.MySQL.cfg. This connection is shared, meaning that you can refer to it from any tMysqlConnection components located in the parent Job, its children and siblings.

Image 3


The following screenshot shows that we've configured the component tMysqlConnection_1 so that it uses the shared connection "MySQLSharedConnection", that has been established by the reusable Job LibMySQLSharedConnection. "MySQLSharedConnection" is the default Shared DB Connection Name that is established by tLibMySQLSharedConnection, and may me overridden by passing the parameter MySQLSharedDBConnectionName. To use the shared connection, check the option Use or register a shared DB Connection and then enter the connection name in the field Shared DB Connection Name.

Image 5

Transaction Control

In a later article, we'll look at the different Transaction Control strategies that we can employ in our Job. For now, we'll set this connection to Auto Commit. This option can be found on the Advanced settings tab of the tMysqlConnection component, as shown in the following screenshot. Transactions will be committed every Number of rows per insert, as specified in the component tMysqlOutput.

Image 8

Creating Our Data

Now that we've established a shared connection to our database, we can create some data. To do this, two pseudo-random data generators will be used. In our example, these have been configured so that they will create 500 Address rows and 2000 Person rows. Each Person row will be linked to a random Address. Read Talend Data Generation, for more information on these data generators.

When adding child Jobs that are capable of returning data to their parent (using the tRunJob component), it is important to remember to hit Copy Child Job Schema. This gives the tRunJob component the same Schema as the child Job. This button can be found on the Basic settings tab.

Image 6

Configuring tMysqlOutput Components

The following screenshot shows the configuration of one of the two tMysqlOutput components. Both components have been configured in a similar manner.

Image 7

Use an existing connection

Checking this option makes the Component List drop-down available. Here, you can select the tMysqlConnection component that is to provide the database connection.

Table

Here, you can specify the name of the table name that both Data Definition Language (DDL) and Data Manipulation Language (DML) operations will be performed against.

Action on table

This option allows you to specify any Data Definition Language (DDL) operation that should be performed against Table. This example will create a new table each time the Job runs (dropping it first if it already exists) and, therefore, this option has been set to Drop table if exists and create.

Action on data

This option allows you to specify the Data Manipulation Language (DML) operation that should be performed. As this Job creates a new table each time it runs, this option is set to Insert as we want to only create new rows.

Schema

We would usually expect the Schema to have been propagated to this component, when it was connected to the preceding component. If this has not happened, hit the Sync columns button. This will always copy the Schema from the preceding component.

Number of rows per insert

Note. This option is dependent on both database vendor and Action on data. Please check the options available for your database vendor and the action that you are performing. For more information, read t[DB]Output Reference.

As can be seen from the screenshot below, Number of rows per insert can be found on the Advanced settings tab (when you check the option Extend Insert) and has a default value of 100. This value specifies the number of rows that will be inserted before a transaction is Committed. This value works in conjunction with Auto Commit, that we specified for our tMysqlConnection component.

Image 10

Reading Our Data

Now that we've created some data in our MySQL database, we can now read it back. As can be seen from the screenshot below, we'll read both our Person and Address tables, join them using a tMap component, and then write the combined output to a file using a tFileOutputDelimited component. I have chosen to use a tMap component over a tJoin (Comparison to tMap), as I find no use case for tJoin.

Image 11

Configuring tMysqlInput Components

The following screenshot shows the configuration of one of the two tMysqlInput components. Both components have been configured in a similar manner.

Image 12

Use an existing connection

As with our tMysqlOutput components, checking this option makes the Component List drop-down available. Here, you can select the tMysqlConnection component that is to provide the database connection.

Pre-configured Options

In this example, Schema, Table Name, Query Type and Query were pre-configured. This is because this component was added from the Talend DB Connections Repository. This is the best way of adding tMysqlInput components to your Job. You can modify these once they have been added (including setting Use an existing connection). For more information on the Talend DB Connections Repository, read our article Working with Databases.

Mapping our Output Data

The following screenshot shows the Mapping Editor of the tMap component that joins our two MySQL data sources. The mapping of the data is fairly basic. The entire Person and Address rows are written, less the two Address keys Address.Id and Person.AddressId. We are dropping these keys as we have now Normalized our data.

You will see that Address is the Lookup of our two inputs and that we've instructed our tMap component to perform a look-up based on the relationship between the two tables, based on Address.Id.

Image 13

Creating our Output

Finally, we'll write our output to a delimited file using a tFileOutputDelimited component. For this example, we've left all of the default values except for File Name. Here, we've set the value to context.outputDir + "/PersonAddress.txt". This allows us to locate the file in a directory of our choice, as specified in the Context file $HOME/talend/context/Default.General.cfg.

Image 14

Conclusion

In this tutorial, we have used some reusable Jobs that will help to make our Talend development easier. Use these in your own Jobs, or use the techniques employed, to help your understanding of Talend. Do avoid embedding database connection details inside individual components.

We've used some fairly basic functionality to read and write to a MySQL database. In later tutorials, we'll build on these techniques.




Expert Consultancy from Yellow Pelican
comments powered by Disqus

© www.TalendByExample.com