Expert Consultancy from Yellow Pelican

Job Parallelization

A site about Talend

Job Parallelization

If you want to improve the throughput of your Jobs, you may want to consider Parallelization (Parallelisation). Talend allows you to run SubJobs in parallel, also known as Multi-threading.

There are two basic methods for running SubJobs in parallel: -

  • Setting a Job's Multi-threading property to TRUE.
  • Use a tParallelize component.

The tParallelize component is only available in the Enterprise Edition of Talend and is outside the scope of this documentation. You can achieve all of the functionality of tParallelize by Multi-threading you Jobs.

Considerations

Before rushing in and setting up all of your Jobs to run in parallel, there are several considerations that you should make. It is better to have a well-architected Job, that to simply run everything in parallel.

  • What is the throughput requirement?
  • What are the dependencies within your Job?
  • What will be the effect on other resources?

Throughput Requirement

Of course, you should always design your Job to be as efficient as practicable; however, you should have a performance goal set, before looking at your tuning opportunities. You should also consider the cost of development.

Dependencies

If you're going to run two SubJobs in parallel, then you need to consider the dependencies between these two SubJobs. There may also be subsequent SubJobs that are dependent on the completion of both of these two SubJobs.

Effect on Other Resources

While ensuring that you get the maximum throughput possible for your Job, you also need to consider the effect that this may have on other resource. This includes; but is not limited to: -

  • Impact on source database servers
  • Impact on target database servers
  • Impact on Talend Job Server - Memory, CPU
  • Impact on Network Infrastructure

If, for example, your reading from an OLTP database to populate your Data Warehouse, you won't be very popular if you fire off 20 bulk-extracts in parallel. It is also true that over-parallelization will have a negative impact on your overall throughput, as resources hit their limits.

A Simple Parallel Job

The following screenshots shows a simple Job, with two SubJobs executing in parallel. Each of these SubJobs are tRunJob components, whose child Jobs announce their commencement and ending whilst sleep for a period of time. You'll see from the console output that these two SubJobs executed in parallel, with ParallelExampleJS2 completing first.

Job Overview

Note the setting of Multi thread execution is TRUE.

Image 1

Job Execution

Image 2

Job Design

As well as setting the parent Job's Multi thread execution property to TRUE, you'll notice that the two SubJobs are not connected. Usually, for non-parallel Jobs, these would be connected using the OnSubJobOk trigger. Not only does this ensure that the second SubJob only executes on the successful completion of the first, but it also determines the execution order. For a non-parallel Job, SubJobs usually execute in the order that they were added; however, it is good pracrive to always connect them.

Adding a Dependency

Let's say we now want to add a third SubJob that should only commence execution, on completion of both of the other two SubJobs. How do we achieve this?

This third SubJob can only have a single OnSubJobOk connector attached, so we cannot easily use some simple triggers to do this.

The solution that I use is to wrap-up any SubJobs, that should run in parallel, in to a new Parallel Group Job, set this Job's Multi thread execution property to TRUE and then have the new dependency against this. This solution both achieves the desired result and provides clarity in the Job design.

The following screenshots show this amended design.

Job Overview

Note the setting of Multi thread execution is FALSE. It is now the Job ParallelExecutionGroup1 that is Multi-threaded.

Image 3

Job Execution

Image 4

Job Design

As can be seen from the above screenshots, our first two SubJobs have been encapsulated in a new SubJob, ParallelExampleGroup1, and this SubJob has the dependency with our new SubJob, ParallelExampleSJ3.

In creating this second example, I simply renamed ParallelExample to ParallelExampleGroup1, and then recreated ParallelExample and then added the two dependent SubJobs.




Expert Consultancy from Yellow Pelican
comments powered by Disqus

© www.TalendByExample.com