Expert Consultancy from Yellow Pelican

Talend Job Recovery using Checkpoints

A site about Talend

Job Recovery

With any software, beit Talend or otherwise, A good way to think about running your software, is that it should be avle to run with Impunity.

What I mean by this is that you should not have any concerns that, when running your software, that it will do the wrong thing. This can be especially true Data Integration and other similar tasks, that may be complex and move large volumes of data from one system to another.

Talend is no exception to the rule that you need to write robust software that can handle errors in a controlled manner. In this article, we're going to specifically look at a Job's restart capability. That is, you've run a Job and, at some point in it's processing, there has been an error. You now need to recover from this situation and, logically, resume processing from the point of failure.

Checkpoints

Checkpoints are a way to break your Job down in to a number of logical steps. When a Job fails, you canalyse the check points to see where the point of failure is, and then to resume from that point, performing any additional recovery steps if needed. Even for the simplest Job, you may want a single checkpoint, allowing you to handle the recovery of a Job in its entirity.

Talend Enterprise Checkpoints

Talend Enterprise allows checkpoints to be defined for OnSubjobOk events. I tend not to use enterprise-only features and it is also helpful to understand how to do this for yourself.

tCheckpoint Component

Unfortunaely, the tCheckpoint component does not exist. It would be a relatively simle component to implement and maybe, one day, I'll have a go at writing one. In the meantime, the rest of this article looks at what this component might do if it did exist and, of course, how you can implement checkpoints usinf the standard components provided by Talend.

Logically Grouping You Tasks

Your Talend project is likely to contain a number of Jobs that all perform specific work streams. You may also have one or more master Jobs whose responsibility is to execute and control a number of child Jobs. If I'm working on a data migration project to migrate data from SAP to Salesforce, then I might have a master Job named SAPSalesforce i.e. it's name identifies both the source and target. This is my masterJob. I may then have a number of child Jobs that perform specific tasks within this migration, for example, SAPSalesforceAccount and SAPSalesforceContact. I always create a Folder for these Subjobs, for example, LibSSAPSalesforce; this helps to organise the Talend Repository and means that only master Jobs are located at the top-level.

Moving beyond this initial organisation, I may choose to sub-divide a complex child in to a number of additional child Jobs, for example, SAPSalesforceAccount may call the child Jobs SAPSAlesforceAccountRead, SAPSAlesforceAccountWrite and SAPSAlesforceAccountAudit.

You will, of course, have your own naming conventions and organisational requirements. The point of this is that it is helpful to think about the organisation of your Jobs at the start meaning that, not only are they easier to maintain, but you can start thinking about Job control and recovery.

Checkpointing Your Job

Let's now look at checkpointing the imaginary Job that we described above. The Job hierarchy looks something like this. We won't worry too much about what these Jobs actually do, we'll just think about how we could checkpoint the Jobs themselves.

	SAPSalesforce (master Job)
		SAPSalesforceAccount
			SAPSalesforceAccountRead (Reads data from SAP)
			SAPSalesforceAccountWrite (Writes data to Salesforce)
			SAPSalesforceAudit (Audits the load)
		SAPSalesforceContact
			...
		...

Now we can look at checkpointing SAPSalesforceAccount. Within this Job, we call three child Jobs. For now, our simple requirement is to determine if the previous execution of this Job as failed and, if so, at what point should we recover from.

                SAPSalesforceAccount
			// CHECKPOINT #0
                        SAPSalesforceAccountRead (Reads data from SAP)
			// CHECKPOINT #1
                        SAPSalesforceAccountWrite (Writes data to Salesforce)
			// CHECKPOINT #2
                        SAPSalesforceAudit (Audits the load)
			// CHECKPOINT #3

As can be seen from the above, our Job SAPSalesforceAccount has four checkpoints and we can clearly see that these will identify the progress of our Job from either having done nothing CHECKPOINT #0, completed all processing CHECKPOINT #3 or something in between. If our Job has made it as far as CHECKPOINT #1 then we know that we've read all of our data from SAP, but we do not know the status of writing data to Salesforce.

As well as checkpointing allowing us to know where to resume our processing, it also allows the oportunity to run some recovery code i.e. logic that should only be executed for a specific recovery scenario. For now, we'll assume that SAPSalesforceAccountWritedoes the right thing (has its own checkpointing and recovery) and that the only requirement for this Job is to resume at the point of failure.

What is a checkpoint?

Checkpoints are simple transactions that you can record about your Job's execution. Your Job can both read and write these transactions. t the stsrt of the Job's execution, it reads these transactions to determine the status of its last execution. As it runs, it writes additional transactions, to record it's process.

Where should I store my checkpoints?

Thre are two places that you may chose to store your checkpoints. In a database, or a regular file on your file system. There are pros and cons with both of these approaches, mostly centreing around the recoverability of your checkpoints. Yes, not only does your Job need to be recoverable; but you also need to make sure that you do not lose your checkpoints. For the purposes of this article, we're going to store our check points in a regular file on our magical file system that never loses data.




Expert Consultancy from Yellow Pelican
comments powered by Disqus

© www.TalendByExample.com