A site about Talend
In this first article on XML, we'll create a new input file definition in our Metadata Repository. For this example, we'll use some data that is freely available from Wikipedia, subject to their Terms & Conditions. This data is both simple and useful, so provides a good real-world use for some XML content.
For further information on downloading the Wikipedia Database, follow the link.
The following shows a few entries from the Wikipedia Abstract Database. These files are large. You will find it helpful to create your own small sample file or to use the data shown below.
<feed> <doc> <title>Wikipedia: A</title> <url>http://en.wikipedia.org/wiki/A</url> <abstract>A (named a , plural aes) is the first letter and vowel in the ISO basic Latin alphabet. It is similar to the Ancient Greek letter alpha, from which it derives.</abstract> <links> <sublink linktype="nav"><anchor>History</anchor><link>http://en.wikipedia.org/wiki/A#History</link></sublink> <sublink linktype="nav"><anchor>Typographic variants</anchor><link>http://en.wikipedia.org/wiki/A#Typographic_variants</link></sublink> </links> </doc> <doc> <title>Wikipedia: Alabama</title> <url>http://en.wikipedia.org/wiki/Alabama</url> <abstract>Elevation adjusted to North American Vertical Datum of 1988.</abstract> <links> <sublink linktype="nav"><anchor>Etymology</anchor><link>http://en.wikipedia.org/wiki/Alabama#Etymology</link></sublink> <sublink linktype="nav"><anchor>History</anchor><link>http://en.wikipedia.org/wiki/Alabama#History</link></sublink> </links> </doc> <doc> <title>Wikipedia: Achilles</title> <url>http://en.wikipedia.org/wiki/Achilles</url> <abstract>In Greek mythology, Achilles (; , Akhilleus, ) was a Greek hero of the Trojan War and the central character and greatest warrior of Homer's Iliad. Achilles was said to be a demigod; his mother was the nymph Thetis, and his father, Peleus, was the king of the Myrmidons.</abstract> <links> <sublink linktype="nav"><anchor>Etymology</anchor><link>http://en.wikipedia.org/wiki/Achilles#Etymology</link></sublink> <sublink linktype="nav"><anchor>Birth</anchor><link>http://en.wikipedia.org/wiki/Achilles#Birth</link></sublink> </links> </doc> </feed>
Open the New XML File dialog, by selecting Metadata->File XML and then selecting Create file xml by activating the popup menu (mouse right-click).
For this tutorial, set Name to WikipediaAbstract and press Next.
Select the specification model. This may be either Input XML or an Output XML. For this tutorial, we are creating an Input XML model. Hit the Next button to proceed.
Hit the Browse button, navigate to your sample XML file and then select it. You should see it displayed in the Schema Viewer pane. Hit the Next button to continue.
This dialog allows you to perform your schema mapping. Two key aspects to this, is to define an Xpath loop expression and the Fields to extract. You can input these values manually, or drag values from the Source Schema pane.
This field allows us to specify an Absolute XPath expression. In our sample XML file, we have a number of <doc>...</doc>
elements, and these are the elements that we would like to loop through, to produce our row data.
You can drag the element doc
from the Source Schema pane to Absolute XPath expression.
We can now specify the fields to extract. As with the Xpath loop expression, we can drag these elements from the Source Schema pane. The elements of interest are title
, url
and abstract
. Drag these elements across to Relative or absolute XPath expression fields, in the Fields to extract grid.
The Fields to extract grid allows you to specify both the Relative or absolute XPath expression and the (output) Column Name. You'll see from the following screenshot, that the element abstract
has been renamed to abstractText
. This is because abstract is a Java reserved word.
Now that you've completed the mapping, hit the Refresh Preview button that can be found on the Preview pane. The dialog should now look as the following screenshot, including the Preview pane that should show correctly mapped row data. Once the file has been correctly mapped, hit the Next button to proceed.
You will now be presented with the definition of the output Schema. Talend has made it's best effort at correctly defining this schema, by sampling the available data. The datatypes have been correctly mapped; however, we can now take the opportunity to increase the column lengths, as as hown in the next screenshot. When you are ahppy that you schema is correctly defined, hit the Finish button to complete this oprtation.
You have now successfully defined an XML input file definition and can use this within your Jobs, to read and process XML data. In the next article in this series, we will use this data in an example Job.