Let us Talend – Part 2 series

Lets start our first Talend task by downloading the Software first. Go to the url https://www.talend.com/products/ and download the “Open Studio for Data Integration”, there will be too many products for specific needs but first I have chosen this version to start off with.

After downloading and installing, go to the installed directory and open up the exe file. This should fire up the GUI and starts asking you to create a new Project ( Project is nothing but like a folder containing all the files related to one of your ETL tasks). It should look like below :

I have given the name “MyFirst_POC”, you can provide your name whatever you like and click on Finish. This should load up the initial files, creation of some folders ( Just go and lookup inside the installation folder, there should be a folder named “workspace” and inside it many initial folders would have been created) and after finishing up loading, the tool would have opened.

Before diving into the tool, there is something you need to know. In Talend, every small executable you create is called as “Job“. Job is nothing but it represents a single Java class.

Let’s have the following scenario for doing one simple task (Simply copying an Excel file contents into a new File with the same format) :

  • Load the data from an Excel file (Extraction)
  • Do some changes if necessary (Transformation)
  • Output the data into a new Excel file ( Load)

Load the data from an Excel file (Extraction)

Before proceeding let’s have a sample file to load the data, which I have attached below

Since every executable or task is called “Job” in Talend tool, let’s start by creating a Job for getting the data from this input file. You can create the job by right click on “Job Designs” in the Repository pane and select “Create Job” as shown below :

Give some name, I have given as “Job1” and given some description.

Now here there is one more thing you need to learn about which is called “Component“. Component is nothing but another Java class which performs a single specific operation. There are several components available in Talend which makes this tool more popular. For example , the component tMysqlInput extracts data from MySQL table.So every job will have atleast one component defined within it. You can visualize the Jobs and components like as below :

Each Job can consist of either Sub Jobs which in turn contains some 2 or 3 Components or Job can consist of only Components itself.

Now coming back to our Use Case scenario, we can think of “Loading input Excel file” like one Component which does the job of loading the data. So what component do we have for this? Its the component tFileInputDelimited that does this task for us. Its purpose is to Open a file and reads it row by row to split them up into fields then sends fields as defined in the Schema to the next Job component, via a Row link.

Lets go ahead and create this Component in our Job. Go to the “Palette” which contains the list of Components available, and search for the component mentioned above and drag it to the Job window :

After drag and drop,you should be able to see some kind of warning icon in that component. Move the cursor over there and you would see two warning messages that “No schema has been defined” and “No output connector has been linked”. This is expected because if you see the definition of that component tFileInputDelimited ( I have mentioned just before, I repeat here again), it opens a file and reads it row by row to split them up into fields then sends fields as defined in the Schema(Schema needs to be defined inside the component, here schema represents what are the columns each row contains and what are its data type) to the next Job component ( this is the Output component which is expected to be present), via a Row link.

So lets go ahead, and define schema for that component. To do that, simply click on that Component twice, not double click, but two single clicks (Make sure to click on the smaller rectangle, as thats the component). Click on the “…” icon near by Edit and it should open up a new Window wherein you can define the columns which are defined in the CSV file. You can see the below screenshot of how I have defined the columns.

For Date columns press “Ctrl + Space” to show different formats available. Make sure to uncheck “Nullable” checkbox for “Id” column. And then Press “Ok”. Then in the fields “File Name/Stream” and “Field Separator”, select the CSV file ( which I have uploaded in the before section) and give comma for separator. Also give “Header” value to “1” to denote that we have header in the file, else we will get some erroe

Until here the Step 1 is done ( which is Extraction phase, loading of data from CSV file into Input component)

Do some changes if necessary (Transformation)

Now we need second component, which transforms the data ( if necessary). The component which does this is tMap. All it does is route/ transform the data from one Component ( mainly Input Component) to another Component( mainly Output Component). So as again in the before section, search for this component tMap from the palette and drag and drop it on to the Job window. Now Right click on the first Component and select “Row -> Main” to draw an arrow from Input Component into the Map component.So in essential, your screen should like as follows :

After that change the “row1” text to “Employee Input” just for reference sake to understand better. Also if you see, the tool had surrounded the components with pale Blue color box, representing this as a kind of Subjob. Now we have defined the Map component but we haven’t done anything inside it. This is because we need some Output component for the Map Component to map the Inputs to Output. So let’s go ahead and create an Output Component ( which is the 3rd Phase in ETL , Load).

Output the data into a new Excel file ( Load)

The output component which we are looking for here in this scenario is tFileOutputDelimited since we have input as Delimited, and we want the output also to be same. So drag and drop the component onto the Job window. Configure this Component same as the Input Component, give the File name and separator.And same as before, try to connect the Map Component to this Output Component by Right Clicking the map component and selecting “Row -> New output (main)” and give some name for the Output link like “MappedEmployee”. So in nutshell , your screen should look like this:

If you see, you might have some error at the Map Component. This is because there are no Schema defined yet for the Map Component. To define that, click on the Map component, it should open up a Map Editor like below :

Here in the above screenshot , I had done additional steps which are copy pasting the columns from Input onto Mapped Employee component. This can be done in the bottom window, select all the columns and press “Copy” icon at the bottom and go to the “Mapped Employee” region and press “paste” icon. After doing that press “Auto Map” at the top right corner. This should auto map the columns, giving some default Expressions ( For now ignore this we can discuss on this later). Press “Ok” and it will ask you, do you want to propogate your changes, Click on “Yes” ( What this will do is to Copy the Output schema from Map Employee onto the Output Component).

Now press Saveall to save all the changes you have done till now. Before running this Job, try to open up the “Problems” tab from “Window -> Show View” search for “problems” and press “Open”. This shows up the problems that occurs during the run of the Job.

Try to run the Job, and you might notice that you will get some error like “Couldn’t parse date and so on.” This is because we have given the format which is not matching in the CSV file. So go to the Input Component and edit the schema definition and change the Date format (Use “Ctrl + Space”) as shown below :

And then Run the Job, if everything went fine then it should have copied the contents of Input File into Output File ( Simply nothing but Extracting the data from Input File, Transforming the data wherein in this case no transform has been done, and finally Loading the data – ETL phase)

Thats it!!, you have finally did a small demo on how to do this ETL operation in Talend Tool.

I will explore further and write up some of the Use Cases I have tried in the coming series.

Comments are always Welcome!!.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.