DataStage Looping for Multiple Table Processing
Each of DataStage’s (hereby known as DS for the rest of this post) components has a particular purpose within the DS ecosystem. However, the ‘Loop’ Stage can radically speed up your development and execution, albeit in certain instances. Read on and find out if it fits your needs. The ‘Loop’ stage can accept textual or numeric values. It accepts different delimiter values, as well, such as comma, or space. You can define a day of the week (“Monday, Tuesday, Wednesday”) or a file name (“Payroll_file01”), or a table name, for instance.
In this article, we are just going to focus on using loops for tables. Say that you have 20 or more tables of data that you need copied from a source database to a destination database on a daily or weekly basis.
You could create a job for each table. It goes without saying that for 20 tables or more this would take some time. If you need to make a change in one job, you might have to make that change in your other jobs.
However, if you can use RCP (Run-time Propagation automatically defines the schema of each table at run time), then you are in luck. You can incorporate the ‘Loop’ stage, by creating a single job and providing a list of table names.
DS’s ‘Loop’ stage makes the operation of copying data from multiple tables simple. Below is the description of a typical job incorporating this stage.
- Creating a Loop Job
- Creation of a Parallel Job - A
- Creation of a Sequence (Master) Job - B
- Executing the Job
Creating a Loop Job
We will use an example to demonstrate the usage of this Stage. We have been provided with the list of tables, the source, and destination configuration information. Our task is to copy the data from the source to the target for these tables. Choose your table names.
These are named for demonstration:
CREATING A LOOP JOB
1. Create a Parallel job with two DB2 Connector stages (Your Database type might differ, so choose the appropriate one, DB2, Neteeza, etc). The left side connector will be your source database. The connector on the right side will be your target connector.
2. Save this job with a name of your choice. Remember all job names in a DS Project are required to be unique and cannot start with a number. We will refer to this job as TEST_LOOP.
3. Go to the Menu, while in DS Designer, and choose Edit > Job Properties
4. The Job Properties window will display. Click on the ‘Parameters’ tab. We will be adding a parameter entry as a placeholder for the table name. Type the following and click OK to save:
Parameter name: TABLE_NAME
Type: String (chosen from the dropdown menu)
5. Your TEST_LOOP job should still be displaying in the Designer. Double click on the source connector to open the Properties. Click on the Properties tab.
6. Fill in the connection details for your source (Database name, User Name, Password) either manually or by using parameters. Using Parameters is recommended for a few reasons, but there is one outstanding reason. If a password expires, you only need to change it in one place, the parameter section. All jobs using the parameter will be updated with the new password.
Change and save
7. Change ‘Generate SQL’ to Yes. Click the button beside Table name and choose your newly created parameter called ‘TABLE_NAME’. This is the placeholder that will read a list of tables. We will provide that later. Save the changes by clicking OK.
8. Repeat this process for the Target connector. Fill in the connection information, change Generate SQL to Yes, and add the TABLE_NAME parameter. In the target, we must also define our write mode.
(Insert, Insert New Rows Only, Update, Delete then Insert….). We must also define the Table action (Append, Truncate, Replace). Click OK.
Creation of the Sequence (Master) Job – B
We will create a Sequence job called SEQ_LOOP that will act as a Master job. Sequence jobs control Parallel jobs. In our case, the Sequence loop will pass the name of the table down to the Parallel job so that it can execute the table copy.
1. With your newly created SEQ_LOOP job open, go to the menu and choose Edit > Job Properties. We are going to add the TABLE_NAME parameter in the Properties section. Click OK to save.
2. Go to the menu and choose View > Palette. Drag the following stages onto your canvas and name them:
Start Loop Stage: START_LOOP
End Loop Stage: END_LOOP
Job Activity Stage: GENERIC_JOB
Execute Command Stage: EC_SLEEP
3. Starting with the START_LOOP, left drag to the GENERIC_JOB, to join the stages. Join the others in the same way. It should look like this:
4. Double click the START_LOOP to access the properties. Choose these settings:
- Loop Type: List Loop
- Delimiter: Comma
- Delimited Values: TEST.TABLE1, TEST.TABLE2, TEST.TABLE3, TEST.TABLE4.
(You will change this to your table names)
Click OK to save.
This is the list of tables we will be copying. In the Parallel job each item on the list is represented by the #TABLE_NAME# parameter. As stated, your table names will be different. This job reads each value and processes the table in the Parallel job.
For each table it processes, it runs one iteration of the loop. In doing so, it passes the table name to the Job Activity stage.
Describing this further, in the Job Activity stage named GENERIC_JOB, we define the Parallel job. In our case it is our previously created job called TEST_LOOP. TABLE_NAME is already listed as a parameter. Click on the row under the ‘Value Expression’ box. Click on the square with the ellipsis (3 dots). Click on ‘Activity Variable’ and choose the Start_Loop.$Counter. Click Ok to save.
5. Moving on to the Execute command stage named EC_SLEEP, this stage is included to ensure that the job does not process the next table too fast. In some DS loop operations, there have been issues with the job crashing or freezing. For this reason, we will pause the jobs for 5 seconds.
In the ExecCommand tab, type the UNIX command ‘sleep 5’ in the Command box.
6. The End Loop Stage named END_LOOP has no parameters or settings.
Executing the Job
Finally, we are ready to run our job. Running the SEQ_LOOP job will copy each of your defined tables, sequentially. You can use the DS Director to schedule this job to run daily, or weekly. If you need to add or remove tables, just change the list in the START_LOOP.
In addition to the method described above, you can also have the list of tables in a text file and read that into the TABLE_NAME parameter. The advantage of this, is that you can have a dynamically changing list of table names.
We have described a rudimentary way of using the Loop Stage. It is worth repeating that the loop definition can be characters, or numeric, or Parameter/Variables.
Of course, you can modify and add complexity to the job. For instance, you could use Sequential file input, instead of Database input. In addition, you can add logging, or file manipulation quite easily. These are just some of the ways you can enhance this type of job.