How to Turn Text into Data Using tNormalize and tJavaFlex in Talend

In this example, we have a set of XML files which represent a response to a stock update (fabricated example). The XML file has one critical field in it which is a text based description of the result of the transaction. We need to analyse the transaction results, and so need the data in a structured format.

The interesting part of the process was applying information from one row down to later rows, using a “tJavaFlex” component to hold onto some “state” as the process runs.

This article was written using the open source Talend Studio for Big Data, which can be downloaded from here.

In this example, we have a field with the data we need to parse which is the field “<result_summary>“. You can see below that this field tells us what action was taken during a transaction.

  • The “Product Category” section of the product had a change to the “Classification”
  • The “Nutritional Information” section of the product had an addition and a removal of an ingredient, as well as a change to an ingredient description
  • The primary keys are provided in angle brackets [] e.g. [ 1987261 ]

We need to get this data into this format, the key features being we have separated out the action being taken, the field the action was taken against, the primary key if available and the “section” the change occurred in.

The end result looks like this, the important fields are highlighted in yellow.

The hardest thing conceptually to resolve is how we apply the “section” value to the relevant rows of data. In effect, we have to extract the section name from one row, and then apply it to subsequent rows (and identify when a new section has been entered).

We will break this XML text field into individual rows of data using a “tNormalize” component, splitting on an “end of line”…then we will process each line of data.

We have to somehow take these “section” values:

…and apply them to the lines below, to end up like this:

Conceptually we have to somehow “know” when we are entering a section, and then hold onto the section name so we can apply it to the next row(s).

For this, we have to use a tJavaFlex to remember the section we are processing.

In this example, the XML field has a lot of structure we can use. It has semi-colons we can use to break things up, and each set of changes is introduced by a section name. Each transaction record is prefixed with the type of change (such as “Remove”,”Add”), and primary keys are always provided in angle brackets.

So the high-level process is:

  1. Break the XML field (result_summary) into rows based on “end of line” character using “tNormalize”
  2. Filter out empty rows
  3. Filter out rows, not in the format “Text: More Text”
  4. Detect when we enter a “Section” and hold on to the section name
  5. Split the fields using substring and split functions (using “:” and “[“)
  6. Remove the “Section: Section Name” row from the final output

Step 1: Iterate over the XML files

We delete the output file using a “tFileDelete” if it already exists and iterate over the XML files using a “tFileList” component:

Step 2: Read the XML file

We use a “tFileInputXML” component to read in the XML file for the current iteration. This is a simple XML schema and easy to map. The important field we need to parse data from is the “result_summary” field:

We need to cause the “result_summary” field to generate individual rows of data for each line of text. For this, we use a “tNormalize” component to break the field into separate rows based on the “end of line” character (“\n” in Java on Windows in this case).

So now we are generating multiple records in the downstream flow based on the number of lines in the “result_summary” field:

Step 4: Filter out unwanted rows

We only want rows in the form “Text : More Text”. This will keep hold of the section headings, and remove noise like “Product Changes:” (as it has a colon at the end, not in the middle of the text). We also get rid of empty rows.

For this we use a “tFilterRow” component, with some regex in the Match formula to only allow values with a “:” in the middle to flow through:

Step 5: Add a tJavaFlex and update the schema

The “tJavaFlex” will perform the bulk of the work. We add fields to the schema here for the individual fields we plan to extract/calculate (“section”, “action”, “field”, “code”), as well as a field for the XML file name:

Step 6: Define the variables in the Start section of tJavaFlex

These variables will hold the calculated field values we are trying to extract, this includes the section name variable which will hold on to the current section being processed.

Note that the scope of these variables (sub job level) means we can calculate the section name while processing the “Section: Section Name” row, and retain it for when we are processing later rows.

Step 7: Grab the file name and clear the variables

The main code is executed for every input row. We clear the variables, apart from the “section” variable.

Note: We only want to change the section variable when we enter a new section.

Step 8: Work out if the row being processed is a Section row or not

If the value being processed starts with “Section”, then split the row based on the “:” character, and store the section name in the variable so it can be “remembered” and applied in subsequent rows. The section name after we “split” to an array will be the second item in the array (index [1]).

If the value doesn’t start with “Section” then it must be a value indicating a transaction. e.g. “Remove:Ingredient [ 1817567 ]”. We, therefore, can split this again using the “:” character, and take the action (e.g. Remove) and the field (e.g. Ingredient [ 1817567 ]) into the “action” and “field” variables:

Step 9: Extract any primary keys

Check if we have any “primary keys” which will be inside square brackets and extract out the value into the “code” variable.

We use a combination of standard String methods (“contains”, “substring” and “indexOf”) to extract the code from any square brackets.

We may as well remove the primary key from the “field” value we have already extracted, i.e. “Ingredient [ 1817567 ]” becomes “Ingredient”. If we are extracting the primary key into a separate “code” field, we don’t need it as part of the field name:

Step 10: Write the calculated fields into the output flow

Step 11: Discard the “Section” line

We only needed the section line (“Section: Nutritional Information”) so we could grab the section name and apply it to other rows. We can discard this row now using a “tFilterRow” component.

We simply filter out values which don’t start with “Section” (note the “!” at the start):

Last Step: Write the value out to a file

We have extracted the data out into a structured format, and have used “tJavaFlex” to detect state changes (entering a new “Section”) – allowing us to apply data across rows.

We could likely do this with a “tMemorizeRows” component somehow, but that’s for another time. I am guessing it wouldn’t be straightforward as we wouldn’t know in advance how many rows to memorize…

Disclaimer: All opinions expressed in this article are my own and do not necessarily reflect the position of my employer. 

About the Author - Stephen Mackenzie, Data Analyst, Pharmaceutical Industry
An experienced data analyst, with over 14 years of experience within the pharmaceutical industry developing innovative solutions across a range of technologies including Talend, Oracle, Oracle Apex, Tibco Spotfire, Excel VBA, Java, SQL Server and Informatica.

 

Share

Leave a comment

Aggiungi un commento

More information?