What’s New in Talend Data Preparation 1.2?
And now, ladies and gentleman (drumrolls), please welcomeTalend Data Preparation 1.2 Free Desktop! In early July, we also introduced Talend Summer 16, which includes the commercial version of this 1.2 version, making it the first unified integration platform for governed, self-service data preparation.
In this blog, we will first run across the new functions of the free desktop version in Talend Data Preparation, and then present what differentiates the commercial version and the free desktop.
If you haven’t heard of Talend Data Preparation before, I would suggest you take a look at this video before continuing this article and downloading the tool.
Standardizing Phone Numbers in Talend Data Preparation
According to a survey by EDQ.com, most marketing organizations are struggling with the quality of their customer and prospect data. Marketing departments welcome Talend Data Preparation because it provides an easy way to standardize and cleanse the most common contact data such as e-mails, first and last names, or job roles.
In Talend Data Preparation 1.2, phone standardization capabilities are now included as well! Not only can the tool automatically recognize phone numbers together with their localization (such as US, UK, Japan, France or Germany) and cleanse potential invalid values accordingly, but the new “format phone numbers” capabilities allows to standardize them according to their related country in different formats such an international, national, E164 and RFC3966 in a single click.
Reorganizing Data in a Dataset
In many cases, there’s a need to reorganize data so that it can conform to a specific output. Scenarios like this would include importing prepared data as leads into Marketo, contacts into Salesforce, products into Magento, or campaign codes into Google Analytics. Many actions on data columns were provided in previous versions of Talend Data Preparation such as duplicate, create, concatenate, split or delete. In Talend Data Preparation 1.2, we’ve also included the ability to reorganizing the sequence of columns.
Dragging and dropping a column to a new place adds a “reorder columns” function on your data prep recipe. You can also use the swap column function in the action panel to reorganize your columns.
Math, Numbers and Date
While marketers will love the previously mentioned capabilities for contact data management, we’ve also included some features for business operations and financial users. These works are required to work with numbers and dates daily, so, for them, we added numerous math functions which can be obtained through down or half up mode, negate, min & max, cosine, sine, tangent, base 10 and natural logarithm, square root and exponential.
Calculations on date have been improved as well with the new modify date function to add or subtract time unit amount.
Data Masking of Sensitive Data
Although I’m a fan of the phone standardization features, I must personally admit that the data masking function is the most compelling new feature included. Protecting sensitive business data, including personally identifiable information (PII), is not an option in times like these where a new story about a data leak is in the news every day. Today, not only do organizations need to consider data masking as a mandatory building block in their data integration architecture, but they also need to deliver it to a company-wide audience. Delivering data masking for everyone, that’s what this new function is all about.
Not only is it ridiculously easy to use, but it is also pretty smart! Talend Data Preparation will automatically adapt to data domains, especially those that hold PII. Applying "Data Masking" to a US phone number field for instance will generate fictitious phone numbers that conforms to a US phone pattern. Apply it to "first name" field, and it will shuffle the first names. Apply it to an e-mail field, and it will mask the left part of the e-mail. Protecting your sensitive data has never been so easy and affordable (free!).
User Experience Improvement
Talend Data Preparation is all about enabling self-service data for a wider business audience. As matter of consequence, this goal makes user experience is key. We designed the product so that anyone can get started with the tool in a matter of minutes.
Talend Data Preparation 1.2 is bringing user experience a step higher. One of the most important changes is the fact that data preparation now takes a more central role in the tool, as it is the one that drives to the outcome. A user’s “preparations” are the first thing that is shown when opening the product. Choosing to create a new preparation will open a screen where you can easily find and select the datasets that you want to use. You can create a new dataset by importing external files, or select existing ones from the most recently used, those that you tagged as favorites (or those that has been sanctioned by authorized colleagues in the case of the commercial version).
If you upgraded to 1.2 from a previous version of Talend Data Preparation, you’ve probably noticed that there is significant performance and UI responsiveness improvements thanks to a smarter caching mechanism. Repackaging of the applies allows you to run it from a standard browser, inheriting its capabilities (for bookmarking your favorite preparation for instance, opening multiple browser tabs, copy/pasting, back and forward buttons). This also solved some stability issues that our community highlighted in our forum.
From Personal Productivity to Managed Self-Service
Now that we've gone over the new capabilities included in the free version of Talend Data Preparation, let's talk for a few minutes about what separates the commercial version from the free version. Simply put, the commercial version elevates Talend Data Preparation from a personal productivity tool into a collaborative platform for data-driven activities.
It also inherits Talend enterprise-class capabilities such as the 900 components and connectors to data sources; the high end, cluster ready, server-based capabilities, the monitoring and automation capabilities from Talend Administration Center, together with the security layer with role-based access, etc.
The commercial version runs on top of the Talend 6.2 platform. When you get your Talend licenses, you also get a link to install the Data Preparation 1.2 server. Your license gives access to 2 free named user licenses for Data Preparation, and you can always purchase additional ones. Let’s take a look at some the additional features that this commercial version provides (you can also check out this blog from my colleague Mark Balkenende for a deeper dive, or take a look at this video).
Self-service data access without putting data at risk through live and batch datasets
One of most unique features of the commercial version is its ability to deliver any data as a self-service; in batch or real time and without compromise of control and governance. Through the newtDatasetOutput component, any data flow that you can run through a Talend job can be published to Talend Data Preparation and accessed in self-service by authorized users. This opens the data inventory to virtually any data sources and formats, leveraging the wide connectivity and transformation capabilities of the Talend platform. Enterprise data can be channeled to self-service data preparation in a controlled way. For example, a data source can be exposed, but only after being masked, and only given to authorized users. Such datasets can be pre-processed in batch or alternatively delivered as live datasets in real time.
Managing Large Datasets with Smart Sampling
Data Preparation done well is an interactive user experience. Delivering this in-memory data discovery experience is not "rocket science" for small datasets, but gets trickier when dealing with large data volumes. Through the commercial version, Talend delivers self-service data preparation at scale through a lightweight cluster ready backbone based on the latest technology. Through sampling, the tool automatically detects large datasets and then exposes a sample rather than the whole dataset for discovering, cleansing, shaping and enriching the data. This preserves the highly interactive in-memory user experience no matter the size of the dataset. Then, once It has been designed based on a sample, and only when needed, the user can run the preparation in the background on the full batch of records.
Sharing Work, Certifying Datasets and Promoting a Data Prep into an Enterprise-Class DI Scenario
Last but not least, the new capabilities in Talend Data Preparation help the tool go well beyond personal productivity.
Not only business users can save, update, and rerun their data preparation anytime in the tool, but now they can share it with their co-workers (and their other datasets as well).
Authorized users can also certify dataset, so that business users can know the datasets that has been sanctioned and those that are not.
Those same authorized users can also share with IT or to any other authorized shared services who can then run those preparations from a Talend job designed with the studio and deliver it to a wide audience. By allowing this push down processing of Data Prep into Talend Data Integration, any data preparation can be promoted as an integral part of an enterprise-class data integration scenario.
This function is delivered through the tDataPrepRun component in the Talend Studio that can be integrated into any Talend Job, real-time or batch.
And now it’s your turn to play
The magic of open source comes from the voice of the community. In less than six months after its launch, Talend Data Preparation has been downloaded more than 20,000 times, and we are getting great feedback in our forum. This inspires our roadmap and fuels the rapid evolution of Talend Data Preparation. Please keep on sharing what you like, dislike or would like to see in the future versions of the product, as well as your use cases. You’ll get rewarded with a better product.