Talend Big Data Basics

Talend provides a development environment that enables users to interact with many Big Data sources and targets without having to understand or write complicated code.

Talend Big Data Basics is an introduction to the Talend components shipped with several products that interact with Big Data systems.

Durata1 day
(7 hours)
Target audienceAnyone who wants to use the Talend Studio to interact with Big Data systems
PrerequisitesCompletion of Talend Data Integration Basics or Talend Data Integration Advanced
Course objectives

After completing this course, you will be able to:

  • Create cluster metadata manually, from configuration files, or automatically
  • Create HDFS and Hive metadata
  • Connect to your cluster to use HDFS, HBase, Hive, Pig, Sqoop, and MapReduce
  • Read data from and write it to HDFS (HDFS, HBase)
  • Read tables from and write them to HDFS (Hive, Sqoop)
  • Process tables stored in HDFS with Hive
  • Process data stored in HDFS with Pig
  • Process data stored in HDFS with Big Data batch Jobs
Course agenda

Big Data in context

  • Concepts

Basic concepts

  • Opening a project
  • Monitoring the Hadoop cluster
  • Creating cluster metadata manually
  • Creating cluster metadata from Hadoop configuration files
  • Creating cluster metadata using a wizard

Reading and writing data in HDFS

  • Storing a file in HDFS
  • Storing multiple files in HDFS
  • Reading data from HDFS
  • Storing sparse datasets with HBase

Working with tables

  • Importing tables with Sqoop
  • Creating tables with Hive

Processing data and tables in HDFS

  • Processing Hive tables with Jobs
  • Profiling Hive tables (optional)
  • Processing data with Pig
  • Processing data with a Big Data batch Job
  • Migrating a standard Job to a batch Job

Clickstream use case

  • Clickstream use case: resource management with YARN
  • Setting up a development environment
  • Loading data files onto HDFS
  • Enriching logs
  • Computing statistics
  • Understanding MapReduce Jobs
  • Using Talend Studio to configure a resource request to YARN