Java Big Data Programming with Hadoop

Duration: 5 Days

Course Background

Hadoop is a highly scalable framework that aims to provide high availability of large data sets, located on either a single server to to clusters of thousands of computers coupled with a relatively simple programming model. This course aims to provide attendees with a sound understanding of the Hadoop stack together with an in depth coverage of practical Hadoop application development in Java. The course will be a mixture of instructor taught material, and programming labs.

Hadoop Architecture and Associated Utilities
Hadoop Distributed File System (HDFS)
MapReduce
Structured data storage With HBase
Cassandra multi-master database
Approaches to Data warehousing with Hive
Parallel programming data retrieval with Pig
Data mining with Mahout
Cloud computing with MapReduce

Course Prerequisites and Target Audience

This course is for experienced Jsva programmers and IT system developers who need to get up to speed with Hadoop application development. A background knowledge of the essentials of cloud computing and of distributed programming would be helpful but is not essential.

Course Outline

Hadoop - A Brief History and Overview

The Big Data Economy - facts, opportunities and challenges

Hadoop Architecture

Hadoop Common
Hadoop Distributed File System (HDFS)
HDFS Clusters – NameNodes, DataNodes & Clients
Metadata
Hadoop and Amazon Elastic Web (EC2)
Hadoop and OpenStack
Web-based Administration
Installing, Configuring and Running Hadoop
The Apache Hadoop Ecosystem - MapReduce, YARN and HDFS

Working with HDFS

Hadoop Configuration API
HDFS API Overview
HDFS File CRUD API
File Compression Decompression
Type Serialization Deserialization
Sequence Files

MapReduce

Parallel Programming Concepts Underlying MapReduce
MapReduce - Map functions
MapReduce Parallel Processing - Heuristics and Patterns
Hadoop and Amazon Elastic Web (EC2)
Hadoop and OpenStack
Failover
YARN - an overview
Map Reduce on YARN

MapReduce Programming in Practice

MapReduce - Map Phase and Reduce Phase
MapReduce API – Key Java Classes and their Hierarchy
Implementation of MapReduce Programs
Setting Mapper Counts and Reducer Counts
Combiners and Partitioners
MapReduce Configuration
Speculative Execution
Task JVM Reuse
Compression

Advanced Map Reduce Programming

Output Formatting
Customised Data Formats
Input formats
Counters
Multithreading
Distributed Cacheing
MapReduce Streaming and Pipes - Hadoop Streaming
Exception handling
Logging and Debugging
Unit testing - MRUnit - an introduction

Data Importing

Importing data with Flume
Importing data with Sqoop

Data warehousing application with Hive

Hive - history and architecture
Downloading, Installing and Configuring Hive
Loading Data into Hive
Hive Query Statements
Hive Schema Violations
Hive Built In Functions
Joining and Partitioning Data with Hive
Data Summarisation, Granularity and Star Schemas
Ad-hoc queries
Analysing large datasets - Strategies, Performance, Costs
Querying with HiveQL an SQL-like Query Language

Parallel processing application development with Pig

Pig History and Architecture
Downloading, Installing, Configuring and Running Pig
Introduction to the Pig Language Syntax
Relational Algebra and the design of Pig queries
Core Relational Operators – DISTINCT, FILTER, SPLIT, ORDER BY, LIMIT, GROUP, FOREACH
Pig Built-in Functions
Debug Operators
Parallel evaluation in Pig

Apache Mahout - Data Mining of Hadoop data

Clustering
Classification
Batch-based collaborative filtering - patterns and strategies

HBase - and Storage of very large amounts of structured data

Architecture of HBase
Understanding the HBase Data Model
Scalability of HBase
Downloading, Installing and Configuring HBase
HBase Shell
HBase Java API for CRUD Operations
Optimising HBase read/write access

Cassandra

Cassandra as a Multi-Master Database
The Cassandra Data Model
Eventual Consistency
Cassandra usage scenarios

Cloud Computing with Hadoop

Using Hadoop with Amazon Web Services (AWS)
Using Hadoop with OpenStack

Oozie - Workflow

The Workflow Concept - An Overview
Oozie as a workflow scheduler
Oozie Java main action
Working with the Oozie Java API

Available Courses

Java Big Data Programming with Hadoop

Duration: 5 Days

Course Background

Course Prerequisites and Target Audience

Course Outline