homecloud computingcloud data Processingcloud of thingsclound of networks
cloud data processing

Java Big Data Programming with Hadoop

Duration: 5 Days

Course Background

Hadoop is a highly scalable framework that aims to provide high availability of large data sets, located on either a single server to to clusters of thousands of computers coupled with a relatively simple programming model. This course aims to provide attendees with a sound understanding of the Hadoop stack together with an in depth coverage of practical Hadoop application development in Java. The course will be a mixture of instructor taught material, and programming labs.

  • Hadoop Architecture and Associated Utilities
  • Hadoop Distributed File System (HDFS)
  • MapReduce
  • Structured data storage With HBase
  • Cassandra multi-master database
  • Approaches to Data warehousing with Hive
  • Parallel programming data retrieval with Pig
  • Data mining with Mahout
  • Cloud computing with MapReduce

Course Prerequisites and Target Audience

This course is for experienced Jsva programmers and IT system developers who need to get up to speed with Hadoop application development. A background knowledge of the essentials of cloud computing and of distributed programming would be helpful but is not essential.

Course Outline

  • Hadoop - A Brief History and Overview
    • The Big Data Economy - facts, opportunities and challenges
  • Hadoop Architecture
    • Hadoop Common
    • Hadoop Distributed File System (HDFS)
    • HDFS Clusters – NameNodes, DataNodes & Clients
    • Metadata
    • Hadoop and Amazon Elastic Web (EC2)
    • Hadoop and OpenStack
    • Web-based Administration
    • Installing, Configuring and Running Hadoop
    • The Apache Hadoop Ecosystem - MapReduce, YARN and HDFS
  • Working with HDFS
    • Hadoop Configuration API
    • HDFS API Overview
    • HDFS File CRUD API
    • File Compression Decompression
    • Type Serialization Deserialization
    • Sequence Files
  • MapReduce
    • Parallel Programming Concepts Underlying MapReduce
    • MapReduce - Map functions
    • MapReduce Parallel Processing - Heuristics and Patterns
    • Hadoop and Amazon Elastic Web (EC2)
    • Hadoop and OpenStack
    • Failover
    • YARN - an overview
    • Map Reduce on YARN
  • MapReduce Programming in Practice
    • MapReduce - Map Phase and Reduce Phase
    • MapReduce API – Key Java Classes and their Hierarchy
    • Implementation of MapReduce Programs
    • Setting Mapper Counts and Reducer Counts
    • Combiners and Partitioners
    • MapReduce Configuration
    • Speculative Execution
    • Task JVM Reuse
    • Compression
  • Advanced Map Reduce Programming
    • Output Formatting
    • Customised Data Formats
    • Input formats
    • Counters
    • Multithreading
    • Distributed Cacheing
    • MapReduce Streaming and Pipes - Hadoop Streaming
    • Exception handling
    • Logging and Debugging
    • Unit testing - MRUnit - an introduction
  • Data Importing
    • Importing data with Flume
    • Importing data with Sqoop
  • Data warehousing application with Hive
    • Hive - history and architecture
    • Downloading, Installing and Configuring Hive
    • Loading Data into Hive
    • Hive Query Statements
    • Hive Schema Violations
    • Hive Built In Functions
    • Joining and Partitioning Data with Hive
    • Data Summarisation, Granularity and Star Schemas
    • Ad-hoc queries
    • Analysing large datasets - Strategies, Performance, Costs
    • Querying with HiveQL an SQL-like Query Language
  • Parallel processing application development with Pig
    • Pig History and Architecture
    • Downloading, Installing, Configuring and Running Pig
    • Introduction to the Pig Language Syntax
    • Relational Algebra and the design of Pig queries
    • Core Relational Operators – DISTINCT, FILTER, SPLIT, ORDER BY, LIMIT, GROUP, FOREACH
    • Pig Built-in Functions
    • Debug Operators
    • Parallel evaluation in Pig
  • Apache Mahout - Data Mining of Hadoop data
    • Clustering
    • Classification
    • Batch-based collaborative filtering - patterns and strategies
  • HBase - and Storage of very large amounts of structured data
    • Architecture of HBase
    • Understanding the HBase Data Model
    • Scalability of HBase
    • Downloading, Installing and Configuring HBase
    • HBase Shell
    • HBase Java API for CRUD Operations
    • Optimising HBase read/write access
  • Cassandra
    • Cassandra as a Multi-Master Database
    • The Cassandra Data Model
    • Eventual Consistency
    • Cassandra usage scenarios
  • Cloud Computing with Hadoop
    • Using Hadoop with Amazon Web Services (AWS)
    • Using Hadoop with OpenStack
  • Oozie - Workflow
    • The Workflow Concept - An Overview
    • Oozie as a workflow scheduler
    • Oozie Java main action
    • Working with the Oozie Java API