homecloud computingcloud data Processingcloud of thingsclound of networks
cloud data processing

Practical Introduction to Big Data and Hadoop

Duration: 5 Days

Course Background

Hadoop is a highly scalable framework that aims to provide high availability of large data sets, located on either a single server to to clusters of thousands of computers coupled with a relatively simple programming model. This course aims to provide attendees with a sound understanding of the Hadoop stack including its risks and benefits. The course will be a mixture of instructor taught material, presentations and structured labs. Topics covered will include

  • Hadoop Architecture and Associated Utilities
  • Hadoop Distributed File System (HDFS)
  • MapReduce
  • Structured data storage With HBase
  • Cassandra multi-master database
  • Approaches to Data warehousing with Hive
  • Parallel programming data retrieval with Pig
  • Data mining with Mahout
  • Cloud computing with MapReduce

Course Prerequisites and Target Audience

This course is suitable for Software Developers, Software Architects, IT Managers and IT Directors, Data Warehouse Managers and Business Intelligence Specialists Attendees are expected to have some understanding of Enterprise application development and business systems integration as well as some knowledge of object oriented programming, preferably, but not necessarily, in Java. The structured workshops are such that the goal is to understand the code rather than to develop complex code de novo.

Course Outline

  • Hadoop - A Brief History and Overview
    • The Big Data Economy - facts, opportunities and challenges
  • Hadoop Architecture
    • Hadoop Common
    • Hadoop Distributed File System (HDFS)
    • HDFS Clusters – NameNodes, DataNodes & Clients
    • Metadata
    • Hadoop and Amazon Elastic Web (EC2)
    • Hadoop and OpenStack
    • Web-based Administration
    • Installing, Configuring and Running Hadoop
  • Working with HDFS
    • Hadoop Configuration API
    • HDFS API Overview
    • HDFS File CRUD API
    • File Compression Decompression
    • Type Serialization Deserialization
    • Sequence Files
  • Data Importing
    • Importing data with Flume
    • Importing data with Sqoop
  • MapReduce
    • Processing and Generating large data sets
    • MapReduce - Map functions
    • Programming MapReduce using SQL / Bash / Python
    • MapReduce Parallel Processing - Heuristics and Patterns
    • Hadoop and Amazon Elastic Web (EC2)
    • Hadoop and OpenStack
    • Failover
    • YARN - an overview
    • Map Reduce on YARN
  • MapReduce Programming in Practice
    • MapReduce - Basic - Programming Concepts – Map Phase and Reduce Phase
    • Intensive overview of Java
    • MapReduce API – Key Java Classes and their Hierarchy
    • Implementation of basic MapReduce Programs
    • Setting Mapper Counts and Reducer Counts
    • Combiners and Partitioners
    • MapReduce Configuration
    • Speculative Execution
    • Task JVM Reuse
    • Compression
  • Advanced Map Reduce Programming in Java
    • Output Formatting
    • Customised Data Formats
    • Input formats
    • Counters
    • Multithreading
    • Distributed Cacheing
    • MapReduce Streaming and Pipes - Hadoop Streaming
    • Exception handling
    • Logging and Debugging
    • Unit testing - MRUnit - an introduction
  • Hive - Application Development and Data Analysis
    • Hive - history and architecture
    • Downloading, Installing and Configuring Hive
    • Loading Data into Hive
    • Hive Query Statements
    • Hive Schema Violations
    • Hive Built In Functions
    • Joining and Partitioning Data with Hive
    • Data Summarisation, Granularity and Star Schemas
    • Ad-hoc queries
    • Analysing large datasets - Strategies, Performance, Costs
    • Querying with HiveQL an SQL-like Query Language
  • Java Hive Clients
    • JDBC clients
    • Thrift Java clients
    • SHDP - Spring for Apache Hadoopo
    • Using Java Hadoop clients with Spring
  • Parallel processing application development with Pig
    • Pig History and Architecture
    • Downloading, Installing, Configuring and Running Pig
    • Introduction to the Pig Language Syntax
    • Relational Algebra and the design of Pig queries
    • Core Relational Operators – DISTINCT, FILTER, SPLIT, ORDER BY, LIMIT, GROUP, FOREACH
    • Pig Built-in Functions
    • Debug Operators
    • Parallel evaluation in Pig
    • Embedding Pig Latin in Java Programs
    • Combining Hadoop Pig and Java Map Reduce
  • Apache Mahout - Data Mining of Hadoop data
    • Clustering
    • Classification
    • Batch-based collaborative filtering - patterns and strategies
  • HBase - and Storage of very large amounts of structured data
    • Architecture of HBase
    • Understanding the HBase Data Model
    • Scalability of HBase
    • Downloading, Installing and Configuring HBase
    • HBase Shell
    • Connecting to HBase with Java
    • HBase Java API for CRUD Operations
    • Optimising HBase read/write access
    • Handling BigData with HBase and Java
  • Cassandra
    • Cassandra as a Multi-Master Database
    • The Cassandra Data Model
    • Eventual Consistency
    • Cassandra usage scenarios
  • Cloud Computing with Hadoop
    • Using Hadoop with Amazon Web Services (AWS)
    • Using Hadoop with OpenStack