Introduction to MapReduce
Duration: 5 Days
Course Background
MapReduce is a key technology for working with Hadoop works. The aim of this course is to provide a sound foundation on Mapreduce programming with Hadoop and the kinds of applications that can be implemented using MapReduce. The course will also cover practical aspects of monitoring profiling and optimising Hadoop based MapReduce implementations.
Course Prerequisites and Target Audience
Attendees are expected to have a basic knowledge of Linux configuration and Bash scripting, a reasonable knowledge of Java programming and a basic understanding of Hadoop.
Course Outline
- Overview of Hadoop and BigData
- Hadoop Architecture and Building Blocks
- Introduction to HDFS and MapReduce
- Overview of the Hadoop CLI (Command Line Interface)
- Map Reduce Programming
- Interacting with HDFS via its Java programming api
- Hadoop datatypes (classes)
- InputFormat
- OutputFormat
- Mapper
- Reducer
- Combiner
- Partitioner
- Anatomy of a MapReduce Job Run
- Job Monitoring, Scheduling
- MapReduce Application Design Considerations
- The Map Reduce Ecosystem - An Overview
- Oozie
- Flume
- Sqoop
- Streaming API
- HCatalog
- Zookeeper
- HBase
- HBase Architecture
- Hive
- MapReduce Algorithms and Patterns
- Map Reduce Use Cases
- Counting and Summation
- Data Collating
- Filtering (“Grepping”), Parsing, and Validation
- Sorting
- Iterative Message Passing (Graph Processing)
- PageRank and Mapper-Side Data Aggregation
- Distinct Values (Unique Items Counting)
- Cross-Correlation
- Relational MapReduce Patterns
- Projection
- Union, Intersection and Difference
- GroupBy and Aggregation
- Joining - Repartition Joins and Replicated Joins
- Map Reduce Monitoring and Performance Tuning
- Debugging MapReduce Applications