Python Big Data Programming with Hadoop

Duration: 5 Days

Course Background

Hadoop is a highly scalable framework that aims to provide high availability of large data sets, located on either a single server to to clusters of thousands of computers coupled with a relatively simple programming model. This course aims to provide attendees with a sound understanding of the Hadoop stack and the key Python Hadoop frameworks. A key motivation for using Python in conjunction with BigData is the availability of the powerful Python tools for data plotting and data analysis, namely, NumPy, SciPy, Matplotlib and Pandas. The course will be a mixture of instructor taught material, and programming labs. The underlying theme of the course will be the use of Python for the exploration, analysis and visualisation of Hadoop BigData.

Course Prerequisites and Target Audience

This course is for experienced Python programmers who need to get up to speed with Pythonic Hadoop application development. A background knowledge of the essentials of cloud computing and of distributed programming would be helpful but is not essential.

Course Outline

Hadoop - A Brief History and Overview

The Big Data Economy - facts, opportunities and challenges

Hadoop Architecture

Hadoop Common
Hadoop Distributed File System (HDFS)
HDFS Clusters – NameNodes, DataNodes & Clients
Metadata
Hadoop and Amazon Elastic Web (EC2)
Hadoop and OpenStack
Web-based Administration
Installing, Configuring and Running Hadoop
The Apache Hadoop Ecosystem - MapReduce, YARN and HDFS

Working with HDFS

Hadoop Configuration API
HDFS API Overview
HDFS File CRUD API
File Compression Decompression
Type Serialization Deserialization
Sequence Files

MapReduce

Parallel Programming Concepts Underlying MapReduce
MapReduce - Map functions
MapReduce Parallel Processing - Heuristics and Patterns

MapReduce Programming in Practice

MapReduce - Map Phase and Reduce Phase
Hadoop Streaming and Map Reduce
Implementing MapReduce programs in Python using Hadoop Streaming - Overview
Implementing Python Mappers
Implementing Python Reducers
Implementing Mappers and Reducers usint Iterators and Generators

Advanced Map Reduce Programming with pydoop

The pydoop framework - an overview
The internal structure of pydoop
Implementng MapReduce Hadoop applications in Python
Combining pydoop with Scipy, matplotlib and Pandas

Data Importing

Importing data with Flume
Importing data with Sqoop

Data warehousing application with Hive

Hive - history and architecture
Loading Data into Hive
Hive Query Statements
Hive Schema Violations
Hive Built In Functions
Joining and Partitioning Data with Hive
Data Summarisation, Granularity and Star Schemas
Ad-hoc queries
Analysing large datasets - Strategies, Performance, Costs
Querying with HiveQL an SQL-like Query Language
HiPy a Python framework for Apache Hive

Apache Thrift and Python

Remote Procedure Calls - Key Concepts
Apache Thrift - an overview
Apache Thrift and Python
Accessing a Hive Server using Thrift and Python
BigData exploration and visualisation using Python and Hive

Parallel processing application development with Pig

Pig History and Architecture
Downloading, Installing, Configuring and Running Pig
Introduction to the Pig Language Syntax
Relational Algebra and the design of Pig queries
Core Relational Operators – DISTINCT, FILTER, SPLIT, ORDER BY, LIMIT, GROUP, FOREACH
Pig Built-in Functions
Debug Operators
Parallel evaluation in Pig
Embedding Pig Latin in Python
BigData exploration and visualisation using Python and Pig

Apache Mahout - Data Mining of Hadoop data

Clustering
Classification
Batch-based collaborative filtering - patterns and strategies
Integrating Java and Python via JPype
Using Mahout with Python via JPype
Refining mined data with Pandas
Visualising mined data with matplotlib

HBase - and Storage of very large amounts of structured data

Architecture of HBase
Understanding the HBase Data Model
Scalability of HBase
Downloading, Installing and Configuring HBase
HBase Shell
HBase Java API for CRUD Operations
Optimising HBase read/write access
Accessing HBase from Python via the Thrift HBase interface
Accessing HBase from Python via Starbase a Python wrapper for the HBase REST API

Cassandra

Cassandra as a Multi-Master Database
The Cassandra Data Model
Eventual Consistency
Cassandra usage scenarios

Cloud Computing with Hadoop

Using Hadoop with Amazon Web Services (AWS)
Using Hadoop with OpenStack

Oozie - Workflow

The Workflow Concept - An Overview
Oozie as a workflow scheduler
Oozie Java main action
Combining Oozie workflow with pythonin streaming map reduce
Understanding and working with the Python-Oozie-Client

Available Courses

Python Big Data Programming with Hadoop

Duration: 5 Days

Course Background

Course Prerequisites and Target Audience

Course Outline