Python Big Data Programming with Hadoop
Duration: 5 Days
Course Background
Hadoop is a highly scalable framework that aims to provide high availability of large data sets, located on either a single server to to clusters of thousands of computers coupled with a relatively simple programming model. This course aims to provide attendees with a sound understanding of the Hadoop stack and the key Python Hadoop frameworks. A key motivation for using Python in conjunction with BigData is the availability of the powerful Python tools for data plotting and data analysis, namely, NumPy, SciPy, Matplotlib and Pandas. The course will be a mixture of instructor taught material, and programming labs. The underlying theme of the course will be the use of Python for the exploration, analysis and visualisation of Hadoop BigData.
- Hadoop - A Brief History and Overview
- Hadoop Architecture
- Hadoop Common
- Hadoop Distributed File System (HDFS)
- HDFS Clusters – NameNodes, DataNodes & Clients
- Metadata
- Hadoop and Amazon Elastic Web (EC2)
- Hadoop and OpenStack
- Web-based Administration
- Installing, Configuring and Running Hadoop
- The Apache Hadoop Ecosystem - MapReduce, YARN and HDFS
- Working with HDFS
- Hadoop Configuration API
- HDFS API Overview
- HDFS File CRUD API
- File Compression Decompression
- Type Serialization Deserialization
- Sequence Files
- MapReduce
- Parallel Programming Concepts Underlying MapReduce
- MapReduce - Map functions
- MapReduce Parallel Processing - Heuristics and Patterns
- MapReduce Programming in Practice
- MapReduce - Map Phase and Reduce Phase
- Hadoop Streaming and Map Reduce
- Implementing MapReduce programs in Python using Hadoop Streaming - Overview
- Implementing Python Mappers
- Implementing Python Reducers
- Implementing Mappers and Reducers usint Iterators and Generators
- Advanced Map Reduce Programming with pydoop
- The pydoop framework - an overview
- The internal structure of pydoop
- Implementng MapReduce Hadoop applications in Python
- Combining pydoop with Scipy, matplotlib and Pandas
- Data Importing
- Importing data with Flume
- Importing data with Sqoop
- Data warehousing application with Hive
- Hive - history and architecture
- Loading Data into Hive
- Hive Query Statements
- Hive Schema Violations
- Hive Built In Functions
- Joining and Partitioning Data with Hive
- Data Summarisation, Granularity and Star Schemas
- Ad-hoc queries
- Analysing large datasets - Strategies, Performance, Costs
- Querying with HiveQL an SQL-like Query Language
- HiPy a Python framework for Apache Hive
- Apache Thrift and Python
- Remote Procedure Calls - Key Concepts
- Apache Thrift - an overview
- Apache Thrift and Python
- Accessing a Hive Server using Thrift and Python
- BigData exploration and visualisation using Python and Hive
- Parallel processing application development with Pig
- Pig History and Architecture
- Downloading, Installing, Configuring and Running Pig
- Introduction to the Pig Language Syntax
- Relational Algebra and the design of Pig queries
- Core Relational Operators – DISTINCT, FILTER, SPLIT, ORDER BY, LIMIT, GROUP, FOREACH
- Pig Built-in Functions
- Debug Operators
- Parallel evaluation in Pig
- Embedding Pig Latin in Python
- BigData exploration and visualisation using Python and Pig
- Apache Mahout - Data Mining of Hadoop data
- Clustering
- Classification
- Batch-based collaborative filtering - patterns and strategies
- Integrating Java and Python via JPype
- Using Mahout with Python via JPype
- Refining mined data with Pandas
- Visualising mined data with matplotlib
- HBase - and Storage of very large amounts of structured data
- Architecture of HBase
- Understanding the HBase Data Model
- Scalability of HBase
- Downloading, Installing and Configuring HBase
- HBase Shell
- HBase Java API for CRUD Operations
- Optimising HBase read/write access
- Accessing HBase from Python via the Thrift HBase interface
- Accessing HBase from Python via Starbase a Python wrapper for the HBase REST API
- Cassandra
- Cassandra as a Multi-Master Database
- The Cassandra Data Model
- Eventual Consistency
- Cassandra usage scenarios
- Cloud Computing with Hadoop
- Using Hadoop with Amazon Web Services (AWS)
- Using Hadoop with OpenStack
- Oozie - Workflow
- The Workflow Concept - An Overview
- Oozie as a workflow scheduler
- Oozie Java main action
- Combining Oozie workflow with pythonin streaming map reduce
- Understanding and working with the Python-Oozie-Client
Course Prerequisites and Target Audience
This course is for experienced Python programmers who need to get up to speed with Pythonic Hadoop application development. A background knowledge of the essentials of cloud computing and of distributed programming would be helpful but is not essential.
Course Outline
- The Big Data Economy - facts, opportunities and challenges