Data Analysis with Hive and Pig
Duration: 5 Days
Course Background
This 5-day hands-on training course covers the development of applications and analyze Big Data stored in Apache Hadoop 2.0 using Pig and Hive. The course will cover the details of Hadoop 2.0, YARN, the Hadoop Distributed File System (HDFS), and provide an overview of MapReduce. A major part of the course will explore the use of Pig and Hive to carry out BigData analytics. Supporting topics such as importing data using Sqoop and Flume, and defining workflows using Oozie will also be covered.
Course Prerequisites and Target Audience
The course is aimed at Data Analysts, (Business Intelligence) BI Analysts and BI Developers, SAS Developers and analysts who need to answer questions of and analyze the Big Data stored in a Hadoop cluster.
Course Outline
- Hadoop - History and Background
- Overview of the Hadoop Architecture and the Hadoop Ecosystem
- HDFS - The Hadoop Distribute File System
- Overview of HDFS and HDFS commands
- Using HDFS commands to add/remove files and folders from HDFS
- Sqoop and how it is used to transfer data between HDFS and a RDBMS (Relational Database Management System)
- The MapReduce Framework and YARN
- Conceptual introduction to MapReduce
- Running MapReduce jobs
- Running YARN applications
- Pig
- Relational Algebra and the Rationale underlying Pig
- The Pig Latin language - and introduction
- Using Pig to explore and transform data
- Split a dataset using Pig
- Join two datasets using Pig
- Use Pig to transform and export a dataset for use with Hive
- Use HCatLoader and HCatStorer to retrieve HCatalog schemas from within a Pig script
- Hive
- Hive tables
- How Hive tables are stored in HDFS
- Introduction to HiveQL - the Hive Query Language
- Use Hive to extract useful information from a dataset
- Overview of how Hive queries are executed as MapReduce jobs
- Perform joins on two datasets with Hive
- Advanced Hivew - windowing, views and ORC )(Optimised Row Columnar) files
- Working with the Hive analytics functions (rank, dense_rank, cume_dist, row_number)
- Use of custom reducers to reduce the number of underlying MapReduce jobs generated from a Hive query
- Example Case Studies
- Analyzin and sessionizing clickstream data using the Pig DataFu library
- Analysis and filtering of NYSE stock prices
- Introduction to HiveQL - the Hive Query Language
- Statistical data analysis combining Hive and R via RHive
- RHadoop - Fine grained interaction between R and Hadoop
- RODBS/RJDBC - Hadoop SQL access from R
- Workflows and Oozie