BigData Analysis with Hadoop and R
Duration: 5 Days
Course Background
This advanced course explores the synergy that can be realised by combining R with Hadoop to implement powerful data analysis and visualisation applications. The course explores approaches involving implementing Map/Reduce modules in R and then using Hadoop's parallel processing Map/Reduce framework to process very large amounts of data. The course will explore RHadoop’s packages that allow R to interact with Hadoop’s Distributed File System (HDFS), make it possible to write and submit MapReduce jobs, and also to interact with HBase.
Course Prerequisites and Target Audience
This course is aimed at data scientists, statisticians, data architects, and engineers who are planning to process and analyze vast amounts of information using R and Hadoop. Attendees are expected to have a reasonable amount of R programming experience and to have a sound basic understanding of Hadoop and Map/Reduce.
Course Outline
- Foundations
- Intensive review of Hadoop
- Intensive review of Map/Reduce
- Intensive review of R
- R Packages for interacting with Hadoop - An Overview
- Hadoop Interactive - R Hive package
- Hadoop Streaming
- RHIPE
- segue
- RHadoop
- RHadoop - R Packages Set - an Overview
- rhdfs package - for accessing HDFS from R
- rmr - provides Hadoop MapReduce interfaces to R
- rhbase - provides an R to HBase interface
- rhdfs
- Installation
- Overview of the R API for accessing HDFS
- Browsing, reading and modifying HDFS files from R
- Populating HDFS with data via R
- rhbase
- Installation
- Overview of the rhbase API
- Composable map reduce jobs
- Debugging using the local backend
- Back end processing options
- Saving results locally for further analysis
- rmr2
- Installation
- Implementing Map/Reduce application code using rmr2
- Overview of the rmr2 API
- rmr input and output formatters
- rmr mappers and reducers
- Debugging using the local backend
- Back end processing options
- Saving results locally for further analysis
- Data Analytics with R and Hadoop - Life Cycle
- Problem Identification
- Data requirements design
- Data preprocessing
- Data analysis
- Data visualisation
- Data Analytics and Machine Learning
- Supervised machine learning - Linear regression, Logistic regression
- Unsupervised machine learning - clustering
- Recommendation algorithms and their implementation using R and Hadoop
- Importing and Exporting Data from Databases and Spreadsheets
- MySQL, PostgreSQL and SQLite
- Excel
- MongoDB
- Hive
- Hbase
- SQLServer
- Oracle