Data Mining of BigData
Duration: 3 Days
Course Background
This course surveys the field of Data Analysis and Data Mining and also considers the special problems and issues arising from having to handle very large amounts of data. The course concentrates on concepts, theories and methods and is , in the main, database and framework agnostic. Possible practical approaches and frameworks are surveyed, but without going into programming details.
Course Prerequisites and Target Audience
The course is aimed at data analysts and statisticians who need to get up to speed with data analysis and data minining in general and also with analysis and mining of data held in the Cloud.
Course Outline
- Data Mining - Origins and Concepts
- The value and meaning of data
- Patterns and correlations
- Statistics and reality
- Machine learning and Statistical learning
- Data mining - theory and practice
- Scientific method - an overview
- Frameworks and approaches to data mining
- Data mining as a process
- Micro-economic and inductive database approaches to data mining
- Importance of domain knowledge
- Business objectives underlying data mining
- Data - understanding, acquisition, integration, description and quality assessment
- Data Modeling, Understanding and Preparation
- Elements of data understanding - acquisition, extraction, description, assessment, profiling, transformation, imputations, weighting and balancing, filtering and smoothing, abstraction, reduction, sampling, discretization, derivations
- Feature selection, ranking and subset selection
- Data Mining Tools and Algorithms
- Data access tools
- Data exploration tools
- Modeling - management and analysis tools
- Data mining algorithms - an overview
- Statistical data analysis
- Basic data mining algorithms
- Association rules
- Neural networks
- Genetic Algorithms
- Radial Basis Function (RBF) networks
- Generalised additive models (GAMs)
- Classification and Regression Trees (CART)
- General CHAID (Chi-squared Automatic Interaction Detection) and Decision Trees
- Generalised EM (Expectation Maximisation) and k-means Cluster Analysis
- Multivariate Adaptive Regression Splines
- Kohonen networks
- Statistical learning theory and support vector machines
- Text Mining and Natural Language Processing (NLP)
- Statistical aspects of NLP
- Understanding the process of Text Analysis
- Applications
- Security
- Biomedicine
- Media and Marketing
- Sentiment analysis
- Review of commercial and open source software
- Generalised EM (Expectation Maximisation) and k-means Cluster Analysis
- Introduction to the R text mining framework
- Introduction to the Python Natural Language Processing Toolkit (NLTK)
- Classification and Classifiers
- Concepts and limitations
- Underlying assumptions
- Overview of Classification Methods
- Nearest neighbour
- CHAID
- Logistic Regression
- Neural networks
- Naive Bayesian classifiers
- Numerical Prediction Methods
- Parametric statistical modeling
- Linear regression and Generalised Linear Models (GLMs)
- Nonlinear regression
- Machine learning methods based on Classification and Regression Trees (C & RT)
- Evaluation and Enhancement of Models
- Splitting data for model evaluation purposes
- Avoiding overfitting
- Enhancement heuristics
- Exploiting Ensembles of models
- Applications and Case studies
- Medical informatics
- Bioinformatics
- Customer response prediction
- Fraud detection
- Security
- Issues associated with Hadoop BigData
- Getting to Know the Data
- Transforming the Data
- Aggregating the Data
- Star Schemas
- Partitioning the Data
- Implementing a Star Schema in Hive
- Data mining and HBase