Data Analysis and Visualisation Courses
Introduction to Data Mining with R
Duration: 5 Days
Course Background
In general terms, Data Mining comprises techniques and algorithms, for determining interesting patterns from large datasets. There are currently hundreds (or even more) algorithms that perform tasks such as frequent pattern mining, clustering, and classification, among others. Understanding how these algorithms work and how to use them effectively is a continuous challenge faced by data mining analysts, researchers, and practitioners, in particular because the algorithm behavior and patterns it provides may change significantly as a function of its parameters. In practice, most of the data mining literature is too abstract regarding the actual use of the algorithms and parameter tuning is usually a frustrating task. On the other hand, there are a large number of implementations available, such as those in the R project, but their documentation focus mainly on implementation details without providing a good discussion about parameter-related trade-offs associated with each of them. This course aims to provide a mix of both practice and theory, as well as "filling in knowledge gaps" and "dusting away cobwebs of topics not visited for several years".
Course Prerequisites and Target Audience
Attendees are expected to have a sound knowledge of R programming, Statistics and Relational Databases
Course Outline
- Overview of R
- Overview of PostgreSQL / MySQL
- Overview of Data Mining and various aspects of using data mining techniques
- Historical data vs. operational data
- Data warehouses and data marts
- Philosophy and concepts of data mining
- Data granularity issues
- Star Schemas
- Data quality issues
- Data complexity issues
- Computational complexity issues
- Data Mining Project Life Cycle
- Problem definition - characterisation of problem and possible solutions / approaches
- Data evaluation - accessibility, evaluation and data quality
- Feature extraction and enhancement
- Prototyping - prototype planning and model development
- Model evaluation
- Implementation
- Iteration
- Methodologies for mining classification and prediction patterns and available R packages
- Regression models
- Bayes classifiers
- Decision trees
- Multi-layer feedforward artificial neural networks
- Support vector machines
- Supervised clustering
- Methodologies for mining clustering and association patterns - Theory and R Practice
- Hierarchical clustering
- Partitional clustering
- Self-organising maps
- Probability distribution estimation
- Association rules
- Bayesian networks
- Methodologies for mining data reduction patterns - Theory and R Practice
- Principal components analysis
- Multi-dimensional scaling
- Latent variable analysis
- Methodologies for mining outlier and anomaly patterns - Theory and R Practice
- Univariate control charts
- Multivariate control charts
- Methodologies for mining sequential and time series patterns - Theory and R Practice
- Autocorrelation based time series analysis
- Hidden Markov models for sequential pattern mining
- Wavelet analysis
- Hilbert transform
- Nonlinear time series analysis
- Data Mining Genres - case studies
- Supervised Learning Genre - Detecting and Characterizing Known Patterns
- Forensic Analysis Genre Section - Detecting, Characterizing, and Exploiting Hidden Patterns
- Knowledge Acquisition, Representation, and Use Genre