# Data Analysis and Visualisation Courses

# Introduction to Data Mining with Python

## Duration: 5 Days

## Course Background

In general terms, Data Mining comprises techniques and algorithms, for determining interesting patterns from large datasets. There are currently hundreds (or even more) algorithms that perform tasks such as frequent pattern mining, clustering, and classification, among others. Understanding how these algorithms work and how to use them effectively is a continuous challenge faced by data mining analysts, researchers, and practitioners, in particular because the algorithm behavior and patterns it provides may change significantly as a function of its parameters. In practice, most of the data mining literature is too abstract regarding the actual use of the algorithms and parameter tuning is usually a frustrating task. On the other hand, there are a large number of implementations available, such as those developed by the Python scientific community, but their documentation focus mainly on implementation details without providing a good discussion about parameter-related trade-offs associated with each of them. This course aims to provide a mix of both practice and theory, as well as "filling in knowledge gaps" and "dusting away cobwebs of topics not visited for several years".

## Course Prerequisites and Target Audience

Attendees are expected to have a sound knowledge of R programming, Statistics and Relational Databases

## Course Outline

- Overview of Python
- Overview of PostgreSQL / MySQL
- Overview of Data Mining and various aspects of using data mining techniques
- Historical data vs. operational data
- Data warehouses and data marts
- Philosophy and concepts of data mining
- Data granularity issues
- Star Schemas
- Data quality issues
- Data complexity issues
- Computational complexity issues
- Python Tools and Packages for the Data Miner and Data Explorer
- Numpy, SciPy and Matplotlib
- PyBrain - Neural networks and machine learning package
- NLTK - natural language processing package
- Orange - data mining package
- Elefant - machine learning package
- Networkx - graph plotting package
- Scrapy - Screen scraping and web crawling framework, used to crawl websites and extract structured data from web pages.
- Pattern - Web mining module
- scikit-learn - Machine Learning toolkit
- Data Mining Project Life Cycle
- Problem definition - characterisation of problem and possible solutions / approaches
- Data evaluation - accessibility, evaluation and data quality
- Feature extraction and enhancement
- Prototyping - prototype planning and model development
- Model evaluation
- Implementation
- Iteration
- Methodologies for mining classification and prediction patterns - Theory and Pythonic Practice
- Regression models
- Bayes classifiers
- Decision trees
- Multi-layer feedforward artificial neural networks
- Support vector machines
- Supervised clustering
- Methodologies for mining clustering and association patterns- Theory and Pythonic Practice
- Hierarchical clustering
- Partitional clustering
- Self-organising maps
- Probability distribution estimation
- Association rules
- Bayesian networks
- Methodologies for mining data reduction patterns- Theory and Pythonic Practice
- Principal components analysis
- Multi-dimensional scaling
- Latent variable analysis
- Methodologies for mining outlier and anomaly patterns- Theory and Pythonic Practice
- Univariate control charts
- Multivariate control charts
- Methodologies for mining sequential and time series patterns- Theory and Pythonic Practice
- Autocorrelation based time series analysis
- Hidden Markov models for sequential pattern mining
- Wavelet analysis
- Hilbert transform
- Nonlinear time series analysis
- Data Mining Genres - case studies
- Supervised Learning Genre - Detecting and Characterizing Known Patterns
- Forensic Analysis Genre Section - Detecting, Characterizing, and Exploiting Hidden Patterns
- Knowledge Acquisition, Representation, and Use Genre