Big Data and NoSQL Courses
Introduction to BigData and NoSQL - Concepts, Strategies and Practice
Duration: 3 Days
Course Background
Wikipedia provides the following definition for BigData Big data is the term for a collection of dat sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The diversity of data types includes multimedia data (books, audio, video, images) as well as the more traditional business transaction data records stored in relational databases. Because of the need to handle a large vay of data types in a wholistic way non-relational database technologies have been either re-discovered (e.g. hierarchical and Codasyl type databases) and new ones invented. Handling and exploiting very large volumes of data (hundreds to millions of terabytes) poses special problems which include challenges such as
- Capture
- Curation
- Storage
- Search
- Sharing
- Transfer
- Analysis
- Visualization
Increasing interest in larger data sets is partly based on the expectation of the opportunities offered by the additional information that might be derived from analysis of a single large set of related data, when contrasted to separate smaller sets with the same total amount of data. The hope is that exploring these massive data sets may make it possible to e.g.
- Spot business trends
- Determine quality of research
- Gain insights into how to prevent diseases
- Combat crime
- Determine real-time network flows in e.g. transportation or communications networks
Key goals include
- Providing an understanding big data and how it can be applied to store, manage, process and analyse massive amounts of unstructured and poly-structured data
- Exploring the various technologies underpinning big data such as Hadoop and NoSQL
- Discussing how big data systems might complement traditional data warehousing and business intelligence solutions and processes
- Surveying the opportunities based on exploitation of suitable big data sets to help government agencies, healthcare provides and businesses provide a better service to their clients and customers
Topics covered :
- Background, history and concepts underlying big data and NoSQL
- BigData lifecycle models and pragmatic roadmaps
- Comprehensive categorization and survey of NoSQL products and packages
- Walk through of a reference architecture and platform stack
- Analysis and Design techniques for realising NoSQL solutions
- High level overview of hardware, networking and software frameworks underpinning BigData
- Evolving trends and protecting against obsolescence
- Discussion of some actual projects and lessons learned
Course Prerequisites and Target Audience
This 3 day workshop is aimed at busines leaders and senior management, strategic planners, investors and concerned citizens. Its aims are to provide an uptodata and fairly comprehensive overview of the underlying technologies and how they are evolving, business trends concerning the use of BigData as well as insights into the variety of potential opportunities, as well as risk, involved in developing and using Big Data systems.
Course Outline
- BigData - some definitions and views
- Three V's of Big Data - Velocity, Volume, Variety
- BigData as Technology - Hadoop and other NoSQL ways for storing and manipulating data
- BigData = Transactions + Interactions + Observations
- BigData as a zoo containing things such as various species of information e.g. Process-Mediated Data, Human-Sourced Information, and Machine-Generated Data
- Big Data as a source of “New Signals”
- BigData as a means of being able to analyse data that was previously ignored
- BigData as an evolutionary step in Human Civilisation
- Foundations of NoSQL
- The data deluge - the generators of Big Data (structured vs. unstructured data)
- The limitations of SQL and Relational Databases
- Hierarchical and Codasyl databases - re-evaluated
- Parallel Computeing vs. Distributed Computing
- Origins of NoSQL - GFS, MapReduce, BigTable, Dynamo
- CAP Theorem: Consistency, Availability, Partition Tolerance
- Ways of thinking when it comes to big data
- Sampling vs. all the data
- Clean data vs. messy data
- Causality vs. correlations
- Varieties of NoSQL Technology
- Key-Value stores (Voldemort, Dynamo)
- Key-Data stores (Redis)
- Key-Document stores (CouchDB, MongoDB, Riak)
- Column Family stores (BigTable, HBase, Cassandra)
- Graph stores (Neo4J, HyperGraphDB)
- Comparison of licenses for different open source NoSQL databases
- Example Use cases for various NoSQL databases
- Hadoop - Architecture, Uses and Limitations
- Masters and Slaves architecture
- HDFS Architecture - NameNode plus DataNodes
- HDFS HA - High Availability mechanisms
- HDFS write and read pipelines
- Heartbeats and Rack Awareness
- Accessing HDFS via the command line or via a web GUI
- Loading data into HDFS and retrieving data from HDFS
- MapReduce Architecture underlying Hadoop - JobTracker and TaskTrackers
- Data Locality
- MapReduce in practice - keys/values, shuffle/sort, combiner, partitioner
- MapReduce application patterns
- The Hadoop ecosystem projects - an overview
- Hive
- Pig
- Oozie
- Mahout
- Scoop
- Talend
- Flume
- Who is using Hadoop and Why
- Column Family Architectures - HBase and Cassandra compared
- Masters/Slaves (HBase) vs. Peer to Peer (Cassandra ring)
- Log Structured Merge Trees - Write Ahead Log, memory buffers, flushing, bloom filters, block indexes, compactions
- Principles of Column Family data modeling
- Rows and columns
- Column families and regions
- HBase specific elements
- ZooKeeper
- HBase Master
- Heartbeats
- Failure detection
- Recovery
- Cassandra specific elements
- Gossip protocol
- Anti-entropy
- Replica placement strategies
- Snitches failure detection and recovery
- Data partitioning - one data center vs. multiple data centers
- MongoDB an overview
- Document-oriented storage
- BSON specification
- Indexing
- GridFS
- MongoDB architecture
- Replication
- HA (High Availability)
- Auto-sharding
- Wire protocol - Communication stream
- Inserting a document
- Querying a Collection
- MapReduce and MongoDB
- Neo4J overview
- Use cases for graph databases
- Neo4J architecture and fundamentals
- Nodes and Relationships
- Querying a graph with a traversal
- Index lookups
- Big Data - The Business Perspective
- The Business Transformation Imperative
- The Big Data Business Model Maturity Index
- Business Monitoring, Business Insights and Business Optimization
- Data Monetization
- Business Metamorphosis
- Big Data Business Model Maturity Metrics
- Big Data - Business Impact
- Management via use of the correct metrics
- Discovering Data Monetization Opportunities
- Evaluating Digital Media Data Assets and Understanding Target Users
- Data Monetization as a consequence of good Transformations and data Enrichment
- Organisational aspects of BigData projects
- Understanding the Data Analytics Lifecycle
- Data Scientist Roles and Responsibilities
- Data Discovery, Preparation and Planning
- Model Planning and model building
- Communication of the Results and operationalisation
- Emergence of New Organizational Roles
- User Experience Team
- New Senior Management Roles
- Organizational Creativity - stimulation and management
- BigData and Decision Support Systems
- Overview of classical decision support systems
- Incorporating BigData applications into Decision Support Systems
- Strategic thinking and BigData
- BigData and the Art and Anatomy of Judgement
- Soft systems, satificing and complexity in the context of BigData strategic thinking
- Aspects of BigData value creation
- Technology Drivers
- Access to More Detailed Transactional Data
- Access to Unstructured Data
- Access to Low-latency (Real-Time) Data
- Integration of Predictive Analytics
- Business Drivers - examples
- Predictive maintenance
- Customer satisfaction
- BigData and Porter's Value Chain model
- Unlocking value by linking social and business results
- BigData and Porter's Five Forces Model
- BigData Use Cases
- The Big Data Envisioning Process
- Step 1: Research Business Initiatives
- Step 2: Acquire and Analyze the Data
- Step 3: Ideation Workshop - Brainstorm New Ideas
- Step 4: Ideation Workshop - Prioritize Big Data Use Cases
- Step 5 - Document Steps to follow
- Using User Experience Mockups to Drive the Envisioning Process
- The Prioritization Process
- The Prioritization Matrix Process
- Business Process Modeling (BPM) and Solution Engineering
- Introduction to the BigData Solution Engineering method
- Step 1: Understand How the Organization Makes Money
- Step 2: Identify the Organization’s Key Business Initiatives
- Step 3: Brainstorm the Big Data Business Impact
- Step 4: Break Down the Business Initiative Into Use Cases
- Step 5: Prove Out the Use Cases
- Step 6: Design and Implement the Big Data Solution
- Overview of BPM