Introduction to Big Data

Course description:

The production of data is expanding at an astounding pace. The explosion of accessible data through social media, the extensive use of web crawling, and the widespread availability of sensor data, have provided unprecedented amounts of data to organizations for collection and analysis. This two-day workshop explores drivers behind Big Data and uses cases across a wide variety of industries to illustrate the power of new technologies to harness Big Data and generate meaningful insights. Participants will be introduced to Hadoop and key-value data storage, the central components of the Big Data movement. These systems allow the distributed processing of very large data sets for structured and unstructured data.

During this course, participants will learn how Hadoop works with hands-on experiences using the Hadoop File Systems (HDFS) and MapReduce. Participants will also be introduced to several ecosystem components like HBase, Hive, Impala, and Spark used in Big Data reporting systems.

This is an introductory course in Big Data and Hadoop, but it will go beyond basics to introduce some technical components. It is appropriate for those that just want to learn more about Hadoop and Big Data and those that are looking to begin on a path to becoming a Hadoop developer.

Course outline:

Day 1 material

What is big data?

Volume, variety, velocity, and veracity
Comparing big data to conventional reporting systems

Strengths and weaknesses of big data solutions

Processing bottlenecks
Data integration challenges
Data redundancy versus speed
Lack of data integrity

Key-value data systems and HDFS

Relational, dimensional, and key-value data models
Navigating data in the HDFS

Parallel processing

Big data hardware setups
Retrieving and processing data in big data environments

Day 2 material

MapReduce

Writing MapReduce algorithms in JAVA
Executing MapReduce code on HDFS data

Common big data algorithms

Reusing mappers and reducers
Common mappers: case, explode, filter, keyspace, identity
Common reducers: sum, average, identity

Big data reporting with Hive

What is Hive?
Writing Hive statements

The big data ecosystem (Hive, Impala, Spark, etc.)

Introduction to other Hadoop ecosystem tools: Pig, Sqoop, Flume, Oozie, Spark, Impala
Comparison of in-memory processing options

Squeezing value from big data

Best practices for wringing value from big data

Instructor:

Andrew Harrison is an Assistant Professor of Information Systems at the University of Cincinnati Lindner College of Business. His research interests include consumer fraud, deception, security systems, privacy, media capabilities, and virtual worlds.

For more information about these classes, or for custom training classes, please contact

Marilyn Kump

Program Director

513-556-5710

kumpm@ucmail.uc.edu