Introduction to Big Data
Course description:
The production of data is expanding at an astounding pace. The explosion of accessible data through social media, the extensive use of web crawling, and the widespread availability of sensor data, have provided unprecedented amounts of data to organizations for collection and analysis. This two-day workshop explores drivers behind Big Data and uses cases across a wide variety of industries to illustrate the power of new technologies to harness Big Data and generate meaningful insights. Participants will be introduced to Hadoop and key-value data storage, the central components of the Big Data movement. These systems allow the distributed processing of very large data sets for structured and unstructured data.
During this course, participants will learn how Hadoop works with hands-on experiences using the Hadoop File Systems (HDFS) and MapReduce. Participants will also be introduced to several ecosystem components like HBase, Hive, Impala, and Spark used in Big Data reporting systems.
This is an introductory course in Big Data and Hadoop, but it will go beyond basics to introduce some technical components. It is appropriate for those that just want to learn more about Hadoop and Big Data and those that are looking to begin on a path to becoming a Hadoop developer.
Course outline:
Day 1 material
What is big data?
- Volume, variety, velocity, and veracity
- Comparing big data to conventional reporting systems
Strengths and weaknesses of big data solutions
- Processing bottlenecks
- Data integration challenges
- Data redundancy versus speed
- Lack of data integrity
Key-value data systems and HDFS
- Relational, dimensional, and key-value data models
- Navigating data in the HDFS
Parallel processing
- Big data hardware setups
- Retrieving and processing data in big data environments
Day 2 material
MapReduce
- Writing MapReduce algorithms in JAVA
- Executing MapReduce code on HDFS data
Common big data algorithms
- Reusing mappers and reducers
- Common mappers: case, explode, filter, keyspace, identity
- Common reducers: sum, average, identity
Big data reporting with Hive
- What is Hive?
- Writing Hive statements
The big data ecosystem (Hive, Impala, Spark, etc.)
- Introduction to other Hadoop ecosystem tools: Pig, Sqoop, Flume, Oozie, Spark, Impala
- Comparison of in-memory processing options
Squeezing value from big data
- Best practices for wringing value from big data
Instructor:
Andrew Harrison is an Assistant Professor of Information Systems at the University of Cincinnati Lindner College of Business. His research interests include consumer fraud, deception, security systems, privacy, media capabilities, and virtual worlds.