Big Data and Spark

Hadoop? MapReduce? Spark? Hive? Making sense of the tools used to analyze big data can seem confusing and overwhelming at times. Dr. Harrison and Dr. Shan will help you understand how these components function and form the core of big data analytics systems. The emphasis of this course will be on understanding the fundamental principles of big data systems using Hadoop and Spark.

Spark allows the processing of huge volumes of data in real-time, and is a dominant choice for performing analytics at scale. Similarly, the Hadoop Distributed File System (HDFS) forms the backbone of most big data systems. In this course, participants will learn the theory behind how these tools work so they can understand when, and how, to implement them effectively. The relative strengths and weaknesses of various big data systems will be highlighted to explain how Spark has emerged as a popular choice for analyzing dynamic, high-velocity, and high-volume data.

Participants will also get hands-on experience using HDFS and Spark to illustrate the power of big data analytics.

Intended audience

This is an introductory course in Big Data and Spark, but it will go beyond basics to introduce some technical components. Most big data analytics will be performed using Spark and HiveQL, a querying language based on SQL. Participants will also use basic Linux commands for operating Hadoop. This course is appropriate for those that want to learn more about how Spark and HDFS function and those that are looking to begin career in big data analytics.


Andrew Harrison is an assistant professor of Information Systems in the Lindner College of Business at the University of Cincinnati. His research interests include consumer fraud, deception, security systems, privacy, media capabilities, and virtual worlds.

Zhe (Jay) Shan is an assistant professor in the Department of Operations, Business Analytics, and Information Systems in the Lindner College of Business at the University of Cincinnati. He earned his Ph.D. in Business Administration and Operations Research from Penn State University Smeal College of Business in 2011. Before joining UC, he worked as assistant professor of Information Systems at Manhattan College School of Business for two years.


  • "It was the most useful content I've ever received on the big data/Hadoop/Spark topic."
  • "Excellent job by Andrew and Jay. The organization was great."
  • "Useful mix of lecture and labs to really make the topics stick. I now have a better practical undersatnding of tools that are part of the Hadoop system."
  • "Awesome for my level of knowledge in Hadoop."
  • "Excellent overview of rapidly changing topic. Very knowledgeable instructors."
  • "Fantastic course. Provided a good knowledge of the big data tools ecosystem."