Big data is a blanket term for the non-traditional strategies and technologies needed to gather, organize, process, and gather insights from large datasets. In this course we will discuss the challenges created by Big Data and some of the state-of-the-art approaches do deal with them. In this curricular unit students will obtain practical experience with Hadoop, Hive, and Spark tools and understand their role in the analytical workflow of a data scientist. Lectures will approach the complex and heterogeneous Big Data ecosystem, and the privacy and societal implications of these technologies, will in the labs students will attain hands on experience with these tools.

Intended Learning Outcomes

  • Explain what Big Data is and what its implications to society;
  • Identify the sources of Big Data;
  • Explain the core technologies that enabled the Big Data revolution;
  • Understand the role and importance of the Hadoop Ecosystem;
  • Explain what Map-Reduce is, and Describe its role in the Hadoop Ecosystem;
  • Perform file manipulations with Hadoop;
  • Setup a Hive Data Warehouse in a Hadoop system;
  • Explore and Analyze data with Hive
  • Understand what Spark is;
  • Load, Transform, and Analyze data using Spark;
  • Develop Spark application to create machine learning models;
  • Analyze network data using Spark graphx;
  • Deploya Spark cluster in Amazon AWS.