SKILL SETS Code up machine learning algorithms on single machines and on clusters of machines / Amazon AWS / Working on problems with terabytes of data / Machine learning pipelines for petabyte-scale data / Algorithmic design / Parallel computing
TOOLS Apache Hadoop / Apache Spark
DESIGNED BY James G. Shanahan
This course builds on and goes beyond the collect-and-analyze phase of big data by focusing on how machine learning algorithms can be rewritten and extended to scale to work on petabytes of data, both structured and unstructured, to generate sophisticated models used for real-time predictions. Conceptually, the course is divided into two parts. The first covers fundamental concepts of MapReduce parallel computing, through the eyes of Hadoop, MrJob, and Spark, while diving deep into Spark Core, data frames, the Spark Shell, Spark Streaming, Spark SQL, MLlib, and more. The second part focuses on hands-on algorithmic design and development in parallel computing environments (Spark), developing algorithms (decision tree learning), graph processing algorithms (pagerank/shortest path), gradient descent algorithms (support vectors machines), and matrix factorization. Students will use MapReduce parallel compute frameworks for industrial applications and deployments for various fields, including advertising, finance, healthcare, and search engines. Examples and exercises will be made available in Python notebooks (Hadoop Streaming, MrJob and pySpark).
Take the Next Step
Advance your data science career with UC Berkeley’s online Master of Information and Data Science.