On Thursday, September 21, Women in Big Data and IBM hosted a workshop for ~60 attendees, ranging from professionals to students, to introduce building data science applications on IBM’s Bluemix cloud platform. Karmen Leung, Global Sales Executive for IBM Analytics & Cloud, gave the opening remarks on the Women in Big Data initiatives, as well as the great opportunities for networking today.
Next came an insightful overview by Shaweta Joshi, Data Scientist at IBM Open Source Analytics. Shaweta noted that a growing number of companies are moving their data science practices into the cloud to accelerate data collection, data pre-processing, machine learning models training and model deployment. She also discussed the IBM Bluemix Cloud Platform, the IBM Watson Data Platform, and the core attributes of IBM’s Data Science Experience (DSX) platform.
Lab Session 1: Following Shaweta’s overview, Stacey Ronaghan, Data Scientist at IBM, started the first lab session of the day, Coding with Apache Spark: A Deeper Look at Apache Spark. Many attendees had used or known about the Hadoop MapReduce framework. Stacey compared Spark with MapReduce and highlighted that Spark is an in-memory application framework for distributed data processing and iterative analysis. Stacey also introduced the core APIs of the Spark system: SparkSQL, Spark Streaming, MLlib, and GraphX. The Lab session involved Stacey working with the attendees to experiment on Spark syntax in iPython Notebook on IBM DSX platform. She covered topics such as basic operations with Spark RDD (Resilient Distributed Datasets) and manipulating data with RDD.
Lab Session 2: After a brief lunch break, Jihyoung Kim, Data Scientist at IBM, started the second lab session: Create Machine Learning Models and Deployment – Titanic Analysis. Jihyoung guided the attendees on the use of the Model module of the IBM DSX platform to build supervised machine learning models with the Titanic Survivors dataset for deployment on the model on testing dataset. Jihyoung also demonstrated building a predictive analytics model with SPSS Modeler GUI tools that could fulfill loading data, partitioning data, transforming data and building models without writing a single line of code.
Lab Session 3: Shaweta Joshi led the third lab session: Notebooks – Breast Cancer Analysis. She walked through the data pre-processing and machine learning modeling processes in a typical project, such as data exploration, data preparation, data visualization, feature engineering, dimensionality reduction, model prototyping, result evaluation, and experimentation with different models. The Breast Cancer Analysis project was built with an open-sourced UC Irvine machine learning dataset and was built with the iPython interactive notebook. After Shaweta’s introduction, attendees continued to work on their own and performed analysis on the Breast Cancer dataset.
After the lab sessions, Karmen Leung, Global Sales Executive for IBM Analytics & Cloud, wrapped up this very informative workshop with closing comments.
Github link to the codes for the workshop:
Steps for setting up IBM Bluemix and Data Science Experience (DSX):