Women in Big Data Global



WiBD and IBM Hosts Workshop on Accelerating Digital Innovation with Data Science in the Cloud

Women in Big Data

By Sabrina Kong,

October 5, 2017


On Thursday, September 21, Women in Big Data and IBM hosted a workshop for ~60 attendees, ranging from professionals to students, to introduce building data science applications on IBM’s Bluemix cloud platform. Karmen Leung, Global Sales Executive for IBM Analytics & Cloud, gave the opening remarks on the Women in Big Data initiatives, as well as the great opportunities for networking today.

Karmen Leung gives opening remarks

Next came an insightful overview by Shaweta Joshi, Data Scientist at IBM Open Source Analytics. Shaweta noted that a growing number of companies are moving their data science practices into the cloud to accelerate data collection, data pre-processing, machine learning models training and model deployment. She also discussed the IBM Bluemix Cloud Platform, the IBM Watson Data Platform, and the core attributes of IBM’s Data Science Experience (DSX) platform.

Shaweta Joshi discusses industry trends

Lab Session 1: Following Shaweta’s overview, Stacey Ronaghan, Data Scientist at IBM, started the first lab session of the day, Coding with Apache Spark:  A Deeper Look at Apache Spark. Many attendees had used or known about the Hadoop MapReduce framework. Stacey compared Spark with MapReduce and highlighted that Spark is an in-memory application framework for distributed data processing and iterative analysis. Stacey also introduced the core APIs of the Spark system: SparkSQL, Spark Streaming, MLlib, and GraphX. The Lab session involved Stacey working with the attendees to experiment on Spark syntax in iPython Notebook on IBM DSX platform. She covered topics such as basic operations with Spark RDD (Resilient Distributed Datasets) and manipulating data with RDD.

Stacey Ronaghan starts the first presentation and Lab session of the day

Lab Session 2: After a brief lunch break, Jihyoung Kim, Data Scientist at IBM, started the second lab session: Create Machine Learning Models and Deployment – Titanic Analysis. Jihyoung guided the attendees on the use of the Model module of the IBM DSX platform to build supervised machine learning models with the Titanic Survivors dataset for deployment on the model on testing dataset. Jihyoung also demonstrated building a predictive analytics model with SPSS Modeler GUI tools that could fulfill loading data, partitioning data, transforming data and building models without writing a single line of code.

Jihyoung Kim kicks off the afternoon session with the second presentation and lab

Lab Session 3: Shaweta Joshi led the third lab session: Notebooks – Breast Cancer Analysis. She walked through the data pre-processing and machine learning modeling processes in a typical project, such as data exploration, data preparation, data visualization, feature engineering, dimensionality reduction, model prototyping, result evaluation, and experimentation with different models. The Breast Cancer Analysis project was built with an open-sourced UC Irvine machine learning dataset and was built with the iPython interactive notebook. After Shaweta’s introduction, attendees continued to work on their own and performed analysis on the Breast Cancer dataset.

After the lab sessions, Karmen Leung, Global Sales Executive for IBM Analytics & Cloud, wrapped up this very informative workshop with closing comments.


Github link to the codes for the workshop:


Steps for setting up IBM Bluemix and Data Science Experience (DSX):

  1. To set up a Spark Service IBM Bluemix environment, navigate to https://new-console.ng.bluemix.net, register and create a Spark service. After you’re logged in to your Bluemix account, click on the Catalog tab, and enter “Spark” in your search bar. Select the “Apache Spark” service. Make sure “US South” is selected as the region to deploy in. Click “Create” and you’re set.
  2. Log in to IBM Data Science Experience (DSX) to create and run notebooks. To set up your IBM DSX (Data Science Experience) environment navigate to http://datascience.ibm.com and log in using your existing Bluemix user id.