Blog Viewer

The Data Science Process

By Shawn Hainsworth posted 07-23-2019 17:19

  

Introduction


This is the second post in my series of machine learning posts in advance of the conference.  I hope this will provide for a lively conversation during our session:  Machine Learning and AI in Action (AI Series, Session 2). 

In my first post, A Non-Technical Introduction to Machine Learning Algorithms, I discussed the various machine learning algorithms, and the types of business problems which can be solved using artificial intelligence and machine learning.

In this post, I will discuss the process of implementing a complete, end-to-end,  data science experiment.  This post should help you understand how to get started, and also provide an overview of the tasks involved in building a machine learning experiment.

Below is a diagram of the data science process, which I will use to detail the various steps required in any machine learning experiment.  This diagram is available from Microsoft in the documentation of the Microsoft Team Data Science Process (TDSP).  

Data Science Process

Business Understanding


At the beginning of any data science project, we must clearly define the business problem, and the objectives of the experiment.​  Remember that machine learning algorithms can be supervised or non-supervised, meaning we can have a labeled objective, such as predicting profitability, or a more exploratory objective, such as clustering clients to understand deeper relationships and behavioral patterns.

For supervised algorithms, we have specific scoring algorithms by which we can evaluate the effectiveness of a model,  We should define the range of acceptable accuracy, such as predicting correctly in 80% of new cases. 

We should also understand trade-offs between precision and recall.  In other words, are we interested in identifying all cases where a certain condition is true, even if we have a number of false positives?  For example, in conflict checks we cannot miss any conflicts even if our algorithm returns a number of false hits.  See this post on The Legal BI Guy for a more detailed discussion of Precision vs. Recall.

Data Acquisition & Understanding

In the data acquisition and understanding phase, there are a number of steps.

Data Sources

We often must combine data from different data sources, such as SQL databases and flat files (Excel, CSV).  We may need to extract data from text documents such as contracts, using a natural language processor.  In addition  aggregate data at different levels (groupings).  Finally, we must plan for latency.  How frequently is the data refreshed?  

Pipeline

What is the flow of data through the various stages of our experiment.  Are we working with streaming data or batch data?  How is unstructured data being stored, on the file system, in a data lake?  Are we creating intermediate data stores?

Environment

This environment includes all of the resources involved in our experiment, including separating test and production environments.  This may involve on-premises resources, as well as resources in the cloud, different technologies (SQL, NoSQL, Data Lake, Apache Spark, etc.)

Wrangling, Exploration and Cleaning

Wrangling involves bringing the data for our experiment together.  This requires joining data sources and potentially integrating unstructured data.  Data exploration includes a thorough understand of each of our data columns.  This step involves reviewing statistical distributions, identifying outliers, identifying categorical columns, and reviewing the scale of numeric columns.  We can use data visualizations (histograms, line plots, heatmaps, etc) to help us understand the data.  Finally, we must clean the data.  This step involves handling missing data by either removing rows or columns or imputing values.

I am in the process of writing an end-to-end machine learning experiment with full source code available on GitHub.  

In Part 1, I provide example code in Python to import, merge, visualize and clean a public IPO data source.

Modeling

The modeling phase includes feature engineering, as well as scoring and evaluating our model.  Please note the bi-directional arrows between Data Acquisition & Understanding and Modeling.  These two steps are generally performed iteratively.  After building some initial models, we may re-visit the data phase to get a further understanding of the data, or to refine the handling of outliers, missing values, etc.

Feature Engineering


In the feature engineering step, we identify the most relevant features for our model.  Selecting the most relevant features for training a model has teh following benefits:
  • Improves model accuracy
  • Reduces over-fitting
  • Reduces training time
In addition to selecting the most relevant features, we often need to apply the following transformations:
  • Binning data columns
  • Converting to indicator columns
  • Normalizing and scaling data columns
The primary objective of these transformations is to improve the performance of our selecting algorithm.

In Part 2 of the IPO machine learning experiment referenced above, I provide example code in Python for common feature engineering tasks.

Model Training

In the model training step, we split data into training and test data sets.  In this way, we can train our models on a different data set than the data we use to test and evaluate our model.  Once we have completed the data preparation, training models is generally very straight forward.  In my experience, training a number of different models using different algorithms, including training ensemble models, generally produces better results.

Model algorithms often have a number of hyperparameters which can be used to adjust both the performance and accuracy of the algorithm. While we can use trial and error to evaluate variations on a limited number of hyperparameters values, we often need to programmatically tune a model's hyperparameters by iterating over the parameter space and evaluating the results of each run.

Model Evaluation

Different machine learning algorithms return different types of results.  There are, therefore, different methods of evaluating a model's accuracy, as described below.

Binary Classification (Logistic) Algorithms

In binary classification algorithms, we are predicting between one of two values, usually true or false.  There are a number of ways to evaluate a model's accuracy, depending on the objective of the experiment.  As mentioned above, we can try to optimize a model for precision, recall, or precision and recall.  Another common technique is to generate a receiver operating characteristic (ROC) curve, and to calculate the area under the curve (AUC).

Regression Algorithms

Regression algorithms predict a value, and are often evaluated by look at the sum of squared errors  and the coefficient of determination (r2).  For an introduction to linear regression, see the following Legal BI Guy blog post.

Deployment and Customer Acceptance

Once we have a trained model, we need to deploy the model, often as a web service, in order to generate predictions on incoming data.  In addition to monitoring the model, we need to consider retraining and concept drift.

Most importantly, we must insure that the model's accuracy meets the business objectives and that we verify customer acceptance.


#DataScience
0 comments
19 views

Permalink