Applied Introduction to Machine Learning

By Cassie Siljander posted 11-13-2017 10:48


Applied Introduction to Machine Learning

Is there anything in your work life that makes you think, “Why am I still doing this manually?” or “I wish I could see predictions based on historical data to help make decisions.” Perhaps you just think, in general, things could be done more efficiently through machine learning and process improvement. Well, you are probably right.

I had something like that in my work life and decided that there was, in fact, a better way to do it.  In this post, I will give an overview of some fundamental machine learning concepts, and how I used Azure Machine Learning to automate helpdesk ticket assignment.  Regardless of your background, you will be able to understand the basic concepts of machine learning.

What is machine learning and AI?

At a high level, machine learning is statistics applied to natural language processing and big data to make data-driven predictions or decisions. There are uses for this concept in virtually every field to help make decisions and automate processes.  Some examples of machine learning that you likely use today are text and speech recognition, fraud detection, market segmentation, recommended products, pricing, classification, and more.

Machine learning models have three major components; an Input Layer (called Features), a Hidden Layer (where the magic happens), and an Output Layer (called Scored Labels).  The Input Features are data points used to train the model and get predictions. The Hidden Layer varies based on the type of machine learning algorithm that is being implemented. (I will talk more about the types of machine learning algorithms later in this post.)  Training the model allows you to output the predicted values or Scored Labels. I am not going to go into the statistical formulas however if you have a statistics background it will be useful in understanding the how of machine learning.

Machine Learning Model diagram

The two main types of machine learning are Supervised Learning and Unsupervised Learning.  With Supervised Learning, the model is provided with examples of the input features, and the resulting output scored labels so it can “learn” from historical data how inputs would be classified or values predicted.  In Unsupervised Learning, the model is provided with a dataset of input features, but no outputs scored labels.  Unsupervised Learning is used to detect patterns. A good example would be credit card fraud detection when something doesn’t fit the pattern a flag is raised. This post will discuss Supervised Learning and specifically a multiclass classification model.  

Predict the future from the past

A successful application of machine learning first requires the availability of suitable data to train a model.  If the model is given bad data, it will return bad results, garbage in garbage out. For my application, the question I wanted to answer was “Who should this helpdesk ticket be assigned to in our IT Development department?”  First, I had to find the helpdesk ticket data and then dig through multiple possible variables that could be used as the input features. When trying to figure out what inputs would be helpful, ask yourself, “What are the key indicators that would predict an outcome?”  The features used in the helpdesk model were Customer, Company, Category, and Description (full text of ticket). The Scored Label (output) is the Assignee ID, which is the person that was assigned to the ticket. Additional features can be added later if its discovered that more information is needed when testing the model, so don’t stress too much about getting this right on the first try.  Once the features have been identified and gathered for the model, it’s time to start building!


Screenshot of Microsoft Azure Machine Learning Studio. The model shown is the Fish Help Desk Triage v1.

Move the data into Azure, preprocess the data, and train the model

Now that the data features have been identified, the dataset can be imported into the Azure Machine Learning Studio as a Saved Dataset. If the dataset is already stored in Azure, use the Import Data module to access the database or set up a gateway to work with on-premise data. The helpdesk model used an uploaded CSV dataset since the data was stored outside of Azure and a gateway is not currently set up.  Once the data is in Azure, we need to process and clean the data with different modules. There are many options to choose from to prepare and process the data to train the model. The modules used will vary based on the data types, the problem being solved and the kind of model being built.

For the helpdesk model, some of the modules used were; clean missing data, preprocess the description text and extract n-gram features before running the data through the machine learning algorithm module.  Preprocessing large text features like the description helps to simplify the text.  Azure has a list of stop words (words to remove from the text), and a custom list may be added or used in place of the canned list. The Preprocess Text module in Azure does a variety of functions such as; case normalization, lemmatization (instances of “running” become “run”), detect sentences, remove numbers, remove special characters and regular expression replacement. This is an essential piece for large text features. Microsoft states “N-Gram are used to featurize long text strings, extracting only the most important pieces of information.”  This means it takes the preprocessed text and breaks it out into smaller pieces that can be assigned a numerical value and processed in the formula. Next, split the data 70/30 to train and score the model.  Use seventy percent of the data to train the model; use the remaining thirty to score the model. This is the recommended data split, but it can be adapted if needed.

Which machine learning algorithm should be used?

In a way, this is the tricky part. When I first started looking into this, the types of algorithms seemed quite cryptic.  Luckily, Microsoft has a handy algorithm cheat sheet. Think about the expected result from the data features posted to the model. If the problem requires predicting values, then use a regression algorithm. If the problem requires predicting a type or class, it’s a classification algorithm. If there are two possible classes (true or false, green or blue), then use a two-class classification algorithm.  The helpdesk model had about fifteen classifications (Assignee IDs) which means it needed a multiclass classification (three or more values) algorithm. Ok, we have figured out half of the answer. Now which type of multiclass classification should be used? There are multiple options that will not be discussed here.  The one that was right for the helpdesk model was the Multiclass Logistic Regression algorithm. That means there are three or more possible values to be analyzed with a logistic function (think back to your statistics class now). Still with me? Don’t go cross-eyed! Everything seems complicated the first time you hear it, and everything seems simple once you understand.  For some, this may make sense the first time they read it, and others may need to reread a few times. We will all get there eventually.

Score, Evaluate, Update, and Consume the Model

The hard part is done, now it is time to score and evaluate the model.  The thirty percent of the data that was not used to train the model is used to score it.  This shows if the model is predicting correctly and helps identify areas of improvement.  If the model has a low accuracy score, go back and update different steps and test until the model is perfected. Once the model is at a satisfactory accuracy, it can be consumed as an API by an application or excel.  Azure has a nifty “Set up Web Service” button to create the service in Azure, which can be posted to as request/response for a single item or as a batch for large data sets needing to be processed. For the helpdesk model, the Azure service is used.  When a helpdesk ticket is assigned to the IT Development department, a record is added to a SQL table to be processed.  A job runs every 5 minutes to check to see if a new item has been entered that needs to be assigned.  If there is a new item, it grabs the input features, posts to the Azure service; the scored probabilities are returned and logged to the database.  Then an email is sent to the top three most probable assignees as predicted by the helpdesk model. The N-Gram keywords are included in the body of the email so the recipients can see at a glance what keywords were found. The keywords help the recipient understand what the ticket is about without going to the helpdesk program itself.  Additionally, there is a link provided in the email to the helpdesk ticket so the assignee can review the ticket right away.

Maintenance: Retraining the Model

As with everything, evolution is inevitable, and the model will need to stay up to date as variables change. The helpdesk model is updated when we add or remove staff from the department.  We repeat the data gathering steps and retrain the model by processing the new data through the model as we did with the initial dataset. This piece has been automated.

There are multiple platforms out there to build machine learning models

There are many other platforms and services available to make machine learning more attainable to us, not data scientists. In addition to Microsoft, other companies such as Amazon Machine Learning on AWS, Google TensorFlow, and others offer products with varying levels of complexity and skills needed to implement the platform.  Azure Machine Learning Studio was accessible (no coding required), and has good documentation to help understand how to use the modules. I have not built anything with Google TensorFlow or Amazon Machine Learning so I cannot speak to the advantages and disadvantages of one platform versus another.

Additionally, you can build machine learning models with programming languages such as Python or R.  If you decide to build from scratch, the main Python data science libraries that are used are Pandas, Numpy, SciKit/Sklearn, nltk, and Matplotlib.

Oh, the possibilities!

Machine Learning allows for an amazing array of possibilities and applications. The uses seem endless.  From pricing predictions, to document classification, to building smarter bots.  We are at a time now where we can use applied statistics, natural language processing, big data, and services to create applications that are smarter and make processes more efficient and accurate.  We can start to automate things that we always knew should be but just did not know how.  Machine learning and AI are pushing all industries forward and the sooner we can get on that bandwagon, the sooner we will start to see the benefits.