Machine Learning in Practice

Sergiu Indrie
Software Engineer@iQuest

PROGRAMMING

I believe quality software is about making the user's life as easy as possible. Not by just executing the required job and doing it flawlessly and efficient, but also by simplifying the task without losing functionality. A lot of our everyday tasks could be made easier if our software were able to make suggestions, try to guess and automate some of our tasks. This is one of the use cases for machine learning techniques. In addition to this, applications such as web search engines, image and sound recognition, spam detection and many more are all using machine learning to solve problems that would otherwise be almost unsolvable.

Explanation

Although machine learning might appear to be complex and hard to apply, like everything in computing, it all comes down to 1s and 0s. (Maybe even more than average)

The basic principle behind it is pretty simple math. I want to determine the similarity of complex items but I can't even begin to declare comparison functions, so I just convert the complex items into numbers and let the machine calculate their correlation.

Let's take a simple scenario:

We have a user base with the following data

And we want to provide similar users with similar hobby content (The idea is that similar people like similar things). Thus, by converting the input data into numerical data (manually or automatically)

we can plot the data in the following chart

Based on the above graph we can see that the two points in the upper side are closer to each other (especially if we apply weights to relevant attributes). This would indicate that the two users (user_id 2 and 3) are similar. In this case, this is due to the close age values.

Mathematically, this would be done by computing the distance between the two points, something like the Euclidian distance but weighted

where u and v are two vectors, representing two points (items), and each vector is composed of two values, age and city (u_age, u_city) and u_age și u_city represent the weights of the age and city attributes.

By providing the machine learning solution with lots of positive and negative examples (training data), the algorithm can fine tune itself to correctly determine the resemblance of two items.

Three important use cases of machine learning that use the above principle are:

Recommendations - analyzes the preferences of a user and based on user similarity suggests potential new preferences
Clustering - groups similar items together (requires no explicit training data)
Classification - assigns items to categories based on previously learned assignments (also referred to as classifier)

Sample Application

One of the popular machine learning libraries in Java is Weka. We will use the Weka APIs to write a simple classification machine learning application.

We want to replicate Gmail's Priority Inbox functionality to the Geary desktop email client so that, when a new email is received, we can automatically determine its membership to one of 5 categories: Primary, Social, Promotions, Updates and Forums.

Data Representation

First, we need a way to represent our data. The 5 categories are easy, we'll use an index from 1 to 5. The incoming email content is a bit trickier. In these types of problems, the solution is generally to encode a piece of text using the bag-of-words model. What this means is that we represent the piece of text as a word count vector.

Let's take the following two text documents:

I like apples.
I ate. I slept.

Based on these two documents we create a common dictionary assigning indexes to all the words.

Thus we can create the following representations of the two pieces of text

1, 1, 1, 0, 0

2, 0, 0, 1, 1

Each number in the above two vectors indicates the word count of that index in the dictionary. (The second vector starts with 2, meaning that the word "I" is used twice in the document).

To simplify this application we will use a partial bag-of-words model. We will define a custom dictionary containing only one relevant keyword for each category.

Using the above dictionary we'll encode the following email

Hi dad. Remember to pick me up from preschool. Oh and dad, don't be late!

into the following word vector

2, 0, 0, 0, 0

Weka Code

In order to start writing a solution using the Weka APIs, first we need to define the data format. We will start by describing a data row.

// create the 5 numeric attributes corresponding to 
// the 5 category keywords
Attribute dadCount = new Attribute("dad");
Attribute congratulateCount = new Attribute("congratulate");
Attribute discountCount = new Attribute("discount");
Attribute reminderCount = new Attribute("reminder");
Attribute groupCount = new Attribute("group");

Each row must be associated with one of the 5 categories, thus we proceed with defining the category attribute.

// Declare the class/category attribute along with 
// its values
FastVector categories = new FastVector(5);
categories.addElement("1");
categories.addElement("2");
categories.addElement("3");
categories.addElement("4");
categories.addElement("5");
Attribute category = new Attribute("category", categories);

And finally we join all attributes into an attribute vector.

// Declare the feature vector (the definition of one data row)
FastVector rowDefinition = new FastVector(6);
rowDefinition.addElement(dadCount);
rowDefinition.addElement(congratulateCount);
rowDefinition.addElement(discountCount);
rowDefinition.addElement(reminderCount);
rowDefinition.addElement(groupCount);
rowDefinition.addElement(category);

Training Data

Machine learning classification algorithms need training data to adjust internal parameters to provide the best results. In our scenario, we need to provide the implementation with examples of email category assignments.

For the purpose of this demo, we will generate our own training data by using the class below.

public class TrainingDataCreator {
   public static void main(String[] args) 
   throws IOException {
   ICombinatoricsVector valuesVector = Factory.createVector(new String[]{"0", "1", "2", "3", "4"});

   Generator gen = Factory.
      createPermutationWithRepetitionGenerator(valuesVector, 5);

      File f = new File("category-training-data.csv");
      FileOutputStream stream = FileUtils.openOutputStream(f);
     for (ICombinatoricsVector perm : gen) {
     if (Math.random() < 0.3) {   
     // restrict the training set size

String match = determineMatch(perm.getVector()); // first highest count wins

    String features = StringUtils.join(perm.getVector(), ",");
    IOUtils.write(String.format("%s,%s\n", features, match), stream);

        }
       }
       stream.close();
   }
}

TrainingDataCreator uses permutations to generate all training rows up to the count of 4 (each word count ranges from 0 to 4). Since we want this to be a realistic training data, we will refrain from creating a complete training set and only include 30% of the permutations. Also, we want the classification algorithm to make some sense of the data, so we are setting the category for each row based on the first highest word count (e.g. For the word count dad=2, congratulate=3, discount=0, reminder=1, group=3; congratulate is the first word with the highest count, 3).

The generated file will contain lines like the following

3,1,1,0,0,1
4,1,1,0,0,1
3,3,2,4,1,4

Now that we've generated the training data, we need to train our classifier. We start by creating the training set.

// Create an empty training set and set the class (category) index

Instances trainingSet = new Instances("Training", rowDefinition, TRAINING_SET_SIZE);

trainingSet.setClassIndex(5);
addInstances(rowDefinition, trainingSet, "category-training-data.csv");

Inside the addInstances method we read the CSV file and create an Instance object holding a training data row for each CSV row.

// Create the instance
Instance trainingRow = new Instance(6);
trainingRow.setValue((Attribute) rowDefinition.elementAt(0), Integer.valueOf(values[0]));
trainingRow.setValue((Attribute) rowDefinition.elementAt(1), Integer.valueOf(values[1]));

trainingRow.setValue((Attribute) rowDefinition.elementAt(2), Integer.valueOf(values[2]));

trainingRow.setValue((Attribute) rowDefinition.elementAt(3), Integer.valueOf(values[3]));

trainingRow.setValue((Attribute) rowDefinition.elementAt(4), Integer.valueOf(values[4]));

trainingRow.setValue((Attribute) rowDefinition.elementAt(5), values[5]);

// add the instance
trainingSet.add(trainingRow);

The next step is to build a classifier using the generated training set.

// Create a naïve bayes classifier
Classifier cModel = new NaiveBayes();
cModel.buildClassifier(trainingSet);

And to put our classification algorithm to the final test, let's say we have an incoming email containing the word dad 5 times and the rest of the word counts being 0. (Notice that the training data only contained word counts up to 4, meaning that the classifier must determine extrapolate to determine the correct category; which, naturally, is 1 - Primary)

// Create the test instance
Instance incomingEmail = new Instance(6);
incomingEmail.setValue((Attribute) rowDefinition.elementAt(0), 5);
incomingEmail.setValue((Attribute) rowDefinition.elementAt(1), 0);
incomingEmail.setValue((Attribute) rowDefinition.elementAt(2), 0);

incomingEmail.setValue((Attribute) rowDefinition.elementAt(3), 0);

incomingEmail.setValue((Attribute) rowDefinition.elementAt(4), 0);
incomingEmail.setDataset(trainingSet);    
  // inherits format rules
double prediction = cModel.classifyInstance(incomingEmail);
System.out.println(trainingSet.classAttribute().value((int) prediction));

The above code will output 1, meaning the Primary category.

A few samples of unknown test data are represented in the table below.

As we can see, the prediction accuracy is never 100% but by fine tuning algorithm parameters and careful selection of vector features as well as training data we can reach a high correctness rate.

Weka provides APIs to evaluating your solutions. By creating two data sets: the initial data set containing 30% of permutations and a second data set containing all the permutations for word counts ranging from 0 to 4; and using the Evaluation class we are able to obtain 95% accuracy from using 30% of the total training set.

// Test the model
Evaluation eTest = new Evaluation(trainingSet);
eTest.evaluateModel(cModel, testSet);
System.out.println(eTest.pctCorrect());

Ending

The above sample code was written using Weka because I feel the APIs are easier to use and understand compared to another popular Java machine learning framework, Apache Mahout. If your use case can deal with less than 100% prediction accuracy and have the possibility to construct training sets, the advantages may outweigh the costs. We have performed tests showing that by choosing the best algorithm for your data, you can obtain training times of less than 1 second for 28000 training rows (5 attributes + category) and similar prediction times for 1000 test rows.

Machine Learning has a tremendous potential to be used in all software domains (even mobile apps), and as we saw in the above use case you don't have to be a mathematician to use the tools available. Let the learning begin!