Showing posts with label artificial intelligence. Show all posts
Showing posts with label artificial intelligence. Show all posts

Friday, October 14, 2016

Applying Auto-encoders to MNIST

Applying Auto-encoders to MNIST

This is a companion article to my new book, Practical Machine Learning with H2O, published by O’Reilly. Beyond the sheer, unadulterated pleasure you will get from reading it, I’m also recommending it for readers of this article, because I’m only going to lightly introduce topics such as H2O, MNIST, and even auto-encoders, that are covered in much more depth in the book.

That Light Introduction

H2O is a powerful, scalable and fast machine-learning server/framework, with APIs in R and Python (as well as Coffeescript, Scala via its Spark interface, and others). It has relatively few machine learning algorithms, but they are generally the ones you would have settled on using anyway, and each has been optimized to scale across clusters and big data, and each has many parameters for tuning.
MNIST is a machine learning problem to recognize which of the digits 0 to 9 a set of 784 pixels represents. There are 60,000 training samples, and 10,000 test samples. To avoid inadvertently over-fitting to the test samples, I split the 60K into 50K training data and 10K validation data.
Auto-encoders are the unsupervised version of deep-learning (neural nets, if you prefer). The Wikipedia article is a good introduction to the idea. By setting the input_dropout_ratio parameter H2O supports the “Denoising autoencoder” variation, and withhidden_dropout_ratios H2O supports the “Sparse autoencoder” variation.

The Aim

More layers in a supervised deep learning neural network can often give better results on complex problems. But the more layers you have, the harder it can be to train. Auto-encoders to the rescue! You run an auto-encoder on the raw inputs, and it will self-organize them - extract some information from them. You then take the middle hidden layer from the auto-encoder, and use that as the inputs to your supervised learning algorithm. Or possibly you do it again, with another auto-encoder, to extract (theoretically) an even higher-level abstraction.
I decided to try this on the MNIST data.
My initial approach was to treat it as a data compression problem: to see how few hidden neurons, in a single layer, I could get a perfect score with. I.e. this autoencoder had 784 input neurons, N hidden neurons, and 784 output neurons. Just one hidden layer, so to request, say, a 784x200x784 layout, in H2O I just do hidden=200, and the input layer is implicit from the data, and the output it implict because I specify autoencoder=TRUE. Unfortunately, even with N=784, I couldn’t get an MSE of 0.0.
(You should be able to see how there is one trivial way for such a network to get the perfect score: for each neuron in the middle layer, exactly one incoming weight should be 1.0 and all the others should be 0.0, and then the same for the outgoing weights. However, also appreciate how hard it would be for training to discover this if all 784 weights leading in and all 784 weights leading out of each neuron started off life as a random number.)

Getting Practical

So, under time pressure, I took an “educated guess” approach, and also an “ensemble” approach. I made three autoencoders (one of them being two-step, so four models in total), and used them together. The code to make them, and their outputs, is wrapped up in a couple of R functions, which I’ll show in a moment, but first a look at the parameters of the four models:
AE200: This uses a single layer of 200 hidden neurons. I set input_dropout_ratio = 0.3, which means as each training sample was used for training it would be setting a random 30% of the pixels to 0. This should make it more robust, less likely to over-fit. I also use L2 regularization, set to 1e-4 (which is fairly high).
AE32: This uses just 32 hidden neurons, so is going to be less perfect representation than AE200. To compensate for that, I use a lower input_dropout_ratio = 0.1, and also lowered L2 regularization to 1e-5.
AE768: Not used directly. One hidden neuron per input pixel (almost), means it is more “rephrasing” rather than “compressing”. I used the same input_dropout_ratio = 0.3 and l2 = 1e-4 settings as AE200. (Among the 784 pixels columns there are a few around the edge that are exactly zero in all training data, so provide no information, and can be thrown away; that is where the 768 came from.)
AE128: This was built from the output of AE768. No input dropout, and just a bit of L2 regularization (1e-5).
L2 regularization penalizes large weights (e.g. see https://www.quora.com/What-is-the-difference-between-L1-and-L2-regularization). It helps make sure all pixels get considered, rather than allowing the algorithm to over-fit to one particular pixel.
All four models used the tanh activation function, and were given 20 epochs.

The Model Generation Code

The following listing shows the R code, for the above model descriptions in H2O, wrapped up in a function that takes two parameters:
  • data is the H2O frame to use for training data
  • x is which columns of that frame to use
_
create_MNIST_autoencoders <- function(data, x){
m_AE200 <- h2o.deeplearning(
  x, training_frame = data,
  hidden = c(200),
  model_id = "AE200",
  autoencoder=T,
  input_dropout_ratio = 0.3,  #Quite high
  l2 = 1e-4,  #Quite high
  activation = "Tanh",
  export_weights_and_biases = T,
  ignore_const_cols = F,
  train_samples_per_iteration = 0,
  epochs = 20
  )

m_AE32 <- h2o.deeplearning(
  x, training_frame = data,
  hidden = c(32),
  model_id = "AE32",
  autoencoder = T,
  input_dropout_ratio = 0.1,  #Fairly low
  l2 = 1e-5,  #Fairly low
  activation = "Tanh",
  export_weights_and_biases = T,
  ignore_const_cols = F,
  train_samples_per_iteration = 0,
  epochs = 20
  )

m_AE768 <- h2o.deeplearning(
  x, training_frame = data,
  hidden = c(768),
  model_id = "AE768",
  autoencoder = T,
  input_dropout_ratio = 0.3,  #Quite high
  l2 = 1e-4,  #Quite high
  activation = "Tanh",
  export_weights_and_biases = T,
  ignore_const_cols = F,
  train_samples_per_iteration = 0,
  epochs = 20
  )

f_AE768 = h2o.deepfeatures(m_AE768, data)

m_AE128 <- h2o.deeplearning(
  1:768, training_frame=f_AE768,
  hidden = c(128),
  model_id = "AE128",
  autoencoder = T,
  input_dropout_ratio = 0,  #No dropout
  l2 = 1e-5,   #Just a bit of L2
  activation = "Tanh",
  #export_weights_and_biases = T,
  #ignore_const_cols = F,
  train_samples_per_iteration = 0,
  epochs = 20
  )

return(list(m_AE200, m_AE32, m_AE768, m_AE128))
}
Feeding the output of one auto-encoder (AE768 in this case) into another (AE128) is done with h2o.deepfeatures().
These two lines are just for troubleshooting/visualization:
export_weights_and_biases = T,
ignore_const_cols = F,
And, in a sense, this one is too:
train_samples_per_iteration = 0
This says I want it to always score the model’s MSE at the end of every epoch. I did this so I could see the shape of the score history chart, and so get a feel for if 20 epochs was enough. Normally touching train_samples_per_iteration counts as micro-management, because the default is to intelligently choose when to score based on some targets for time spent training vs. scoring, and some communication overhead targets. (See the explanation in chapter 8 of the book, if you crave more detail.)

Using The Models To Make Pixels

The second helper function is shown next. It returns (a handle to) an H20 frame that has 200 + 32 + 128 columns from the autoencoders, plus any additional columns you specify in columns (which must include at least the answer column).
generate_from_MNIST_autoencoders <- function(models, data, columns){
  stopifnot(length(models) == 4)
  names(models) = c("AE200", "AE32", "AE768", "AE128")

  f_AE200 <- h2o.deepfeatures(models[["AE200"]], data)
  f_AE32 <- h2o.deepfeatures(models[["AE32"]], data)
  f_AE768 <- h2o.deepfeatures(models[["AE768"]], data)
  f_AE128 <- h2o.deepfeatures(models[["AE128"]], f_AE768)

  h2o.cbind(f_AE200, f_AE32, f_AE128, data[,columns] )
  }
Notice how, if you don’t include the original 768 pixel columns in columns that the auto-encoded features will effectively replace the raw pixel data. (This was my intention, but you don’t have to do that.)

Usage Example

I’ll assume you have 785 columns, where the pixel data is in columns 1 to 784, and the answer, 0 to 9, is in column 785. If you have the book, you will be familiar with this convention:
train <- #...50K of training data
valid <- #...10K of validation data
test <- #...10K of test data
x <- 1:784
y <- 785
To use them I then write:
models <- create_MNIST_autoencoders(train, x)
train_ae <- generate_from_MNIST_autoencoders(models, train, y)
valid_ae <- generate_from_MNIST_autoencoders(models, valid, y)
test_ae <- generate_from_MNIST_autoencoders(models, test, y)
m <- h2o.deeplearning(1:360, 361, train_ae, validation_frame = valid_ae)
Here I train a deep learning model with all defaults, but it could be a more complex deep learning model, or it could be a random forest, GBM or any other algorithm supported by H2O.
When using it for predictions, remember to use test_ae, not test (and similarly, in production, any future data has to be put through all four auto-encoder models). So following on from the above, you could evaluate it with:
h2o.performance(m, test_ae)

Usage: extended data

If you are lucky enough to have read the book, you will know I added 113 columns of “extended information” to the MNIST data. Though I got rid of the pixels, I chose to keep the extended columns alongside the auto-encoder generated data.
This is how the above code looks if you are using the extended MNIST data:
x <- 114:897
columns <- c(1:113,898)
models <- create_MNIST_autoencoders(train, x)
train_ae <- generate_from_MNIST_autoencoders(models, train, columns)
valid_ae <- generate_from_MNIST_autoencoders(models, valid, columns)
test_ae <- generate_from_MNIST_autoencoders(models, test, columns)
m <- h2o.deeplearning(1:473, 474, train_ae, validation_frame = valid_ae)

Summary

Informally, I can tell you that a deep learning model built on 473 auto-encoded/extended columns was significantly better than one built on 897 pixel/extended columns. However, I also increased the amount of training data at the same time (see http://darrendev.blogspot.com/2016/10/applying-rs-imager-library-to-mnist.html ), so I cannot tell you the relative contributions of those two changes.
But, I was pleased with the results. And the “educated guess” approach of choosing three distinct auto-encoder models and combining their outputs also seemed to work well. It might be that just one of the auto-encoder models is carrying all the useful information, and the others could be dropped? That is a good experiment to do (let us know if you do it!), but my hunch is that the “ensemble” approach is what allows the educated guess approach to work.

Wednesday, October 12, 2016

Multinomial Ensemble In H2O

Categorization Ensemble In H2O

An ensemble is the term for using two or more machine learning models together, as a team. There are various types (normally categorized in the literature as boosting, bagging and stacking). In this article I want to look at making teams of high-quality categorizing models. (It handles both multinomial and binomial categorization.)
I will use H2O. Any machine-learning framework that returns probabilities for each of the possible categories can be used, but I will assume a little bit of familiarity with H2O… and therefore I highly recommend my book: Practical Machine Learning with H2O! By the way, I will use R here, though any language supported by H2O can be used.

Getting The Probabilities

Here is the core function (in R). It takes a list of models ¹, and runs predict on each of them. The rest of the code is a bit of manipulation to return a 3D array of just the probabilities (h2o.predict() returns its chosen answer in the “predict” column, so the setdiff() is to get rid of it).
getPredictionProbabilities <- function(models, data){
sapply(models, function(m){
  p <- h2o.predict(m, data)
  as.matrix(p[, setdiff(colnames(p), "predict") ])
  }, simplify = "array")
}

Using Them

getPredictionProbabilities() returns a 3D array:
  • First dimension is nrows(data)
  • Second dimension is the number of possible outcomes (labelled “p1”, “p2”, …)
  • Third dimension is length(models)
  • Value is 0.0 to 1.0.
To make the next explanation a bit less abstract, let’s assume it is MNIST data (where the goal is to look at raw pixels and guess which of 10 digits it is), and that we have made three models, and that we are trying it on 10,000 test samples. Therefore we have a 10000 x 10 x 3 array. Some example slices of it:
  • predictions[1,,] is the predictions for the first sample row; returning a 10 x 3 matrix.
  • predictions[1:5, "p3",] is the probability of each of the first 5 rows being the digit “2”. So, five rows and three columns, one per model. (It is “2”, because “p1” is “0”, “p2” is “1”, etc.).
  • predictions[1:5,,1] are the probabilites of the first model on the first 5 test samples, for each of the 10 categories.
When I just want the predictions of the team, and don’t care about the details, I use this wrapper function (explained in the next section):
predictTeam <- function(models, data){
  probabilities <- getPredictionProbabilities(models, data)
  apply(
    probabilities, 1,
    function(m) which.max(apply(m, 1, sum))
    )
  }

3D Arrays In R

A quick diversion into dealing with multi-dimensional arrays in R. apply(probabilities, 1, FUN) will apply FUN to each row in the first dimension. I.e. FUN will be called 10,000 times, and will be given a 10x3 matrix (m in the above code). I then use apply() on the first dimension of m, calling sum. sum() is called 10 times (once per category) and each time is given a vector of length 3, containing one probability per model.
The last piece of the puzzle is which.max(), which tells us which category has the biggest combined confidence from all models. It returns 1 if category 1 (i.e. “0” if MNIST), 2 if category 2 (“1” if MNIST), etc. This gets returned by the function, and therefore gets returned by the outer apply(), and therefore is what predictTeam() returns.
Aside: Using sum() is equivalent to mean(), but just very slightly faster. E.g. 30 > 27 > 24, and if you divide through by 3, then 10 > 9 > 8.
Your answer column in your training data is a factor, so to convert that answer into the string it represents, use levels(). E.g. if f is the factor representing your answer, and p is what predictTeam()returned, then levels(f)[p] will get the string. (With MNIST, there is a short-cut: p-1 will do it.)

Counting Correctness

A common question to ask of an ensemble is how many test samples every member of the team got right, and how many none of the team got right. I’ll look at that in 4-steps. First up is to ask what each models best answer is.³
predictByModel <- apply(probabilities, 1,
 function(sample) apply(sample, 2, which.max)
 )
Assuming the earlier MNIST example, this gives a 3 x 10000 int matrix (where the values are 1 to 10).
Next, we need a vector of the correct answers, and they must be the category index, not the actual value. (In the case of MNIST data it is simply adding 1, so “0” becomes 1, “1” becomes 2, etc.)
eachModelsCorrectness <- apply(predictByModel, 1,
 function(modelPredictions) modelPredictions == correctAnswersTest
 )
This returns a 10000 x 3 logical (boolean) matrix. It tells us if each model got each answer correct. The next step returns a 10000 element vector, which will have 0 if they all got it wrong, 1 if only one member of the team got the correct answer, 2 if two of them, and 3 if all three members of the team got it right.
cmc <- apply(eachModelsCorrectness, 1, sum)
(cmc for count of model correctness) table(cmc) will give you those four numbers, E.g. you might see:
 0    1    2    3
67   59  105 9769
Indicating that there were 67 that none of them got right, 59 samples that only one of the three models managed, 105 that just one model got wrong, and 9769 that all the models could do.
I sometimes find cumsum(table(cmc)) to be more useful:
 0     1     2     3
67   126   231 10000
Either way, if the first number is high, you need stronger model(s). Better team members. If it is low, but the ensemble is performing poorly (i.e. no better than the best individual model), then you need more variety in the models.

Further Ideas

By slicing and dicing the 3D probability array, you can do different things. You might want to poke into what those difficult 67 samples were, that none of your models can get. Another idea, which I will deal in a future blog post, is how you can use the probability to indicate those with low confidence that need more investigation work. (You can use this with an ensemble, or just with a single model.)

Performance

In my experience, in general, this kind of team will always give better performance than any individual model in the team. It gives the biggest jump in strength, when:
  • The models are very distinct.
  • The models are fairly close in strength.
  • Just a few models. Making 4 or 5 models, evaluating them on the valid data, and using the strongest 3 for the ensemble, works reasonably well. However if you managing to make near-strength and very distinct models, the more the merrier.
For instance, three deep learning models, all built with the same parameters, each getting about 2% error individually, might manage 1.8% error as a team. But two deep learning models, one random forest, one GBM, all getting 2% error, might give 1.5% or even lower, when used together. Conversely, two very similar deep learning models that have 2% error combined with a GBM with 3% error and a random forest getting 4% error might do poorly, possibly even worse than your single best model.

Summary

This approach to making an ensemble for categorization problems is straightforward and usually very effective. Because it works with models of different types it at first glance seems closer to stacking, but really it is more like the bagging of random forest (except using probabilities, instead of mode⁴ as in random forest).

Footnotes

[1]: All the models in models must all be modelling the same thing, i.e. they must each have been trained on data with the exact same columns.²
Normally each model will have been trained on the exact same data, but the precise phrasing is deliberate: interesting ensembles can be built with models trained on different subsets of your data. (By the way, subsets and ensemble are the two fundamental ideas behind the random forest algorithm.)
[2]: You can take this even further and only require they are all trained with the same response variable, not even the same columns. However, you’ll need to adapt getPredictionProbabilities() in that case, because each call to h2o.predict() will then need to be given a different data object.
[3]: I do it this way because I already had probabilities made. But if it was all that you were interested in, this was the long way to get it. The direct way was to do the exact opposite of getPredictionProbabilities(): keep the “predict” column, and get rid of the other columns!
[4]: I did experiment with simple one vote per model (aka mode), instead of summing probabilities, and generally got worse results. But it is worth bearing that kind of ensemble in mind. If you do more rigorous experiments comparing the two methods, I’d be very interested to hear what kind of problems, if any, that voting (mode) is superior on.

Thursday, June 4, 2009

The Shodan Go Bet

I am a natural optimist. A hopeless optimist. I spend hours battling against myself with carefully constructed cynicism, but things still get past my guard. And one day, back in 1997, I made the mistake of putting my money where my mouth is. Let me tell you were it all started...

http://dcook.org/gobet/

(About a bet I made with John Tromp, in 1997, that a computer could beat him at the game of go before the year 2011; the above URL is to publicize this bet and also has places where you can vote with your opinion and leave comments. If you enjoy it please do link to it, blog about it, tell your friends, etc.)