Friday, March 31, 2017

The Seven Day A Year Bug

I’ll cut straight to the chase: when you use d.setMonth(m - 1) in JavaScript, always set the optional second parameter.

What’s that, you didn’t know there was one? Neither did I until earlier today. It allows you to set the date. Cute, I thought at the time, minor time-saver, but hardly worth complicating an API for.

Ooh, how wrong I was. Let me take you back to when it happened. Friday, March 31st….

After a long coding session, I did a check-in. And then ran all unit tests. That is strange, three failing, but in code I hadn’t touched all day. I peer at the code, but it looks correct - it was to do with dates, specifically months, and I was correctly subtracting 1.

Aside: JavaScript dates inherit C’s approach of counting months from 0. In the first draft of this blog post I used a more judgemental phrase than “approach”. But to be fair, it was a 1970s design decision, and the world was different back then. Google “1970s men fashion”.

So, back to the test failures. I start using “git checkout xxxx” to go back to earlier versions, to see exactly when it broke. Running all tests every time. I know something fishy is going on, by the time I’ve gone back 10 days and the tests still fail. I am fairly sure I ran all tests yesterday, and I am certain it hasn’t been 10 days.

Timezones?! Unlikely, but we did put the clocks back last weekend. But a quick test refutes that. (TZ=XXX mocha . will run your unit tests in timezone XXX.)

So, out of ideas, I litter the failing code with console.log lines, to find out what is going on.

Here is what is happening. I initialize a Date object to the current date (to set the current year), then call setMonth(). I don’t use the day, so don’t explicitly set it. I was calling setMonth(8), expecting to see “September”, but the unit test was being given “October”. Where it gets interesting is that the default date today is March 31st. In other words, when I set month to September the date object becomes “September 31st”, which isn’t allowed. So it automatically changes it to October 1st.

You hopefully see where the title of this piece comes from now? If I was setting a date in February I would have discovered the bug two days earlier, and if my unit test had chose October instead of September, the bug would never have been detected. If I’d thought, “ah, I’ll run them Monday”, the bug would not have been discovered until someone used the code in production on May 31st. I’d have processed their bug report on June 1st and told them, “can’t reproduce it”. And they’d have gone, “Oh, you’re right, neither can I now.”

To conclude with a happy ending, I changed all occurrences of d.setMonth(m - 1) into d.setMonth(m-1, 1), and the test failures all went away. I also changed all occurrences of d.setMonth(m-1);d.setDate(v) (where v is the day of the month) into: d.setMonth(m-1, v) not because it is shorter and I can impress people with my knowledge of JavaScript API calls, but because two separate calls was a bug that I simply didn’t have a unit test for.

But writing that unit test can wait until Monday.

Friday, February 24, 2017

NorDevCon 2017: code samples

This is the sample code, in Python and R, for the talk I gave yesterday at the NorDevCon 2017, pre-meeting talks.

To install h2o for Python, from the commandline do:

  pip install h2o

To install it in R, from inside an R session do:

install.packages("h2o")

Either way, they should get all the dependencies that you need.

The data was the “train.csv.zip” file found at Kaggle (You need to sign-up to Kaggle to be allowed to download it.) The following scripts assume you have unzipped it and put train.csv in the same directory as the scripts.

That Kaggle URL is also where the description of fields is to be found.

Here is how to prepare H2O, and the data, in Python:

import h2o

h2o.init()

data = h2o.import_file("train.csv")

data["Ht"].cor(data["Wt"])

factorsList = ['Product_Info_1', 'Product_Info_2', 'Product_Info_3', 'Product_Info_5', 'Product_Info_6', 'Product_Info_7', 'Employment_Info_2', 'Employment_Info_3', 'Employment_Info_5', 'InsuredInfo_1', 'InsuredInfo_2', 'InsuredInfo_3', 'InsuredInfo_4', 'InsuredInfo_5', 'InsuredInfo_6', 'InsuredInfo_7', 'Insurance_History_1', 'Insurance_History_2', 'Insurance_History_3', 'Insurance_History_4', 'Insurance_History_7', 'Insurance_History_8', 'Insurance_History_9', 'Family_Hist_1', 'Medical_History_2', 'Medical_History_3', 'Medical_History_4', 'Medical_History_5', 'Medical_History_6', 'Medical_History_7', 'Medical_History_8', 'Medical_History_9', 'Medical_History_11', 'Medical_History_12', 'Medical_History_13', 'Medical_History_14', 'Medical_History_16', 'Medical_History_17', 'Medical_History_18', 'Medical_History_19', 'Medical_History_20', 'Medical_History_21', 'Medical_History_22', 'Medical_History_23', 'Medical_History_25', 'Medical_History_26', 'Medical_History_27', 'Medical_History_28', 'Medical_History_29', 'Medical_History_30', 'Medical_History_31', 'Medical_History_33', 'Medical_History_34', 'Medical_History_35', 'Medical_History_36', 'Medical_History_37', 'Medical_History_38', 'Medical_History_39', 'Medical_History_40', 'Medical_History_41']

data[factorsList] = data[factorsList].asfactor()

# Split off a random 10% to use to evaluate
# the models we build.
train, test = data.split_frame([0.9], seed=123)

# Sanity check
train.ncol
test.ncol
train.nrow
test.nrow

# What the data looks like:
train.head(rows=1)
test.head(rows=1)

Here is the very quick deep learning model:

m_DL = h2o.estimators.H2ODeepLearningEstimator(epochs=1)
m_DL.train(x, y, train)

(I made the powerful one-liner claim in the talk but, as you can see, in Python they are two-liners.)

Then to evaluate that model:

m_DL

m_DL.predict( test[1, x] )  #Ask prediction about first test record

m_DL.predict( test[range(1,6), x] ).cbind(test[range(1,6), y] )  #Compare result for first 6 records

m_DL.model_performance(test) #Average performance on all 6060 test records

m_DL.model_performance(train)  #For comparison: the performance on the data it was trained on

Here is the default GBM model:

m_GBM = h2o.estimators.H2OGradientBoostingEstimator()
m_GBM.train(x, y, train)

m_GBM.model_performance(test)

Then here is the tuned GBM model - basically it is all about giving it more trees to play with:

m_GBM_best = h2o.estimators.H2OGradientBoostingEstimator(
    sample_rate=0.95,
    ntrees=200,
    stopping_tolerance=0,stopping_rounds=4,stopping_metric="MSE"
    )
m_GBM_best.train(x, y, train, validation_frame=test)

m_GBM_best.model_performance(test)

And here is the tuned deep learning model:

m_DL_best = h2o.estimators.H2ODeepLearningEstimator(
    activation="RectifierWithDropout",
    hidden=[300,300,300],
    l1=1e-5,
    l2=0,
    input_dropout_ratio=0.2,
    hidden_dropout_ratios=[0.4, 0.4, 0.4],
    epochs=1000,
    stopping_tolerance=0,stopping_rounds=4,stopping_metric="MSE"
    )
m_DL_best.train(x, y, train, validation_frame=test)

m_DL_best.model_performance(test)

And here is the R code, that does the same as the above:

library(h2o)

h2o.init(nthreads=-1)


data = h2o.importFile("train.csv")
# View it on Flow

h2o.cor(data$Wt, data$BMI)


factorsList = c('Product_Info_1', 'Product_Info_2', 'Product_Info_3', 'Product_Info_5', 'Product_Info_6', 'Product_Info_7', 'Employment_Info_2', 'Employment_Info_3', 'Employment_Info_5', 'InsuredInfo_1', 'InsuredInfo_2', 'InsuredInfo_3', 'InsuredInfo_4', 'InsuredInfo_5', 'InsuredInfo_6', 'InsuredInfo_7', 'Insurance_History_1', 'Insurance_History_2', 'Insurance_History_3', 'Insurance_History_4', 'Insurance_History_7', 'Insurance_History_8', 'Insurance_History_9', 'Family_Hist_1', 'Medical_History_2', 'Medical_History_3', 'Medical_History_4', 'Medical_History_5', 'Medical_History_6', 'Medical_History_7', 'Medical_History_8', 'Medical_History_9', 'Medical_History_11', 'Medical_History_12', 'Medical_History_13', 'Medical_History_14', 'Medical_History_16', 'Medical_History_17', 'Medical_History_18', 'Medical_History_19', 'Medical_History_20', 'Medical_History_21', 'Medical_History_22', 'Medical_History_23', 'Medical_History_25', 'Medical_History_26', 'Medical_History_27', 'Medical_History_28', 'Medical_History_29', 'Medical_History_30', 'Medical_History_31', 'Medical_History_33', 'Medical_History_34', 'Medical_History_35', 'Medical_History_36', 'Medical_History_37', 'Medical_History_38', 'Medical_History_39', 'Medical_History_40', 'Medical_History_41')
data[,factorsList] <- as.factor(data[,factorsList])


splits <- h2o.splitFrame(data, 0.9, seed=123)
train <- h2o.assign(splits[[1]], "train")  #90% for training
test <- h2o.assign(splits[[2]], "test")  #10% to evaluate with

ncol(train)   #128
ncol(test)    #128

nrow(train)  #53321
nrow(test)  #6060

t(head(train, 1))
t( as.matrix(test[1,1:127]) )

m_DL <- h2o.deeplearning(2:127, 128, train)
m_DL <- h2o.deeplearning(2:127, 128, train, epochs = 1)  #7 to 9 secs
#system.time( m_DL <- h2o.deeplearning(2:127, 128, train) )  #42.5 secs

h2o.predict(m_DL, test[1,2:127])

h2o.cbind(
  h2o.predict(m_DL, test[1:6, 2:127]),
  test[1:6, 128]
  )

#    predict Response
# 1 7.402184        8
# 2 5.414277        1
# 3 6.946732        8
# 4 6.542647        1
# 5 2.596471        6
# 6 6.224758        5


h2o.performance(m_DL, test)
# H2ORegressionMetrics: deeplearning
# 
# MSE:  3.770782
# RMSE:  1.94185
# MAE:  1.444321
# RMSLE:  0.4248774
# Mean Residual Deviance :  3.770782




######
m_GBM <- h2o.gbm(2:127, 128, train)  #7.3s

h2o.predict(m_GBM, test[1, 2:127])

h2o.cbind(
  h2o.predict(m_GBM, test[1:6, 2:127]),
  test[1:6, 128]
  )

#    predict Response
# 1 6.934054        8
# 2 5.231893        1
# 3 7.135411        8
# 4 5.906502        1
# 5 3.056508        6
# 6 5.049540        5

h2o.performance(m_GBM, test)

# MSE:  3.599897
# RMSE:  1.89734
# MAE:  1.433456
# RMSLE:  0.4225507
# Mean Residual Deviance :  3.599897


##########

#Takes 20-30secs
m_GBM_best = h2o.gbm(
  2:127, 128, train,

  sample_rate = 0.95,
  validation_frame = test,
  stopping_tolerance = 0,
  stopping_rounds = 4,
  stopping_metric = "MSE",
  ntrees = 200

  )

#h2o.performance gave MSE of 3.473637856428858

plot(m_GBM_best)

h2o.scoreHistory(m_GBM_best)


####################

# 3-4 minutes  (204secs)
m_DL_best <- h2o.deeplearning(
  2:127, 128, train,
  epochs = 1000,

  validation_frame = test,
  stopping_tolerance = 0,
  stopping_rounds = 4,
  stopping_metric = "MSE",

  activation = "RectifierWithDropout",
  hidden = c(300, 300, 300),
  l1 = 1e-5,
  l2 = 0,
  input_dropout_ratio = 0.2,
  hidden_dropout_ratios = c(0.4, 0.4, 0.4)
  )  


h2o.performance(m_DL_best, test)

# MSE:  3.609624
# RMSE:  1.899901
# MAE:  1.444417
# RMSLE:  0.4164153
# Mean Residual Deviance :  3.609624

Finally, and not surprisingly, I can highly recommend my own book, if you would like to learn more about how to use H2O. Examples in the book are on three different data sets, and go into more depth about the different machine learning algorithms that H2O offers, as well as some ideas about how to tune:

From O’Reilly here:
http://shop.oreilly.com/product/0636920053170.do

From Amazon UK:
https://www.amazon.co.uk/Practical-Machine-Learning-Darren-Cook/dp/149196460X

(And other good bookshops, of course!)

Thanks!

Friday, October 14, 2016

Applying Auto-encoders to MNIST

Applying Auto-encoders to MNIST

This is a companion article to my new book, Practical Machine Learning with H2O, published by O’Reilly. Beyond the sheer, unadulterated pleasure you will get from reading it, I’m also recommending it for readers of this article, because I’m only going to lightly introduce topics such as H2O, MNIST, and even auto-encoders, that are covered in much more depth in the book.

That Light Introduction

H2O is a powerful, scalable and fast machine-learning server/framework, with APIs in R and Python (as well as Coffeescript, Scala via its Spark interface, and others). It has relatively few machine learning algorithms, but they are generally the ones you would have settled on using anyway, and each has been optimized to scale across clusters and big data, and each has many parameters for tuning.
MNIST is a machine learning problem to recognize which of the digits 0 to 9 a set of 784 pixels represents. There are 60,000 training samples, and 10,000 test samples. To avoid inadvertently over-fitting to the test samples, I split the 60K into 50K training data and 10K validation data.
Auto-encoders are the unsupervised version of deep-learning (neural nets, if you prefer). The Wikipedia article is a good introduction to the idea. By setting the input_dropout_ratio parameter H2O supports the “Denoising autoencoder” variation, and withhidden_dropout_ratios H2O supports the “Sparse autoencoder” variation.

The Aim

More layers in a supervised deep learning neural network can often give better results on complex problems. But the more layers you have, the harder it can be to train. Auto-encoders to the rescue! You run an auto-encoder on the raw inputs, and it will self-organize them - extract some information from them. You then take the middle hidden layer from the auto-encoder, and use that as the inputs to your supervised learning algorithm. Or possibly you do it again, with another auto-encoder, to extract (theoretically) an even higher-level abstraction.
I decided to try this on the MNIST data.
My initial approach was to treat it as a data compression problem: to see how few hidden neurons, in a single layer, I could get a perfect score with. I.e. this autoencoder had 784 input neurons, N hidden neurons, and 784 output neurons. Just one hidden layer, so to request, say, a 784x200x784 layout, in H2O I just do hidden=200, and the input layer is implicit from the data, and the output it implict because I specify autoencoder=TRUE. Unfortunately, even with N=784, I couldn’t get an MSE of 0.0.
(You should be able to see how there is one trivial way for such a network to get the perfect score: for each neuron in the middle layer, exactly one incoming weight should be 1.0 and all the others should be 0.0, and then the same for the outgoing weights. However, also appreciate how hard it would be for training to discover this if all 784 weights leading in and all 784 weights leading out of each neuron started off life as a random number.)

Getting Practical

So, under time pressure, I took an “educated guess” approach, and also an “ensemble” approach. I made three autoencoders (one of them being two-step, so four models in total), and used them together. The code to make them, and their outputs, is wrapped up in a couple of R functions, which I’ll show in a moment, but first a look at the parameters of the four models:
AE200: This uses a single layer of 200 hidden neurons. I set input_dropout_ratio = 0.3, which means as each training sample was used for training it would be setting a random 30% of the pixels to 0. This should make it more robust, less likely to over-fit. I also use L2 regularization, set to 1e-4 (which is fairly high).
AE32: This uses just 32 hidden neurons, so is going to be less perfect representation than AE200. To compensate for that, I use a lower input_dropout_ratio = 0.1, and also lowered L2 regularization to 1e-5.
AE768: Not used directly. One hidden neuron per input pixel (almost), means it is more “rephrasing” rather than “compressing”. I used the same input_dropout_ratio = 0.3 and l2 = 1e-4 settings as AE200. (Among the 784 pixels columns there are a few around the edge that are exactly zero in all training data, so provide no information, and can be thrown away; that is where the 768 came from.)
AE128: This was built from the output of AE768. No input dropout, and just a bit of L2 regularization (1e-5).
L2 regularization penalizes large weights (e.g. see https://www.quora.com/What-is-the-difference-between-L1-and-L2-regularization). It helps make sure all pixels get considered, rather than allowing the algorithm to over-fit to one particular pixel.
All four models used the tanh activation function, and were given 20 epochs.

The Model Generation Code

The following listing shows the R code, for the above model descriptions in H2O, wrapped up in a function that takes two parameters:
  • data is the H2O frame to use for training data
  • x is which columns of that frame to use
_
create_MNIST_autoencoders <- function(data, x){
m_AE200 <- h2o.deeplearning(
  x, training_frame = data,
  hidden = c(200),
  model_id = "AE200",
  autoencoder=T,
  input_dropout_ratio = 0.3,  #Quite high
  l2 = 1e-4,  #Quite high
  activation = "Tanh",
  export_weights_and_biases = T,
  ignore_const_cols = F,
  train_samples_per_iteration = 0,
  epochs = 20
  )

m_AE32 <- h2o.deeplearning(
  x, training_frame = data,
  hidden = c(32),
  model_id = "AE32",
  autoencoder = T,
  input_dropout_ratio = 0.1,  #Fairly low
  l2 = 1e-5,  #Fairly low
  activation = "Tanh",
  export_weights_and_biases = T,
  ignore_const_cols = F,
  train_samples_per_iteration = 0,
  epochs = 20
  )

m_AE768 <- h2o.deeplearning(
  x, training_frame = data,
  hidden = c(768),
  model_id = "AE768",
  autoencoder = T,
  input_dropout_ratio = 0.3,  #Quite high
  l2 = 1e-4,  #Quite high
  activation = "Tanh",
  export_weights_and_biases = T,
  ignore_const_cols = F,
  train_samples_per_iteration = 0,
  epochs = 20
  )

f_AE768 = h2o.deepfeatures(m_AE768, data)

m_AE128 <- h2o.deeplearning(
  1:768, training_frame=f_AE768,
  hidden = c(128),
  model_id = "AE128",
  autoencoder = T,
  input_dropout_ratio = 0,  #No dropout
  l2 = 1e-5,   #Just a bit of L2
  activation = "Tanh",
  #export_weights_and_biases = T,
  #ignore_const_cols = F,
  train_samples_per_iteration = 0,
  epochs = 20
  )

return(list(m_AE200, m_AE32, m_AE768, m_AE128))
}
Feeding the output of one auto-encoder (AE768 in this case) into another (AE128) is done with h2o.deepfeatures().
These two lines are just for troubleshooting/visualization:
export_weights_and_biases = T,
ignore_const_cols = F,
And, in a sense, this one is too:
train_samples_per_iteration = 0
This says I want it to always score the model’s MSE at the end of every epoch. I did this so I could see the shape of the score history chart, and so get a feel for if 20 epochs was enough. Normally touching train_samples_per_iteration counts as micro-management, because the default is to intelligently choose when to score based on some targets for time spent training vs. scoring, and some communication overhead targets. (See the explanation in chapter 8 of the book, if you crave more detail.)

Using The Models To Make Pixels

The second helper function is shown next. It returns (a handle to) an H20 frame that has 200 + 32 + 128 columns from the autoencoders, plus any additional columns you specify in columns (which must include at least the answer column).
generate_from_MNIST_autoencoders <- function(models, data, columns){
  stopifnot(length(models) == 4)
  names(models) = c("AE200", "AE32", "AE768", "AE128")

  f_AE200 <- h2o.deepfeatures(models[["AE200"]], data)
  f_AE32 <- h2o.deepfeatures(models[["AE32"]], data)
  f_AE768 <- h2o.deepfeatures(models[["AE768"]], data)
  f_AE128 <- h2o.deepfeatures(models[["AE128"]], f_AE768)

  h2o.cbind(f_AE200, f_AE32, f_AE128, data[,columns] )
  }
Notice how, if you don’t include the original 768 pixel columns in columns that the auto-encoded features will effectively replace the raw pixel data. (This was my intention, but you don’t have to do that.)

Usage Example

I’ll assume you have 785 columns, where the pixel data is in columns 1 to 784, and the answer, 0 to 9, is in column 785. If you have the book, you will be familiar with this convention:
train <- #...50K of training data
valid <- #...10K of validation data
test <- #...10K of test data
x <- 1:784
y <- 785
To use them I then write:
models <- create_MNIST_autoencoders(train, x)
train_ae <- generate_from_MNIST_autoencoders(models, train, y)
valid_ae <- generate_from_MNIST_autoencoders(models, valid, y)
test_ae <- generate_from_MNIST_autoencoders(models, test, y)
m <- h2o.deeplearning(1:360, 361, train_ae, validation_frame = valid_ae)
Here I train a deep learning model with all defaults, but it could be a more complex deep learning model, or it could be a random forest, GBM or any other algorithm supported by H2O.
When using it for predictions, remember to use test_ae, not test (and similarly, in production, any future data has to be put through all four auto-encoder models). So following on from the above, you could evaluate it with:
h2o.performance(m, test_ae)

Usage: extended data

If you are lucky enough to have read the book, you will know I added 113 columns of “extended information” to the MNIST data. Though I got rid of the pixels, I chose to keep the extended columns alongside the auto-encoder generated data.
This is how the above code looks if you are using the extended MNIST data:
x <- 114:897
columns <- c(1:113,898)
models <- create_MNIST_autoencoders(train, x)
train_ae <- generate_from_MNIST_autoencoders(models, train, columns)
valid_ae <- generate_from_MNIST_autoencoders(models, valid, columns)
test_ae <- generate_from_MNIST_autoencoders(models, test, columns)
m <- h2o.deeplearning(1:473, 474, train_ae, validation_frame = valid_ae)

Summary

Informally, I can tell you that a deep learning model built on 473 auto-encoded/extended columns was significantly better than one built on 897 pixel/extended columns. However, I also increased the amount of training data at the same time (see http://darrendev.blogspot.com/2016/10/applying-rs-imager-library-to-mnist.html ), so I cannot tell you the relative contributions of those two changes.
But, I was pleased with the results. And the “educated guess” approach of choosing three distinct auto-encoder models and combining their outputs also seemed to work well. It might be that just one of the auto-encoder models is carrying all the useful information, and the others could be dropped? That is a good experiment to do (let us know if you do it!), but my hunch is that the “ensemble” approach is what allows the educated guess approach to work.

Thursday, October 13, 2016

Applying R's imager library to MNIST digits

Applying R’s imager library to MNIST digits

Introduction

For machine learning, when it comes to training data, the more the better! Well, there are a whole bunch of disclaimers on that statement, but as long as the data is representative of the type of data the machine learning model will be used on in production, doubling the amount of unique training data will be more useful than doubling the number of epochs (if deep learning) or trees (if random forest, GBM, etc.).
I’m going to concentrate in this article on the MNIST data set, and look at how to use R’s imager library to increase the number of training samples. I take a deep look at MNIST in my new book, Practical Machine Learning with H2O, which (not surprisingly) I highly recommend. But, briefly, the MNIST data is handwritten digits, and the challenge is to identify which of the 10 digits, 0 to 9, each one is.
There are 60,000 training sample, plus 10,000 test samples (and everyone uses the same test set - so watch out for the inadvertent over-fitting in published papers). The samples are 28x28 pixels, each a 0 to 255 greyscale value. I usually split the 60,000 samples into 50K for training, 10K for validation (to make sure I am not over-fitting on the test data).

imager and parallel

I used this R library for dealing with images: http://dahtah.github.io/imager/ which is based on a C++ library called CImg. Most function calls are just wrappers around the C++ code, which means they are fairly quick. It is well-documented with a good starter tutorial.
I used version 0.20 for all my development. I have just seen that 0.30 is now out, and in particular offers native parallelization. This is great! Out of scope for this article, but I used imager in conjunction with R’s parallel functions, and found the latter quite clunky, with time spent copying data structures limiting scalability. On the other hand, the docs say these new parallel options work best on large images, and the 28x28 MNIST images are certainly not that. So maybe I am still stuck using parApply() and friends.

The Approach

In the 20,000 unseen samples (10K valid, 10K test), there are often examples of bad handwriting that we don’t get to see in our 50,000 training samples. Therefore I am most interested in generated bad handwriting samples, not nice neat ones.
I spent a lot of time experimenting with what imager can produce, and settled on generating these effects, each with a random element:
  • rotate
  • warp (make it “scruffier”)
  • shift (move it 1 pixel up, down, left or right)
  • bold (make it fatter - my code)
  • dilate (make it fatter - cimg code)
  • erode (make it thinner)
  • erodedilate (one or the other)
  • scratches (add lines)
  • blotches (remove blobs)
I also defined “all” and “all2” which combined most of them.
In the full code I create the image in an imager (cimg) object called im, then copy it to im2. Each subsequent operation is performed on im2. im is left unchanged, but can be referred to for the initial state.

Rotate

The code to rotate comes in two parts. Here is the first part:
needSharpen = FALSE
angle = rnorm(1, 0, 8)
if(angle < -1.0 || angle > 1.0){
  im2 = imrotate(im2, angle, interpolation = 2)
  nPix = (width(im2) - width(im)) / 2
  im2 = crop.borders(im2 , nPix = nPix)
  needSharpen = TRUE
  }
The use of rnorm(sd=8), means 68% of time it’ll be +/-8°, only 5% of the time more than +/-16°. If my goal was simply more training samples, I’d have perhaps used as smaller sd, and/or clipped to a maximum rotation of 10°. But, as mentioned earlier, I wanted more scruffy handwriting. The if() block is a CPU optimization - if rotation is less than 1° don’t bother doing anything.
The imrotate() command takes the current im2 and replaces it with one that is rotated. This creates a larger image. To see what is going on, try running this (self-contained) script (see the inline comments for what is happening):
library(imager)

# Make 28x28 "mid-grey" square
im <- as.cimg(rep(128, 28*28), x = 28, y = 28)

#Prepare to plot side-by-side
par(mfrow = c(1,2))

#Show initial square 28x28
plot(im)

#Show rotated square, 34x34
plot(imrotate(im, angle = 16))
The output is like this:
You can see the image has become larger, to contain the rotated version. (That image also shows how imager’s plot command will scale the colours based on the range, and that my choice of 128 could have been any non-zero number. When there is only a single value (128), it chooses a grey. After rotating we have 0 for the background, 128 for the square, so it does 0 as black, 128 as white.)
For rotating MNIST digits:
  • I want to keep the 28x28 size
  • All the interesting content is in the middle, so clipping is fine.
So I call crop.borders(), which takes an argument nPix saying how many pixels to remove on each side. If it has grown from 28 to 34 pixels square, nPix will be 3.

Feeling A Bit Vague…

Here is what one of the MNIST digits looks like rotated 30° at a time, 11 times.




In a perfect world, 12 rotations would give you exactly the image you started with. But you can see the effect of each rotation is to blur it slightly. If we did another lap, even your clever mammalian brain would no longer recognize it as a 4.
The author of the imager library provided a match.hist() function (see it, and the surrounding discussion, here: https://github.com/dahtah/imager/issues/17 ) which does a good (but not perfect) job. Here are the histograms of the image before rotation, after rotation, and then after match.hist:




You can judge the success from looking at the image on the right, or by seeing how the bars on the rightmost histogram match those of the leftmost one. (Yes, even though the bumps are very small, their height, and where they are, really matter!)

You better sharpen up…

You would have noticed the earlier rotate code set needSharpen to true. That is used by the following code. Some of the time it uses the imager library’s imsharpen(), some of the time match.hist(), and some of the time a hack I wrote to make dim pixels dimmer, and bright pixels brighter.
if(needSharpen){
  if(runif(1) < 0.8){
    im2 <- imsharpen(im2, amplitude = 55)
    }
  if(runif(1) < 0.3){
    im2 <- ifelse(im2 < 128, im2-16, im2)
    im2 <- ifelse(im2 < 0, 0, im2)
    im2 <- ifelse(im2 > 200, im2+8, im2)
    im2 <- ifelse(im2 > 150, im2+8, im2)
    im2 <- ifelse(im2 > 100, im2+8, im2)
    im2 <- ifelse(im2 > 255, 255, im2)
  }else{
    im2 <- match.hist(im2, im)
    }
  }

The Others

The other image modifications, listed earlier, use imwarp(), imshift(), pmax() with imshift() (for a bold effect), dilate_square(), and erode_square(). The blotches and scratches were done by putting random noise on an image, then using pmax() or pmin() to combine them.
If there is interest I can write another article going into the details.

Timings

On a 2.8GHz single-core, I recorded these timings to process 60,000 28x28 images. (It was a 36-core 8xlarge EC2 machine, but my R script, and imager (at the time), only used one thread.)
  • 304s to run “bold”.
  • 296s to run “shift”
  • 417s for warp
  • 447s for rotate
  • 517s to 539s for all and all2
warp and rotate need to do the sharpen step, which is why they take longer.

Summary

I made 20 files, so over 95% of my training data was generated. As you will discover if you read the book, this generated data gave a very useful boost in model strength, though of course dramatically increased learning-time due to having 1.2 million training rows instead of 50,000. An interesting property was that it found the sampled data harder to learn: I got lower error rates on the unseen valid and test data sets than on the seen training data. This is a consequence of my deliberate decision to bias towards noisy data and scruffy handwriting.
Generating additional training data is a good way to prevent over-fitting, but generating data that is still representative is always a challenge. Normally you’d check mean, sd, the distribution, etc. to see how you’ve done, and usually your only technique is to jitter - add a bit of random noise. With images you have a rich set of geometric transformations to choose from, and you often use your eyes to see how it has done. Though, as we saw from the image histograms, there are some objective techniques available too.