Showing posts with label R. Show all posts
Showing posts with label R. Show all posts

Wednesday, December 6, 2017

Word Embeddings, NLP and I18N in H2O

Word embeddings can be thought of as a dimension-reduction tool, needing a sequence of tokens to learn from. They are really that generic, but I’ve only ever heard of them used for languages; i.e. the sequences are sentences, the tokens are words (or compound words, or n-grams, or morphemes).
This blog post is for code I presented recently on how to use the H2O implementation of word embeddings, aka word2vec. The main thing being demonstrated is that they apply equally well for any language, but you may need some language-specific tokenization, and other data engineering, first.
Here is the preparation code, in R; bring in H2O, and define a couple of helper functions.
library(h2o)

h2o.init(nthreads = -1)


show <- function(w, v){
  x = as.data.frame(h2o.cbind(w, v))
  x = unique(x)
  plot(x[,2:3], pch=16, type="n")
  text(x[,2:3], x[,1], cex = 2.2)
}


reduceShow <- function(w,v){
m <- h2o.prcomp(v, 1:ncol(v), k = 2, impute_missing=T)
p <- h2o.predict(m, v)
show(w, p)
}
Then, I define an artificial corpus, and try with word embedding dimensions of 2, 4 and 9. For dimensions above 2, reduceShow() is using PCA to just show the first two dimensions.
eg1 = c(
  "I like to drive a blue car",
  "I like to drive a red car",
  "I like to drive a green car",
  "I like to drive a blue lorry",
  "I like to drive a yellow lorry",
  "I like to drive a brown lorry",
  "I like to drive a green lorry",
  "I like to drive a red Ferrari",
  "I like to drive a blue Mercedes"
)

eg1.words <- h2o.tokenize(
  as.character(as.h2o(eg1)), "\\\\W+")

head(eg1.words,12)

eg1.wordsNoNA <- eg1.words[!is.na(eg1.words),]

eg1.wv <- h2o.word2vec(eg1.words,
                       min_word_freq = 1,
                       vec_size = 2)

eg1.vectors = h2o.transform(eg1.wv,
                            eg1.wordsNoNA,
                            "NONE")
show(eg1.wordsNoNA, eg1.vectors)


eg1.wv4 <- h2o.word2vec(eg1.words,
                        min_word_freq = 1,
                        vec_size = 4)

eg1.vectors4 = h2o.transform(eg1.wv4,
                             eg1.wordsNoNA,
                             "NONE")
reduceShow(eg1.wordsNoNA, eg1.vectors4)

eg1.wv9 <- h2o.word2vec(eg1.words,
                        min_word_freq = 1,
                        vec_size = 9,
                        epochs = 50 * 9)

eg1.vectors9 = h2o.transform(eg1.wv9,
                             eg1.wordsNoNA,
                             "NONE")
reduceShow(eg1.wordsNoNA, eg1.vectors9)
Those results are fairly poor, as we only have 9 sentences; we can do better by sampling those 9 sentences 1000 times. I.e. more data, even if it is exactly the same data, is better!
eg2 = sample(eg1, size = 1000, replace = T)

#(rest of code exactly the same, just changing eg1 to eg2)
What about Japanese? Here are the same 9 sentences (well, almost “This is a …” each time, rather than “I drive a …”), but hand-tokenized in a realistic way (in particular の is a separate token). I’ve gone straight to having 1000 sentences, as we know that helps:
  # これは青い車です。
  # これは赤い車です。
  # これは緑の車です。
  # これは青いトラックです。
  # これは黄色いトラックです。
  # これは茶色のトラックです。
  # これは緑のトラックです。
  # これは赤いフェラーリです。
  # これは青いメルセデスです。

ja1 = c(  #Pre-tokenized
  "これ","は","青い","車","です","。",NA,
  "これ","は","赤い","車","です","。",NA,
  "これ","は","緑","の","車","です","。",NA,  # ***
  "これ","は","青い","トラック","です","。",NA,
  "これ","は","黄色い","トラック","です","。",NA,
  "これ","は","茶色","の","トラック","です","。",NA,  # ***
  "これ","は","緑","の","トラック","です","。",NA,  # ***
  "これ","は","赤い","フェラーリ","です","。",NA,
  "これ","は","青い","メルセデス","です","。",NA
  )
ja2 = rep(ja1, times = 120)
# nrow(ja2) is 7920 tokens; representing 1080 sentences.
The code to try it is exactly as with the English, except we just import ja2 and don’t run it through h2o.tokenize().
ja2.words = as.character(as.h2o(ja2))
head(ja2.words, 12)

ja2.wordsNoNA <- ja2.words[!is.na(ja2.words),]

ja2.wv2 <- h2o.word2vec(ja2.words,
                        min_word_freq = 1,
                        vec_size = 2,
                        epochs = 20)
ja2.vectors2 = h2o.transform(ja2.wv2,
                             ja2.wordsNoNA,
                             "NONE")
show(ja2.wordsNoNA, ja2.vectors2)

ja2.wv4 <- h2o.word2vec(ja2.words,
                        min_word_freq = 1,
                        vec_size = 4,
                        epochs = 20)
ja2.vectors4 = h2o.transform(ja2.wv4,
                             ja2.wordsNoNA,
                             "NONE")
reduceShow(ja2.wordsNoNA, ja2.vectors4)


ja2.wv9 <- h2o.word2vec(ja2.words,
                        min_word_freq = 1,
                        vec_size = 9,
                        epochs = 50)
ja2.vectors9 = h2o.transform(ja2.wv9,
                             ja2.wordsNoNA,
                             "NONE")
reduceShow(ja2.wordsNoNA, ja2.vectors9)
In Python? I’ll just quickly show the key changes (there are full word embedding examples in Python, for H2O, floating around, e.g. https://github.com/h2oai/h2o-meetups/blob/master/2017_11_14_NLP_H2O/Amazon%20Reviews.ipynb )
To bring the data in and tokenize it:
eg1 = [...]
sentences = h2o.H2OFrame(eg1).ascharacter()
eg1_words = sentences.tokenize("\\W+")
eg1_words.head()
Then to make the embeddings:
from h2o.estimators.word2vec import H2OWord2vecEstimator
eg1_wv = H2OWord2vecEstimator(vec_size = 2, min_word_freq = 1)
eg1_wv.train(eg1_words)
And to get the vectors for visualization:
eg1_vectors = eg1_wv.transform(eg1_words_no_NA, "NONE")
(The Python code is untested as I type this - if I have a typo, let me know in the comments or at darren at dcook dot org, and I will fix it.)

Friday, October 14, 2016

Applying Auto-encoders to MNIST

Applying Auto-encoders to MNIST

This is a companion article to my new book, Practical Machine Learning with H2O, published by O’Reilly. Beyond the sheer, unadulterated pleasure you will get from reading it, I’m also recommending it for readers of this article, because I’m only going to lightly introduce topics such as H2O, MNIST, and even auto-encoders, that are covered in much more depth in the book.

That Light Introduction

H2O is a powerful, scalable and fast machine-learning server/framework, with APIs in R and Python (as well as Coffeescript, Scala via its Spark interface, and others). It has relatively few machine learning algorithms, but they are generally the ones you would have settled on using anyway, and each has been optimized to scale across clusters and big data, and each has many parameters for tuning.
MNIST is a machine learning problem to recognize which of the digits 0 to 9 a set of 784 pixels represents. There are 60,000 training samples, and 10,000 test samples. To avoid inadvertently over-fitting to the test samples, I split the 60K into 50K training data and 10K validation data.
Auto-encoders are the unsupervised version of deep-learning (neural nets, if you prefer). The Wikipedia article is a good introduction to the idea. By setting the input_dropout_ratio parameter H2O supports the “Denoising autoencoder” variation, and withhidden_dropout_ratios H2O supports the “Sparse autoencoder” variation.

The Aim

More layers in a supervised deep learning neural network can often give better results on complex problems. But the more layers you have, the harder it can be to train. Auto-encoders to the rescue! You run an auto-encoder on the raw inputs, and it will self-organize them - extract some information from them. You then take the middle hidden layer from the auto-encoder, and use that as the inputs to your supervised learning algorithm. Or possibly you do it again, with another auto-encoder, to extract (theoretically) an even higher-level abstraction.
I decided to try this on the MNIST data.
My initial approach was to treat it as a data compression problem: to see how few hidden neurons, in a single layer, I could get a perfect score with. I.e. this autoencoder had 784 input neurons, N hidden neurons, and 784 output neurons. Just one hidden layer, so to request, say, a 784x200x784 layout, in H2O I just do hidden=200, and the input layer is implicit from the data, and the output it implict because I specify autoencoder=TRUE. Unfortunately, even with N=784, I couldn’t get an MSE of 0.0.
(You should be able to see how there is one trivial way for such a network to get the perfect score: for each neuron in the middle layer, exactly one incoming weight should be 1.0 and all the others should be 0.0, and then the same for the outgoing weights. However, also appreciate how hard it would be for training to discover this if all 784 weights leading in and all 784 weights leading out of each neuron started off life as a random number.)

Getting Practical

So, under time pressure, I took an “educated guess” approach, and also an “ensemble” approach. I made three autoencoders (one of them being two-step, so four models in total), and used them together. The code to make them, and their outputs, is wrapped up in a couple of R functions, which I’ll show in a moment, but first a look at the parameters of the four models:
AE200: This uses a single layer of 200 hidden neurons. I set input_dropout_ratio = 0.3, which means as each training sample was used for training it would be setting a random 30% of the pixels to 0. This should make it more robust, less likely to over-fit. I also use L2 regularization, set to 1e-4 (which is fairly high).
AE32: This uses just 32 hidden neurons, so is going to be less perfect representation than AE200. To compensate for that, I use a lower input_dropout_ratio = 0.1, and also lowered L2 regularization to 1e-5.
AE768: Not used directly. One hidden neuron per input pixel (almost), means it is more “rephrasing” rather than “compressing”. I used the same input_dropout_ratio = 0.3 and l2 = 1e-4 settings as AE200. (Among the 784 pixels columns there are a few around the edge that are exactly zero in all training data, so provide no information, and can be thrown away; that is where the 768 came from.)
AE128: This was built from the output of AE768. No input dropout, and just a bit of L2 regularization (1e-5).
L2 regularization penalizes large weights (e.g. see https://www.quora.com/What-is-the-difference-between-L1-and-L2-regularization). It helps make sure all pixels get considered, rather than allowing the algorithm to over-fit to one particular pixel.
All four models used the tanh activation function, and were given 20 epochs.

The Model Generation Code

The following listing shows the R code, for the above model descriptions in H2O, wrapped up in a function that takes two parameters:
  • data is the H2O frame to use for training data
  • x is which columns of that frame to use
_
create_MNIST_autoencoders <- function(data, x){
m_AE200 <- h2o.deeplearning(
  x, training_frame = data,
  hidden = c(200),
  model_id = "AE200",
  autoencoder=T,
  input_dropout_ratio = 0.3,  #Quite high
  l2 = 1e-4,  #Quite high
  activation = "Tanh",
  export_weights_and_biases = T,
  ignore_const_cols = F,
  train_samples_per_iteration = 0,
  epochs = 20
  )

m_AE32 <- h2o.deeplearning(
  x, training_frame = data,
  hidden = c(32),
  model_id = "AE32",
  autoencoder = T,
  input_dropout_ratio = 0.1,  #Fairly low
  l2 = 1e-5,  #Fairly low
  activation = "Tanh",
  export_weights_and_biases = T,
  ignore_const_cols = F,
  train_samples_per_iteration = 0,
  epochs = 20
  )

m_AE768 <- h2o.deeplearning(
  x, training_frame = data,
  hidden = c(768),
  model_id = "AE768",
  autoencoder = T,
  input_dropout_ratio = 0.3,  #Quite high
  l2 = 1e-4,  #Quite high
  activation = "Tanh",
  export_weights_and_biases = T,
  ignore_const_cols = F,
  train_samples_per_iteration = 0,
  epochs = 20
  )

f_AE768 = h2o.deepfeatures(m_AE768, data)

m_AE128 <- h2o.deeplearning(
  1:768, training_frame=f_AE768,
  hidden = c(128),
  model_id = "AE128",
  autoencoder = T,
  input_dropout_ratio = 0,  #No dropout
  l2 = 1e-5,   #Just a bit of L2
  activation = "Tanh",
  #export_weights_and_biases = T,
  #ignore_const_cols = F,
  train_samples_per_iteration = 0,
  epochs = 20
  )

return(list(m_AE200, m_AE32, m_AE768, m_AE128))
}
Feeding the output of one auto-encoder (AE768 in this case) into another (AE128) is done with h2o.deepfeatures().
These two lines are just for troubleshooting/visualization:
export_weights_and_biases = T,
ignore_const_cols = F,
And, in a sense, this one is too:
train_samples_per_iteration = 0
This says I want it to always score the model’s MSE at the end of every epoch. I did this so I could see the shape of the score history chart, and so get a feel for if 20 epochs was enough. Normally touching train_samples_per_iteration counts as micro-management, because the default is to intelligently choose when to score based on some targets for time spent training vs. scoring, and some communication overhead targets. (See the explanation in chapter 8 of the book, if you crave more detail.)

Using The Models To Make Pixels

The second helper function is shown next. It returns (a handle to) an H20 frame that has 200 + 32 + 128 columns from the autoencoders, plus any additional columns you specify in columns (which must include at least the answer column).
generate_from_MNIST_autoencoders <- function(models, data, columns){
  stopifnot(length(models) == 4)
  names(models) = c("AE200", "AE32", "AE768", "AE128")

  f_AE200 <- h2o.deepfeatures(models[["AE200"]], data)
  f_AE32 <- h2o.deepfeatures(models[["AE32"]], data)
  f_AE768 <- h2o.deepfeatures(models[["AE768"]], data)
  f_AE128 <- h2o.deepfeatures(models[["AE128"]], f_AE768)

  h2o.cbind(f_AE200, f_AE32, f_AE128, data[,columns] )
  }
Notice how, if you don’t include the original 768 pixel columns in columns that the auto-encoded features will effectively replace the raw pixel data. (This was my intention, but you don’t have to do that.)

Usage Example

I’ll assume you have 785 columns, where the pixel data is in columns 1 to 784, and the answer, 0 to 9, is in column 785. If you have the book, you will be familiar with this convention:
train <- #...50K of training data
valid <- #...10K of validation data
test <- #...10K of test data
x <- 1:784
y <- 785
To use them I then write:
models <- create_MNIST_autoencoders(train, x)
train_ae <- generate_from_MNIST_autoencoders(models, train, y)
valid_ae <- generate_from_MNIST_autoencoders(models, valid, y)
test_ae <- generate_from_MNIST_autoencoders(models, test, y)
m <- h2o.deeplearning(1:360, 361, train_ae, validation_frame = valid_ae)
Here I train a deep learning model with all defaults, but it could be a more complex deep learning model, or it could be a random forest, GBM or any other algorithm supported by H2O.
When using it for predictions, remember to use test_ae, not test (and similarly, in production, any future data has to be put through all four auto-encoder models). So following on from the above, you could evaluate it with:
h2o.performance(m, test_ae)

Usage: extended data

If you are lucky enough to have read the book, you will know I added 113 columns of “extended information” to the MNIST data. Though I got rid of the pixels, I chose to keep the extended columns alongside the auto-encoder generated data.
This is how the above code looks if you are using the extended MNIST data:
x <- 114:897
columns <- c(1:113,898)
models <- create_MNIST_autoencoders(train, x)
train_ae <- generate_from_MNIST_autoencoders(models, train, columns)
valid_ae <- generate_from_MNIST_autoencoders(models, valid, columns)
test_ae <- generate_from_MNIST_autoencoders(models, test, columns)
m <- h2o.deeplearning(1:473, 474, train_ae, validation_frame = valid_ae)

Summary

Informally, I can tell you that a deep learning model built on 473 auto-encoded/extended columns was significantly better than one built on 897 pixel/extended columns. However, I also increased the amount of training data at the same time (see http://darrendev.blogspot.com/2016/10/applying-rs-imager-library-to-mnist.html ), so I cannot tell you the relative contributions of those two changes.
But, I was pleased with the results. And the “educated guess” approach of choosing three distinct auto-encoder models and combining their outputs also seemed to work well. It might be that just one of the auto-encoder models is carrying all the useful information, and the others could be dropped? That is a good experiment to do (let us know if you do it!), but my hunch is that the “ensemble” approach is what allows the educated guess approach to work.

Thursday, October 13, 2016

Applying R's imager library to MNIST digits

Applying R’s imager library to MNIST digits

Introduction

For machine learning, when it comes to training data, the more the better! Well, there are a whole bunch of disclaimers on that statement, but as long as the data is representative of the type of data the machine learning model will be used on in production, doubling the amount of unique training data will be more useful than doubling the number of epochs (if deep learning) or trees (if random forest, GBM, etc.).
I’m going to concentrate in this article on the MNIST data set, and look at how to use R’s imager library to increase the number of training samples. I take a deep look at MNIST in my new book, Practical Machine Learning with H2O, which (not surprisingly) I highly recommend. But, briefly, the MNIST data is handwritten digits, and the challenge is to identify which of the 10 digits, 0 to 9, each one is.
There are 60,000 training sample, plus 10,000 test samples (and everyone uses the same test set - so watch out for the inadvertent over-fitting in published papers). The samples are 28x28 pixels, each a 0 to 255 greyscale value. I usually split the 60,000 samples into 50K for training, 10K for validation (to make sure I am not over-fitting on the test data).

imager and parallel

I used this R library for dealing with images: http://dahtah.github.io/imager/ which is based on a C++ library called CImg. Most function calls are just wrappers around the C++ code, which means they are fairly quick. It is well-documented with a good starter tutorial.
I used version 0.20 for all my development. I have just seen that 0.30 is now out, and in particular offers native parallelization. This is great! Out of scope for this article, but I used imager in conjunction with R’s parallel functions, and found the latter quite clunky, with time spent copying data structures limiting scalability. On the other hand, the docs say these new parallel options work best on large images, and the 28x28 MNIST images are certainly not that. So maybe I am still stuck using parApply() and friends.

The Approach

In the 20,000 unseen samples (10K valid, 10K test), there are often examples of bad handwriting that we don’t get to see in our 50,000 training samples. Therefore I am most interested in generated bad handwriting samples, not nice neat ones.
I spent a lot of time experimenting with what imager can produce, and settled on generating these effects, each with a random element:
  • rotate
  • warp (make it “scruffier”)
  • shift (move it 1 pixel up, down, left or right)
  • bold (make it fatter - my code)
  • dilate (make it fatter - cimg code)
  • erode (make it thinner)
  • erodedilate (one or the other)
  • scratches (add lines)
  • blotches (remove blobs)
I also defined “all” and “all2” which combined most of them.
In the full code I create the image in an imager (cimg) object called im, then copy it to im2. Each subsequent operation is performed on im2. im is left unchanged, but can be referred to for the initial state.

Rotate

The code to rotate comes in two parts. Here is the first part:
needSharpen = FALSE
angle = rnorm(1, 0, 8)
if(angle < -1.0 || angle > 1.0){
  im2 = imrotate(im2, angle, interpolation = 2)
  nPix = (width(im2) - width(im)) / 2
  im2 = crop.borders(im2 , nPix = nPix)
  needSharpen = TRUE
  }
The use of rnorm(sd=8), means 68% of time it’ll be +/-8°, only 5% of the time more than +/-16°. If my goal was simply more training samples, I’d have perhaps used as smaller sd, and/or clipped to a maximum rotation of 10°. But, as mentioned earlier, I wanted more scruffy handwriting. The if() block is a CPU optimization - if rotation is less than 1° don’t bother doing anything.
The imrotate() command takes the current im2 and replaces it with one that is rotated. This creates a larger image. To see what is going on, try running this (self-contained) script (see the inline comments for what is happening):
library(imager)

# Make 28x28 "mid-grey" square
im <- as.cimg(rep(128, 28*28), x = 28, y = 28)

#Prepare to plot side-by-side
par(mfrow = c(1,2))

#Show initial square 28x28
plot(im)

#Show rotated square, 34x34
plot(imrotate(im, angle = 16))
The output is like this:
You can see the image has become larger, to contain the rotated version. (That image also shows how imager’s plot command will scale the colours based on the range, and that my choice of 128 could have been any non-zero number. When there is only a single value (128), it chooses a grey. After rotating we have 0 for the background, 128 for the square, so it does 0 as black, 128 as white.)
For rotating MNIST digits:
  • I want to keep the 28x28 size
  • All the interesting content is in the middle, so clipping is fine.
So I call crop.borders(), which takes an argument nPix saying how many pixels to remove on each side. If it has grown from 28 to 34 pixels square, nPix will be 3.

Feeling A Bit Vague…

Here is what one of the MNIST digits looks like rotated 30° at a time, 11 times.




In a perfect world, 12 rotations would give you exactly the image you started with. But you can see the effect of each rotation is to blur it slightly. If we did another lap, even your clever mammalian brain would no longer recognize it as a 4.
The author of the imager library provided a match.hist() function (see it, and the surrounding discussion, here: https://github.com/dahtah/imager/issues/17 ) which does a good (but not perfect) job. Here are the histograms of the image before rotation, after rotation, and then after match.hist:




You can judge the success from looking at the image on the right, or by seeing how the bars on the rightmost histogram match those of the leftmost one. (Yes, even though the bumps are very small, their height, and where they are, really matter!)

You better sharpen up…

You would have noticed the earlier rotate code set needSharpen to true. That is used by the following code. Some of the time it uses the imager library’s imsharpen(), some of the time match.hist(), and some of the time a hack I wrote to make dim pixels dimmer, and bright pixels brighter.
if(needSharpen){
  if(runif(1) < 0.8){
    im2 <- imsharpen(im2, amplitude = 55)
    }
  if(runif(1) < 0.3){
    im2 <- ifelse(im2 < 128, im2-16, im2)
    im2 <- ifelse(im2 < 0, 0, im2)
    im2 <- ifelse(im2 > 200, im2+8, im2)
    im2 <- ifelse(im2 > 150, im2+8, im2)
    im2 <- ifelse(im2 > 100, im2+8, im2)
    im2 <- ifelse(im2 > 255, 255, im2)
  }else{
    im2 <- match.hist(im2, im)
    }
  }

The Others

The other image modifications, listed earlier, use imwarp(), imshift(), pmax() with imshift() (for a bold effect), dilate_square(), and erode_square(). The blotches and scratches were done by putting random noise on an image, then using pmax() or pmin() to combine them.
If there is interest I can write another article going into the details.

Timings

On a 2.8GHz single-core, I recorded these timings to process 60,000 28x28 images. (It was a 36-core 8xlarge EC2 machine, but my R script, and imager (at the time), only used one thread.)
  • 304s to run “bold”.
  • 296s to run “shift”
  • 417s for warp
  • 447s for rotate
  • 517s to 539s for all and all2
warp and rotate need to do the sharpen step, which is why they take longer.

Summary

I made 20 files, so over 95% of my training data was generated. As you will discover if you read the book, this generated data gave a very useful boost in model strength, though of course dramatically increased learning-time due to having 1.2 million training rows instead of 50,000. An interesting property was that it found the sampled data harder to learn: I got lower error rates on the unseen valid and test data sets than on the seen training data. This is a consequence of my deliberate decision to bias towards noisy data and scruffy handwriting.
Generating additional training data is a good way to prevent over-fitting, but generating data that is still representative is always a challenge. Normally you’d check mean, sd, the distribution, etc. to see how you’ve done, and usually your only technique is to jitter - add a bit of random noise. With images you have a rich set of geometric transformations to choose from, and you often use your eyes to see how it has done. Though, as we saw from the image histograms, there are some objective techniques available too.

Wednesday, October 12, 2016

Multinomial Ensemble In H2O

Categorization Ensemble In H2O

An ensemble is the term for using two or more machine learning models together, as a team. There are various types (normally categorized in the literature as boosting, bagging and stacking). In this article I want to look at making teams of high-quality categorizing models. (It handles both multinomial and binomial categorization.)
I will use H2O. Any machine-learning framework that returns probabilities for each of the possible categories can be used, but I will assume a little bit of familiarity with H2O… and therefore I highly recommend my book: Practical Machine Learning with H2O! By the way, I will use R here, though any language supported by H2O can be used.

Getting The Probabilities

Here is the core function (in R). It takes a list of models ¹, and runs predict on each of them. The rest of the code is a bit of manipulation to return a 3D array of just the probabilities (h2o.predict() returns its chosen answer in the “predict” column, so the setdiff() is to get rid of it).
getPredictionProbabilities <- function(models, data){
sapply(models, function(m){
  p <- h2o.predict(m, data)
  as.matrix(p[, setdiff(colnames(p), "predict") ])
  }, simplify = "array")
}

Using Them

getPredictionProbabilities() returns a 3D array:
  • First dimension is nrows(data)
  • Second dimension is the number of possible outcomes (labelled “p1”, “p2”, …)
  • Third dimension is length(models)
  • Value is 0.0 to 1.0.
To make the next explanation a bit less abstract, let’s assume it is MNIST data (where the goal is to look at raw pixels and guess which of 10 digits it is), and that we have made three models, and that we are trying it on 10,000 test samples. Therefore we have a 10000 x 10 x 3 array. Some example slices of it:
  • predictions[1,,] is the predictions for the first sample row; returning a 10 x 3 matrix.
  • predictions[1:5, "p3",] is the probability of each of the first 5 rows being the digit “2”. So, five rows and three columns, one per model. (It is “2”, because “p1” is “0”, “p2” is “1”, etc.).
  • predictions[1:5,,1] are the probabilites of the first model on the first 5 test samples, for each of the 10 categories.
When I just want the predictions of the team, and don’t care about the details, I use this wrapper function (explained in the next section):
predictTeam <- function(models, data){
  probabilities <- getPredictionProbabilities(models, data)
  apply(
    probabilities, 1,
    function(m) which.max(apply(m, 1, sum))
    )
  }

3D Arrays In R

A quick diversion into dealing with multi-dimensional arrays in R. apply(probabilities, 1, FUN) will apply FUN to each row in the first dimension. I.e. FUN will be called 10,000 times, and will be given a 10x3 matrix (m in the above code). I then use apply() on the first dimension of m, calling sum. sum() is called 10 times (once per category) and each time is given a vector of length 3, containing one probability per model.
The last piece of the puzzle is which.max(), which tells us which category has the biggest combined confidence from all models. It returns 1 if category 1 (i.e. “0” if MNIST), 2 if category 2 (“1” if MNIST), etc. This gets returned by the function, and therefore gets returned by the outer apply(), and therefore is what predictTeam() returns.
Aside: Using sum() is equivalent to mean(), but just very slightly faster. E.g. 30 > 27 > 24, and if you divide through by 3, then 10 > 9 > 8.
Your answer column in your training data is a factor, so to convert that answer into the string it represents, use levels(). E.g. if f is the factor representing your answer, and p is what predictTeam()returned, then levels(f)[p] will get the string. (With MNIST, there is a short-cut: p-1 will do it.)

Counting Correctness

A common question to ask of an ensemble is how many test samples every member of the team got right, and how many none of the team got right. I’ll look at that in 4-steps. First up is to ask what each models best answer is.³
predictByModel <- apply(probabilities, 1,
 function(sample) apply(sample, 2, which.max)
 )
Assuming the earlier MNIST example, this gives a 3 x 10000 int matrix (where the values are 1 to 10).
Next, we need a vector of the correct answers, and they must be the category index, not the actual value. (In the case of MNIST data it is simply adding 1, so “0” becomes 1, “1” becomes 2, etc.)
eachModelsCorrectness <- apply(predictByModel, 1,
 function(modelPredictions) modelPredictions == correctAnswersTest
 )
This returns a 10000 x 3 logical (boolean) matrix. It tells us if each model got each answer correct. The next step returns a 10000 element vector, which will have 0 if they all got it wrong, 1 if only one member of the team got the correct answer, 2 if two of them, and 3 if all three members of the team got it right.
cmc <- apply(eachModelsCorrectness, 1, sum)
(cmc for count of model correctness) table(cmc) will give you those four numbers, E.g. you might see:
 0    1    2    3
67   59  105 9769
Indicating that there were 67 that none of them got right, 59 samples that only one of the three models managed, 105 that just one model got wrong, and 9769 that all the models could do.
I sometimes find cumsum(table(cmc)) to be more useful:
 0     1     2     3
67   126   231 10000
Either way, if the first number is high, you need stronger model(s). Better team members. If it is low, but the ensemble is performing poorly (i.e. no better than the best individual model), then you need more variety in the models.

Further Ideas

By slicing and dicing the 3D probability array, you can do different things. You might want to poke into what those difficult 67 samples were, that none of your models can get. Another idea, which I will deal in a future blog post, is how you can use the probability to indicate those with low confidence that need more investigation work. (You can use this with an ensemble, or just with a single model.)

Performance

In my experience, in general, this kind of team will always give better performance than any individual model in the team. It gives the biggest jump in strength, when:
  • The models are very distinct.
  • The models are fairly close in strength.
  • Just a few models. Making 4 or 5 models, evaluating them on the valid data, and using the strongest 3 for the ensemble, works reasonably well. However if you managing to make near-strength and very distinct models, the more the merrier.
For instance, three deep learning models, all built with the same parameters, each getting about 2% error individually, might manage 1.8% error as a team. But two deep learning models, one random forest, one GBM, all getting 2% error, might give 1.5% or even lower, when used together. Conversely, two very similar deep learning models that have 2% error combined with a GBM with 3% error and a random forest getting 4% error might do poorly, possibly even worse than your single best model.

Summary

This approach to making an ensemble for categorization problems is straightforward and usually very effective. Because it works with models of different types it at first glance seems closer to stacking, but really it is more like the bagging of random forest (except using probabilities, instead of mode⁴ as in random forest).

Footnotes

[1]: All the models in models must all be modelling the same thing, i.e. they must each have been trained on data with the exact same columns.²
Normally each model will have been trained on the exact same data, but the precise phrasing is deliberate: interesting ensembles can be built with models trained on different subsets of your data. (By the way, subsets and ensemble are the two fundamental ideas behind the random forest algorithm.)
[2]: You can take this even further and only require they are all trained with the same response variable, not even the same columns. However, you’ll need to adapt getPredictionProbabilities() in that case, because each call to h2o.predict() will then need to be given a different data object.
[3]: I do it this way because I already had probabilities made. But if it was all that you were interested in, this was the long way to get it. The direct way was to do the exact opposite of getPredictionProbabilities(): keep the “predict” column, and get rid of the other columns!
[4]: I did experiment with simple one vote per model (aka mode), instead of summing probabilities, and generally got worse results. But it is worth bearing that kind of ensemble in mind. If you do more rigorous experiments comparing the two methods, I’d be very interested to hear what kind of problems, if any, that voting (mode) is superior on.

Friday, September 23, 2016

Hacking the H2O R API

Hacking the H2O R API

H2O comes with a comprehensive R API, but sometimes you want to do something that it does not (yet) support. This article will show how to add a couple of functions for fetching and saving models. Beyond giving you these functions, I want to show how to approach hacking on the API, including using internals. (Code in this article has been tested on the 3.8.2.x, 3.8.3.x and 3.10.0.x releases.)
If you want to learn more about H2O, and machine learning, may I recommend my book: Practical Machine Learning with H2O, published by O’Reilly? (It is “coming really soon” as I write this!) And, my company, QQ Trend, are available for helping you with all your machine learning needs, everything from a few hours of H2O-related consulting to helping you build massive models to solve the mysteries of life. (Contact me at dc at qqtrend.com )

Saving it all for another day

Say you have 30 models stored on H2O, and you want to save them all. The scenario might be that you want to stop the cluster overnight, but want to use your current set of models as the starting point for better models tomorrow. Or in an ensemble. Or something. At the time of writing H2O does not offer this functionality, in any of its various APIs and front-ends. So I want to write an h2o.saveAllModels() function.
Breaking that down a bit, I’m going to need these two functions:
  • get a list of all models
  • save models, given a list of models or model IDs.
Let’s start with the “or model IDs” requirement. H2O’s API offers h2o.saveModel(), but that only takes a model object, so how can we use it when all we have is an ID?

Exposing The Guts…

I am a huge fan of open source. H2O is open source. R is open source. But there is open, and then there is open, and one of the things I like about R is if you want to see how something was implemented, you just type the function name.
Type h2o.saveModel (without parentheses) in an R session (where you’ve already done library(h2o), of course) and you will see the source code. Here it is; notice how the only part of object that it uses is the model id - that is a stroke of luck, because it means that (under the covers) the API works just the way we needed it to!
function (object, path = "", force = FALSE) 
{
    #... Error-checking elided ...
    path <- file.path(path, object@model_id)
    res <- .h2o.__remoteSend(
      paste0("Models.bin/", object@model_id),
      dir = path, force = force, h2oRestApiVersion = 99)
    res$dir
}
If you are new to H2O, you need to understand that all the hard work is done in a Java application (which can equally well be running on your machine or on a cluster the other side of the world), and the clients (whether R, Python or Flow’s CoffeeScript) are all using the same REST API to send commands to it. So it should be no surprise to see .h2o.__remoteSend there; it is making a call to the “Models.bin” REST endpoint.
.h2o.__remoteSend is a private function in the R API. That means you cannot call it directly. Luckily, R doesn’t get in our way like Java or C++ would. We can use the package name followed by the triple colon operator to run it from normal user code: h2o:::.h2o.__remoteSend(...)
WARNING: Remember that hacking with the internals of an API is not future-proof. An upgrade might break everything you’ve written. …ooh, look at us, adrenaline flowing, living the life of danger. (Note to self: need to get out more.)

Let’s Write Some Code!

We now have enough to make the saveModels() function:
h2o.saveModels <- function(models, path, force = FALSE){
sapply(models, function(id){
  if(is.object(id))id <- id@model_id
  res <- h2o:::.h2o.__remoteSend(
    paste0("Models.bin/", id),
    dir = file.path(path, id),
    force = force, h2oRestApiVersion = 99)
  res$dir
  }, USE.NAMES=F)
}
The if(is.object(id))id = id@model_id line is what allows it to work with a mix of model id strings or model objects. The use of sapply(..., USE.NAMES=F) means it returns a character vector, containing the full path of each model file that was saved. Use it as follows:
h2o.saveModels(
  c("DL:defaults", "DL:200x200-500", "RF:100-40"),
  "/path/to/h2o_models/todays_hard_work",
  force = TRUE
  )
and it will output:
[1] "/path/to/h2o_models/todays_hard_work/DL:defaults"
[2] "/path/to/h2o_models/todays_hard_work/DL:200x200-500"
[3] "/path/to/h2o_models/todays_hard_work/RF:100-40"
(By the way, there is one irritating problem with this function: if any failure occurs, such as a file already existing, or a model ID not found, it stops with a long error message, and doesn’t attempt to save the other models. I’ll leave improving that to you. Hint: consider wrapping the h2o:::.h2o.__remoteSend() call with ?tryCatch.)

What Models Have I Made?

Next, how to get a list of all models? The Flow interface has getModels, and the REST API has GET /3/Models, but the R and Python APIs do not; the closest they have is h2o.ls() which returns the names of all data frames, models, prediction results, etc. with no (reliable) way to tell them apart. But GET /3/Models is not ideal either, because it returns everything about the model, whereas all we want is the model id. Having trawled through the H2O source (BTW, If you fancied submitting a patch to add a GET /3/ModelIds/ command, this looks like a good starting point I.e. what we want is just the first half of that function) it appears we are stuck with this. It is unlikely to matter unless you have 1000s of models, or a slow connection to your remote H2O cluster.
Start the same way as before by typing h2o.getModel (no parentheses) into R. Ooh! That function is long, and it is doing an awful lot. If you want a generally useful h2o.getModels() function I leave that as another of those exercises for the reader. Instead I’m going to call my function
h2o.getAllModelIds(), and limit the scope to just that, which makes the code much simpler. (Did you notice the pro tip there: just by calling my function “getAllModelIds” instead of “getModels” I saved myself hours of work. You see kids, naming really does matter.)
Here it is:
h2o.getAllModelIds <- function(){
d <- h2o:::.h2o.__remoteSend(method = "GET", "Models")
sapply(d[[3]], function(x) x$model_id$name)
}
Line 1 says get all the models. Line 2 says filter just the model id out, and throw the rest of it away. (Yeah, that d[[3]] bit is particularly fragile.)
Anyway, the final step is to simply put our two new functions together:
h2o.saveAllModels <- function(path){
h2o.saveModels(h2o.getAllModelIds(), path)
}
Use it, as shown here:
fnames <- h2o.saveAllModels("/path/to/todays_hard_work")
On one test, length(fnames) returned 154 (I’d been busy), and those 154 models totalled 150MB. However some models (e.g. random forest) are bigger than others, so make sure you have plenty of disk space to hand, just in case. Speaking of which, h2o.saveAllModels() should work equally well with S3 or HDFS destinations.

The day after the night before…

I could’ve done a dput(fnames) after running h2o.saveAllModels(), and saved the output somewhere. But as I’m not putting anything else in that particular directory, I can get the list again with Sys.glob(). So, I might start my next day’s session as follows.
 library(h2o)
 h2o.init(nthreads = -1)
 fnames <- Sys.glob("/path/to/todays_hard_work/*")
 models <- lapply(fnames, h2o.loadModel)
Voila! models will be an R list of the H2O model objects.

Clusters

If you are working on a remote cluster, with more than one node, there is a little twist to be aware of. h2o.saveModel() (and therefore our h2o.saveModels() extension) will create files on whichever node of the cluster your client is connected to. (At least, as of 3.10.0.7; I suspect this behaviour might change in future.)
But h2o.loadModel() will look for it on the file system of node 1 of the cluster. And node 1 is not (necessarily) the first node you listed in your flatfile. Instead it is the one listed first in h2o.clusterStatus().
This won’t concern you if you saved to HDFS or S3.

Bonus

What I actually use to load model files is shown below. It will get all the files from sub-directories. (Look at the list.file() documentation for how you can use pattern to choose just some files to be loaded in.)
h2o.loadModelsDirectory <- function(path, pattern=NULL, recursive=T, verbose=F){
fnames <- list.files(path, pattern = pattern, recursive = recursive, full.names = T, include.dirs = F)
h2o.loadModels(fnames, verbose)
}
(This code still has one problem: if I’m loading in work from both yesterday and two days ago, and I had made a model called “DF:default” on both days, I lose one of them. Sorting that out is my final exercise for the reader - please post your answer, or a link to your answer, in the comments!)

Tuesday, August 9, 2016

H2O Upgrade: Detailed Steps

H2O Upgrade: Detailed Steps

I wanted to upgrade both R and Python to the latest version of H2O (as of Aug 9th 2016). Here are the exact steps, and I think you will find them relevant even if you only need to update one or the other of those clients. Remember if following this at a later date, that you should follow the spirit of it, rather than copy-and-pasting: all the version numbers will have changed. This was with Linux Mint, but it should apply equally well to all other Linux distros.
Make sure you first close any R or Python clients that are using the H2O library; and separately shutdown H2O if it is still running after that.

The First Time

The below instructions are all for upgrading, which means I know I have all the dependencies in place. If this is your first H2O install, well I’d first recommend you buy my new book: Practical Machine Learning with H2O, published by O’Reilly. (Coming really soon, as I type this! Let me know if you are interested, and I'll send the best discount code I can find.)
As a quick guide, from R I recommend you use CRAN
install.packages("h2o")
and from Python I recommend you use pip:
pip install h2o
Both these approaches get all the dependences for you; you may end up back a version or two from the very latest, but it won’t matter.

The Download

cd /usr/local/src/
wget http://download.h2o.ai/versions/h2o-3.10.0.3.zip
unzip h2o-3.10.0.3.zip
It made a “h2o-3.10.0.3” directory, with python and R sub-directories.

R

I installed the h2o the first time as root, so I will continue to do that, hence the sudo:
cd /usr/local/src/h2o-3.10.0.3/R/
sudo R
Then:
remove.packages("h2o")
install.packages("h2o_3.10.0.3.tar.gz")
Then ctrl-d to exit.

Python

cd /usr/local/src/h2o-3.10.0.3/python/
sudo pip uninstall h2o
sudo pip install -U h2o-3.10.0.3-py2.py3-none-any.whl 
(The -U means upgrade any dependencies; the first time I forgot it, and ended up with some very weird errors when trying to do anything in Python.)

The Test

I started RStudio, and ran:
library(h2o)
h2o.init(nthreads=-1)
as.h2o(iris)
I then started ipython and ran:
import h2o,pandas
h2o.init()
iris = h2o.get_frame("iris")
print(iris)
As well as making sure the data arrived, I’m also checking the h2o.init() call in both cases said the cluster version was “3.10.0.3”.

AWS Scripts

If you use the AWS scripts (https://github.com/h2oai/h2o-3/tree/master/ec2) and want to make sure EC2 instances start with exactly the same version as you have installed locally, the file to edit is h2o-cluster-download-h2o.sh. (If not using those scripts, just skip this section.)
First find the h2oBranch= line and set it to “rel-turing” (notice the “g” on the end - there is also a version without the “g”!). Then comment out the two curl calls that follow, and instead set version to be whatever you have above, and build to be the last digit in the version number. So, for 3.10.0.3, I set:
h2oBranch=rel-turing

#echo "Fetching latest build number for branch ${h2oBranch}..."
#curl --silent -o latest https://h2o-release.s3.amazonaws.com/h2o/${h2oBranch}/latest
h2oBuild=3

#echo "Fetching full version number for build ${h2oBuild}..."
#curl --silent -o project_version https://h2o-release.s3.amazonaws.com/h2o/${h2oBranch}/${h2oBuild}/project_version
h2oVersion=3.10.0.3
The rest of that script, and the other EC2 scripts, can be left untouched.

Summary

Well that was easy! No excuses! Having said that, I recommend you upgrade cautiously - I have seen some hard-to-test-for regressions (e.g. model learning no longer scaling as well over a cluster) when grabbing the latest version.

Monday, February 29, 2016

Factorials and Rationals in R

At the recent NorDevCon, Richard Astbury used the example of factorial to compare some “modern” languages (the theme of his talk being that they were all invented decades ago). Among them was Scheme, which impressed me by having native support for very large numbers and rational numbers.

I felt like my go-to-language for maths-stuff, R, ought to be able to do that, too. The shortest R way to calculate factorials (using built-in functions) I can find is:

prod(1:5)

which gives 120. But it gives me Inf if I try prod(1:50).

BTW, I hoped this might work:

`*`(1:5)

but * operator only takes two arguments (which is ironic for a language where everything is a vector, i.e. an array).

It seems I need to use the gmp package to get big number and rational number support. Here is how factorial can be written:

library(gmp)
prod(as.bigz(1:50))

which outputs
“30414093201713378043612608166064768844377641568960512000000000000”

That is not bad: still fairly short and, as you can see, vectorized operations are still trivial (as.bigz(1:50) creates a vector of 50 gmp numbers).

as.bigq(2, 3) is how you create a rational (⅔). Here is a quick example of creating vectors of rational numbers:

as.bigq(1, 1:50)

which outputs:

Big Rational ('bigq') object of length 50:
[1] 1 1/2 1/3 1/4 1/5 1/6 1/7 1/8 1/9 1/10 1/11 1/12 1/13 1/14 1/15
[16] 1/16 1/17 1/18 1/19 1/20 1/21 1/22 1/23 1/24 1/25 1/26 1/27 1/28 1/29 1/30
[31] 1/31 1/32 1/33 1/34 1/35 1/36 1/37 1/38 1/39 1/40 1/41 1/42 1/43 1/44 1/45
[46] 1/46 1/47 1/48 1/49 1/50

Rational operations also work nicely:

as.bigq(1,1:50) + as.bigq(1,3)

giving:
4/3 5/6 2/3 ... 49/138 50/141 17/48 52/147 53/150

Of course that is not as nice as having built-in rationals, but good enough for those relatively few times when you need them.

Tuesday, October 13, 2015

Maths Tests With R

This maths problem hit the news recently, about how far a crocodile should swim up the bank before going on land, in order to catch a zebra in the shortest possible time. It appears to be an A-level question, i.e. for 18-year old students.

The first two questions are arithmetic; more about understanding the question being asked. But the main question is obviously calculus: you are supposed to differentiate, and find out where it is zero.

I happened to have R open at the time, and my calculus is a bit rusty on how to differentiate a square root. So, this is what I typed:

T = function(x){
(5 * (36 + x^2) ^ 0.5) + 4 * (20-x)
}

(curly brackets were optional: it could all have been on one line.)

Then to answer the three questions:

T(20)
T(0)
optimize(T, lower=0, upper=20)

I.e. if he swims the whole way it is 10.44 seconds, if he cuts to land immediately it takes 11 seconds, and the 3rd line tells me he should swim 8 metres, then cut to land, and it will take 9.8 seconds.

Or, if you want to see how I should have solved it, and be reminded how to do a tricky differentiation, go to https://www.youtube.com/watch?v=xko48OoTAQU and watch from 5:00 to about 10:00. For comparison, It took me less than 1 minute to write the function and get the solutions. R itself ran instantly, of course.

As a data scientist, the important thing here is I use the same techniques when things get messy. If you show me enough observations of crocodiles catching zebras, I can give you an estimated function that also takes into account the speed of the flowing water, the wind speed, the age of the zebra, the water temperature, the air temperature, the weight of the crocodile, and when he last ate!

Sunday, June 16, 2013

RUnit and callback functions and scope

I have a function (called something() in the below example code) that takes a callback function as a parameter. To be flexible it can take the callback as a Function object, or as a character string, or as a list (which keeps the function name and the parameters together).

But I hit a problem when trying to write a unit test for something(). It works with a global function as the callback (such as mean in the below code), but not for a locally defined function (thingy in the below code):
thingy=function(x){ mean(x) }

test.something=function(){
x=1:10

something(mean,x)
something("mean",x)
something( list("mean",x) )
something(thingy,x)
something("thingy",x)    #Fails
something( list("thingy",x) )   #Fails
}
The problem appears to be that functions in a test file are all loaded in a special environment. And, when I thought about it, that makes sense, unit test functions are supposed to be small and independent, and I don't want to be defining a function at the same level as the unit tests.

I tried a few things, but here is the one that worked:
test.something=function(){ 
assign("thingy",function(x){ mean(x) }, envir=globalenv())
 
x=1:10

something(mean,x)
something("mean",x)
something( list("mean",x) )
something(thingy,x)
something("thingy",x)
something( list("thingy",x) )

remove("thingy", envir=globalenv())
}
So, we explicitly put our thingy() in the global environment, where everyone can find it and make use of it. And we define it inside the test function, no earlier. Removing it at the end is optional, but Murphy's Law states that if we don't then it will end up clashing with something. Late on a Friday afternoon, in six months time, when we've forgotten about this code.

Monday, October 15, 2012

Various R Tips

#1   How to validate a commandline argument only uses allowed values?


Say we have a commandline R script, where the first argument is allowed to be a csv list. But that csv list can only contain 'A', 'B' or 'C'. E.g. "myscript.R A" is valid, as is "myscript.R C,B", as is "myscript.R A,C,A,B,C", but not "myscript.R X", or "myscript.R A,B,X,C,B,A"

    my_list=strsplit(argv[1],',')[[1]]
    if(! all (my_list %in% c('A','B','C') ) ) quit()

The first line is the idiom for cracking open a csv list, and getting a vector of character strings. The [[1]] at the end is just noise, live with it.

The middle part of the second line is:
    my_list %in% allowed_values

It returns a vector of logical values the same length as my_list. What we want is for all the elements to be TRUE (which means all the items in our csv list are in the allowed_values list). If there is even a single FALSE (meaning the user gave at least one bad value in the csv list) then we fail, and in this example we call quit().  (See next tip for something less crude that calling quit() with no explanation!)


Thursday, September 13, 2012

<- vs. = in R

One of the first thing to strike a programmer new to R is <- all over the place. "Ugly!" may be a first reaction. But probe a bit deeper and you'll find you can use = just as well. "So why bother with <- at all?!" is a common reaction.The most pragmatic advice I've seen from experienced R programmers is it really just comes down to personal preference.

As I'm coming from experience in C++, PHP, Javascript, python, java, (the list goes on), I use = in my R code. Except in one case, which I'll come to in a moment. I've used = in almost all my R-related blog posts and StackOverflow questions and answers and no-one has ever taken offence (AFAIK), shunned me (AFAIK), or done an edit to change them all to <-. So it appears to be acceptable.

Downsides


But there are three downsides I can think of, one important, one social and one bogus.

The first downside is R also uses the single equals sign for a named parameter assignment in a function call. This generally doesn't matter because using assignment in parameters is bad form in C-like languages, so I don't get confused. Just about the only time it ever matters to me is timing code. I *have* to write:

     timing = system.time( x <- do_calculation(a=1,b=2) )


If I write the following then x won't get assigned:

  timing = system.time( x = do_calculation(a=1,b=2) )

The second downside is the R community considers <- to be standard. So all packages use it, all books use it. If you want to be part of the in-crowd you have to use it too.

The third downside is it is easier to use search-and-replace to convert all your "<-" to "=", than it is to go the other way. But this is suspicious: the above example shows why you cannot do "<-" to "=" completely automatically anyway.

Upsides

Are there any upsides to preferring = to compensate? Yes, though they'll sound petty to people who believe using <- is the Only Way. First it is one less keystroke. Second, in comparisons, this does not work:

    if(x<-5)cat("x is less than minus five\n")

So you must put spaces around < and > in R; it is not just a style thing as it is in other languages.

The third reason is when I'm using <- it is communicating intent: I'm deliberately doing an assignment to a variable in the parameter list of a function call. As that is considered bad form in most languages, I like how it stands out.

I've saved the biggest upside for last: familiarity to programmers coming from any of the other C-like languages. (R is a C-like language too.)

Comparisons

You thought I'd mention ==, and how it can be confused with = ? That potential confusion exists in all C-like languages, and we just know to look out for it. And, anyway, it still exists in R:

   f(x=1)   vs.    f(x==1)

When I write one of those, did I mean the other?

Language Design

If I was designing R from scratch, how would I do it differently? I love how I can assign to named parameters in R - it is perhaps the most beautiful feature of the language. But it is an overload of the = sign, and one not found in other C-like languages, so I'd be tempted to change it to use ":" (it looks a bit like object notation in Javascript, but without the curly brackets). So the above example would become:

   timing = system.time( x = do_calculation(a:1,b:2) )

Notice how I can use "x=" safely, because it now has no other meaning. Also notice how this helps with the f(x=1)   vs.    f(x==1) confusion too. But I'm contradicting my comment above, about liking the way <- stands out. So maybe I want it to look like this:

   timing = system.time( x <- do_calculation(a:1,b:2) )

Now a lint tool can warn about use of a single equals sign in a function call, because there should never be one. Hhhmmm, I'm not convinced but it is something to chew on.

Do you have any thoughts or constructive criticism? Please add a comment.

Thursday, August 23, 2012

Small R/XTS Code Snippets And Tips

1. Splitting by week

The split function in xts can split your data into weeks:  split(data,f="weeks")
Under the covers it uses endpoints, which in turn uses a C function of the same name in the XTS package. The problem I discovered it that it considers the start of the week as Monday, 00:00:00, and the end of the week as Sunday, 23:59:59. This is a problem because the FX markets start at 9pm or 10pm on a Sunday night in the UTC/GMT timezone (at Sunday, 5pm in New York timezone).

What we want is for it to treat Sunday 00:00:00 as the start of the week. That gives us good buffer on both sides (about 24hrs after Nasdaq after-market hours finish, and about 21hrs before the FX markets open). If you look at the source of endpoints (just type endpoints in an interactive R session) you'll see it has the magic constant: 3L * 86400L. After some trial and error I found we want that to be 4L not 3L. Though it may be brittle, as you're using XTS internals, you can use this line in your code:

    ep= .Call("endpoints", .index(data)+4L*86400, 604800L, k=k, PACKAGE="xts")

(I don't suppose the XTS package can consider this a bug, as there may be people relying on the current "weeks" behaviour; but maybe they could add a "sunweeks" option to endpoints and split? What do you think?)

P.S. A minor optimization for the XTS endpoints function:
    if (on %in% c("years", "quarters", "months", "weeks", "days"))
could read:
    if (on %in% c("years", "quarters", "months", "days"))
because weeks is not using posixltindex, and it takes CPU time to generate.

2. Put an XTS object in a string

Say you want to show the first 4 rows of an xts object in a string. The first trick you need is capture.output(). This takes the print() output and puts it in a string. But it returns each row as an entry in a vector. So we'll use paste() to convert that to a single string. Here is our code:


    msg = paste("Here are the first 5 rows:",paste(capture.output(x[1:5,]),collapse="\n") )

See item 3 for an example of this.

3. Show me just the NA rows

Here is some test data:
    library(xts)
    x=xts(  data.frame(a=c(1,2,3,NA,5,6), b=c(100,99,NA,NA,NA,95)),
            as.Date("2012-10-01")-6:1)
Which looks like:

                a   b
    2012-09-25  1 100
    2012-09-26  2  99
    2012-09-27  3  NA
    2012-09-28 NA  NA
    2012-09-29  5  NA
    2012-09-30  6  95


Imagine there should be none, and you want to print them out in an error message. This is the command to just show the rows with no NAs:

    x[complete.cases(x),]

And therefore just the rows with  NAs:


    x[!complete.cases(x),]

So, to print a fatal error message that shows the NA rows:

    if(any(is.na(x))){
      stop( paste("We have NAs:",

        paste(capture.output(x[!complete.cases(x),]),collapse="\n")
        ))
      }


(see Item 2, "Put an XTS object in a string", if the second half of that line looks scarier than a rabid dog who has just bitten through his leash.)


4. Loop through as key=>value


In many languages there is a foreach(container as key=>value) type construct. I've not found something so compact for XTS objects, so I use this:
    for(ix in 1:nrow(x)){
      datestamp=index(x)[ix]
      b=x$b[ix]  #Or, b=coredata(x$b)[ix]

      #print(datestamp);print(b)
      }

I've shown two ways of setting 'b'. The first way leaves 'b' as an XTS object. In the second way 'b' is a number. If using the latter, I'm sure it is more efficient to do the coredata() call before the loop. I.e. like this:
    xcd=coredata(x$b)
    for(ix in 1:nrow(x)){
      datestamp=index(x)[ix]
      b=xcd[ix]

      #print(datestamp);print(b)
      }

 

5. Merging, but excluding values only in one xts object

Here is some test data:
  library(xts)
  Sys.setenv(TZ = "UTC")
  d=xts(1:5,Sys.Date()+(2:6))
  ix=Sys.Date()+(1:5)

So, d is our data, whereas ix lists the days we should have data for.
ix looks like:
  [1] "2013-01-10" "2013-01-11" "2013-01-12" "2013-01-13" "2013-01-14"

and d looks like:
             [,1]
  2013-01-11    1
  2013-01-12    2
  2013-01-13    3
  2013-01-14    4
  2013-01-15    5


In other words, d is missing an entry for 2013-01-10, so we need to add an NA entry for it. But also d has an entry for 2013-01-15, which it shouldn't have yet. (In a finance context it might be a bar that we are still collecting ticks for, so we don't have final values for yet; in a business context it might be a value that has not been approved by management for release yet.)

When we do merge(d,xts(,ix),all=F) we get:
             d
  2013-01-11 1
  2013-01-12 2
  2013-01-13 3
  2013-01-14 4



When we do merge(d,xts(,ix),all=T) we get:
             d
  2013-01-10 NA
  2013-01-11  1
  2013-01-12  2
  2013-01-13  3
  2013-01-14  4
  2013-01-15  5

Neither is what we want. The solution is  merge(d,xts(,ix),join='right') which gives us:
  2013-01-10 NA
  2013-01-11  1
  2013-01-12  2
  2013-01-13  3
  2013-01-14  4


6a. Combining two xts objects that have same column but different timestamps

Here is some test data:
  library(xts)
  Sys.setenv(TZ = "UTC")
  a=xts(1:5,as.Date("2013-02-01")+1:5);colnames(a)="v"
  b=xts(0:-2,as.Date("2013-02-01")-0:2);colnames(b)="v"

When we do merge(a,b) we get:
             v v.1
 2013-01-30 NA  -2
 2013-01-31 NA  -1
 2013-02-01 NA   0
 2013-02-02  1  NA
 2013-02-03  2  NA
 2013-02-04  3  NA
 2013-02-05  4  NA
 2013-02-06  5  NA


You can play with various values for the join parameter, but it won't help. The solution is  rbind(a,b):
             v
 2013-01-30 -2
 2013-01-31 -1
 2013-02-01  0
 2013-02-02  1
 2013-02-03  2
 2013-02-04  3
 2013-02-05  4
 2013-02-06  5


What if both a and b have a timestamp in common? Then you get two rows. In that case you can remove the duplicates afterwards:
  x=rbind(a,b)
  x=x[!duplicated(index(x)),]

The duplicate from b is the one that gets removed. If you wanted the one from a to be removed instead, then pass fromLast = TRUE to duplicated(), or simply do rbind(b,a) !

6b. Combining two xts objects that have same columns but different timestamps

This follows on from 6a, but when the xts object have more than one column. Here is some test data:
  library(xts)
  Sys.setenv(TZ = "UTC")
  a=xts(data.frame(x=1:5,y=5:9),as.Date("2013-02-01")+1:5)
  b=xts(data.frame(x=0:-2,y=10:12),as.Date("2013-02-01")-0:2)

Again,  rbind(a,b) is the answer: 
             x  y
 2013-01-30 -2 12
 2013-01-31 -1 11
 2013-02-01  0 10
 2013-02-02  1  5
 2013-02-03  2  6
 2013-02-04  3  7
 2013-02-05  4  8
 2013-02-06  5  9


Thursday, October 20, 2011

R: what to do when the jitter gets too much

The subject sounds like this will be a discussion of an over-full bladder... instead, brace yourself for some serious functional programming.

So, I had this little bit of R to generate a "rough" exponent curve (as test data):
  x=1:40
  y=exp( jitter( x, factor=30 ) ^ 0.5 )

The problem was it was sometimes giving NA values - too much jitter!
I started off with this fix:

  y[is.na(y)]=exp( x^0.5 )

I.e. wherever it was an NA use the un-jittered value.

Unfortunately it was often complaining with "number of items to replace is not a multiple of replacement length". In this case this was not a warning to ignore, because it revealed I was doing something very wrong. exp(x) is a vector of 40 items. y[is.na(y)] is a vector of however many entries in y are NA. The dangerous thing is if there is just one NA in y, but it is the 5th element, it will be set to exp(1)^0.5, instead of exp(5)^0.5.

Let's cut straight to the solution:
  y[is.na(y)]=exp(x[is.na(y)]^0.5)

x[is.na(y)] means the values out of x where the same-sized y-vector has a NA in that position. So, for the above example where just the 5th element in y is NA, then x[is.na(y)] will be "5".

Simple when you know how. (?!)


Sunday, July 31, 2011

Rcpp: embedding C++ in R scripts

Rcpp is an R package to aid in making R extensions written in C++. It also includes a side-project for embedding R code in a C++ program (including being able to use all the R packages), but most amazing of all is the integration with R's inline package. Let me show you a minimal (but complete) script:
library('inline')
src='Rcpp::List output(1);output[0]="Hello World";return output;'
fun=cxxfunction(signature(),src,plugin="Rcpp")
fun()
Yes, it is the classic Hello World script, the output looks like:
[[1]]
[1] "Hello World"
What do I call this amazing? The contents of src is the body of a C++ function. The cxxfunction takes your C++ code, sends it to your C++ compiler, compiles it with the necessary headers, and finally embeds it in your R session as an R extension. It just works, there is no catch (*).

(I could have used a character object instead of a list. The Rcpp::List is just like an R list, meaning each element can be of a different type. If you've always wanted something like this in C++, also take a look at boost::any.)

The cxxfunction() call takes 1.9s to run on my machine, so the overhead to start-up the compiler and run the compilation process is not too bad.

Okay, not a motivating example, but the Rcpp project is not just technically amazing, it is also well documented. You can find some more interesting examples in the documentation, and if this has got your interest you will enjoy this video (1.5hrs) of a talk the Rcpp main authors did at Google.

By the way, if anyone knows of a PHP project for embedding C++ like this, please let me know!

*: You need to install the Rcpp and inline packages, and you need R version 2.12.0 or later. If you are using Ubuntu and your default version of R is too old, there are good instructions here on getting the latest R version as a deb package.

Sunday, July 17, 2011

R: extract column from xts object as a vector

A quickie for the R language... you will look at the answer and think it is hardly worth writing a blog post over. Well it turns out that everyone has thought the same... which is why it took me over half an hour of failed searching and trial and error to work it out.

I've an XTS object, x, and I've added a column which is the sum of the other columns:
  x$sum=apply(x,1,sum)
print(x$sum) shows me the sum for each row, along with the datestamp of that row. I.e. x$sum is still an XTS object. Normally this is good, but I wanted it as a simple vector, without the dates associated. Here is how you do that:
  as.vector(x$sum)
Why did I want that? Simply because summary(x$sum) uses up 7 lines, half of it being noise about the datestamps; summary(as.vector(x$sum)) is 2 lines, all signal, no noise.

Thursday, July 14, 2011

Customize less: less annoying

Open ~/.bashrc (or ~/.bash_profile) and add this line at the end:
LESS=-FRX;export LESS
Don't ask why, just trust me. ('cos actually I don't know what it means either...)

This was the magic incantation to tell "git log" to wrap commit messages (instead of just chopping off everything extra).

But it has had the delightful side-effect of man now works properly. About 6 or 7 years ago I used to be able to type "man whatever", press q, and the advice would stay on screen. Then linux (all distros as far as I could tell) changed to hiding the man page as soon as you press q. Frequently I'd have to have two terminal windows open, simply so I could keep the man page open while I type my command. Yes, you're ahead of me; the above LESS command fixes this. If I'd known it would be so easy I'd have done this years ago!!

Even better, the help system in the R language was working just like man pages (and it was even more annoying there), and that has been fixed too.

It is a miracle, I tell you, a miracle! July 13th should go down as "Saint LESS=-FRX" day; in 100 years time it will be a public holiday and children will be taught about it in schools. They might even be taught what it means...

Monday, July 4, 2011

R: foreach parallelization

I was experimenting with the foreach and doMC packages in R, which make your code multi-threaded. It actually took a bit of refactoring to use foreach(), as there seems to be no shared memory between the threads - they each took their own copy of the in-scope variables. So I had to return my results as a one-row data frame from each iteration, and combine them when the foreach loop had finished.

Here are my results; the user CPU column was giving me strange numbers, so I'm just going to list total time spent (wall clock time). I have three different loops, so here are the base timings (running the foreach loop in sequential mode, without doMC loaded) for each:
4.33         7.72         16.14
My CPU has 4 cores, but the OS sees it as 8 virtual cores. Here are my results for 1, 2, 4 and 8 threads:
1   4.36         7.90         16.20
  2   2.50 [2.18]  4.30 [3.95]   8.80 [8.10]  [8-15% slower]
  4   1.46 [1.09]  2.50 [1.98]   5.06 [4.05]  [25-35% slower]
  8   1.32 [0.55]  2.30 [0.99]   4.30 [2.02]  [110-140% slower]

All times are in seconds, and this loop represents most of the time spent in my script, so while the results are a long way from linear, they represent a useful speed-up. The numbers in square brackets show the speeds if I had got linear improvement.

By the way, my foreach loop had 200 to 250 iterations. The above results tell me that when each foreach loop iteration does more work we get better efficiency. This is fairly course parallelization, which suggests to me that there is lots of room for improvement in the doMC() code.
UPDATE: When running with top, and cores set to 4, I notice 4 new instances of R appear each time it hits the foreach loop, and then they disappear after. I.e. it appears to be using processes, not threads, and creating them on-demand! No wonder my results indicate so much overhead!