Friday, October 14, 2016

Applying Auto-encoders to MNIST

Applying Auto-encoders to MNIST

This is a companion article to my new book, Practical Machine Learning with H2O, published by O’Reilly. Beyond the sheer, unadulterated pleasure you will get from reading it, I’m also recommending it for readers of this article, because I’m only going to lightly introduce topics such as H2O, MNIST, and even auto-encoders, that are covered in much more depth in the book.

That Light Introduction

H2O is a powerful, scalable and fast machine-learning server/framework, with APIs in R and Python (as well as Coffeescript, Scala via its Spark interface, and others). It has relatively few machine learning algorithms, but they are generally the ones you would have settled on using anyway, and each has been optimized to scale across clusters and big data, and each has many parameters for tuning.
MNIST is a machine learning problem to recognize which of the digits 0 to 9 a set of 784 pixels represents. There are 60,000 training samples, and 10,000 test samples. To avoid inadvertently over-fitting to the test samples, I split the 60K into 50K training data and 10K validation data.
Auto-encoders are the unsupervised version of deep-learning (neural nets, if you prefer). The Wikipedia article is a good introduction to the idea. By setting the input_dropout_ratio parameter H2O supports the “Denoising autoencoder” variation, and withhidden_dropout_ratios H2O supports the “Sparse autoencoder” variation.

The Aim

More layers in a supervised deep learning neural network can often give better results on complex problems. But the more layers you have, the harder it can be to train. Auto-encoders to the rescue! You run an auto-encoder on the raw inputs, and it will self-organize them - extract some information from them. You then take the middle hidden layer from the auto-encoder, and use that as the inputs to your supervised learning algorithm. Or possibly you do it again, with another auto-encoder, to extract (theoretically) an even higher-level abstraction.
I decided to try this on the MNIST data.
My initial approach was to treat it as a data compression problem: to see how few hidden neurons, in a single layer, I could get a perfect score with. I.e. this autoencoder had 784 input neurons, N hidden neurons, and 784 output neurons. Just one hidden layer, so to request, say, a 784x200x784 layout, in H2O I just do hidden=200, and the input layer is implicit from the data, and the output it implict because I specify autoencoder=TRUE. Unfortunately, even with N=784, I couldn’t get an MSE of 0.0.
(You should be able to see how there is one trivial way for such a network to get the perfect score: for each neuron in the middle layer, exactly one incoming weight should be 1.0 and all the others should be 0.0, and then the same for the outgoing weights. However, also appreciate how hard it would be for training to discover this if all 784 weights leading in and all 784 weights leading out of each neuron started off life as a random number.)

Getting Practical

So, under time pressure, I took an “educated guess” approach, and also an “ensemble” approach. I made three autoencoders (one of them being two-step, so four models in total), and used them together. The code to make them, and their outputs, is wrapped up in a couple of R functions, which I’ll show in a moment, but first a look at the parameters of the four models:
AE200: This uses a single layer of 200 hidden neurons. I set input_dropout_ratio = 0.3, which means as each training sample was used for training it would be setting a random 30% of the pixels to 0. This should make it more robust, less likely to over-fit. I also use L2 regularization, set to 1e-4 (which is fairly high).
AE32: This uses just 32 hidden neurons, so is going to be less perfect representation than AE200. To compensate for that, I use a lower input_dropout_ratio = 0.1, and also lowered L2 regularization to 1e-5.
AE768: Not used directly. One hidden neuron per input pixel (almost), means it is more “rephrasing” rather than “compressing”. I used the same input_dropout_ratio = 0.3 and l2 = 1e-4 settings as AE200. (Among the 784 pixels columns there are a few around the edge that are exactly zero in all training data, so provide no information, and can be thrown away; that is where the 768 came from.)
AE128: This was built from the output of AE768. No input dropout, and just a bit of L2 regularization (1e-5).
L2 regularization penalizes large weights (e.g. see https://www.quora.com/What-is-the-difference-between-L1-and-L2-regularization). It helps make sure all pixels get considered, rather than allowing the algorithm to over-fit to one particular pixel.
All four models used the tanh activation function, and were given 20 epochs.

The Model Generation Code

The following listing shows the R code, for the above model descriptions in H2O, wrapped up in a function that takes two parameters:
  • data is the H2O frame to use for training data
  • x is which columns of that frame to use
_
create_MNIST_autoencoders <- function(data, x){
m_AE200 <- h2o.deeplearning(
  x, training_frame = data,
  hidden = c(200),
  model_id = "AE200",
  autoencoder=T,
  input_dropout_ratio = 0.3,  #Quite high
  l2 = 1e-4,  #Quite high
  activation = "Tanh",
  export_weights_and_biases = T,
  ignore_const_cols = F,
  train_samples_per_iteration = 0,
  epochs = 20
  )

m_AE32 <- h2o.deeplearning(
  x, training_frame = data,
  hidden = c(32),
  model_id = "AE32",
  autoencoder = T,
  input_dropout_ratio = 0.1,  #Fairly low
  l2 = 1e-5,  #Fairly low
  activation = "Tanh",
  export_weights_and_biases = T,
  ignore_const_cols = F,
  train_samples_per_iteration = 0,
  epochs = 20
  )

m_AE768 <- h2o.deeplearning(
  x, training_frame = data,
  hidden = c(768),
  model_id = "AE768",
  autoencoder = T,
  input_dropout_ratio = 0.3,  #Quite high
  l2 = 1e-4,  #Quite high
  activation = "Tanh",
  export_weights_and_biases = T,
  ignore_const_cols = F,
  train_samples_per_iteration = 0,
  epochs = 20
  )

f_AE768 = h2o.deepfeatures(m_AE768, data)

m_AE128 <- h2o.deeplearning(
  1:768, training_frame=f_AE768,
  hidden = c(128),
  model_id = "AE128",
  autoencoder = T,
  input_dropout_ratio = 0,  #No dropout
  l2 = 1e-5,   #Just a bit of L2
  activation = "Tanh",
  #export_weights_and_biases = T,
  #ignore_const_cols = F,
  train_samples_per_iteration = 0,
  epochs = 20
  )

return(list(m_AE200, m_AE32, m_AE768, m_AE128))
}
Feeding the output of one auto-encoder (AE768 in this case) into another (AE128) is done with h2o.deepfeatures().
These two lines are just for troubleshooting/visualization:
export_weights_and_biases = T,
ignore_const_cols = F,
And, in a sense, this one is too:
train_samples_per_iteration = 0
This says I want it to always score the model’s MSE at the end of every epoch. I did this so I could see the shape of the score history chart, and so get a feel for if 20 epochs was enough. Normally touching train_samples_per_iteration counts as micro-management, because the default is to intelligently choose when to score based on some targets for time spent training vs. scoring, and some communication overhead targets. (See the explanation in chapter 8 of the book, if you crave more detail.)

Using The Models To Make Pixels

The second helper function is shown next. It returns (a handle to) an H20 frame that has 200 + 32 + 128 columns from the autoencoders, plus any additional columns you specify in columns (which must include at least the answer column).
generate_from_MNIST_autoencoders <- function(models, data, columns){
  stopifnot(length(models) == 4)
  names(models) = c("AE200", "AE32", "AE768", "AE128")

  f_AE200 <- h2o.deepfeatures(models[["AE200"]], data)
  f_AE32 <- h2o.deepfeatures(models[["AE32"]], data)
  f_AE768 <- h2o.deepfeatures(models[["AE768"]], data)
  f_AE128 <- h2o.deepfeatures(models[["AE128"]], f_AE768)

  h2o.cbind(f_AE200, f_AE32, f_AE128, data[,columns] )
  }
Notice how, if you don’t include the original 768 pixel columns in columns that the auto-encoded features will effectively replace the raw pixel data. (This was my intention, but you don’t have to do that.)

Usage Example

I’ll assume you have 785 columns, where the pixel data is in columns 1 to 784, and the answer, 0 to 9, is in column 785. If you have the book, you will be familiar with this convention:
train <- #...50K of training data
valid <- #...10K of validation data
test <- #...10K of test data
x <- 1:784
y <- 785
To use them I then write:
models <- create_MNIST_autoencoders(train, x)
train_ae <- generate_from_MNIST_autoencoders(models, train, y)
valid_ae <- generate_from_MNIST_autoencoders(models, valid, y)
test_ae <- generate_from_MNIST_autoencoders(models, test, y)
m <- h2o.deeplearning(1:360, 361, train_ae, validation_frame = valid_ae)
Here I train a deep learning model with all defaults, but it could be a more complex deep learning model, or it could be a random forest, GBM or any other algorithm supported by H2O.
When using it for predictions, remember to use test_ae, not test (and similarly, in production, any future data has to be put through all four auto-encoder models). So following on from the above, you could evaluate it with:
h2o.performance(m, test_ae)

Usage: extended data

If you are lucky enough to have read the book, you will know I added 113 columns of “extended information” to the MNIST data. Though I got rid of the pixels, I chose to keep the extended columns alongside the auto-encoder generated data.
This is how the above code looks if you are using the extended MNIST data:
x <- 114:897
columns <- c(1:113,898)
models <- create_MNIST_autoencoders(train, x)
train_ae <- generate_from_MNIST_autoencoders(models, train, columns)
valid_ae <- generate_from_MNIST_autoencoders(models, valid, columns)
test_ae <- generate_from_MNIST_autoencoders(models, test, columns)
m <- h2o.deeplearning(1:473, 474, train_ae, validation_frame = valid_ae)

Summary

Informally, I can tell you that a deep learning model built on 473 auto-encoded/extended columns was significantly better than one built on 897 pixel/extended columns. However, I also increased the amount of training data at the same time (see http://darrendev.blogspot.com/2016/10/applying-rs-imager-library-to-mnist.html ), so I cannot tell you the relative contributions of those two changes.
But, I was pleased with the results. And the “educated guess” approach of choosing three distinct auto-encoder models and combining their outputs also seemed to work well. It might be that just one of the auto-encoder models is carrying all the useful information, and the others could be dropped? That is a good experiment to do (let us know if you do it!), but my hunch is that the “ensemble” approach is what allows the educated guess approach to work.

Thursday, October 13, 2016

Applying R's imager library to MNIST digits

Applying R’s imager library to MNIST digits

Introduction

For machine learning, when it comes to training data, the more the better! Well, there are a whole bunch of disclaimers on that statement, but as long as the data is representative of the type of data the machine learning model will be used on in production, doubling the amount of unique training data will be more useful than doubling the number of epochs (if deep learning) or trees (if random forest, GBM, etc.).
I’m going to concentrate in this article on the MNIST data set, and look at how to use R’s imager library to increase the number of training samples. I take a deep look at MNIST in my new book, Practical Machine Learning with H2O, which (not surprisingly) I highly recommend. But, briefly, the MNIST data is handwritten digits, and the challenge is to identify which of the 10 digits, 0 to 9, each one is.
There are 60,000 training sample, plus 10,000 test samples (and everyone uses the same test set - so watch out for the inadvertent over-fitting in published papers). The samples are 28x28 pixels, each a 0 to 255 greyscale value. I usually split the 60,000 samples into 50K for training, 10K for validation (to make sure I am not over-fitting on the test data).

imager and parallel

I used this R library for dealing with images: http://dahtah.github.io/imager/ which is based on a C++ library called CImg. Most function calls are just wrappers around the C++ code, which means they are fairly quick. It is well-documented with a good starter tutorial.
I used version 0.20 for all my development. I have just seen that 0.30 is now out, and in particular offers native parallelization. This is great! Out of scope for this article, but I used imager in conjunction with R’s parallel functions, and found the latter quite clunky, with time spent copying data structures limiting scalability. On the other hand, the docs say these new parallel options work best on large images, and the 28x28 MNIST images are certainly not that. So maybe I am still stuck using parApply() and friends.

The Approach

In the 20,000 unseen samples (10K valid, 10K test), there are often examples of bad handwriting that we don’t get to see in our 50,000 training samples. Therefore I am most interested in generated bad handwriting samples, not nice neat ones.
I spent a lot of time experimenting with what imager can produce, and settled on generating these effects, each with a random element:
  • rotate
  • warp (make it “scruffier”)
  • shift (move it 1 pixel up, down, left or right)
  • bold (make it fatter - my code)
  • dilate (make it fatter - cimg code)
  • erode (make it thinner)
  • erodedilate (one or the other)
  • scratches (add lines)
  • blotches (remove blobs)
I also defined “all” and “all2” which combined most of them.
In the full code I create the image in an imager (cimg) object called im, then copy it to im2. Each subsequent operation is performed on im2. im is left unchanged, but can be referred to for the initial state.

Rotate

The code to rotate comes in two parts. Here is the first part:
needSharpen = FALSE
angle = rnorm(1, 0, 8)
if(angle < -1.0 || angle > 1.0){
  im2 = imrotate(im2, angle, interpolation = 2)
  nPix = (width(im2) - width(im)) / 2
  im2 = crop.borders(im2 , nPix = nPix)
  needSharpen = TRUE
  }
The use of rnorm(sd=8), means 68% of time it’ll be +/-8°, only 5% of the time more than +/-16°. If my goal was simply more training samples, I’d have perhaps used as smaller sd, and/or clipped to a maximum rotation of 10°. But, as mentioned earlier, I wanted more scruffy handwriting. The if() block is a CPU optimization - if rotation is less than 1° don’t bother doing anything.
The imrotate() command takes the current im2 and replaces it with one that is rotated. This creates a larger image. To see what is going on, try running this (self-contained) script (see the inline comments for what is happening):
library(imager)

# Make 28x28 "mid-grey" square
im <- as.cimg(rep(128, 28*28), x = 28, y = 28)

#Prepare to plot side-by-side
par(mfrow = c(1,2))

#Show initial square 28x28
plot(im)

#Show rotated square, 34x34
plot(imrotate(im, angle = 16))
The output is like this:
You can see the image has become larger, to contain the rotated version. (That image also shows how imager’s plot command will scale the colours based on the range, and that my choice of 128 could have been any non-zero number. When there is only a single value (128), it chooses a grey. After rotating we have 0 for the background, 128 for the square, so it does 0 as black, 128 as white.)
For rotating MNIST digits:
  • I want to keep the 28x28 size
  • All the interesting content is in the middle, so clipping is fine.
So I call crop.borders(), which takes an argument nPix saying how many pixels to remove on each side. If it has grown from 28 to 34 pixels square, nPix will be 3.

Feeling A Bit Vague…

Here is what one of the MNIST digits looks like rotated 30° at a time, 11 times.




In a perfect world, 12 rotations would give you exactly the image you started with. But you can see the effect of each rotation is to blur it slightly. If we did another lap, even your clever mammalian brain would no longer recognize it as a 4.
The author of the imager library provided a match.hist() function (see it, and the surrounding discussion, here: https://github.com/dahtah/imager/issues/17 ) which does a good (but not perfect) job. Here are the histograms of the image before rotation, after rotation, and then after match.hist:




You can judge the success from looking at the image on the right, or by seeing how the bars on the rightmost histogram match those of the leftmost one. (Yes, even though the bumps are very small, their height, and where they are, really matter!)

You better sharpen up…

You would have noticed the earlier rotate code set needSharpen to true. That is used by the following code. Some of the time it uses the imager library’s imsharpen(), some of the time match.hist(), and some of the time a hack I wrote to make dim pixels dimmer, and bright pixels brighter.
if(needSharpen){
  if(runif(1) < 0.8){
    im2 <- imsharpen(im2, amplitude = 55)
    }
  if(runif(1) < 0.3){
    im2 <- ifelse(im2 < 128, im2-16, im2)
    im2 <- ifelse(im2 < 0, 0, im2)
    im2 <- ifelse(im2 > 200, im2+8, im2)
    im2 <- ifelse(im2 > 150, im2+8, im2)
    im2 <- ifelse(im2 > 100, im2+8, im2)
    im2 <- ifelse(im2 > 255, 255, im2)
  }else{
    im2 <- match.hist(im2, im)
    }
  }

The Others

The other image modifications, listed earlier, use imwarp(), imshift(), pmax() with imshift() (for a bold effect), dilate_square(), and erode_square(). The blotches and scratches were done by putting random noise on an image, then using pmax() or pmin() to combine them.
If there is interest I can write another article going into the details.

Timings

On a 2.8GHz single-core, I recorded these timings to process 60,000 28x28 images. (It was a 36-core 8xlarge EC2 machine, but my R script, and imager (at the time), only used one thread.)
  • 304s to run “bold”.
  • 296s to run “shift”
  • 417s for warp
  • 447s for rotate
  • 517s to 539s for all and all2
warp and rotate need to do the sharpen step, which is why they take longer.

Summary

I made 20 files, so over 95% of my training data was generated. As you will discover if you read the book, this generated data gave a very useful boost in model strength, though of course dramatically increased learning-time due to having 1.2 million training rows instead of 50,000. An interesting property was that it found the sampled data harder to learn: I got lower error rates on the unseen valid and test data sets than on the seen training data. This is a consequence of my deliberate decision to bias towards noisy data and scruffy handwriting.
Generating additional training data is a good way to prevent over-fitting, but generating data that is still representative is always a challenge. Normally you’d check mean, sd, the distribution, etc. to see how you’ve done, and usually your only technique is to jitter - add a bit of random noise. With images you have a rich set of geometric transformations to choose from, and you often use your eyes to see how it has done. Though, as we saw from the image histograms, there are some objective techniques available too.

Wednesday, October 12, 2016

Multinomial Ensemble In H2O

Categorization Ensemble In H2O

An ensemble is the term for using two or more machine learning models together, as a team. There are various types (normally categorized in the literature as boosting, bagging and stacking). In this article I want to look at making teams of high-quality categorizing models. (It handles both multinomial and binomial categorization.)
I will use H2O. Any machine-learning framework that returns probabilities for each of the possible categories can be used, but I will assume a little bit of familiarity with H2O… and therefore I highly recommend my book: Practical Machine Learning with H2O! By the way, I will use R here, though any language supported by H2O can be used.

Getting The Probabilities

Here is the core function (in R). It takes a list of models ¹, and runs predict on each of them. The rest of the code is a bit of manipulation to return a 3D array of just the probabilities (h2o.predict() returns its chosen answer in the “predict” column, so the setdiff() is to get rid of it).
getPredictionProbabilities <- function(models, data){
sapply(models, function(m){
  p <- h2o.predict(m, data)
  as.matrix(p[, setdiff(colnames(p), "predict") ])
  }, simplify = "array")
}

Using Them

getPredictionProbabilities() returns a 3D array:
  • First dimension is nrows(data)
  • Second dimension is the number of possible outcomes (labelled “p1”, “p2”, …)
  • Third dimension is length(models)
  • Value is 0.0 to 1.0.
To make the next explanation a bit less abstract, let’s assume it is MNIST data (where the goal is to look at raw pixels and guess which of 10 digits it is), and that we have made three models, and that we are trying it on 10,000 test samples. Therefore we have a 10000 x 10 x 3 array. Some example slices of it:
  • predictions[1,,] is the predictions for the first sample row; returning a 10 x 3 matrix.
  • predictions[1:5, "p3",] is the probability of each of the first 5 rows being the digit “2”. So, five rows and three columns, one per model. (It is “2”, because “p1” is “0”, “p2” is “1”, etc.).
  • predictions[1:5,,1] are the probabilites of the first model on the first 5 test samples, for each of the 10 categories.
When I just want the predictions of the team, and don’t care about the details, I use this wrapper function (explained in the next section):
predictTeam <- function(models, data){
  probabilities <- getPredictionProbabilities(models, data)
  apply(
    probabilities, 1,
    function(m) which.max(apply(m, 1, sum))
    )
  }

3D Arrays In R

A quick diversion into dealing with multi-dimensional arrays in R. apply(probabilities, 1, FUN) will apply FUN to each row in the first dimension. I.e. FUN will be called 10,000 times, and will be given a 10x3 matrix (m in the above code). I then use apply() on the first dimension of m, calling sum. sum() is called 10 times (once per category) and each time is given a vector of length 3, containing one probability per model.
The last piece of the puzzle is which.max(), which tells us which category has the biggest combined confidence from all models. It returns 1 if category 1 (i.e. “0” if MNIST), 2 if category 2 (“1” if MNIST), etc. This gets returned by the function, and therefore gets returned by the outer apply(), and therefore is what predictTeam() returns.
Aside: Using sum() is equivalent to mean(), but just very slightly faster. E.g. 30 > 27 > 24, and if you divide through by 3, then 10 > 9 > 8.
Your answer column in your training data is a factor, so to convert that answer into the string it represents, use levels(). E.g. if f is the factor representing your answer, and p is what predictTeam()returned, then levels(f)[p] will get the string. (With MNIST, there is a short-cut: p-1 will do it.)

Counting Correctness

A common question to ask of an ensemble is how many test samples every member of the team got right, and how many none of the team got right. I’ll look at that in 4-steps. First up is to ask what each models best answer is.³
predictByModel <- apply(probabilities, 1,
 function(sample) apply(sample, 2, which.max)
 )
Assuming the earlier MNIST example, this gives a 3 x 10000 int matrix (where the values are 1 to 10).
Next, we need a vector of the correct answers, and they must be the category index, not the actual value. (In the case of MNIST data it is simply adding 1, so “0” becomes 1, “1” becomes 2, etc.)
eachModelsCorrectness <- apply(predictByModel, 1,
 function(modelPredictions) modelPredictions == correctAnswersTest
 )
This returns a 10000 x 3 logical (boolean) matrix. It tells us if each model got each answer correct. The next step returns a 10000 element vector, which will have 0 if they all got it wrong, 1 if only one member of the team got the correct answer, 2 if two of them, and 3 if all three members of the team got it right.
cmc <- apply(eachModelsCorrectness, 1, sum)
(cmc for count of model correctness) table(cmc) will give you those four numbers, E.g. you might see:
 0    1    2    3
67   59  105 9769
Indicating that there were 67 that none of them got right, 59 samples that only one of the three models managed, 105 that just one model got wrong, and 9769 that all the models could do.
I sometimes find cumsum(table(cmc)) to be more useful:
 0     1     2     3
67   126   231 10000
Either way, if the first number is high, you need stronger model(s). Better team members. If it is low, but the ensemble is performing poorly (i.e. no better than the best individual model), then you need more variety in the models.

Further Ideas

By slicing and dicing the 3D probability array, you can do different things. You might want to poke into what those difficult 67 samples were, that none of your models can get. Another idea, which I will deal in a future blog post, is how you can use the probability to indicate those with low confidence that need more investigation work. (You can use this with an ensemble, or just with a single model.)

Performance

In my experience, in general, this kind of team will always give better performance than any individual model in the team. It gives the biggest jump in strength, when:
  • The models are very distinct.
  • The models are fairly close in strength.
  • Just a few models. Making 4 or 5 models, evaluating them on the valid data, and using the strongest 3 for the ensemble, works reasonably well. However if you managing to make near-strength and very distinct models, the more the merrier.
For instance, three deep learning models, all built with the same parameters, each getting about 2% error individually, might manage 1.8% error as a team. But two deep learning models, one random forest, one GBM, all getting 2% error, might give 1.5% or even lower, when used together. Conversely, two very similar deep learning models that have 2% error combined with a GBM with 3% error and a random forest getting 4% error might do poorly, possibly even worse than your single best model.

Summary

This approach to making an ensemble for categorization problems is straightforward and usually very effective. Because it works with models of different types it at first glance seems closer to stacking, but really it is more like the bagging of random forest (except using probabilities, instead of mode⁴ as in random forest).

Footnotes

[1]: All the models in models must all be modelling the same thing, i.e. they must each have been trained on data with the exact same columns.²
Normally each model will have been trained on the exact same data, but the precise phrasing is deliberate: interesting ensembles can be built with models trained on different subsets of your data. (By the way, subsets and ensemble are the two fundamental ideas behind the random forest algorithm.)
[2]: You can take this even further and only require they are all trained with the same response variable, not even the same columns. However, you’ll need to adapt getPredictionProbabilities() in that case, because each call to h2o.predict() will then need to be given a different data object.
[3]: I do it this way because I already had probabilities made. But if it was all that you were interested in, this was the long way to get it. The direct way was to do the exact opposite of getPredictionProbabilities(): keep the “predict” column, and get rid of the other columns!
[4]: I did experiment with simple one vote per model (aka mode), instead of summing probabilities, and generally got worse results. But it is worth bearing that kind of ensemble in mind. If you do more rigorous experiments comparing the two methods, I’d be very interested to hear what kind of problems, if any, that voting (mode) is superior on.

Friday, September 23, 2016

Hacking the H2O R API

Hacking the H2O R API

H2O comes with a comprehensive R API, but sometimes you want to do something that it does not (yet) support. This article will show how to add a couple of functions for fetching and saving models. Beyond giving you these functions, I want to show how to approach hacking on the API, including using internals. (Code in this article has been tested on the 3.8.2.x, 3.8.3.x and 3.10.0.x releases.)
If you want to learn more about H2O, and machine learning, may I recommend my book: Practical Machine Learning with H2O, published by O’Reilly? (It is “coming really soon” as I write this!) And, my company, QQ Trend, are available for helping you with all your machine learning needs, everything from a few hours of H2O-related consulting to helping you build massive models to solve the mysteries of life. (Contact me at dc at qqtrend.com )

Saving it all for another day

Say you have 30 models stored on H2O, and you want to save them all. The scenario might be that you want to stop the cluster overnight, but want to use your current set of models as the starting point for better models tomorrow. Or in an ensemble. Or something. At the time of writing H2O does not offer this functionality, in any of its various APIs and front-ends. So I want to write an h2o.saveAllModels() function.
Breaking that down a bit, I’m going to need these two functions:
  • get a list of all models
  • save models, given a list of models or model IDs.
Let’s start with the “or model IDs” requirement. H2O’s API offers h2o.saveModel(), but that only takes a model object, so how can we use it when all we have is an ID?

Exposing The Guts…

I am a huge fan of open source. H2O is open source. R is open source. But there is open, and then there is open, and one of the things I like about R is if you want to see how something was implemented, you just type the function name.
Type h2o.saveModel (without parentheses) in an R session (where you’ve already done library(h2o), of course) and you will see the source code. Here it is; notice how the only part of object that it uses is the model id - that is a stroke of luck, because it means that (under the covers) the API works just the way we needed it to!
function (object, path = "", force = FALSE) 
{
    #... Error-checking elided ...
    path <- file.path(path, object@model_id)
    res <- .h2o.__remoteSend(
      paste0("Models.bin/", object@model_id),
      dir = path, force = force, h2oRestApiVersion = 99)
    res$dir
}
If you are new to H2O, you need to understand that all the hard work is done in a Java application (which can equally well be running on your machine or on a cluster the other side of the world), and the clients (whether R, Python or Flow’s CoffeeScript) are all using the same REST API to send commands to it. So it should be no surprise to see .h2o.__remoteSend there; it is making a call to the “Models.bin” REST endpoint.
.h2o.__remoteSend is a private function in the R API. That means you cannot call it directly. Luckily, R doesn’t get in our way like Java or C++ would. We can use the package name followed by the triple colon operator to run it from normal user code: h2o:::.h2o.__remoteSend(...)
WARNING: Remember that hacking with the internals of an API is not future-proof. An upgrade might break everything you’ve written. …ooh, look at us, adrenaline flowing, living the life of danger. (Note to self: need to get out more.)

Let’s Write Some Code!

We now have enough to make the saveModels() function:
h2o.saveModels <- function(models, path, force = FALSE){
sapply(models, function(id){
  if(is.object(id))id <- id@model_id
  res <- h2o:::.h2o.__remoteSend(
    paste0("Models.bin/", id),
    dir = file.path(path, id),
    force = force, h2oRestApiVersion = 99)
  res$dir
  }, USE.NAMES=F)
}
The if(is.object(id))id = id@model_id line is what allows it to work with a mix of model id strings or model objects. The use of sapply(..., USE.NAMES=F) means it returns a character vector, containing the full path of each model file that was saved. Use it as follows:
h2o.saveModels(
  c("DL:defaults", "DL:200x200-500", "RF:100-40"),
  "/path/to/h2o_models/todays_hard_work",
  force = TRUE
  )
and it will output:
[1] "/path/to/h2o_models/todays_hard_work/DL:defaults"
[2] "/path/to/h2o_models/todays_hard_work/DL:200x200-500"
[3] "/path/to/h2o_models/todays_hard_work/RF:100-40"
(By the way, there is one irritating problem with this function: if any failure occurs, such as a file already existing, or a model ID not found, it stops with a long error message, and doesn’t attempt to save the other models. I’ll leave improving that to you. Hint: consider wrapping the h2o:::.h2o.__remoteSend() call with ?tryCatch.)

What Models Have I Made?

Next, how to get a list of all models? The Flow interface has getModels, and the REST API has GET /3/Models, but the R and Python APIs do not; the closest they have is h2o.ls() which returns the names of all data frames, models, prediction results, etc. with no (reliable) way to tell them apart. But GET /3/Models is not ideal either, because it returns everything about the model, whereas all we want is the model id. Having trawled through the H2O source (BTW, If you fancied submitting a patch to add a GET /3/ModelIds/ command, this looks like a good starting point I.e. what we want is just the first half of that function) it appears we are stuck with this. It is unlikely to matter unless you have 1000s of models, or a slow connection to your remote H2O cluster.
Start the same way as before by typing h2o.getModel (no parentheses) into R. Ooh! That function is long, and it is doing an awful lot. If you want a generally useful h2o.getModels() function I leave that as another of those exercises for the reader. Instead I’m going to call my function
h2o.getAllModelIds(), and limit the scope to just that, which makes the code much simpler. (Did you notice the pro tip there: just by calling my function “getAllModelIds” instead of “getModels” I saved myself hours of work. You see kids, naming really does matter.)
Here it is:
h2o.getAllModelIds <- function(){
d <- h2o:::.h2o.__remoteSend(method = "GET", "Models")
sapply(d[[3]], function(x) x$model_id$name)
}
Line 1 says get all the models. Line 2 says filter just the model id out, and throw the rest of it away. (Yeah, that d[[3]] bit is particularly fragile.)
Anyway, the final step is to simply put our two new functions together:
h2o.saveAllModels <- function(path){
h2o.saveModels(h2o.getAllModelIds(), path)
}
Use it, as shown here:
fnames <- h2o.saveAllModels("/path/to/todays_hard_work")
On one test, length(fnames) returned 154 (I’d been busy), and those 154 models totalled 150MB. However some models (e.g. random forest) are bigger than others, so make sure you have plenty of disk space to hand, just in case. Speaking of which, h2o.saveAllModels() should work equally well with S3 or HDFS destinations.

The day after the night before…

I could’ve done a dput(fnames) after running h2o.saveAllModels(), and saved the output somewhere. But as I’m not putting anything else in that particular directory, I can get the list again with Sys.glob(). So, I might start my next day’s session as follows.
 library(h2o)
 h2o.init(nthreads = -1)
 fnames <- Sys.glob("/path/to/todays_hard_work/*")
 models <- lapply(fnames, h2o.loadModel)
Voila! models will be an R list of the H2O model objects.

Clusters

If you are working on a remote cluster, with more than one node, there is a little twist to be aware of. h2o.saveModel() (and therefore our h2o.saveModels() extension) will create files on whichever node of the cluster your client is connected to. (At least, as of 3.10.0.7; I suspect this behaviour might change in future.)
But h2o.loadModel() will look for it on the file system of node 1 of the cluster. And node 1 is not (necessarily) the first node you listed in your flatfile. Instead it is the one listed first in h2o.clusterStatus().
This won’t concern you if you saved to HDFS or S3.

Bonus

What I actually use to load model files is shown below. It will get all the files from sub-directories. (Look at the list.file() documentation for how you can use pattern to choose just some files to be loaded in.)
h2o.loadModelsDirectory <- function(path, pattern=NULL, recursive=T, verbose=F){
fnames <- list.files(path, pattern = pattern, recursive = recursive, full.names = T, include.dirs = F)
h2o.loadModels(fnames, verbose)
}
(This code still has one problem: if I’m loading in work from both yesterday and two days ago, and I had made a model called “DF:default” on both days, I lose one of them. Sorting that out is my final exercise for the reader - please post your answer, or a link to your answer, in the comments!)

Tuesday, August 9, 2016

H2O Upgrade: Detailed Steps

H2O Upgrade: Detailed Steps

I wanted to upgrade both R and Python to the latest version of H2O (as of Aug 9th 2016). Here are the exact steps, and I think you will find them relevant even if you only need to update one or the other of those clients. Remember if following this at a later date, that you should follow the spirit of it, rather than copy-and-pasting: all the version numbers will have changed. This was with Linux Mint, but it should apply equally well to all other Linux distros.
Make sure you first close any R or Python clients that are using the H2O library; and separately shutdown H2O if it is still running after that.

The First Time

The below instructions are all for upgrading, which means I know I have all the dependencies in place. If this is your first H2O install, well I’d first recommend you buy my new book: Practical Machine Learning with H2O, published by O’Reilly. (Coming really soon, as I type this! Let me know if you are interested, and I'll send the best discount code I can find.)
As a quick guide, from R I recommend you use CRAN
install.packages("h2o")
and from Python I recommend you use pip:
pip install h2o
Both these approaches get all the dependences for you; you may end up back a version or two from the very latest, but it won’t matter.

The Download

cd /usr/local/src/
wget http://download.h2o.ai/versions/h2o-3.10.0.3.zip
unzip h2o-3.10.0.3.zip
It made a “h2o-3.10.0.3” directory, with python and R sub-directories.

R

I installed the h2o the first time as root, so I will continue to do that, hence the sudo:
cd /usr/local/src/h2o-3.10.0.3/R/
sudo R
Then:
remove.packages("h2o")
install.packages("h2o_3.10.0.3.tar.gz")
Then ctrl-d to exit.

Python

cd /usr/local/src/h2o-3.10.0.3/python/
sudo pip uninstall h2o
sudo pip install -U h2o-3.10.0.3-py2.py3-none-any.whl 
(The -U means upgrade any dependencies; the first time I forgot it, and ended up with some very weird errors when trying to do anything in Python.)

The Test

I started RStudio, and ran:
library(h2o)
h2o.init(nthreads=-1)
as.h2o(iris)
I then started ipython and ran:
import h2o,pandas
h2o.init()
iris = h2o.get_frame("iris")
print(iris)
As well as making sure the data arrived, I’m also checking the h2o.init() call in both cases said the cluster version was “3.10.0.3”.

AWS Scripts

If you use the AWS scripts (https://github.com/h2oai/h2o-3/tree/master/ec2) and want to make sure EC2 instances start with exactly the same version as you have installed locally, the file to edit is h2o-cluster-download-h2o.sh. (If not using those scripts, just skip this section.)
First find the h2oBranch= line and set it to “rel-turing” (notice the “g” on the end - there is also a version without the “g”!). Then comment out the two curl calls that follow, and instead set version to be whatever you have above, and build to be the last digit in the version number. So, for 3.10.0.3, I set:
h2oBranch=rel-turing

#echo "Fetching latest build number for branch ${h2oBranch}..."
#curl --silent -o latest https://h2o-release.s3.amazonaws.com/h2o/${h2oBranch}/latest
h2oBuild=3

#echo "Fetching full version number for build ${h2oBuild}..."
#curl --silent -o project_version https://h2o-release.s3.amazonaws.com/h2o/${h2oBranch}/${h2oBuild}/project_version
h2oVersion=3.10.0.3
The rest of that script, and the other EC2 scripts, can be left untouched.

Summary

Well that was easy! No excuses! Having said that, I recommend you upgrade cautiously - I have seen some hard-to-test-for regressions (e.g. model learning no longer scaling as well over a cluster) when grabbing the latest version.

Monday, March 21, 2016

WebGL: RWD, Mobile-First And Progressive Enhancement

WebGL: adapting to the device

This is a rather under-studied area, but it is going to become more important as WebGL is increasingly used to make websites. This article is a summary of my thoughts and learnings on the topic, so far.

I should say that my context is using WebGL for things other than games. Informational websites, educational apps, data visualization, etc., etc.

Please use the comments to add related links you can recommend; or just to disagree.

What is RWD?

Responsive Web Design. The idea is that, rather than make a mobile version of your website and a separate desktop version of your website, you make a single version in such a way that it will adapt and be viewable on all devices, from mobile phones (both portrait and landscape), through tablets, to desktop computers.

Mobile-First? Progressive Enhancement?

Mobile First is the idea that you first make your site work on the smallest screen, and the device with least capability. Then for the larger screens you add more sections.

This is contrast to starting at the other end: make beautiful graphics designed for a FullHD desktop monitor, using both mouse and keyboard, then hiding and removing things as you move to the smaller devices.

Just remember Mobile-First is a guideline, not a rule. If you end up with a desktop site where the user is getting frustrated by having to simulate touch gestures with their mouse, then you’ve missed the point.

It Is Hard!

RWD for a complex website can get rather hard. On toy examples it all seems nice and simple. But then add some forms. Make it multilingual. Add some CSS transitions and animations. Add user uploaded text or images. Then just as you start to crack under the strain of all the combinations that need testing, the fatal blow: the client insists on Internet Explorer 8 being supported.

But if you thought RWD and UX for normal websites was hard, then WebGL/3D takes it to a whole new dimension…

Progressive Enhancement In 3D

Progressive enhancement can be obvious things like using lower-polygon models and lower-resolution textures, or adding/removing shadows (see below).

But it can also be quite subtle things: in a 3D game presentation I did recently, the main avatar had an “idle” state animation: his chest moved up and down. But this requires redrawing the whole screen 60 times a second; without that idle animation the screen only needs to be redrawn when the character moves. Removing the idle animation can extend mobile battery life by an order of magnitude.

And that can lead to political issues. If you’ve ever seen a designer throw a tantrum just because you re-saved his graphics as 80% quality jpegs, think about what will happen if two-thirds of the design budget, and over three-quarters of the designer’s time went on making those subtle animations, and you’ve just switched them off for most of your users.

By the way, it is also about the difference between zero and one continuous animations. Remember an always-on “flyover” effect counts as animation. An arcade game where the user’s character is constantly being moved around the screen does too. So, once one effect requires constantly re-drawing the scene, the extra load of adding those little avatar animations will be negligible.

Lower-Poly Models

I mentioned this in passing above. Be aware that it is often more involved than with 2D images, where using Gimp/Photoshop to turn the 800x600 image to 320x240 and as lower quality can be automated. In fact you may end up with doubling your designer costs, if they have to make two versions.

If the motivation for low-poly is to reduce download time, you could consider running a sub-surf modifier once the data has been downloaded. Or describing the shape with a spline and dynamically extrude it.

If the motivation is to reduce the number of polygons to reduce CPU/GPU effort, again consider the extrude approach but using different splines, and/or different bevels.

Shadows

Adding shadows increases the realism of a 3D scene, but adds more CPU/GPU effort. Also more programmer effort: you need to specify which objects cast shadows, which objects receive shadows, and which light sources cast shadows. (All libraries I mentioned in Comparison Of Three WebGL Libraries handle shadows in this way.)

For many data visualization tasks, shadows are unnecessary, and could even get in the way. Even for games they are usually an optional extra. But in some applications the sense of depth that shadows give can really improve the user experience (UX).

If you have a fixed viewing angle on your 3D scene, and fixed lighting, you can use pre-made shadows: these are simply a 2D graphic that holds the shadow.

VR

With virtual reality headsets you will be updating two displays, and it has to be at a very high refresh rate, so it is demanding on hardware.

But virtual reality is well-suited for progressive enhancement: just make sure your website or application is fully usable without it, but if a user has the headset they are able to experience a deeper immersion.

Controls In 3D

Your standard web page didn’t need much controlling: up/down was the only axis of movement, and being able to click a link. Touch-only mobile devices could adapt easily: drag up/down, and tap to follow a link.

Mouseover hints are usually done as progressive enhancements, meaning they are not available to people not using a device with a mouse. (Meaning in mobile apps I often have no idea what all the different icons do…)

If your WebGL involves the user navigating around a 3D world, the four arrow keys can be a very natural approach. But there is no common convention on a touch-only device. Some games show a semi-transparent joystick control on top, so you press that for the 4 directions. Others have you touch the left/right halves of the screen to steer left and right, and perhaps you move at a constant speed.

Another approach is to touch/click the point you want to move to, and have your app choose the route, and animate following it.

Zoom is an interesting one, as the approach for standard web sites can generally be used for 3D too. There are two conventions on mobile: the pinch to grow/shrink, or double-tap to zoom a fixed distance (and double-tap to restore). With a mouse, the scroll-wheel, while holding down ctrl, zooms. With only a keyboard, ctrl and plus/minus, with ctrl and zero to restore to default zoom.

Friday, March 18, 2016

Timestamp helper in Handlebars

Handlebars is a widely-used templating language for web pages. In a nutshell, the variables to insert go between {{ and }}. Easy. It offers a few bits of logic, such as if/else clauses, for-each loops, etc. But, just as usefully, Handlebars allows you to add helper functions of your own.
In this article I will show a nice little Handlebars helper to format datestamps and timestamps. Its raison d’etre is its support for multiple languages and timezones. The simplest use case (assuming birthday is their birthday in some common format):
<p>Your birthday is on {{timestamp birthday}}.</p>
It builds on top of sugar.js’s Date enhancements; I was going to do this article without using them, to keep it focused, but that would have made it unreasonably complex.
There are two ways to configure it: with global variables, or with per-tag options. For most applications, setting the globals once will be best. Here are the globals it expects to find:
  • tzOffset: the number of seconds your timezone is ahead of UTC. E.g. if in Japan, then tzOffset = 9*3600. If in the U.K. this is either 0 or 3600 depending on if it is summer time or not.
  • lang: The user-interface language, e.g. “en” for English, “ja” for Japanese, etc.
(By the way, if setting lang to something other than “en”, you will also need to have included locale support into sugar.js for the languages you are supporting - this is easy, see the sugar.js customize page, and check Date Locales.)
The default timestamp format is the one built-in to sugar.js for your specified language. All these configuration options (the two above, and format) can be overridden when using the tag. E.g. if start is the timestamp of when an online event starts, you could write:
<p>The live streaming will start at
{{timestamp start tzOffset=0}} UTC,
which is {{timestamp start tzOffset=32400}}
in Tokyo and {{timestamp start tzOffset=-25200}}
in San Francisco.</p>
Here is the basic version:
Handlebars.registerHelper('timestamp', function(t, options){
var offset = options.hash.tzOffset;
if(!offset)offset = tzOffset;

if(!Object.isDate(t)){
    if(!t)return "";
    if(Object.isString(t))t = Date.create(t + "+0000").setUTC(true).addSeconds(offset);
    else t = Date.create(t*1000).setUTC(true).addSeconds(offset);
    }
else t = t.clone().addSeconds(offset);

if(!t.isValid())return "";

var code = options.hash.lang;
if(!code)code = lang;   //Use global as default

var format = options.hash.format ? options.hash.format : '';
return t.format(format, lang);
});
The first two-thirds of the function turn t into a Date object, coping whether it was already a Date object, or a string (in UTC, and in any common format the Date.create() can cope with), or a number (in which case it is seconds since Jan 1st 1970 UTC). However, be careful if giving a pre-made Date object: make sure it was the time in UTC and specifies that is in UTC.
The rest of the function just chooses the language and format, and returns the formatted date string.
If you were paying attention you would have noticed t stores a lie. E.g. for 5pm BST, t would be given as 4pm UTC. We then turn it into a date that claims to be 5pm UTC. Basically this is to stop format() being too clever, and adjusting for local browser time. (This trick is so you can show a date in a browser for something other than the user’s local timezone.)
But it does mean that if you include any of the timezone specifiers in your format string, they will wrongly claim it is UTC. {{timestamp
theDeadline format="{HH}:{mm} {tz}" }} will output 17:00 +0000.
To allow you to explicitly specify the true timezone, here is an enhanced version:
Handlebars.registerHelper('timestamp', function(t, options){
var offset = options.hash.tzOffset;
if(!offset)offset = tzOffset;   //Use global as default
if(!Object.isDate(t)){
    if(!t)return "";
    if(Object.isString(t))t = Date.create(t + "+0000").setUTC(true).addSeconds(offset);
    else t = Date.create(t*1000).setUTC(true).addSeconds(offset);
    }
else t = t.clone().addSeconds(offset);
if(!t.isValid())return "";

var code = options.hash.lang;
if(!code)code = lang;   //Use global as default

var format = options.hash.format ? options.hash.format : '';
var s = t.format(format, lang);
if(options.hash.appendTZ)s+=tzString;
if(options.hash.append)s+=options.hash.append;
return s;
});
(the only change is to add a couple of lines near the end)
Now if you specify appendTZ=true then it will append the global tzString. Alternatively you can append any text you want by specifying append. So, our earlier example becomes one of these:
{{timestamp theDeadline format="{HH}:{mm}" appendTZ=true}}
{{timestamp theDeadline format="{HH}:{mm}" append="BST"}}
{{timestamp theDeadline format="{HH}:{mm}" append=theDeadlineTimezone}}
The first one assumes a global tzString is set. The second one hard-codes the timezone, which is unlikely to be the case; the third one is the same idea but getting timezone from another variable.
VERSION INFO: The above code is for sugar.js v.1.5.0, which is the latest version at the time of writing, and likely to be so for a while. If you need it for sugar.js 1.4.x then please change all occurrences of setUTC(true) to utc().

Wednesday, March 9, 2016

Comparison Of Three WebGL Libraries

Comparison Of Three WebGL Libraries

For many people, WebGL is a technology for making browser-based games, but I am more interested in all the other uses: data visualization, data presentation, making web sites look fantastic, new and interesting user experience (UX), etc. (I have spent many years using Flash for similar things.)

What is WebGL?

WebGL is an API to allow browsers to use a GPU to speed up 2D and 3D graphics; you write in a mix of JavaScript and a shader language. Because it is low-level and complex I recommend against writing in raw WebGL; use a library instead.

It is supported on just about any popular OS/browser combination, including working on tablets and mobile phones. Your device does not need to have a dedicated GPU to run WebGL.

What libraries are there?

There are actually quite a few choices, but for this article I will focus on the three libraries I have made (non-trivial) WebGL applications with:
The first two are fairly low-level (Babylon.JS has a few more abstractions built-in), meaning you will be thinking in terms of vertices, faces, 3D coordinates, cameras, lighting, etc. A 3D graphics background will be useful. Superpowers is higher-level, but more focused on games development. Some Blender (or equivalent) skills will also come in handy, whichever library you go for.

Three.js And Its Resources


Three.JS is the most established WebGL library, with some published books, many demos (http://threejs.org/, https://stemkoski.github.io/Three.js/ and others), even a Udacity course.

However it has scant regard for backwards compatibility, meaning that frequently the code in the published books (or the source code of older demos and tutorials) will not work with the latest library version. It has a relatively aggressive developer community, who think that having an uncommented demo of a feature counts as documentation.

It uses the MIT license (the most liberal open source - fine for commercial use), hosted on github; bug reports to github, but support questions to StackOverflow’s [three.js] tag.

Babylon.js And Its Resources

Babylon.JS is now two years old, and was developed at Microsoft in France, though it is open source (Apache license, so fine for commercial use). It is primarily intended for making games, but is flexible enough for other work.

Like Three.JS, it has plenty of demos, and again they are often undocumented. There is an active web forum; explanations and experiments there often link to the Babylon Playground, which is a live coding editor. There is also a very useful eight hour training video course (free), presented by the two David’s who created Babylon.JS. (There is a just-released book, https://www.packtpub.com/game-development/babylonjs-essentials, but I’ve not seen it, so cannot comment.)

Superpowers And Its Resources

Superpowers is a bit different: it is a gaming system, with its own IDE. It is very new, only released as open source (ISC license, which is basically the nice liberal MIT license again) in the middle of January 2016, though appears to have a year’s closed development behind it. (The IDE is cross-platform; it has been running nicely for me on Linux, I’ve not tried it on other platforms.)

Some of the initial batch of demos and games have been released on GitHub (kind of as open-source - the licenses are a bit vague, especially regarding re-use of assets), which has been my main source of learning. A few tutorials have also appeared recently (GameFromScratch.com, and https://itch.io/board/11494/tutorials-guides).

What grabbed my attention was the quality and completeness of the Fat Kevin game, combined with the fact that I could download all source and assets for it, to learn from. (The Discover Superpowers demo is similar, but simpler, so easier to learn from.)

Support is through forums on itch.io, with separate English and French sections. This requires yet another user account; I find it a shame they didn’t use StackOverflow, Github, or at least HTML5 Game Devs (as Babylon did). I’d not heard of itch.io (“an open marketplace for independent digital creators with a focus on independent video games”) before, but I think their choice tells you how they see Superpowers being used.

The coding language is TypeScript, basically JavaScript 1.8 plus types; it is worth specifying those types, as then the IDE’s helpful autocomplete can work. Note that Superpowers is closely tied to the IDE - you need to be clicking and dragging things; doing everything in code is not realistic (though this might just be the style of the initial few games). Superpowers is built on Three.JS, but I’m not seeing anything exposed, so I don’t think you can take a Three.JS example and use it directly.

Conclusion

Which library to choose? I suggest you try out the demos for each of these, and choose the library that has demos that cover all the things you want to do. If the choice comes down to Three.JS vs. Babylon.JS, and you cannot find a killer reason to choose one over the other, this is because it doesn’t really matter, they can each do 95+% of what the other can: follow your hunch, choose one or the other, and dive in and learn it.

Finally I should say that WebGL for website development is hard: your programmer(s) will need 3D experience, as will your graphic designer(s). If you are using RWD/mobile-first to target both mobile and desktop, it is even more complex. My company, QQ Trend Ltd. can help (contact me at dc [at] qqtrend.com).

Monday, March 7, 2016

Gradients in Three.JS

Gradients in Three.js

3D charts with gradientsMany years ago, I made some charts in Flash: 3D histograms using boxes and pyramids. More as proof of concept than anything else, I used two types of gradients on the sides:
  • Colour gradients (e.g. red at the bottom of the bar, gradually changing to be yellow at the top)
  • Opacity gradients (e.g. solid at the bottom, gradually changing to be only 20% opaque at the top)
Recently I’ve been trying to reproduce (and go beyond) those charts in WebGL. Gradients seem to be both harder, and less flexible, than they were in ActionScript/Flash.

I’ve been working with two libraries, Three.JS and Babylon.JS. In Babylon.JS I couldn’t find any examples of how to do either type of gradient. In Three.JS I believe there is no support for opacity gradients, but colour gradients are possible, and that will be the theme of this article.

Three.js: mesh, geometry, material

I will assume some familiarity with WebGL and Three.JS concepts, but the most essential knowledge you need to follow-on with this article is:
  • geometry is a shape.
  • material is the appearance
  • mesh is a geometry plus a material.
Most of the time your geometry and your material are orthogonal. E.g. if you have a red shiny material you can apply it equally easily to your pyramid or your torus. And you can just as easily tile a grass image on either of those shapes.

A more tightly coupled (less orthogonal) example is a game character (a mesh) you have made in, say, Blender, with a special texture map (a material) to give it a face and clothes. The mesh and the material are basically tied together. However, if the mesh comes with multiple poses, or animations, the same texture map works for all of them. And you can repaint the texture map to give your mesh (in any of its poses) new clothes.

In contrast, gradients are highly coupled; at least in the way I will show you here. Like coupling in software engineering, this is bad: I cannot prepare a red-to-yellow gradient material, and then apply it to any mesh; instead I have to embed the gradient description into that mesh, in a way specific to that mesh.

Vertex Colours

The way it works is you can switch a material to use VertexColors. E.g.
  var mat = new THREE.MeshPhongMaterial({vertexColors:THREE.VertexColors});
And then over in the mesh you specify a colour for each vertex. If you do this, then Three.JS will, for each triangle, blend the vertex colours in a smooth gradient. All faces in Three.JS are triangles, and vertices are referenced through each face, so you end up with lots of lines like this:
  myGeo.faces[ix].vertexColors = [c1, c2, c3];
where each of c1, c2 and c3 are THREE.Color instances.

By the way, I said opacity gradients were not possible with this technique (because vertexColors takes an RGB triplet, not RGBA), but it is still possible to make the whole mesh semi-transparent.

BoxGeometry Vertices

A THREE.BoxGeometry is used to make a 6-faced cuboid shape. To be more precise it is a shape made up of 12 triangles (two on each face). To be able to set vertexColors you need to know the order of the those 12 triangles. I reverse-engineered it, by colouring each in turn, to get the following list:
  • 0: top-left of one of the side faces. 0 is top-left, 1 is bottom-left, 2 is top-right (anti-clockwise)
  • 1: bottom-right. vertex 0 is bottom-left, 1 is bottom-right, 2 is top-right. (anti-clockwise)
  • 2/3: same for opposite side
  • 4/5: is top. (4’s vertex zero touches 2’s vertex zero)
  • 6/7 is bottom
  • 8/9 is one side
  • 10/11 is the other side

A Factory Function

Here is a factory function to make a box mesh with colour c1 on the base, color c2 on the top, and each side having a smooth linear gradient from c1 at the bottom to c2 on the top.

You can specify c1 and c2 as either a hex code (e.g. 0xff0000) or as a THREE.Color object. w, d, h are the three dimensions of the cube. opacity is optional, and can be from 0.0 (invisible) to 1.0 (full opaque - the default).

function makeGradientCube(c1, c2, w, d, h, opacity){
if(typeof opacity === 'undefined')opacity = 1.0;
if(typeof c1 === 'number')c1 = new THREE.Color( c1 );
if(typeof c2 === 'number')c2 = new THREE.Color( c2 );

var cubeGeometry = new THREE.BoxGeometry(w, h, d);

var cubeMaterial = new THREE.MeshPhongMaterial({
    vertexColors:THREE.VertexColors
    });

if(opacity < 1.0){
    cubeMaterial.opacity = opacity;
    cubeMaterial.transparent = true;
    }

for(var ix=0;ix<12;++ix){
    if(ix==4 || ix==5){ //Top edge, all c2
        cubeGeometry.faces[ix].vertexColors = [c2,c2,c2];
        }
    else if(ix==6 || ix==7){ //Bottom edge, all c1
        cubeGeometry.faces[ix].vertexColors = [c1,c1,c1];
        }
    else if(ix%2 ==0){ //First triangle on each side edge
        cubeGeometry.faces[ix].vertexColors = [c2,c1,c2];
        }
    else{ //Second triangle on each side edge
        cubeGeometry.faces[ix].vertexColors = [c1,c1,c2];
        }
    }

return new THREE.Mesh(cubeGeometry, cubeMaterial);
}
 
Given the earlier explanation, I hope the code is self-explanatory: make a material where all we set is that we will use VertexColors (and, optionally, that it is partially transparent), make a box mesh, and then go through all 12 faces, work out which face it is, and set the colours of the three corners accordingly.

A Full Example

Here is a complete example (you’ll need to paste in the above code, where shown), to quickly demonstrate that it works. (This was tested with r74, but as far as I know it should work back to at least r65.)

The code is minimal: make a scene, with a camera and a light, and put a gradient box (2x3 units at the base, 6 units high) at the centre of the scene. The gradient goes from red to a pale yellow. I made it slightly transparent (the 0.8 for the final parameter), but as it is the only object in the scene this has no effect (except to dim the colours a bit, because of the black background)!

<!DOCTYPE html>
<html>
<head>
  <title>Gradient test</title>
  <script src="https://cdnjs.cloudflare.com/ajax/libs/three.js/r74/three.min.js"></script>
</head>
<script>
function makeGradientCube(c1, c2, w, d, h, opacity){/*As above*/}

function init() {
  var scene = new THREE.Scene();
  var camera = new THREE.PerspectiveCamera(45,
    window.innerWidth / window.innerHeight, 0.1, 1000);

  var renderer = new THREE.WebGLRenderer();
  renderer.setClearColor(0x000000, 1.0);
  renderer.setSize(window.innerWidth, window.innerHeight);

  var dirLight = new THREE.DirectionalLight();
  dirLight.position.set(30, 10, 20);
  scene.add(dirLight);

  scene.add( makeGradientCube(0xff0000, 0xffff66, 2,3,6, 0.8) );

  camera.position.set(10,10,10);
  camera.lookAt(scene.position);

  document.body.appendChild(renderer.domElement);
  renderer.render(scene, camera);
  }

window.onload = init;
</script>
<body>
</body>
</html>


Sources

The above code was based on studying this example and trying to work out how it was doing that. It is undocumented - par for the course with Three.JS examples, sadly. I also peeked at the Three.JS source code. If you want more undocumented code examples of using THREE.VertexColors, see https://stemkoski.github.io/Three.js/Vertex-Colors.html

Future Work

First, if you write your own shaders I believe anything and everything is possible.

Second, I wonder about making a gradient in a 2D canvas, and using that as a texture map. And/or using it as the alpha map to create an opacity gradient.

Either of those may be the subject of a future article. In the meantime, if you know a good tutorial on using gradients in either Babylon.JS or Three.JS, please link to it in the comments. Thanks, and thanks for reading!

Monday, February 29, 2016

Factorials and Rationals in R

At the recent NorDevCon, Richard Astbury used the example of factorial to compare some “modern” languages (the theme of his talk being that they were all invented decades ago). Among them was Scheme, which impressed me by having native support for very large numbers and rational numbers.

I felt like my go-to-language for maths-stuff, R, ought to be able to do that, too. The shortest R way to calculate factorials (using built-in functions) I can find is:

prod(1:5)

which gives 120. But it gives me Inf if I try prod(1:50).

BTW, I hoped this might work:

`*`(1:5)

but * operator only takes two arguments (which is ironic for a language where everything is a vector, i.e. an array).

It seems I need to use the gmp package to get big number and rational number support. Here is how factorial can be written:

library(gmp)
prod(as.bigz(1:50))

which outputs
“30414093201713378043612608166064768844377641568960512000000000000”

That is not bad: still fairly short and, as you can see, vectorized operations are still trivial (as.bigz(1:50) creates a vector of 50 gmp numbers).

as.bigq(2, 3) is how you create a rational (⅔). Here is a quick example of creating vectors of rational numbers:

as.bigq(1, 1:50)

which outputs:

Big Rational ('bigq') object of length 50:
[1] 1 1/2 1/3 1/4 1/5 1/6 1/7 1/8 1/9 1/10 1/11 1/12 1/13 1/14 1/15
[16] 1/16 1/17 1/18 1/19 1/20 1/21 1/22 1/23 1/24 1/25 1/26 1/27 1/28 1/29 1/30
[31] 1/31 1/32 1/33 1/34 1/35 1/36 1/37 1/38 1/39 1/40 1/41 1/42 1/43 1/44 1/45
[46] 1/46 1/47 1/48 1/49 1/50

Rational operations also work nicely:

as.bigq(1,1:50) + as.bigq(1,3)

giving:
4/3 5/6 2/3 ... 49/138 50/141 17/48 52/147 53/150

Of course that is not as nice as having built-in rationals, but good enough for those relatively few times when you need them.