Darren's Developer Diary

Word Embeddings, NLP and I18N in H2O

2017-12-06T10:36:00.001-08:00

Word embeddings can be thought of as a dimension-reduction tool, needing a sequence of tokens to learn from. They are really that generic, but I’ve only ever heard of them used for languages; i.e. the sequences are sentences, the tokens are words (or compound words, or n-grams, or morphemes).
This blog post is for code I presented recently on how to use the H2O implementation of word embeddings, aka word2vec. The main thing being demonstrated is that they apply equally well for any language, but you may need some language-specific tokenization, and other data engineering, first.
Here is the preparation code, in R; bring in H2O, and define a couple of helper functions.

library(h2o)

h2o.init(nthreads = -1)


show <- function(w, v){
  x = as.data.frame(h2o.cbind(w, v))
  x = unique(x)
  plot(x[,2:3], pch=16, type="n")
  text(x[,2:3], x[,1], cex = 2.2)
}


reduceShow <- function(w,v){
m <- h2o.prcomp(v, 1:ncol(v), k = 2, impute_missing=T)
p <- h2o.predict(m, v)
show(w, p)
}

Then, I define an artificial corpus, and try with word embedding dimensions of 2, 4 and 9. For dimensions above 2, reduceShow() is using PCA to just show the first two dimensions.
eg1 = c(

  "I like to drive a blue car",
  "I like to drive a red car",
  "I like to drive a green car",
  "I like to drive a blue lorry",
  "I like to drive a yellow lorry",
  "I like to drive a brown lorry",
  "I like to drive a green lorry",
  "I like to drive a red Ferrari",
  "I like to drive a blue Mercedes"
)

eg1.words <- h2o.tokenize(
  as.character(as.h2o(eg1)), "\\\\W+")

head(eg1.words,12)

eg1.wordsNoNA <- eg1.words[!is.na(eg1.words),]

eg1.wv <- h2o.word2vec(eg1.words,
                       min_word_freq = 1,
                       vec_size = 2)

eg1.vectors = h2o.transform(eg1.wv,
                            eg1.wordsNoNA,
                            "NONE")
show(eg1.wordsNoNA, eg1.vectors)


eg1.wv4 <- h2o.word2vec(eg1.words,
                        min_word_freq = 1,
                        vec_size = 4)

eg1.vectors4 = h2o.transform(eg1.wv4,
                             eg1.wordsNoNA,
                             "NONE")
reduceShow(eg1.wordsNoNA, eg1.vectors4)

eg1.wv9 <- h2o.word2vec(eg1.words,
                        min_word_freq = 1,
                        vec_size = 9,
                        epochs = 50 * 9)

eg1.vectors9 = h2o.transform(eg1.wv9,
                             eg1.wordsNoNA,
                             "NONE")
reduceShow(eg1.wordsNoNA, eg1.vectors9)

Those results are fairly poor, as we only have 9 sentences; we can do better by sampling those 9 sentences 1000 times. I.e. more data, even if it is exactly the same data, is better!

eg2 = sample(eg1, size = 1000, replace = T)

#(rest of code exactly the same, just changing eg1 to eg2)

What about Japanese? Here are the same 9 sentences (well, almost “This is a …” each time, rather than “I drive a …”), but hand-tokenized in a realistic way (in particular の is a separate token). I’ve gone straight to having 1000 sentences, as we know that helps:

  # これは青い車です。
  # これは赤い車です。
  # これは緑の車です。
  # これは青いトラックです。
  # これは黄色いトラックです。
  # これは茶色のトラックです。
  # これは緑のトラックです。
  # これは赤いフェラーリです。
  # これは青いメルセデスです。

ja1 = c(  #Pre-tokenized
  "これ","は","青い","車","です","。",NA,
  "これ","は","赤い","車","です","。",NA,
  "これ","は","緑","の","車","です","。",NA,  # ***
  "これ","は","青い","トラック","です","。",NA,
  "これ","は","黄色い","トラック","です","。",NA,
  "これ","は","茶色","の","トラック","です","。",NA,  # ***
  "これ","は","緑","の","トラック","です","。",NA,  # ***
  "これ","は","赤い","フェラーリ","です","。",NA,
  "これ","は","青い","メルセデス","です","。",NA
  )
ja2 = rep(ja1, times = 120)
# nrow(ja2) is 7920 tokens; representing 1080 sentences.

The code to try it is exactly as with the English, except we just import ja2 and don’t run it through h2o.tokenize().
ja2.words = as.character(as.h2o(ja2))

head(ja2.words, 12)

ja2.wordsNoNA <- ja2.words[!is.na(ja2.words),]

ja2.wv2 <- h2o.word2vec(ja2.words,
                        min_word_freq = 1,
                        vec_size = 2,
                        epochs = 20)
ja2.vectors2 = h2o.transform(ja2.wv2,
                             ja2.wordsNoNA,
                             "NONE")
show(ja2.wordsNoNA, ja2.vectors2)

ja2.wv4 <- h2o.word2vec(ja2.words,
                        min_word_freq = 1,
                        vec_size = 4,
                        epochs = 20)
ja2.vectors4 = h2o.transform(ja2.wv4,
                             ja2.wordsNoNA,
                             "NONE")
reduceShow(ja2.wordsNoNA, ja2.vectors4)


ja2.wv9 <- h2o.word2vec(ja2.words,
                        min_word_freq = 1,
                        vec_size = 9,
                        epochs = 50)
ja2.vectors9 = h2o.transform(ja2.wv9,
                             ja2.wordsNoNA,
                             "NONE")
reduceShow(ja2.wordsNoNA, ja2.vectors9)

In Python? I’ll just quickly show the key changes (there are full word embedding examples in Python, for H2O, floating around, e.g. https://github.com/h2oai/h2o-meetups/blob/master/2017_11_14_NLP_H2O/Amazon%20Reviews.ipynb )
To bring the data in and tokenize it:

eg1 = [...]
sentences = h2o.H2OFrame(eg1).ascharacter()
eg1_words = sentences.tokenize("\\W+")
eg1_words.head()

Then to make the embeddings:

from h2o.estimators.word2vec import H2OWord2vecEstimator
eg1_wv = H2OWord2vecEstimator(vec_size = 2, min_word_freq = 1)
eg1_wv.train(eg1_words)

And to get the vectors for visualization:

eg1_vectors = eg1_wv.transform(eg1_words_no_NA, "NONE")

(The Python code is untested as I type this - if I have a typo, let me know in the comments or at darren at dcook dot org, and I will fix it.)

The Seven Day A Year Bug

2017-03-31T14:48:00.001-07:00

I’ll cut straight to the chase: when you use d.setMonth(m - 1) in JavaScript, always set the optional second parameter.

What’s that, you didn’t know there was one? Neither did I until earlier today. It allows you to set the date. Cute, I thought at the time, minor time-saver, but hardly worth complicating an API for.

Ooh, how wrong I was. Let me take you back to when it happened. Friday, March 31st….

After a long coding session, I did a check-in. And then ran all unit tests. That is strange, three failing, but in code I hadn’t touched all day. I peer at the code, but it looks correct - it was to do with dates, specifically months, and I was correctly subtracting 1.

Aside: JavaScript dates inherit C’s approach of counting months from 0. In the first draft of this blog post I used a more judgemental phrase than “approach”. But to be fair, it was a 1970s design decision, and the world was different back then. Google “1970s men fashion”.

So, back to the test failures. I start using “git checkout xxxx” to go back to earlier versions, to see exactly when it broke. Running all tests every time. I know something fishy is going on, by the time I’ve gone back 10 days and the tests still fail. I am fairly sure I ran all tests yesterday, and I am certain it hasn’t been 10 days.

Timezones?! Unlikely, but we did put the clocks back last weekend. But a quick test refutes that. (TZ=XXX mocha . will run your unit tests in timezone XXX.)

So, out of ideas, I litter the failing code with console.log lines, to find out what is going on.

Here is what is happening. I initialize a Date object to the current date (to set the current year), then call setMonth(). I don’t use the day, so don’t explicitly set it. I was calling setMonth(8), expecting to see “September”, but the unit test was being given “October”. Where it gets interesting is that the default date today is March 31st. In other words, when I set month to September the date object becomes “September 31st”, which isn’t allowed. So it automatically changes it to October 1st.

You hopefully see where the title of this piece comes from now? If I was setting a date in February I would have discovered the bug two days earlier, and if my unit test had chose October instead of September, the bug would never have been detected. If I’d thought, “ah, I’ll run them Monday”, the bug would not have been discovered until someone used the code in production on May 31st. I’d have processed their bug report on June 1st and told them, “can’t reproduce it”. And they’d have gone, “Oh, you’re right, neither can I now.”

To conclude with a happy ending, I changed all occurrences of d.setMonth(m - 1) into d.setMonth(m-1, 1), and the test failures all went away. I also changed all occurrences of d.setMonth(m-1);d.setDate(v) (where v is the day of the month) into: d.setMonth(m-1, v) not because it is shorter and I can impress people with my knowledge of JavaScript API calls, but because two separate calls was a bug that I simply didn’t have a unit test for.

But writing that unit test can wait until Monday.

NorDevCon 2017: code samples

2017-02-24T07:57:00.001-08:00

This is the sample code, in Python and R, for the talk I gave yesterday at the NorDevCon 2017, pre-meeting talks.

To install h2o for Python, from the commandline do:

  pip install h2o

To install it in R, from inside an R session do:

install.packages("h2o")

Either way, they should get all the dependencies that you need.

The data was the “train.csv.zip” file found at Kaggle (You need to sign-up to Kaggle to be allowed to download it.) The following scripts assume you have unzipped it and put train.csv in the same directory as the scripts.

That Kaggle URL is also where the description of fields is to be found.

Here is how to prepare H2O, and the data, in Python:

import h2o

h2o.init()

data = h2o.import_file("train.csv")

data["Ht"].cor(data["Wt"])

factorsList = ['Product_Info_1', 'Product_Info_2', 'Product_Info_3', 'Product_Info_5', 'Product_Info_6', 'Product_Info_7', 'Employment_Info_2', 'Employment_Info_3', 'Employment_Info_5', 'InsuredInfo_1', 'InsuredInfo_2', 'InsuredInfo_3', 'InsuredInfo_4', 'InsuredInfo_5', 'InsuredInfo_6', 'InsuredInfo_7', 'Insurance_History_1', 'Insurance_History_2', 'Insurance_History_3', 'Insurance_History_4', 'Insurance_History_7', 'Insurance_History_8', 'Insurance_History_9', 'Family_Hist_1', 'Medical_History_2', 'Medical_History_3', 'Medical_History_4', 'Medical_History_5', 'Medical_History_6', 'Medical_History_7', 'Medical_History_8', 'Medical_History_9', 'Medical_History_11', 'Medical_History_12', 'Medical_History_13', 'Medical_History_14', 'Medical_History_16', 'Medical_History_17', 'Medical_History_18', 'Medical_History_19', 'Medical_History_20', 'Medical_History_21', 'Medical_History_22', 'Medical_History_23', 'Medical_History_25', 'Medical_History_26', 'Medical_History_27', 'Medical_History_28', 'Medical_History_29', 'Medical_History_30', 'Medical_History_31', 'Medical_History_33', 'Medical_History_34', 'Medical_History_35', 'Medical_History_36', 'Medical_History_37', 'Medical_History_38', 'Medical_History_39', 'Medical_History_40', 'Medical_History_41']

data[factorsList] = data[factorsList].asfactor()

# Split off a random 10% to use to evaluate
# the models we build.
train, test = data.split_frame([0.9], seed=123)

# Sanity check
train.ncol
test.ncol
train.nrow
test.nrow

# What the data looks like:
train.head(rows=1)
test.head(rows=1)

Here is the very quick deep learning model:

m_DL = h2o.estimators.H2ODeepLearningEstimator(epochs=1)
m_DL.train(x, y, train)

(I made the powerful one-liner claim in the talk but, as you can see, in Python they are two-liners.)

Then to evaluate that model:

m_DL

m_DL.predict( test[1, x] )  #Ask prediction about first test record

m_DL.predict( test[range(1,6), x] ).cbind(test[range(1,6), y] )  #Compare result for first 6 records

m_DL.model_performance(test) #Average performance on all 6060 test records

m_DL.model_performance(train)  #For comparison: the performance on the data it was trained on

Here is the default GBM model:

m_GBM = h2o.estimators.H2OGradientBoostingEstimator()
m_GBM.train(x, y, train)

m_GBM.model_performance(test)

Then here is the tuned GBM model - basically it is all about giving it more trees to play with:

m_GBM_best = h2o.estimators.H2OGradientBoostingEstimator(
    sample_rate=0.95,
    ntrees=200,
    stopping_tolerance=0,stopping_rounds=4,stopping_metric="MSE"
    )
m_GBM_best.train(x, y, train, validation_frame=test)

m_GBM_best.model_performance(test)

And here is the tuned deep learning model:

m_DL_best = h2o.estimators.H2ODeepLearningEstimator(
    activation="RectifierWithDropout",
    hidden=[300,300,300],
    l1=1e-5,
    l2=0,
    input_dropout_ratio=0.2,
    hidden_dropout_ratios=[0.4, 0.4, 0.4],
    epochs=1000,
    stopping_tolerance=0,stopping_rounds=4,stopping_metric="MSE"
    )
m_DL_best.train(x, y, train, validation_frame=test)

m_DL_best.model_performance(test)

And here is the R code, that does the same as the above:

library(h2o)

h2o.init(nthreads=-1)


data = h2o.importFile("train.csv")
# View it on Flow

h2o.cor(data$Wt, data$BMI)


factorsList = c('Product_Info_1', 'Product_Info_2', 'Product_Info_3', 'Product_Info_5', 'Product_Info_6', 'Product_Info_7', 'Employment_Info_2', 'Employment_Info_3', 'Employment_Info_5', 'InsuredInfo_1', 'InsuredInfo_2', 'InsuredInfo_3', 'InsuredInfo_4', 'InsuredInfo_5', 'InsuredInfo_6', 'InsuredInfo_7', 'Insurance_History_1', 'Insurance_History_2', 'Insurance_History_3', 'Insurance_History_4', 'Insurance_History_7', 'Insurance_History_8', 'Insurance_History_9', 'Family_Hist_1', 'Medical_History_2', 'Medical_History_3', 'Medical_History_4', 'Medical_History_5', 'Medical_History_6', 'Medical_History_7', 'Medical_History_8', 'Medical_History_9', 'Medical_History_11', 'Medical_History_12', 'Medical_History_13', 'Medical_History_14', 'Medical_History_16', 'Medical_History_17', 'Medical_History_18', 'Medical_History_19', 'Medical_History_20', 'Medical_History_21', 'Medical_History_22', 'Medical_History_23', 'Medical_History_25', 'Medical_History_26', 'Medical_History_27', 'Medical_History_28', 'Medical_History_29', 'Medical_History_30', 'Medical_History_31', 'Medical_History_33', 'Medical_History_34', 'Medical_History_35', 'Medical_History_36', 'Medical_History_37', 'Medical_History_38', 'Medical_History_39', 'Medical_History_40', 'Medical_History_41')
data[,factorsList] <- as.factor(data[,factorsList])


splits <- h2o.splitFrame(data, 0.9, seed=123)
train <- h2o.assign(splits[[1]], "train")  #90% for training
test <- h2o.assign(splits[[2]], "test")  #10% to evaluate with

ncol(train)   #128
ncol(test)    #128

nrow(train)  #53321
nrow(test)  #6060

t(head(train, 1))
t( as.matrix(test[1,1:127]) )

m_DL <- h2o.deeplearning(2:127, 128, train)
m_DL <- h2o.deeplearning(2:127, 128, train, epochs = 1)  #7 to 9 secs
#system.time( m_DL <- h2o.deeplearning(2:127, 128, train) )  #42.5 secs

h2o.predict(m_DL, test[1,2:127])

h2o.cbind(
  h2o.predict(m_DL, test[1:6, 2:127]),
  test[1:6, 128]
  )

#    predict Response
# 1 7.402184        8
# 2 5.414277        1
# 3 6.946732        8
# 4 6.542647        1
# 5 2.596471        6
# 6 6.224758        5


h2o.performance(m_DL, test)
# H2ORegressionMetrics: deeplearning
# 
# MSE:  3.770782
# RMSE:  1.94185
# MAE:  1.444321
# RMSLE:  0.4248774
# Mean Residual Deviance :  3.770782




######
m_GBM <- h2o.gbm(2:127, 128, train)  #7.3s

h2o.predict(m_GBM, test[1, 2:127])

h2o.cbind(
  h2o.predict(m_GBM, test[1:6, 2:127]),
  test[1:6, 128]
  )

#    predict Response
# 1 6.934054        8
# 2 5.231893        1
# 3 7.135411        8
# 4 5.906502        1
# 5 3.056508        6
# 6 5.049540        5

h2o.performance(m_GBM, test)

# MSE:  3.599897
# RMSE:  1.89734
# MAE:  1.433456
# RMSLE:  0.4225507
# Mean Residual Deviance :  3.599897


##########

#Takes 20-30secs
m_GBM_best = h2o.gbm(
  2:127, 128, train,

  sample_rate = 0.95,
  validation_frame = test,
  stopping_tolerance = 0,
  stopping_rounds = 4,
  stopping_metric = "MSE",
  ntrees = 200

  )

#h2o.performance gave MSE of 3.473637856428858

plot(m_GBM_best)

h2o.scoreHistory(m_GBM_best)


####################

# 3-4 minutes  (204secs)
m_DL_best <- h2o.deeplearning(
  2:127, 128, train,
  epochs = 1000,

  validation_frame = test,
  stopping_tolerance = 0,
  stopping_rounds = 4,
  stopping_metric = "MSE",

  activation = "RectifierWithDropout",
  hidden = c(300, 300, 300),
  l1 = 1e-5,
  l2 = 0,
  input_dropout_ratio = 0.2,
  hidden_dropout_ratios = c(0.4, 0.4, 0.4)
  )  


h2o.performance(m_DL_best, test)

# MSE:  3.609624
# RMSE:  1.899901
# MAE:  1.444417
# RMSLE:  0.4164153
# Mean Residual Deviance :  3.609624

Finally, and not surprisingly, I can highly recommend my own book, if you would like to learn more about how to use H2O. Examples in the book are on three different data sets, and go into more depth about the different machine learning algorithms that H2O offers, as well as some ideas about how to tune:

From O’Reilly here:
http://shop.oreilly.com/product/0636920053170.do

From Amazon UK:
https://www.amazon.co.uk/Practical-Machine-Learning-Darren-Cook/dp/149196460X

(And other good bookshops, of course!)

Thanks!

Applying Auto-encoders to MNIST

2016-10-14T14:33:00.001-07:00

Applying Auto-encoders to MNIST

This is a companion article to my new book, Practical Machine Learning with H2O, published by O’Reilly. Beyond the sheer, unadulterated pleasure you will get from reading it, I’m also recommending it for readers of this article, because I’m only going to lightly introduce topics such as H2O, MNIST, and even auto-encoders, that are covered in much more depth in the book.

That Light Introduction

H2O is a powerful, scalable and fast machine-learning server/framework, with APIs in R and Python (as well as Coffeescript, Scala via its Spark interface, and others). It has relatively few machine learning algorithms, but they are generally the ones you would have settled on using anyway, and each has been optimized to scale across clusters and big data, and each has many parameters for tuning.
MNIST is a machine learning problem to recognize which of the digits 0 to 9 a set of 784 pixels represents. There are 60,000 training samples, and 10,000 test samples. To avoid inadvertently over-fitting to the test samples, I split the 60K into 50K training data and 10K validation data.
Auto-encoders are the unsupervised version of deep-learning (neural nets, if you prefer). The Wikipedia article is a good introduction to the idea. By setting the input_dropout_ratio parameter H2O supports the “Denoising autoencoder” variation, and withhidden_dropout_ratios H2O supports the “Sparse autoencoder” variation.

The Aim

More layers in a supervised deep learning neural network can often give better results on complex problems. But the more layers you have, the harder it can be to train. Auto-encoders to the rescue! You run an auto-encoder on the raw inputs, and it will self-organize them - extract some information from them. You then take the middle hidden layer from the auto-encoder, and use that as the inputs to your supervised learning algorithm. Or possibly you do it again, with another auto-encoder, to extract (theoretically) an even higher-level abstraction.
I decided to try this on the MNIST data.
My initial approach was to treat it as a data compression problem: to see how few hidden neurons, in a single layer, I could get a perfect score with. I.e. this autoencoder had 784 input neurons, N hidden neurons, and 784 output neurons. Just one hidden layer, so to request, say, a 784x200x784 layout, in H2O I just do hidden=200, and the input layer is implicit from the data, and the output it implict because I specify autoencoder=TRUE. Unfortunately, even with N=784, I couldn’t get an MSE of 0.0.
(You should be able to see how there is one trivial way for such a network to get the perfect score: for each neuron in the middle layer, exactly one incoming weight should be 1.0 and all the others should be 0.0, and then the same for the outgoing weights. However, also appreciate how hard it would be for training to discover this if all 784 weights leading in and all 784 weights leading out of each neuron started off life as a random number.)

Getting Practical

So, under time pressure, I took an “educated guess” approach, and also an “ensemble” approach. I made three autoencoders (one of them being two-step, so four models in total), and used them together. The code to make them, and their outputs, is wrapped up in a couple of R functions, which I’ll show in a moment, but first a look at the parameters of the four models:
AE200: This uses a single layer of 200 hidden neurons. I set input_dropout_ratio = 0.3, which means as each training sample was used for training it would be setting a random 30% of the pixels to 0. This should make it more robust, less likely to over-fit. I also use L2 regularization, set to 1e-4 (which is fairly high).
AE32: This uses just 32 hidden neurons, so is going to be less perfect representation than AE200. To compensate for that, I use a lower input_dropout_ratio = 0.1, and also lowered L2 regularization to 1e-5.
AE768: Not used directly. One hidden neuron per input pixel (almost), means it is more “rephrasing” rather than “compressing”. I used the same input_dropout_ratio = 0.3 and l2 = 1e-4 settings as AE200. (Among the 784 pixels columns there are a few around the edge that are exactly zero in all training data, so provide no information, and can be thrown away; that is where the 768 came from.)
AE128: This was built from the output of AE768. No input dropout, and just a bit of L2 regularization (1e-5).
L2 regularization penalizes large weights (e.g. see https://www.quora.com/What-is-the-difference-between-L1-and-L2-regularization). It helps make sure all pixels get considered, rather than allowing the algorithm to over-fit to one particular pixel.
All four models used the tanh activation function, and were given 20 epochs.

The Model Generation Code

The following listing shows the R code, for the above model descriptions in H2O, wrapped up in a function that takes two parameters:

data is the H2O frame to use for training data
x is which columns of that frame to use

create_MNIST_autoencoders <- function(data, x){
m_AE200 <- h2o.deeplearning(
  x, training_frame = data,
  hidden = c(200),
  model_id = "AE200",
  autoencoder=T,
  input_dropout_ratio = 0.3,  #Quite high
  l2 = 1e-4,  #Quite high
  activation = "Tanh",
  export_weights_and_biases = T,
  ignore_const_cols = F,
  train_samples_per_iteration = 0,
  epochs = 20
  )

m_AE32 <- h2o.deeplearning(
  x, training_frame = data,
  hidden = c(32),
  model_id = "AE32",
  autoencoder = T,
  input_dropout_ratio = 0.1,  #Fairly low
  l2 = 1e-5,  #Fairly low
  activation = "Tanh",
  export_weights_and_biases = T,
  ignore_const_cols = F,
  train_samples_per_iteration = 0,
  epochs = 20
  )

m_AE768 <- h2o.deeplearning(
  x, training_frame = data,
  hidden = c(768),
  model_id = "AE768",
  autoencoder = T,
  input_dropout_ratio = 0.3,  #Quite high
  l2 = 1e-4,  #Quite high
  activation = "Tanh",
  export_weights_and_biases = T,
  ignore_const_cols = F,
  train_samples_per_iteration = 0,
  epochs = 20
  )

f_AE768 = h2o.deepfeatures(m_AE768, data)

m_AE128 <- h2o.deeplearning(
  1:768, training_frame=f_AE768,
  hidden = c(128),
  model_id = "AE128",
  autoencoder = T,
  input_dropout_ratio = 0,  #No dropout
  l2 = 1e-5,   #Just a bit of L2
  activation = "Tanh",
  #export_weights_and_biases = T,
  #ignore_const_cols = F,
  train_samples_per_iteration = 0,
  epochs = 20
  )

return(list(m_AE200, m_AE32, m_AE768, m_AE128))
}

Feeding the output of one auto-encoder (AE768 in this case) into another (AE128) is done with h2o.deepfeatures().
These two lines are just for troubleshooting/visualization:

export_weights_and_biases = T,
ignore_const_cols = F,

And, in a sense, this one is too:

train_samples_per_iteration = 0

This says I want it to always score the model’s MSE at the end of every epoch. I did this so I could see the shape of the score history chart, and so get a feel for if 20 epochs was enough. Normally touching train_samples_per_iteration counts as micro-management, because the default is to intelligently choose when to score based on some targets for time spent training vs. scoring, and some communication overhead targets. (See the explanation in chapter 8 of the book, if you crave more detail.)

Using The Models To Make Pixels

The second helper function is shown next. It returns (a handle to) an H20 frame that has 200 + 32 + 128 columns from the autoencoders, plus any additional columns you specify in columns (which must include at least the answer column).

generate_from_MNIST_autoencoders <- function(models, data, columns){
  stopifnot(length(models) == 4)
  names(models) = c("AE200", "AE32", "AE768", "AE128")

  f_AE200 <- h2o.deepfeatures(models[["AE200"]], data)
  f_AE32 <- h2o.deepfeatures(models[["AE32"]], data)
  f_AE768 <- h2o.deepfeatures(models[["AE768"]], data)
  f_AE128 <- h2o.deepfeatures(models[["AE128"]], f_AE768)

  h2o.cbind(f_AE200, f_AE32, f_AE128, data[,columns] )
  }

Notice how, if you don’t include the original 768 pixel columns in columns that the auto-encoded features will effectively replace the raw pixel data. (This was my intention, but you don’t have to do that.)

Usage Example

I’ll assume you have 785 columns, where the pixel data is in columns 1 to 784, and the answer, 0 to 9, is in column 785. If you have the book, you will be familiar with this convention:

train <- #...50K of training data
valid <- #...10K of validation data
test <- #...10K of test data
x <- 1:784
y <- 785

To use them I then write:

models <- create_MNIST_autoencoders(train, x)
train_ae <- generate_from_MNIST_autoencoders(models, train, y)
valid_ae <- generate_from_MNIST_autoencoders(models, valid, y)
test_ae <- generate_from_MNIST_autoencoders(models, test, y)
m <- h2o.deeplearning(1:360, 361, train_ae, validation_frame = valid_ae)

Here I train a deep learning model with all defaults, but it could be a more complex deep learning model, or it could be a random forest, GBM or any other algorithm supported by H2O.
When using it for predictions, remember to use test_ae, not test (and similarly, in production, any future data has to be put through all four auto-encoder models). So following on from the above, you could evaluate it with:

h2o.performance(m, test_ae)

Usage: extended data

If you are lucky enough to have read the book, you will know I added 113 columns of “extended information” to the MNIST data. Though I got rid of the pixels, I chose to keep the extended columns alongside the auto-encoder generated data.
This is how the above code looks if you are using the extended MNIST data:

x <- 114:897
columns <- c(1:113,898)
models <- create_MNIST_autoencoders(train, x)
train_ae <- generate_from_MNIST_autoencoders(models, train, columns)
valid_ae <- generate_from_MNIST_autoencoders(models, valid, columns)
test_ae <- generate_from_MNIST_autoencoders(models, test, columns)
m <- h2o.deeplearning(1:473, 474, train_ae, validation_frame = valid_ae)

Summary

Informally, I can tell you that a deep learning model built on 473 auto-encoded/extended columns was significantly better than one built on 897 pixel/extended columns. However, I also increased the amount of training data at the same time (see http://darrendev.blogspot.com/2016/10/applying-rs-imager-library-to-mnist.html ), so I cannot tell you the relative contributions of those two changes.
But, I was pleased with the results. And the “educated guess” approach of choosing three distinct auto-encoder models and combining their outputs also seemed to work well. It might be that just one of the auto-encoder models is carrying all the useful information, and the others could be dropped? That is a good experiment to do (let us know if you do it!), but my hunch is that the “ensemble” approach is what allows the educated guess approach to work.

Applying R's imager library to MNIST digits

2016-10-13T11:19:00.001-07:00

Applying R’s imager library to MNIST digits

Introduction

For machine learning, when it comes to training data, the more the better! Well, there are a whole bunch of disclaimers on that statement, but as long as the data is representative of the type of data the machine learning model will be used on in production, doubling the amount of unique training data will be more useful than doubling the number of epochs (if deep learning) or trees (if random forest, GBM, etc.).
I’m going to concentrate in this article on the MNIST data set, and look at how to use R’s imager library to increase the number of training samples. I take a deep look at MNIST in my new book, Practical Machine Learning with H2O, which (not surprisingly) I highly recommend. But, briefly, the MNIST data is handwritten digits, and the challenge is to identify which of the 10 digits, 0 to 9, each one is.
There are 60,000 training sample, plus 10,000 test samples (and everyone uses the same test set - so watch out for the inadvertent over-fitting in published papers). The samples are 28x28 pixels, each a 0 to 255 greyscale value. I usually split the 60,000 samples into 50K for training, 10K for validation (to make sure I am not over-fitting on the test data).

imager and parallel

I used this R library for dealing with images: http://dahtah.github.io/imager/ which is based on a C++ library called CImg. Most function calls are just wrappers around the C++ code, which means they are fairly quick. It is well-documented with a good starter tutorial.
I used version 0.20 for all my development. I have just seen that 0.30 is now out, and in particular offers native parallelization. This is great! Out of scope for this article, but I used imager in conjunction with R’s parallel functions, and found the latter quite clunky, with time spent copying data structures limiting scalability. On the other hand, the docs say these new parallel options work best on large images, and the 28x28 MNIST images are certainly not that. So maybe I am still stuck using parApply() and friends.

The Approach

In the 20,000 unseen samples (10K valid, 10K test), there are often examples of bad handwriting that we don’t get to see in our 50,000 training samples. Therefore I am most interested in generated bad handwriting samples, not nice neat ones.
I spent a lot of time experimenting with what imager can produce, and settled on generating these effects, each with a random element:

rotate
warp (make it “scruffier”)
shift (move it 1 pixel up, down, left or right)
bold (make it fatter - my code)
dilate (make it fatter - cimg code)
erode (make it thinner)
erodedilate (one or the other)
scratches (add lines)
blotches (remove blobs)

I also defined “all” and “all2” which combined most of them.
In the full code I create the image in an imager (cimg) object called im, then copy it to im2. Each subsequent operation is performed on im2. im is left unchanged, but can be referred to for the initial state.

Rotate

The code to rotate comes in two parts. Here is the first part:

needSharpen = FALSE
angle = rnorm(1, 0, 8)
if(angle < -1.0 || angle > 1.0){
  im2 = imrotate(im2, angle, interpolation = 2)
  nPix = (width(im2) - width(im)) / 2
  im2 = crop.borders(im2 , nPix = nPix)
  needSharpen = TRUE
  }

The use of rnorm(sd=8), means 68% of time it’ll be +/-8°, only 5% of the time more than +/-16°. If my goal was simply more training samples, I’d have perhaps used as smaller sd, and/or clipped to a maximum rotation of 10°. But, as mentioned earlier, I wanted more scruffy handwriting. The if() block is a CPU optimization - if rotation is less than 1° don’t bother doing anything.
The imrotate() command takes the current im2 and replaces it with one that is rotated. This creates a larger image. To see what is going on, try running this (self-contained) script (see the inline comments for what is happening):

library(imager)

# Make 28x28 "mid-grey" square
im <- as.cimg(rep(128, 28*28), x = 28, y = 28)

#Prepare to plot side-by-side
par(mfrow = c(1,2))

#Show initial square 28x28
plot(im)

#Show rotated square, 34x34
plot(imrotate(im, angle = 16))

The output is like this:

You can see the image has become larger, to contain the rotated version. (That image also shows how imager’s plot command will scale the colours based on the range, and that my choice of 128 could have been any non-zero number. When there is only a single value (128), it chooses a grey. After rotating we have 0 for the background, 128 for the square, so it does 0 as black, 128 as white.)
For rotating MNIST digits:

I want to keep the 28x28 size
All the interesting content is in the middle, so clipping is fine.

So I call crop.borders(), which takes an argument nPix saying how many pixels to remove on each side. If it has grown from 28 to 34 pixels square, nPix will be 3.

Feeling A Bit Vague…

Here is what one of the MNIST digits looks like rotated 30° at a time, 11 times.

In a perfect world, 12 rotations would give you exactly the image you started with. But you can see the effect of each rotation is to blur it slightly. If we did another lap, even your clever mammalian brain would no longer recognize it as a 4.
The author of the imager library provided a match.hist() function (see it, and the surrounding discussion, here: https://github.com/dahtah/imager/issues/17 ) which does a good (but not perfect) job. Here are the histograms of the image before rotation, after rotation, and then after match.hist:

You can judge the success from looking at the image on the right, or by seeing how the bars on the rightmost histogram match those of the leftmost one. (Yes, even though the bumps are very small, their height, and where they are, really matter!)

You better sharpen up…

You would have noticed the earlier rotate code set needSharpen to true. That is used by the following code. Some of the time it uses the imager library’s imsharpen(), some of the time match.hist(), and some of the time a hack I wrote to make dim pixels dimmer, and bright pixels brighter.

if(needSharpen){
  if(runif(1) < 0.8){
    im2 <- imsharpen(im2, amplitude = 55)
    }
  if(runif(1) < 0.3){
    im2 <- ifelse(im2 < 128, im2-16, im2)
    im2 <- ifelse(im2 < 0, 0, im2)
    im2 <- ifelse(im2 > 200, im2+8, im2)
    im2 <- ifelse(im2 > 150, im2+8, im2)
    im2 <- ifelse(im2 > 100, im2+8, im2)
    im2 <- ifelse(im2 > 255, 255, im2)
  }else{
    im2 <- match.hist(im2, im)
    }
  }

The Others

The other image modifications, listed earlier, use imwarp(), imshift(), pmax() with imshift() (for a bold effect), dilate_square(), and erode_square(). The blotches and scratches were done by putting random noise on an image, then using pmax() or pmin() to combine them.
If there is interest I can write another article going into the details.

Timings

On a 2.8GHz single-core, I recorded these timings to process 60,000 28x28 images. (It was a 36-core 8xlarge EC2 machine, but my R script, and imager (at the time), only used one thread.)

304s to run “bold”.
296s to run “shift”
417s for warp
447s for rotate
517s to 539s for all and all2

warp and rotate need to do the sharpen step, which is why they take longer.

Summary

I made 20 files, so over 95% of my training data was generated. As you will discover if you read the book, this generated data gave a very useful boost in model strength, though of course dramatically increased learning-time due to having 1.2 million training rows instead of 50,000. An interesting property was that it found the sampled data harder to learn: I got lower error rates on the unseen valid and test data sets than on the seen training data. This is a consequence of my deliberate decision to bias towards noisy data and scruffy handwriting.
Generating additional training data is a good way to prevent over-fitting, but generating data that is still representative is always a challenge. Normally you’d check mean, sd, the distribution, etc. to see how you’ve done, and usually your only technique is to jitter - add a bit of random noise. With images you have a rich set of geometric transformations to choose from, and you often use your eyes to see how it has done. Though, as we saw from the image histograms, there are some objective techniques available too.

Multinomial Ensemble In H2O

2016-10-12T09:12:00.001-07:00

Categorization Ensemble In H2O

An ensemble is the term for using two or more machine learning models together, as a team. There are various types (normally categorized in the literature as boosting, bagging and stacking). In this article I want to look at making teams of high-quality categorizing models. (It handles both multinomial and binomial categorization.)
I will use H2O. Any machine-learning framework that returns probabilities for each of the possible categories can be used, but I will assume a little bit of familiarity with H2O… and therefore I highly recommend my book: Practical Machine Learning with H2O! By the way, I will use R here, though any language supported by H2O can be used.

Getting The Probabilities

Here is the core function (in R). It takes a list of models ¹, and runs predict on each of them. The rest of the code is a bit of manipulation to return a 3D array of just the probabilities (h2o.predict() returns its chosen answer in the “predict” column, so the setdiff() is to get rid of it).

getPredictionProbabilities <- function(models, data){
sapply(models, function(m){
  p <- h2o.predict(m, data)
  as.matrix(p[, setdiff(colnames(p), "predict") ])
  }, simplify = "array")
}

Using Them

getPredictionProbabilities() returns a 3D array:

First dimension is nrows(data)
Second dimension is the number of possible outcomes (labelled “p1”, “p2”, …)
Third dimension is length(models)
Value is 0.0 to 1.0.

To make the next explanation a bit less abstract, let’s assume it is MNIST data (where the goal is to look at raw pixels and guess which of 10 digits it is), and that we have made three models, and that we are trying it on 10,000 test samples. Therefore we have a 10000 x 10 x 3 array. Some example slices of it:

predictions[1,,] is the predictions for the first sample row; returning a 10 x 3 matrix.
predictions[1:5, "p3",] is the probability of each of the first 5 rows being the digit “2”. So, five rows and three columns, one per model. (It is “2”, because “p1” is “0”, “p2” is “1”, etc.).
predictions[1:5,,1] are the probabilites of the first model on the first 5 test samples, for each of the 10 categories.

When I just want the predictions of the team, and don’t care about the details, I use this wrapper function (explained in the next section):

predictTeam <- function(models, data){
  probabilities <- getPredictionProbabilities(models, data)
  apply(
    probabilities, 1,
    function(m) which.max(apply(m, 1, sum))
    )
  }

3D Arrays In R

A quick diversion into dealing with multi-dimensional arrays in R. apply(probabilities, 1, FUN) will apply FUN to each row in the first dimension. I.e. FUN will be called 10,000 times, and will be given a 10x3 matrix (m in the above code). I then use apply() on the first dimension of m, calling sum. sum() is called 10 times (once per category) and each time is given a vector of length 3, containing one probability per model.
The last piece of the puzzle is which.max(), which tells us which category has the biggest combined confidence from all models. It returns 1 if category 1 (i.e. “0” if MNIST), 2 if category 2 (“1” if MNIST), etc. This gets returned by the function, and therefore gets returned by the outer apply(), and therefore is what predictTeam() returns.
Aside: Using sum() is equivalent to mean(), but just very slightly faster. E.g. 30 > 27 > 24, and if you divide through by 3, then 10 > 9 > 8.
Your answer column in your training data is a factor, so to convert that answer into the string it represents, use levels(). E.g. if f is the factor representing your answer, and p is what predictTeam()returned, then levels(f)[p] will get the string. (With MNIST, there is a short-cut: p-1 will do it.)

Counting Correctness

A common question to ask of an ensemble is how many test samples every member of the team got right, and how many none of the team got right. I’ll look at that in 4-steps. First up is to ask what each models best answer is.³

predictByModel <- apply(probabilities, 1,
 function(sample) apply(sample, 2, which.max)
 )

Assuming the earlier MNIST example, this gives a 3 x 10000 int matrix (where the values are 1 to 10).
Next, we need a vector of the correct answers, and they must be the category index, not the actual value. (In the case of MNIST data it is simply adding 1, so “0” becomes 1, “1” becomes 2, etc.)

eachModelsCorrectness <- apply(predictByModel, 1,
 function(modelPredictions) modelPredictions == correctAnswersTest
 )

This returns a 10000 x 3 logical (boolean) matrix. It tells us if each model got each answer correct. The next step returns a 10000 element vector, which will have 0 if they all got it wrong, 1 if only one member of the team got the correct answer, 2 if two of them, and 3 if all three members of the team got it right.

cmc <- apply(eachModelsCorrectness, 1, sum)

(cmc for count of model correctness) table(cmc) will give you those four numbers, E.g. you might see:

 0    1    2    3
67   59  105 9769

Indicating that there were 67 that none of them got right, 59 samples that only one of the three models managed, 105 that just one model got wrong, and 9769 that all the models could do.
I sometimes find cumsum(table(cmc)) to be more useful:

 0     1     2     3
67   126   231 10000

Either way, if the first number is high, you need stronger model(s). Better team members. If it is low, but the ensemble is performing poorly (i.e. no better than the best individual model), then you need more variety in the models.

Further Ideas

By slicing and dicing the 3D probability array, you can do different things. You might want to poke into what those difficult 67 samples were, that none of your models can get. Another idea, which I will deal in a future blog post, is how you can use the probability to indicate those with low confidence that need more investigation work. (You can use this with an ensemble, or just with a single model.)

Performance

In my experience, in general, this kind of team will always give better performance than any individual model in the team. It gives the biggest jump in strength, when:

The models are very distinct.
The models are fairly close in strength.
Just a few models. Making 4 or 5 models, evaluating them on the valid data, and using the strongest 3 for the ensemble, works reasonably well. However if you managing to make near-strength and very distinct models, the more the merrier.

For instance, three deep learning models, all built with the same parameters, each getting about 2% error individually, might manage 1.8% error as a team. But two deep learning models, one random forest, one GBM, all getting 2% error, might give 1.5% or even lower, when used together. Conversely, two very similar deep learning models that have 2% error combined with a GBM with 3% error and a random forest getting 4% error might do poorly, possibly even worse than your single best model.

Summary

This approach to making an ensemble for categorization problems is straightforward and usually very effective. Because it works with models of different types it at first glance seems closer to stacking, but really it is more like the bagging of random forest (except using probabilities, instead of mode⁴ as in random forest).

Footnotes

[1]: All the models in models must all be modelling the same thing, i.e. they must each have been trained on data with the exact same columns.²
Normally each model will have been trained on the exact same data, but the precise phrasing is deliberate: interesting ensembles can be built with models trained on different subsets of your data. (By the way, subsets and ensemble are the two fundamental ideas behind the random forest algorithm.)
[2]: You can take this even further and only require they are all trained with the same response variable, not even the same columns. However, you’ll need to adapt getPredictionProbabilities() in that case, because each call to h2o.predict() will then need to be given a different data object.
[3]: I do it this way because I already had probabilities made. But if it was all that you were interested in, this was the long way to get it. The direct way was to do the exact opposite of getPredictionProbabilities(): keep the “predict” column, and get rid of the other columns!
[4]: I did experiment with simple one vote per model (aka mode), instead of summing probabilities, and generally got worse results. But it is worth bearing that kind of ensemble in mind. If you do more rigorous experiments comparing the two methods, I’d be very interested to hear what kind of problems, if any, that voting (mode) is superior on.

Hacking the H2O R API

2016-09-23T07:29:00.001-07:00

Hacking the H2O R API

H2O comes with a comprehensive R API, but sometimes you want to do something that it does not (yet) support. This article will show how to add a couple of functions for fetching and saving models. Beyond giving you these functions, I want to show how to approach hacking on the API, including using internals. (Code in this article has been tested on the 3.8.2.x, 3.8.3.x and 3.10.0.x releases.)
If you want to learn more about H2O, and machine learning, may I recommend my book: Practical Machine Learning with H2O, published by O’Reilly? (It is “coming really soon” as I write this!) And, my company, QQ Trend, are available for helping you with all your machine learning needs, everything from a few hours of H2O-related consulting to helping you build massive models to solve the mysteries of life. (Contact me at dc at qqtrend.com )

Saving it all for another day

Say you have 30 models stored on H2O, and you want to save them all. The scenario might be that you want to stop the cluster overnight, but want to use your current set of models as the starting point for better models tomorrow. Or in an ensemble. Or something. At the time of writing H2O does not offer this functionality, in any of its various APIs and front-ends. So I want to write an h2o.saveAllModels() function.
Breaking that down a bit, I’m going to need these two functions:

get a list of all models
save models, given a list of models or model IDs.

Let’s start with the “or model IDs” requirement. H2O’s API offers h2o.saveModel(), but that only takes a model object, so how can we use it when all we have is an ID?

Exposing The Guts…

I am a huge fan of open source. H2O is open source. R is open source. But there is open, and then there is open, and one of the things I like about R is if you want to see how something was implemented, you just type the function name.
Type h2o.saveModel (without parentheses) in an R session (where you’ve already done library(h2o), of course) and you will see the source code. Here it is; notice how the only part of object that it uses is the model id - that is a stroke of luck, because it means that (under the covers) the API works just the way we needed it to!

function (object, path = "", force = FALSE) 
{
    #... Error-checking elided ...
    path <- file.path(path, object@model_id)
    res <- .h2o.__remoteSend(
      paste0("Models.bin/", object@model_id),
      dir = path, force = force, h2oRestApiVersion = 99)
    res$dir
}

If you are new to H2O, you need to understand that all the hard work is done in a Java application (which can equally well be running on your machine or on a cluster the other side of the world), and the clients (whether R, Python or Flow’s CoffeeScript) are all using the same REST API to send commands to it. So it should be no surprise to see .h2o.__remoteSend there; it is making a call to the “Models.bin” REST endpoint.
.h2o.__remoteSend is a private function in the R API. That means you cannot call it directly. Luckily, R doesn’t get in our way like Java or C++ would. We can use the package name followed by the triple colon operator to run it from normal user code: h2o:::.h2o.__remoteSend(...)
WARNING: Remember that hacking with the internals of an API is not future-proof. An upgrade might break everything you’ve written. …ooh, look at us, adrenaline flowing, living the life of danger. (Note to self: need to get out more.)

Let’s Write Some Code!

We now have enough to make the saveModels() function:

h2o.saveModels <- function(models, path, force = FALSE){
sapply(models, function(id){
  if(is.object(id))id <- id@model_id
  res <- h2o:::.h2o.__remoteSend(
    paste0("Models.bin/", id),
    dir = file.path(path, id),
    force = force, h2oRestApiVersion = 99)
  res$dir
  }, USE.NAMES=F)
}

The if(is.object(id))id = id@model_id line is what allows it to work with a mix of model id strings or model objects. The use of sapply(..., USE.NAMES=F) means it returns a character vector, containing the full path of each model file that was saved. Use it as follows:

h2o.saveModels(
  c("DL:defaults", "DL:200x200-500", "RF:100-40"),
  "/path/to/h2o_models/todays_hard_work",
  force = TRUE
  )

and it will output:

[1] "/path/to/h2o_models/todays_hard_work/DL:defaults"
[2] "/path/to/h2o_models/todays_hard_work/DL:200x200-500"
[3] "/path/to/h2o_models/todays_hard_work/RF:100-40"

(By the way, there is one irritating problem with this function: if any failure occurs, such as a file already existing, or a model ID not found, it stops with a long error message, and doesn’t attempt to save the other models. I’ll leave improving that to you. Hint: consider wrapping the h2o:::.h2o.__remoteSend() call with ?tryCatch.)

What Models Have I Made?

Next, how to get a list of all models? The Flow interface has getModels, and the REST API has GET /3/Models, but the R and Python APIs do not; the closest they have is h2o.ls() which returns the names of all data frames, models, prediction results, etc. with no (reliable) way to tell them apart. But GET /3/Models is not ideal either, because it returns everything about the model, whereas all we want is the model id. Having trawled through the H2O source (BTW, If you fancied submitting a patch to add a GET /3/ModelIds/ command, this looks like a good starting point I.e. what we want is just the first half of that function) it appears we are stuck with this. It is unlikely to matter unless you have 1000s of models, or a slow connection to your remote H2O cluster.
Start the same way as before by typing h2o.getModel (no parentheses) into R. Ooh! That function is long, and it is doing an awful lot. If you want a generally useful h2o.getModels() function I leave that as another of those exercises for the reader. Instead I’m going to call my function
h2o.getAllModelIds(), and limit the scope to just that, which makes the code much simpler. (Did you notice the pro tip there: just by calling my function “getAllModelIds” instead of “getModels” I saved myself hours of work. You see kids, naming really does matter.)
Here it is:

h2o.getAllModelIds <- function(){
d <- h2o:::.h2o.__remoteSend(method = "GET", "Models")
sapply(d[[3]], function(x) x$model_id$name)
}

Line 1 says get all the models. Line 2 says filter just the model id out, and throw the rest of it away. (Yeah, that d[[3]] bit is particularly fragile.)
Anyway, the final step is to simply put our two new functions together:

h2o.saveAllModels <- function(path){
h2o.saveModels(h2o.getAllModelIds(), path)
}

Use it, as shown here:

fnames <- h2o.saveAllModels("/path/to/todays_hard_work")

On one test, length(fnames) returned 154 (I’d been busy), and those 154 models totalled 150MB. However some models (e.g. random forest) are bigger than others, so make sure you have plenty of disk space to hand, just in case. Speaking of which, h2o.saveAllModels() should work equally well with S3 or HDFS destinations.

The day after the night before…

I could’ve done a dput(fnames) after running h2o.saveAllModels(), and saved the output somewhere. But as I’m not putting anything else in that particular directory, I can get the list again with Sys.glob(). So, I might start my next day’s session as follows.

 library(h2o)
 h2o.init(nthreads = -1)
 fnames <- Sys.glob("/path/to/todays_hard_work/*")
 models <- lapply(fnames, h2o.loadModel)

Voila! models will be an R list of the H2O model objects.

Clusters

If you are working on a remote cluster, with more than one node, there is a little twist to be aware of. h2o.saveModel() (and therefore our h2o.saveModels() extension) will create files on whichever node of the cluster your client is connected to. (At least, as of 3.10.0.7; I suspect this behaviour might change in future.)
But h2o.loadModel() will look for it on the file system of node 1 of the cluster. And node 1 is not (necessarily) the first node you listed in your flatfile. Instead it is the one listed first in h2o.clusterStatus().
This won’t concern you if you saved to HDFS or S3.

Bonus

What I actually use to load model files is shown below. It will get all the files from sub-directories. (Look at the list.file() documentation for how you can use pattern to choose just some files to be loaded in.)

h2o.loadModelsDirectory <- function(path, pattern=NULL, recursive=T, verbose=F){
fnames <- list.files(path, pattern = pattern, recursive = recursive, full.names = T, include.dirs = F)
h2o.loadModels(fnames, verbose)
}

(This code still has one problem: if I’m loading in work from both yesterday and two days ago, and I had made a model called “DF:default” on both days, I lose one of them. Sorting that out is my final exercise for the reader - please post your answer, or a link to your answer, in the comments!)

H2O Upgrade: Detailed Steps

2016-08-09T14:34:00.001-07:00

H2O Upgrade: Detailed Steps

I wanted to upgrade both R and Python to the latest version of H2O (as of Aug 9th 2016). Here are the exact steps, and I think you will find them relevant even if you only need to update one or the other of those clients. Remember if following this at a later date, that you should follow the spirit of it, rather than copy-and-pasting: all the version numbers will have changed. This was with Linux Mint, but it should apply equally well to all other Linux distros.
Make sure you first close any R or Python clients that are using the H2O library; and separately shutdown H2O if it is still running after that.

The First Time

The below instructions are all for upgrading, which means I know I have all the dependencies in place. If this is your first H2O install, well I’d first recommend you buy my new book: Practical Machine Learning with H2O, published by O’Reilly. (Coming really soon, as I type this! Let me know if you are interested, and I'll send the best discount code I can find.)
As a quick guide, from R I recommend you use CRAN

install.packages("h2o")

and from Python I recommend you use pip:

pip install h2o

Both these approaches get all the dependences for you; you may end up back a version or two from the very latest, but it won’t matter.

The Download

cd /usr/local/src/
wget http://download.h2o.ai/versions/h2o-3.10.0.3.zip
unzip h2o-3.10.0.3.zip

It made a “h2o-3.10.0.3” directory, with python and R sub-directories.

R

I installed the h2o the first time as root, so I will continue to do that, hence the sudo:

cd /usr/local/src/h2o-3.10.0.3/R/
sudo R

Then:

remove.packages("h2o")
install.packages("h2o_3.10.0.3.tar.gz")

Then ctrl-d to exit.

Python

cd /usr/local/src/h2o-3.10.0.3/python/
sudo pip uninstall h2o
sudo pip install -U h2o-3.10.0.3-py2.py3-none-any.whl

(The -U means upgrade any dependencies; the first time I forgot it, and ended up with some very weird errors when trying to do anything in Python.)

The Test

I started RStudio, and ran:

library(h2o)
h2o.init(nthreads=-1)
as.h2o(iris)

I then started ipython and ran:

import h2o,pandas
h2o.init()
iris = h2o.get_frame("iris")
print(iris)

As well as making sure the data arrived, I’m also checking the h2o.init() call in both cases said the cluster version was “3.10.0.3”.

AWS Scripts

If you use the AWS scripts (https://github.com/h2oai/h2o-3/tree/master/ec2) and want to make sure EC2 instances start with exactly the same version as you have installed locally, the file to edit is h2o-cluster-download-h2o.sh. (If not using those scripts, just skip this section.)
First find the h2oBranch= line and set it to “rel-turing” (notice the “g” on the end - there is also a version without the “g”!). Then comment out the two curl calls that follow, and instead set version to be whatever you have above, and build to be the last digit in the version number. So, for 3.10.0.3, I set:

h2oBranch=rel-turing

#echo "Fetching latest build number for branch ${h2oBranch}..."
#curl --silent -o latest https://h2o-release.s3.amazonaws.com/h2o/${h2oBranch}/latest
h2oBuild=3

#echo "Fetching full version number for build ${h2oBuild}..."
#curl --silent -o project_version https://h2o-release.s3.amazonaws.com/h2o/${h2oBranch}/${h2oBuild}/project_version
h2oVersion=3.10.0.3

The rest of that script, and the other EC2 scripts, can be left untouched.

Summary

Well that was easy! No excuses! Having said that, I recommend you upgrade cautiously - I have seen some hard-to-test-for regressions (e.g. model learning no longer scaling as well over a cluster) when grabbing the latest version.

WebGL: RWD, Mobile-First And Progressive Enhancement

2016-03-21T05:08:00.001-07:00

WebGL: adapting to the device

This is a rather under-studied area, but it is going to become more important as WebGL is increasingly used to make websites. This article is a summary of my thoughts and learnings on the topic, so far.

I should say that my context is using WebGL for things other than games. Informational websites, educational apps, data visualization, etc., etc.

Please use the comments to add related links you can recommend; or just to disagree.

What is RWD?

Responsive Web Design. The idea is that, rather than make a mobile version of your website and a separate desktop version of your website, you make a single version in such a way that it will adapt and be viewable on all devices, from mobile phones (both portrait and landscape), through tablets, to desktop computers.

Mobile-First? Progressive Enhancement?

Mobile First is the idea that you first make your site work on the smallest screen, and the device with least capability. Then for the larger screens you add more sections.

This is contrast to starting at the other end: make beautiful graphics designed for a FullHD desktop monitor, using both mouse and keyboard, then hiding and removing things as you move to the smaller devices.

Just remember Mobile-First is a guideline, not a rule. If you end up with a desktop site where the user is getting frustrated by having to simulate touch gestures with their mouse, then you’ve missed the point.

It Is Hard!

RWD for a complex website can get rather hard. On toy examples it all seems nice and simple. But then add some forms. Make it multilingual. Add some CSS transitions and animations. Add user uploaded text or images. Then just as you start to crack under the strain of all the combinations that need testing, the fatal blow: the client insists on Internet Explorer 8 being supported.

But if you thought RWD and UX for normal websites was hard, then WebGL/3D takes it to a whole new dimension…

Progressive Enhancement In 3D

Progressive enhancement can be obvious things like using lower-polygon models and lower-resolution textures, or adding/removing shadows (see below).

But it can also be quite subtle things: in a 3D game presentation I did recently, the main avatar had an “idle” state animation: his chest moved up and down. But this requires redrawing the whole screen 60 times a second; without that idle animation the screen only needs to be redrawn when the character moves. Removing the idle animation can extend mobile battery life by an order of magnitude.

And that can lead to political issues. If you’ve ever seen a designer throw a tantrum just because you re-saved his graphics as 80% quality jpegs, think about what will happen if two-thirds of the design budget, and over three-quarters of the designer’s time went on making those subtle animations, and you’ve just switched them off for most of your users.

By the way, it is also about the difference between zero and one continuous animations. Remember an always-on “flyover” effect counts as animation. An arcade game where the user’s character is constantly being moved around the screen does too. So, once one effect requires constantly re-drawing the scene, the extra load of adding those little avatar animations will be negligible.

Lower-Poly Models

I mentioned this in passing above. Be aware that it is often more involved than with 2D images, where using Gimp/Photoshop to turn the 800x600 image to 320x240 and as lower quality can be automated. In fact you may end up with doubling your designer costs, if they have to make two versions.

If the motivation for low-poly is to reduce download time, you could consider running a sub-surf modifier once the data has been downloaded. Or describing the shape with a spline and dynamically extrude it.

If the motivation is to reduce the number of polygons to reduce CPU/GPU effort, again consider the extrude approach but using different splines, and/or different bevels.

Shadows

Adding shadows increases the realism of a 3D scene, but adds more CPU/GPU effort. Also more programmer effort: you need to specify which objects cast shadows, which objects receive shadows, and which light sources cast shadows. (All libraries I mentioned in Comparison Of Three WebGL Libraries handle shadows in this way.)

For many data visualization tasks, shadows are unnecessary, and could even get in the way. Even for games they are usually an optional extra. But in some applications the sense of depth that shadows give can really improve the user experience (UX).

If you have a fixed viewing angle on your 3D scene, and fixed lighting, you can use pre-made shadows: these are simply a 2D graphic that holds the shadow.

VR

With virtual reality headsets you will be updating two displays, and it has to be at a very high refresh rate, so it is demanding on hardware.

But virtual reality is well-suited for progressive enhancement: just make sure your website or application is fully usable without it, but if a user has the headset they are able to experience a deeper immersion.

Controls In 3D

Your standard web page didn’t need much controlling: up/down was the only axis of movement, and being able to click a link. Touch-only mobile devices could adapt easily: drag up/down, and tap to follow a link.

Mouseover hints are usually done as progressive enhancements, meaning they are not available to people not using a device with a mouse. (Meaning in mobile apps I often have no idea what all the different icons do…)

If your WebGL involves the user navigating around a 3D world, the four arrow keys can be a very natural approach. But there is no common convention on a touch-only device. Some games show a semi-transparent joystick control on top, so you press that for the 4 directions. Others have you touch the left/right halves of the screen to steer left and right, and perhaps you move at a constant speed.

Another approach is to touch/click the point you want to move to, and have your app choose the route, and animate following it.

Zoom is an interesting one, as the approach for standard web sites can generally be used for 3D too. There are two conventions on mobile: the pinch to grow/shrink, or double-tap to zoom a fixed distance (and double-tap to restore). With a mouse, the scroll-wheel, while holding down ctrl, zooms. With only a keyboard, ctrl and plus/minus, with ctrl and zero to restore to default zoom.

Timestamp helper in Handlebars

2016-03-18T05:41:00.001-07:00

Handlebars is a widely-used templating language for web pages. In a nutshell, the variables to insert go between {{ and }}. Easy. It offers a few bits of logic, such as if/else clauses, for-each loops, etc. But, just as usefully, Handlebars allows you to add helper functions of your own.
In this article I will show a nice little Handlebars helper to format datestamps and timestamps. Its raison d’etre is its support for multiple languages and timezones. The simplest use case (assuming birthday is their birthday in some common format):

<p>Your birthday is on {{timestamp birthday}}.</p>

It builds on top of sugar.js’s Date enhancements; I was going to do this article without using them, to keep it focused, but that would have made it unreasonably complex.
There are two ways to configure it: with global variables, or with per-tag options. For most applications, setting the globals once will be best. Here are the globals it expects to find:

tzOffset: the number of seconds your timezone is ahead of UTC. E.g. if in Japan, then tzOffset = 9*3600. If in the U.K. this is either 0 or 3600 depending on if it is summer time or not.
lang: The user-interface language, e.g. “en” for English, “ja” for Japanese, etc.

(By the way, if setting lang to something other than “en”, you will also need to have included locale support into sugar.js for the languages you are supporting - this is easy, see the sugar.js customize page, and check Date Locales.)
The default timestamp format is the one built-in to sugar.js for your specified language. All these configuration options (the two above, and format) can be overridden when using the tag. E.g. if start is the timestamp of when an online event starts, you could write:

<p>The live streaming will start at
{{timestamp start tzOffset=0}} UTC,
which is {{timestamp start tzOffset=32400}}
in Tokyo and {{timestamp start tzOffset=-25200}}
in San Francisco.</p>

Here is the basic version:

Handlebars.registerHelper('timestamp', function(t, options){
var offset = options.hash.tzOffset;
if(!offset)offset = tzOffset;

if(!Object.isDate(t)){
    if(!t)return "";
    if(Object.isString(t))t = Date.create(t + "+0000").setUTC(true).addSeconds(offset);
    else t = Date.create(t*1000).setUTC(true).addSeconds(offset);
    }
else t = t.clone().addSeconds(offset);

if(!t.isValid())return "";

var code = options.hash.lang;
if(!code)code = lang;   //Use global as default

var format = options.hash.format ? options.hash.format : '';
return t.format(format, lang);
});

The first two-thirds of the function turn t into a Date object, coping whether it was already a Date object, or a string (in UTC, and in any common format the Date.create() can cope with), or a number (in which case it is seconds since Jan 1st 1970 UTC). However, be careful if giving a pre-made Date object: make sure it was the time in UTC and specifies that is in UTC.
The rest of the function just chooses the language and format, and returns the formatted date string.
If you were paying attention you would have noticed t stores a lie. E.g. for 5pm BST, t would be given as 4pm UTC. We then turn it into a date that claims to be 5pm UTC. Basically this is to stop format() being too clever, and adjusting for local browser time. (This trick is so you can show a date in a browser for something other than the user’s local timezone.)
But it does mean that if you include any of the timezone specifiers in your format string, they will wrongly claim it is UTC. {{timestamp
theDeadline format="{HH}:{mm} {tz}" }} will output 17:00 +0000.
To allow you to explicitly specify the true timezone, here is an enhanced version:

Handlebars.registerHelper('timestamp', function(t, options){
var offset = options.hash.tzOffset;
if(!offset)offset = tzOffset;   //Use global as default
if(!Object.isDate(t)){
    if(!t)return "";
    if(Object.isString(t))t = Date.create(t + "+0000").setUTC(true).addSeconds(offset);
    else t = Date.create(t*1000).setUTC(true).addSeconds(offset);
    }
else t = t.clone().addSeconds(offset);
if(!t.isValid())return "";

var code = options.hash.lang;
if(!code)code = lang;   //Use global as default

var format = options.hash.format ? options.hash.format : '';
var s = t.format(format, lang);
if(options.hash.appendTZ)s+=tzString;
if(options.hash.append)s+=options.hash.append;
return s;
});

(the only change is to add a couple of lines near the end)
Now if you specify appendTZ=true then it will append the global tzString. Alternatively you can append any text you want by specifying append. So, our earlier example becomes one of these:

{{timestamp theDeadline format="{HH}:{mm}" appendTZ=true}}
{{timestamp theDeadline format="{HH}:{mm}" append="BST"}}
{{timestamp theDeadline format="{HH}:{mm}" append=theDeadlineTimezone}}

The first one assumes a global tzString is set. The second one hard-codes the timezone, which is unlikely to be the case; the third one is the same idea but getting timezone from another variable.
VERSION INFO: The above code is for sugar.js v.1.5.0, which is the latest version at the time of writing, and likely to be so for a while. If you need it for sugar.js 1.4.x then please change all occurrences of setUTC(true) to utc().

Comparison Of Three WebGL Libraries

2016-03-09T01:58:00.001-08:00

Comparison Of Three WebGL Libraries

For many people, WebGL is a technology for making browser-based games, but I am more interested in all the other uses: data visualization, data presentation, making web sites look fantastic, new and interesting user experience (UX), etc. (I have spent many years using Flash for similar things.)

What is WebGL?

WebGL is an API to allow browsers to use a GPU to speed up 2D and 3D graphics; you write in a mix of JavaScript and a shader language. Because it is low-level and complex I recommend against writing in raw WebGL; use a library instead.

It is supported on just about any popular OS/browser combination, including working on tablets and mobile phones. Your device does not need to have a dedicated GPU to run WebGL.

What libraries are there?

There are actually quite a few choices, but for this article I will focus on the three libraries I have made (non-trivial) WebGL applications with:

Three.JS (http://threejs.org/)
Babylon.JS (http://www.babylonjs.com/)
Superpowers (http://superpowers-html5.com/)

The first two are fairly low-level (Babylon.JS has a few more abstractions built-in), meaning you will be thinking in terms of vertices, faces, 3D coordinates, cameras, lighting, etc. A 3D graphics background will be useful. Superpowers is higher-level, but more focused on games development. Some Blender (or equivalent) skills will also come in handy, whichever library you go for.

Three.js And Its Resources

Three.JS is the most established WebGL library, with some published books, many demos (http://threejs.org/, https://stemkoski.github.io/Three.js/ and others), even a Udacity course.

However it has scant regard for backwards compatibility, meaning that frequently the code in the published books (or the source code of older demos and tutorials) will not work with the latest library version. It has a relatively aggressive developer community, who think that having an uncommented demo of a feature counts as documentation.

It uses the MIT license (the most liberal open source - fine for commercial use), hosted on github; bug reports to github, but support questions to StackOverflow’s [three.js] tag.

Babylon.js And Its Resources

Babylon.JS is now two years old, and was developed at Microsoft in France, though it is open source (Apache license, so fine for commercial use). It is primarily intended for making games, but is flexible enough for other work.

Like Three.JS, it has plenty of demos, and again they are often undocumented. There is an active web forum; explanations and experiments there often link to the Babylon Playground, which is a live coding editor. There is also a very useful eight hour training video course (free), presented by the two David’s who created Babylon.JS. (There is a just-released book, https://www.packtpub.com/game-development/babylonjs-essentials, but I’ve not seen it, so cannot comment.)

Superpowers And Its Resources

Superpowers is a bit different: it is a gaming system, with its own IDE. It is very new, only released as open source (ISC license, which is basically the nice liberal MIT license again) in the middle of January 2016, though appears to have a year’s closed development behind it. (The IDE is cross-platform; it has been running nicely for me on Linux, I’ve not tried it on other platforms.)

Some of the initial batch of demos and games have been released on GitHub (kind of as open-source - the licenses are a bit vague, especially regarding re-use of assets), which has been my main source of learning. A few tutorials have also appeared recently (GameFromScratch.com, and https://itch.io/board/11494/tutorials-guides).

What grabbed my attention was the quality and completeness of the Fat Kevin game, combined with the fact that I could download all source and assets for it, to learn from. (The Discover Superpowers demo is similar, but simpler, so easier to learn from.)

Support is through forums on itch.io, with separate English and French sections. This requires yet another user account; I find it a shame they didn’t use StackOverflow, Github, or at least HTML5 Game Devs (as Babylon did). I’d not heard of itch.io (“an open marketplace for independent digital creators with a focus on independent video games”) before, but I think their choice tells you how they see Superpowers being used.

The coding language is TypeScript, basically JavaScript 1.8 plus types; it is worth specifying those types, as then the IDE’s helpful autocomplete can work. Note that Superpowers is closely tied to the IDE - you need to be clicking and dragging things; doing everything in code is not realistic (though this might just be the style of the initial few games). Superpowers is built on Three.JS, but I’m not seeing anything exposed, so I don’t think you can take a Three.JS example and use it directly.

Conclusion

Which library to choose? I suggest you try out the demos for each of these, and choose the library that has demos that cover all the things you want to do. If the choice comes down to Three.JS vs. Babylon.JS, and you cannot find a killer reason to choose one over the other, this is because it doesn’t really matter, they can each do 95+% of what the other can: follow your hunch, choose one or the other, and dive in and learn it.

Finally I should say that WebGL for website development is hard: your programmer(s) will need 3D experience, as will your graphic designer(s). If you are using RWD/mobile-first to target both mobile and desktop, it is even more complex. My company, QQ Trend Ltd. can help (contact me at dc [at] qqtrend.com).

Gradients in Three.JS

2016-03-07T12:56:00.001-08:00

Gradients in Three.js

Many years ago, I made some charts in Flash: 3D histograms using boxes and pyramids. More as proof of concept than anything else, I used two types of gradients on the sides:

Colour gradients (e.g. red at the bottom of the bar, gradually changing to be yellow at the top)
Opacity gradients (e.g. solid at the bottom, gradually changing to be only 20% opaque at the top)

Recently I’ve been trying to reproduce (and go beyond) those charts in WebGL. Gradients seem to be both harder, and less flexible, than they were in ActionScript/Flash.

I’ve been working with two libraries, Three.JS and Babylon.JS. In Babylon.JS I couldn’t find any examples of how to do either type of gradient. In Three.JS I believe there is no support for opacity gradients, but colour gradients are possible, and that will be the theme of this article.

Three.js: mesh, geometry, material

I will assume some familiarity with WebGL and Three.JS concepts, but the most essential knowledge you need to follow-on with this article is:

geometry is a shape.
material is the appearance
mesh is a geometry plus a material.

Most of the time your geometry and your material are orthogonal. E.g. if you have a red shiny material you can apply it equally easily to your pyramid or your torus. And you can just as easily tile a grass image on either of those shapes.

A more tightly coupled (less orthogonal) example is a game character (a mesh) you have made in, say, Blender, with a special texture map (a material) to give it a face and clothes. The mesh and the material are basically tied together. However, if the mesh comes with multiple poses, or animations, the same texture map works for all of them. And you can repaint the texture map to give your mesh (in any of its poses) new clothes.

In contrast, gradients are highly coupled; at least in the way I will show you here. Like coupling in software engineering, this is bad: I cannot prepare a red-to-yellow gradient material, and then apply it to any mesh; instead I have to embed the gradient description into that mesh, in a way specific to that mesh.

Vertex Colours

The way it works is you can switch a material to use VertexColors. E.g.
var mat = new THREE.MeshPhongMaterial({vertexColors:THREE.VertexColors});
And then over in the mesh you specify a colour for each vertex. If you do this, then Three.JS will, for each triangle, blend the vertex colours in a smooth gradient. All faces in Three.JS are triangles, and vertices are referenced through each face, so you end up with lots of lines like this:

  myGeo.faces[ix].vertexColors = [c1, c2, c3];

where each of c1, c2 and c3 are THREE.Color instances.

By the way, I said opacity gradients were not possible with this technique (because vertexColors takes an RGB triplet, not RGBA), but it is still possible to make the whole mesh semi-transparent.

BoxGeometry Vertices

A THREE.BoxGeometry is used to make a 6-faced cuboid shape. To be more precise it is a shape made up of 12 triangles (two on each face). To be able to set vertexColors you need to know the order of the those 12 triangles. I reverse-engineered it, by colouring each in turn, to get the following list:

0: top-left of one of the side faces. 0 is top-left, 1 is bottom-left, 2 is top-right (anti-clockwise)
1: bottom-right. vertex 0 is bottom-left, 1 is bottom-right, 2 is top-right. (anti-clockwise)
2/3: same for opposite side
4/5: is top. (4’s vertex zero touches 2’s vertex zero)
6/7 is bottom
8/9 is one side
10/11 is the other side

A Factory Function

Here is a factory function to make a box mesh with colour c1 on the base, color c2 on the top, and each side having a smooth linear gradient from c1 at the bottom to c2 on the top.

You can specify c1 and c2 as either a hex code (e.g. 0xff0000) or as a THREE.Color object. w, d, h are the three dimensions of the cube. opacity is optional, and can be from 0.0 (invisible) to 1.0 (full opaque - the default).

function makeGradientCube(c1, c2, w, d, h, opacity){
if(typeof opacity === 'undefined')opacity = 1.0;
if(typeof c1 === 'number')c1 = new THREE.Color( c1 );
if(typeof c2 === 'number')c2 = new THREE.Color( c2 );

var cubeGeometry = new THREE.BoxGeometry(w, h, d);

var cubeMaterial = new THREE.MeshPhongMaterial({
    vertexColors:THREE.VertexColors
    });

if(opacity < 1.0){
    cubeMaterial.opacity = opacity;
    cubeMaterial.transparent = true;
    }

for(var ix=0;ix<12;++ix){
    if(ix==4 || ix==5){ //Top edge, all c2
        cubeGeometry.faces[ix].vertexColors = [c2,c2,c2];
        }
    else if(ix==6 || ix==7){ //Bottom edge, all c1
        cubeGeometry.faces[ix].vertexColors = [c1,c1,c1];
        }
    else if(ix%2 ==0){ //First triangle on each side edge
        cubeGeometry.faces[ix].vertexColors = [c2,c1,c2];
        }
    else{ //Second triangle on each side edge
        cubeGeometry.faces[ix].vertexColors = [c1,c1,c2];
        }
    }

return new THREE.Mesh(cubeGeometry, cubeMaterial);
}

Given the earlier explanation, I hope the code is self-explanatory: make a material where all we set is that we will use VertexColors (and, optionally, that it is partially transparent), make a box mesh, and then go through all 12 faces, work out which face it is, and set the colours of the three corners accordingly.

A Full Example

Here is a complete example (you’ll need to paste in the above code, where shown), to quickly demonstrate that it works. (This was tested with r74, but as far as I know it should work back to at least r65.)

The code is minimal: make a scene, with a camera and a light, and put a gradient box (2x3 units at the base, 6 units high) at the centre of the scene. The gradient goes from red to a pale yellow. I made it slightly transparent (the 0.8 for the final parameter), but as it is the only object in the scene this has no effect (except to dim the colours a bit, because of the black background)!

<!DOCTYPE html>
<html>
<head>
  <title>Gradient test</title>
  <script src="https://cdnjs.cloudflare.com/ajax/libs/three.js/r74/three.min.js"></script>
</head>
<script>
function makeGradientCube(c1, c2, w, d, h, opacity){/*As above*/}

function init() {
  var scene = new THREE.Scene();
  var camera = new THREE.PerspectiveCamera(45,
    window.innerWidth / window.innerHeight, 0.1, 1000);

  var renderer = new THREE.WebGLRenderer();
  renderer.setClearColor(0x000000, 1.0);
  renderer.setSize(window.innerWidth, window.innerHeight);

  var dirLight = new THREE.DirectionalLight();
  dirLight.position.set(30, 10, 20);
  scene.add(dirLight);

  scene.add( makeGradientCube(0xff0000, 0xffff66, 2,3,6, 0.8) );

  camera.position.set(10,10,10);
  camera.lookAt(scene.position);

  document.body.appendChild(renderer.domElement);
  renderer.render(scene, camera);
  }

window.onload = init;
</script>
<body>
</body>
</html>

Sources

The above code was based on studying this example and trying to work out how it was doing that. It is undocumented - par for the course with Three.JS examples, sadly. I also peeked at the Three.JS source code. If you want more undocumented code examples of using THREE.VertexColors, see https://stemkoski.github.io/Three.js/Vertex-Colors.html

Future Work

First, if you write your own shaders I believe anything and everything is possible.

Second, I wonder about making a gradient in a 2D canvas, and using that as a texture map. And/or using it as the alpha map to create an opacity gradient.

Either of those may be the subject of a future article. In the meantime, if you know a good tutorial on using gradients in either Babylon.JS or Three.JS, please link to it in the comments. Thanks, and thanks for reading!

Factorials and Rationals in R

2016-02-29T10:02:00.001-08:00

At the recent NorDevCon, Richard Astbury used the example of factorial to compare some “modern” languages (the theme of his talk being that they were all invented decades ago). Among them was Scheme, which impressed me by having native support for very large numbers and rational numbers.

I felt like my go-to-language for maths-stuff, R, ought to be able to do that, too. The shortest R way to calculate factorials (using built-in functions) I can find is:

prod(1:5)

which gives 120. But it gives me Inf if I try prod(1:50).

BTW, I hoped this might work:

`*`(1:5)

but * operator only takes two arguments (which is ironic for a language where everything is a vector, i.e. an array).

It seems I need to use the gmp package to get big number and rational number support. Here is how factorial can be written:

library(gmp)
prod(as.bigz(1:50))

which outputs
“30414093201713378043612608166064768844377641568960512000000000000”

That is not bad: still fairly short and, as you can see, vectorized operations are still trivial (as.bigz(1:50) creates a vector of 50 gmp numbers).

as.bigq(2, 3) is how you create a rational (⅔). Here is a quick example of creating vectors of rational numbers:

as.bigq(1, 1:50)

which outputs:

Big Rational ('bigq') object of length 50:
[1] 1 1/2 1/3 1/4 1/5 1/6 1/7 1/8 1/9 1/10 1/11 1/12 1/13 1/14 1/15
[16] 1/16 1/17 1/18 1/19 1/20 1/21 1/22 1/23 1/24 1/25 1/26 1/27 1/28 1/29 1/30
[31] 1/31 1/32 1/33 1/34 1/35 1/36 1/37 1/38 1/39 1/40 1/41 1/42 1/43 1/44 1/45
[46] 1/46 1/47 1/48 1/49 1/50

Rational operations also work nicely:

as.bigq(1,1:50) + as.bigq(1,3)

giving:
4/3 5/6 2/3 ... 49/138 50/141 17/48 52/147 53/150

Of course that is not as nice as having built-in rationals, but good enough for those relatively few times when you need them.

Maths Tests With R

2015-10-13T14:20:00.001-07:00

This maths problem hit the news recently, about how far a crocodile should swim up the bank before going on land, in order to catch a zebra in the shortest possible time. It appears to be an A-level question, i.e. for 18-year old students.

The first two questions are arithmetic; more about understanding the question being asked. But the main question is obviously calculus: you are supposed to differentiate, and find out where it is zero.

I happened to have R open at the time, and my calculus is a bit rusty on how to differentiate a square root. So, this is what I typed:

T = function(x){
(5 * (36 + x^2) ^ 0.5) + 4 * (20-x)
}

(curly brackets were optional: it could all have been on one line.)

Then to answer the three questions:

T(20)
T(0)
optimize(T, lower=0, upper=20)

I.e. if he swims the whole way it is 10.44 seconds, if he cuts to land immediately it takes 11 seconds, and the 3rd line tells me he should swim 8 metres, then cut to land, and it will take 9.8 seconds.

Or, if you want to see how I should have solved it, and be reminded how to do a tricky differentiation, go to https://www.youtube.com/watch?v=xko48OoTAQU and watch from 5:00 to about 10:00. For comparison, It took me less than 1 minute to write the function and get the solutions. R itself ran instantly, of course.

As a data scientist, the important thing here is I use the same techniques when things get messy. If you show me enough observations of crocodiles catching zebras, I can give you an estimated function that also takes into account the speed of the flowing water, the wind speed, the age of the zebra, the water temperature, the air temperature, the weight of the crocodile, and when he last ate!

Markdown Editors (for Linux or cross-platform)

2015-09-15T13:35:00.001-07:00

I’ve been using markdown more and more, and use pandoc to make a PDF from it. But it often comes out differently to how I expect, so I have been looking for an editor with live preview.

(This article has been updated, end-March 2016, to add Atom; and I've checked for any improvements in StackEdit, RStudio, NetBeans 8.1 and Remarkable. I've also added spell-checking as a requirement.)

Quick summary: all of them are (fatally) flawed.

Here is a quick review of each; well, more a summary of the flaws with each. (When I say “wrong”, in the below, I am treating pandoc, with default settings, as correct.)

stackedit.io: Online. That is a fatal flaw when looking for an offline editor! It also gets the 1-2-3 case wrong (see below). Not open source. I do like how it exports to Blogger, though. (I use it for writing blog posts.) Has same spell-checking as your browser (which is good). Additional Fatal Flaw: your data is stored in the browser, so almost impossible to backup separately; I've just discovered my recent clearing of all cookies has destroyed all my articles.

RStudio: It supports its own .rmd format (which allows embedding live R code inside markdown), but can also be a normal Markdown editor. But no live preview, so you need to keep clicking “Preview HTML” to see what it looks like (though it does have syntax hilighting as you type). It handles underlines wrongly (see below). No spell-checking.

Remarkable: Gets the 1-2-3 case wrong. Open source and looks nice. But it feels black-box-ish. E.g. I don’t know how the code syntax hilighting works, and I don’t know if writing {php} or {r} is being listened to. (It highlighted a short PHP code snippet, with or without a hint, but not R code.) Another fatal flaw: it resets the preview window to the top every time you add a new line, making it useless for a document longer than one screen. It is also very slow - a distinct sluggishness as you type.

Mark My Words. Gets both the 1-2-3 and underlines case correct in the preview window, but the underline case is wrong in the syntax highlighting in the editor window! The icons at the top are a bit confusing.

NetBeans, with the “Markdown Support” plugin. 8.0 had an unusable preview window, but as of NetBeans 8.1 that is better; however the editor and preview windows scroll completely independently, rather than staying in sync. No word-wrap in the editor window. On the plus side it gets the 1-2-3 and underline cases correct. (NB. NetBeans also has most of the problems I point out with Atom, below: in fact they are very similar.)

Haroopad: This makes me nervous, as it does not appear to be open source, and is a 40MB download. It gets the 1-2-3 case wrong. It also doesn’t do syntax highlighting (but that is not an essential feature for me). No spell-checking. No new releases the past 6 months, so this may be a dead/dying project.

Atom (built-in plugin): Currently (March 2016) this is the number one choice at a comparison of Linux Markdown Editors, so I just installed it. I think it is one to watch, because Atom is actively developed and with some more development the markdown support could become the best of the bunch. It handles 1-2-3 and underline cases correctly. There is live preview, but sadly the two panes are unconnected - when you scroll in one, the other just sits there, with no way to sync them. Also they do not agree what is correct markdown: the left window goes all weird with "*.txt", whereas the preview window handles it just fine. It underlines wrong spellings, but right-click does not suggest the correct spellings. It is more a programmer's editor than a writing tool, e.g. ctrl-b with a word highlighted does not make it bold. No print/export options.

My choice? Initially I went with Remarkable, as the best of the bunch, but I hadn't discovered the one-screenful limit at that point. (it beats Haroopad by being open source, and much smaller). Go with stackedit.io if working in a browser is okay for you. March 2016 Update: I've been using Haroopad the past 6 months, but the lack of spell-checking has become an irritation. Atom and NetBeans have very similar pros and cons; of the two I prefer Atom. Not sure if it is quite good enough yet to make me switch from Haroopad, though...

(Your suggestion? Let me know in the comments.)

The 1-2-3 problem. When I type:

1
2
3

(i.e. 1, 2 and 3 each on their own line, with no blank lines between them)

I should see “1 2 3”. A blank line is needed to start a new paragraph. It is nice if it shows the line break, but no good if I send that code to pandoc and all my neat formatting is lost! [BUT, maybe there is a nice flag I can give to pandoc to preserve that formatting, as I’d rather it worked that way!]
The underline problem. When I type:

Then you should open my_special_file.txt

It should not treat those underlines as italics or bold formatting. That formatting only applies with preceding whitespace:

This word is in _italics_ this one is in __bold__

That appears like this:

This word is in italics this one is in bold

Format Japanese date with kanji day-of-week

2015-09-03T10:49:00.001-07:00

In Japanese, there are single kanji for each day of the week.

var days = ['日','月','火','水','木','金','土'];

(If you want to mutter them under your breath, at work, to impress colleagues, nichi-getsu-ka-sui-moku-kin-do.)
In JavaScript, to put them in a date use days[d.getDay()] (where d is a Date object).
I use sugar.js, which adds a format() function (amongst loads of other useful stuff) to the Date class; I now extend it further with this:

Date.prototype.format_ja_MMDDK = function(){
var days = ['日','月','火','水','木','金','土'];
return this.format("{MM}月{dd}日") + "(" + days[this.getDay()] + ")";
};

(If you hate underlines feel free to use `formatJaMMDDK() or anything you like, for the matter.)
Here is one way you might use it (jQuery-syntax):

$('.todaysDate').html(new Date().format_ja_MMDDK());

Easylogging++: how to get one log file per day

2015-06-02T11:08:00.001-07:00

I introduced EasyLogging++ before. This article will build on that to show how to rotate the logs daily.

In a nutshell, assuming your log filename already has date specifiers in it, all you have to do is run these two lines, at midnight each day:

auto L = el::Loggers::getLogger("default");
L->reconfigure();

If you have multiple loggers, repeat that for all of them.

I recommend using a config file to configure EasyLogging++; but if you are configuring it completely in your code, and your FILENAME entry does not include date specifiers, you can instead change the filename at any time with this:

Loggers::reconfigureAllLoggers(
 ConfigurationType::Filename,
 "/path/to/logs/my-new-filename.log"
 );

But, going back to the first approach, here is a complete program to show creating a new log file every 20 seconds (!). First create a logging.conf file with these contents:

-- default
* GLOBAL:
    FORMAT = "%datetime{%Y-%M-%d %H:%m:%s.%g},%level,%thread,%msg"
    Milliseconds_Width = 4
    TO_FILE = true
    FILENAME = "info.%datetime{%Y%M%d_%H%m%s}.log"
    LOG_FLUSH_THRESHOLD = 5

(The FORMAT, and Milliseconds_Width lines are optional, but useful for checking it worked.)

Here is the full code:

#define _ELPP_THREAD_SAFE
#define _ELPP_NO_DEFAULT_LOG_FILE
#include "easylogging++.h"
_INITIALIZE_EASYLOGGINGPP

namespace sc = std::chrono;

int main(int,char**){
el::Loggers::configureFromGlobal("logging.conf");
LOG(INFO)<<"The program has started!";

std::thread logRotatorThread([](){
const sc::seconds wakeUpDelta = sc::seconds(20);
auto nextWakeUp = sc::system_clock::now() + wakeUpDelta;

while(true){
    std::this_thread::sleep_until(nextWakeUp);
    nextWakeUp += wakeUpDelta;
    LOG(INFO) << "About to rotate log file!";
    auto L = el::Loggers::getLogger("default");
    if(L == nullptr)LOG(ERROR)<<"Oops, it is not called default!";
    else L->reconfigure();
    }

});

logRotatorThread.detach();

//Main thread
for(int n=0; n < 1000; ++n){
    LOG(TRACE) << n;
    std::this_thread::sleep_for(sc::milliseconds(100));
    }

LOG(INFO) << "Shutting down.";
return 0;
}

I compiled it with this command:

g++ -std=c++11 -Wall -Werror logtest.cpp -lpthread -o logtest

and then ran it with this command:

./logtest

It should be easy to follow. I set up a dedicated thread to call reconfigure() every 20 seconds, and then the main thread logs a counter about 10 times/second.

You’ll end up with about 5 log files, and you can examine them to see that no log commands were lost.

If I was coding for a mission-critical application, where missing even a single log line would be considered Very Bad, I might set up a mutex and a lock to make sure the main thread is not active when the call to reconfigure() happens. I don’t know for sure if that is needed, or if it is guaranteed to be safe. If you know for sure one way or the other, please leave a comment!

But, for a once/day log rotation, in most applications this is a small enough risk that I would not want the overhead of the extra mutex, and I would go with the code shown above.

Add a forwarding alias with google mail

2015-03-19T06:52:00.001-07:00

The goal was to create a special email alias to forward to a 3rd party. E.g. accountant@example.com would be forwarded to joe.bloggs@my.accountant.com This is easy when you host your own mail server (edit /etc/aliases) or even when your domain uses cPanel (find the forwarding icon under mail config). But my company email is hosted at google…

Well, here are easy 22-step instructions for adding a forwarding address to a company email account hosted by google. You will need a handkerchief, strong resolve and the co-operation of each of the people receiving the forwarded email.

sign in to gmail
Under the settings cog icon, choose “manage this domain” (do not choose “settings”!)
Click users from the left menu
Click the user
The “Account” section is a button, click it to open it up.
Scroll down to Aliases, and click “Add an alias”.
Type in the alias name. Don’t be put off by the big “CANCEL” button, and the lack of a submit button. Just go down to the bottom and click SAVE CHANGES.
Now, deep breath, go back to gmail, and this time you do want to choose “settings” from the settings cog.
Choose filters from the blue menu along the top.
Choose “Create a new filter”
Put the new alias email address in the “To:” block. (Careful: this is not the “create new filter” page, but is a search page!!)
Click the “create filter with this search” link in the bottom right.
Click “add a forwarding address”.
Do it again.
Input the address.
Check your email. Click the link in the email you get. If it is a 3rd party, wait for them to click it.
Sob into your handkerchief. Then console yourself, as you are nearly there.
Repeat 13 to 16 if more than one address.
Now go back to do steps 10, 11 and 12 again. This time don’t click “add forwarding address”, but choose your target from the dropdown box.
Click save, and I think you are done. Do a test, and see if it arrives! If it does not, give it more time. For me the email arrived in my main email immediately (when it shouldn’t have at all), but then turned up in the forwardee inbox 19 seconds later.
I think if you want to forward it to more than one person you need to set up a whole new forwarding filter for each forwardee.
Resolve never to moan about how clunky cPanel is, ever again.

(To be fair, I suspect steps 2 to 7 are not required, which may be why I am receiving a copy. And, of course, 10, 11 and 12 were a mistake the first time round. But the above took so long that I don’t have the time to do any more experimentation today.)

Gnucash Timezone Problem Workaround

2015-02-22T07:03:00.001-08:00

GnuCash is accounting software. It is usable, which is the most important compliment software can get, but it has one annoying bug, first reported in 2002. Let’s assume it will never be fixed, and just work around it.

The bug is that dates are stored in the xml file as timestamps. E.g. if you say a transaction happened on 2014-06-21, then it gets stored as “2014-06-21 00:00:00”. That is the bug. The problem is that it also stores the users current timezone. So, if I type that in when in the Asia/Tokyo timezone it will actually store it as “2014-06-21 00:00:00 +0900”. If I then open the gnucash file on a server in the Europe/London timezone, and re-save, the file now stores this: “2014-06-20 16:00:00 +0100”. It has moved it to BST timezone, but in the UI all you now see is that the transaction is on the wrong day! (That was a real problem, but if you think that is exotic the original bug report was done by someone who merely moved from one state in the U.S. to another state!)

Anyway, my proposed workaround is to always run gnucash in the UTC timezone. On Linux you do this from commandline by starting it like this:

TZ=UTC gnucash

In windows environments, you might have associated your *.gnucash files, so double-clicking them opens them in gnucash. The following instructions work for xfce (thunar file manager), but I suspect gnome is exactly the same.

Right click any gnucash file, and choose properties
Choose open with, and at the bottom it says “Other application…”
Give bash -c "TZ=UTC gnucash %f"
Close it, and test it by double-clicking any gnucash file.

(I confirmed it had worked by looking at the raw XML.)

What to do if you have someone else working on your files who does not want to do this? Or is on Windows (where I don’t know a similar workaround)? Well, the only solution I know is that they temporarily move their whole machine to UTC, open, edit and save your gnucash files, then restore their machine back to their real timezone. (Remember that UTC is different from Europe/London.)

PHP Sending mails twice?? (doing stuff twice)

2015-02-05T04:41:00.001-08:00

Just had a mammoth troubleshooting session, because a very simple PHP script to send an email kept sending them twice. I’d only just configured postfix to send email, so I kept looking for problems there. I kept staring at the PHP code, and could see no problem, but it looked more and more like PHP was calling postfix twice. Commenting php.ini settings in and out made no difference. This was commandline PHP, so nothing to do with browser reloads, or anything like that.

Then I had the brainwave to append a random number to the bottom of the body text. Different numbers; in fact, not just that, but the 2nd email got both numbers! So it is definitely my PHP script. But I still couldn’t see it.

Stripped down, so the problem is more obvious, it looked like this:

$bodyText = "Whatever";

class Test{

function test(){
$bodyText.="RANDOM=".mt_rand(1,1000);
mail("me@example.com", "Test", $bodyText);
}
}

$R = new Test;
$R->test();

I’ve been spending too much time jumping between languages. And I’d also got used to PHP constructors being called __construct() and forget it still offered backwards compatibility for using the class name. Yep, that’s right, PHP functions (and classes) are case-insensitive, and so test() was being treated as the constructor of class Test. So one mail was being sent from the constructor, the second from my explicit function call. Grrrr….

(The above is also a possible explanation for problems like “PHP calls web service twice”, or “PHP has double log entries” or “PHP does something twice”!)

Written with StackEdit.

Rackspace vs. AWS Oct 2014

2014-10-19T09:32:00.001-07:00

I’ve previously blogged about Rackspace vs. AWS. For small servers, Rackspace has generally come out on top. But this year the pendulum has swung the other way.

First, Amazon released their t2.micro instance. The t1.micro was really just for learning the AWS API on, but the t2.micro is the kind of instance you can do real work on. (In fact I’ve been running the test version of a DB-backed website on one, for a month or two now, with no issues.)

A t2.micro server is $11.35/month in Ireland, ($15.64/month in U.S. East or Tokyo - I didn’t expect that! Europe cheaper than the U.S.!). This includes a 20GB magnetic disk, at about $1/month. (SSD is $2/month). $11.35 is £7.10/month (at today’s $1.60/£).

The second change this year is that Rackspace have introduced a compulsory service level fee, of £35/month. This is per data centre. (They might claim it is pay as you go, but small customers won’t do enough go-ing, so will always simply be paying £35/month.)

The third change is Rackspace have done away with their low-end servers. You used to be able to get a “next generation” 512MB server for 2p/hour (http://www.rackspace.co.uk/cloud/servers/existing-customer-pricing), and if you are an existing customer you still can. That works out at £14.40/month. The new minimum server is £52.52/month (including the £35 service level charge). (Of course, the new minimum server is higher spec., but as all I needed was the previous minimum spec, that does not matter.)

By the way, existing customers get a different pricing system, and don’t need to pay the service level charge. However, the per-hour prices are higher, e.g. for “Performance 1” (1GB RAM, 20GB SSD) it is £21.60/month, compared to £17.52/month (+£35) for new customers. (I guess this is related to their point - the service level was hidden in the prices, and now they’re just breaking it out… well, if the price was exactly the same, without the minimum fee, I’d have no problem with that.)

In Rackspace’s favour is their locations. They have a London data centre, the closest Amazon has is Ireland. They have a Hong Kong data centre, the closest Amazon can manage is Tokyo or Singapore.

In AWS’s favour, you have one account, globally, whereas with Rackspace it is one account per data centre.

BTW, one way they both make life difficult is by hiding their calculators. Here is AWS:
http://calculator.s3.amazonaws.com/index.html
And here is Rackspace UK’s:
http://www.rackspace.co.uk/calculator

So, to sum up, if all you need is minimum spec, Rackspace are now either £14.40 or £52.52/month (depending on if you are an existing customer at your desired data centre or not) while Amazon is now £7.10/month.

(Be aware that the prices are always in motion; but if you think I’ve misunderstood something above, please let me know in the comments.)

Written with StackEdit.

Fixing SVG font glyphs by hand

2014-10-13T09:58:00.001-07:00

In a web font, I had one glyph where the background and foreground was reversed. Being Mr.Pragmatic, I just went with it, setting foreground and background colours appropriately. But it is an RWD site, and at certain scalings a 1 pixel white line was appearing. Here is how I fixed it.

I opened it in a text editor. What I had was:

<svg ... x="0px" y="0px" width="71px" height="67px" viewBox="0 0 71 67" enable-background="new 0 0 71 67">
<rect fill="#000000" width="71" height="67"/>
<g>
 <path fill="#FFFFFF" d="M49.562,...,31.799z"/>
 <path fill="#FFFFFF" d="M46,...z"/>
</g>
</svg>

The fix was as simple as deleting that <rect> line, and then setting the fill colour of the two <path> tags to be #000000.

When I made the webfont (I use the excellent icomoon.io site to do this) it looked fine. But it looked smaller than the other glyphs in the font. I can add padding easily with CSS, but removing it can be more work.

Conclusion: I don’t know how to do this. I could play around with the viewBox="..." in the <svg> tag, and change the appearance in Inkscape, but it made no difference in icomoon. Similarly, I could select the whole path, scale it in Inkscape, but still no change in icomoon. So, having done the important fix, I gave up on this one (and will stick to controlling the size and padding from CSS.)

Annoying keyboard shortcuts in Xfce

2014-09-24T01:39:00.001-07:00

Quickie for mint16/17 using Xfce. You may have found the ctrl-Fn keys have been mapped to change the workspace you are on? This behaviour is highly annoying as applications often define ctrl-Fn to do something.

Do not try to fix it from the graphic configuration window. It does not work.

Instead, as root, open this file in a text editor:

/etc/xdg/xdg-default/xfce4/xfconf/xfce-perchannel-xml/xfce4-keyboard-shortcuts.xml

(Because or links there are about three or four files pointing to this file; it doesn’t matter which one you edit.)

Look for lines like:

<property name="&lt;Control&gt;F5" type="string" value="..."/>

Then delete all those lines, there are 12 in total, from F1 to F12. (In my case they were not in a block, so you had to hunt around to catch all twelve.)

Then you need to logout of your desktop, and log back in again for the change to take effect.

Written with StackEdit.

C++ Logging: EasyLogging++

2014-08-27T15:08:00.001-07:00

The Basics Of EasyLogging++

I wanted a basic logging library for C++. The first one I looked at required I first install Java to be able to compile it. Eh? For a modern C++ library? So then I added “header-only” to be key requirements. And shortly after that I sadly started to resign myself to writing my own. Then, luckily, I stumbled across EasyLogging++.

I retrofitted a couple of projects today, stripping out the ad hoc logging, and replacing it with this. The documentation is good, but gets lost in the details some times, and I felt a simpler tutorial was needed. This is my attempt at it:

Here is the Hello World example:

#define _ELPP_THREAD_SAFE
#include "easylogging++.h"
_INITIALIZE_EASYLOGGINGPP

int main(int,char**){
LOG(INFO)<<"Hello World!";
}

I’ve decided to request it be thread-safe, right from this first example, because most C++11 apps use threads. Remove that line if you definitely have a single-threaded application. (Speaking of which, this library is C++11 only; but there is a link on their website to an earlier version that supports older C++).

The above program prints “Hello World!” to stdout. But, obviously, you want to log to a file. And for a project of any worthwhile size you will end up with a logging configuration file anyway, so let’s add one now. Here is “logging.conf”:

-- default

* GLOBAL:
    TO_FILE = true
    FILENAME = "info.%datetime{%Y%M%d}.log"
    TO_STANDARD_OUTPUT   =  false
* WARNING:
    TO_STANDARD_OUTPUT   =  true
* ERROR:
    TO_STANDARD_OUTPUT   =  true
* FATAL:
    TO_STANDARD_OUTPUT   =  true

Here I am saying I want to have one log file per day, using the YYYYMMMDD datestamp in the log filename, and that it will store messages of all log levels. I’m also saying that I want TRACE and INFO messages to only go to the file, but WARNING, ERROR and FATAL to go to both the file and stdout. There may be more elegant ways to do that, but the above works.

You use the config file by adding just one line at the start of main():

#define _ELPP_THREAD_SAFE
#include "easylogging++.h"
_INITIALIZE_EASYLOGGINGPP

int main(int,char**){
el::Loggers::configureFromGlobal("logging.conf");
LOG(INFO)<<"Hello World!";
}

That code will write “Hello World\n” to e.g. “info.20140827.log”

And that is all you need to know; all your other questions will be answered by the documentation. Do please spend some time with the documentation as there is a lot of functionality in this library (e.g. Conditional logging, Occasional Logging, log output of STL containers, log output for your own classes, datestamps to customizable sub-second accuracy, run-time disabling of certain log levels, and even more.)

One feature it does not have, that I wanted, is an asynchronous log queue. I.e. a thread grabs the lock just long enough to push a string on to a queue, with another dedicated thread doing the actual writing to disk. This makes sure your worker threads do not get caught up waiting for a lock because another thread is waiting for disk I/O to finish. However another feature, that EasyLogging++ does have, lessens the impact of this: it only flushes to disk every N log messages. So effectively strings are being pushed to a queue, and it is only once every N times that a thread gets caught waiting for disk I/O to finish. N defaults to 256; I reduced it in my config file to 5, which gives a fair balance between thread wait and log latency (and the risk of losing log messsages). This is not as good as an asynchronous log queue, but I can live with it.

Written with StackEdit.

Using C++11 std::future to push data from producer to multiple consumers

2014-08-16T04:35:00.002-07:00

This is an example of using C++11's std::future to move data from a data producer to multiple consumer threads, in a very stable and thread-safe way. And hopefully it is at least as efficient as the alternatives:

http://stackoverflow.com/a/25339704/841830

Critiques of my approach (or of the alternatives) are very welcome.