Wednesday, December 6, 2017

Word Embeddings, NLP and I18N in H2O

Word embeddings can be thought of as a dimension-reduction tool, needing a sequence of tokens to learn from. They are really that generic, but I’ve only ever heard of them used for languages; i.e. the sequences are sentences, the tokens are words (or compound words, or n-grams, or morphemes).
This blog post is for code I presented recently on how to use the H2O implementation of word embeddings, aka word2vec. The main thing being demonstrated is that they apply equally well for any language, but you may need some language-specific tokenization, and other data engineering, first.
Here is the preparation code, in R; bring in H2O, and define a couple of helper functions.

h2o.init(nthreads = -1)

show <- function(w, v){
  x =, v))
  x = unique(x)
  plot(x[,2:3], pch=16, type="n")
  text(x[,2:3], x[,1], cex = 2.2)

reduceShow <- function(w,v){
m <- h2o.prcomp(v, 1:ncol(v), k = 2, impute_missing=T)
p <- h2o.predict(m, v)
show(w, p)
Then, I define an artificial corpus, and try with word embedding dimensions of 2, 4 and 9. For dimensions above 2, reduceShow() is using PCA to just show the first two dimensions.
eg1 = c(
  "I like to drive a blue car",
  "I like to drive a red car",
  "I like to drive a green car",
  "I like to drive a blue lorry",
  "I like to drive a yellow lorry",
  "I like to drive a brown lorry",
  "I like to drive a green lorry",
  "I like to drive a red Ferrari",
  "I like to drive a blue Mercedes"

eg1.words <- h2o.tokenize(
  as.character(as.h2o(eg1)), "\\\\W+")


eg1.wordsNoNA <- eg1.words[!,]

eg1.wv <- h2o.word2vec(eg1.words,
                       min_word_freq = 1,
                       vec_size = 2)

eg1.vectors = h2o.transform(eg1.wv,
show(eg1.wordsNoNA, eg1.vectors)

eg1.wv4 <- h2o.word2vec(eg1.words,
                        min_word_freq = 1,
                        vec_size = 4)

eg1.vectors4 = h2o.transform(eg1.wv4,
reduceShow(eg1.wordsNoNA, eg1.vectors4)

eg1.wv9 <- h2o.word2vec(eg1.words,
                        min_word_freq = 1,
                        vec_size = 9,
                        epochs = 50 * 9)

eg1.vectors9 = h2o.transform(eg1.wv9,
reduceShow(eg1.wordsNoNA, eg1.vectors9)
Those results are fairly poor, as we only have 9 sentences; we can do better by sampling those 9 sentences 1000 times. I.e. more data, even if it is exactly the same data, is better!
eg2 = sample(eg1, size = 1000, replace = T)

#(rest of code exactly the same, just changing eg1 to eg2)
What about Japanese? Here are the same 9 sentences (well, almost “This is a …” each time, rather than “I drive a …”), but hand-tokenized in a realistic way (in particular の is a separate token). I’ve gone straight to having 1000 sentences, as we know that helps:
  # これは青い車です。
  # これは赤い車です。
  # これは緑の車です。
  # これは青いトラックです。
  # これは黄色いトラックです。
  # これは茶色のトラックです。
  # これは緑のトラックです。
  # これは赤いフェラーリです。
  # これは青いメルセデスです。

ja1 = c(  #Pre-tokenized
  "これ","は","緑","の","車","です","。",NA,  # ***
  "これ","は","茶色","の","トラック","です","。",NA,  # ***
  "これ","は","緑","の","トラック","です","。",NA,  # ***
ja2 = rep(ja1, times = 120)
# nrow(ja2) is 7920 tokens; representing 1080 sentences.
The code to try it is exactly as with the English, except we just import ja2 and don’t run it through h2o.tokenize().
ja2.words = as.character(as.h2o(ja2))
head(ja2.words, 12)

ja2.wordsNoNA <- ja2.words[!,]

ja2.wv2 <- h2o.word2vec(ja2.words,
                        min_word_freq = 1,
                        vec_size = 2,
                        epochs = 20)
ja2.vectors2 = h2o.transform(ja2.wv2,
show(ja2.wordsNoNA, ja2.vectors2)

ja2.wv4 <- h2o.word2vec(ja2.words,
                        min_word_freq = 1,
                        vec_size = 4,
                        epochs = 20)
ja2.vectors4 = h2o.transform(ja2.wv4,
reduceShow(ja2.wordsNoNA, ja2.vectors4)

ja2.wv9 <- h2o.word2vec(ja2.words,
                        min_word_freq = 1,
                        vec_size = 9,
                        epochs = 50)
ja2.vectors9 = h2o.transform(ja2.wv9,
reduceShow(ja2.wordsNoNA, ja2.vectors9)
In Python? I’ll just quickly show the key changes (there are full word embedding examples in Python, for H2O, floating around, e.g. )
To bring the data in and tokenize it:
eg1 = [...]
sentences = h2o.H2OFrame(eg1).ascharacter()
eg1_words = sentences.tokenize("\\W+")
Then to make the embeddings:
from h2o.estimators.word2vec import H2OWord2vecEstimator
eg1_wv = H2OWord2vecEstimator(vec_size = 2, min_word_freq = 1)
And to get the vectors for visualization:
eg1_vectors = eg1_wv.transform(eg1_words_no_NA, "NONE")
(The Python code is untested as I type this - if I have a typo, let me know in the comments or at darren at dcook dot org, and I will fix it.)

Friday, March 31, 2017

The Seven Day A Year Bug

I’ll cut straight to the chase: when you use d.setMonth(m - 1) in JavaScript, always set the optional second parameter.

What’s that, you didn’t know there was one? Neither did I until earlier today. It allows you to set the date. Cute, I thought at the time, minor time-saver, but hardly worth complicating an API for.

Ooh, how wrong I was. Let me take you back to when it happened. Friday, March 31st….

After a long coding session, I did a check-in. And then ran all unit tests. That is strange, three failing, but in code I hadn’t touched all day. I peer at the code, but it looks correct - it was to do with dates, specifically months, and I was correctly subtracting 1.

Aside: JavaScript dates inherit C’s approach of counting months from 0. In the first draft of this blog post I used a more judgemental phrase than “approach”. But to be fair, it was a 1970s design decision, and the world was different back then. Google “1970s men fashion”.

So, back to the test failures. I start using “git checkout xxxx” to go back to earlier versions, to see exactly when it broke. Running all tests every time. I know something fishy is going on, by the time I’ve gone back 10 days and the tests still fail. I am fairly sure I ran all tests yesterday, and I am certain it hasn’t been 10 days.

Timezones?! Unlikely, but we did put the clocks back last weekend. But a quick test refutes that. (TZ=XXX mocha . will run your unit tests in timezone XXX.)

So, out of ideas, I litter the failing code with console.log lines, to find out what is going on.

Here is what is happening. I initialize a Date object to the current date (to set the current year), then call setMonth(). I don’t use the day, so don’t explicitly set it. I was calling setMonth(8), expecting to see “September”, but the unit test was being given “October”. Where it gets interesting is that the default date today is March 31st. In other words, when I set month to September the date object becomes “September 31st”, which isn’t allowed. So it automatically changes it to October 1st.

You hopefully see where the title of this piece comes from now? If I was setting a date in February I would have discovered the bug two days earlier, and if my unit test had chose October instead of September, the bug would never have been detected. If I’d thought, “ah, I’ll run them Monday”, the bug would not have been discovered until someone used the code in production on May 31st. I’d have processed their bug report on June 1st and told them, “can’t reproduce it”. And they’d have gone, “Oh, you’re right, neither can I now.”

To conclude with a happy ending, I changed all occurrences of d.setMonth(m - 1) into d.setMonth(m-1, 1), and the test failures all went away. I also changed all occurrences of d.setMonth(m-1);d.setDate(v) (where v is the day of the month) into: d.setMonth(m-1, v) not because it is shorter and I can impress people with my knowledge of JavaScript API calls, but because two separate calls was a bug that I simply didn’t have a unit test for.

But writing that unit test can wait until Monday.

Friday, February 24, 2017

NorDevCon 2017: code samples

This is the sample code, in Python and R, for the talk I gave yesterday at the NorDevCon 2017, pre-meeting talks.

To install h2o for Python, from the commandline do:

  pip install h2o

To install it in R, from inside an R session do:


Either way, they should get all the dependencies that you need.

The data was the “” file found at Kaggle (You need to sign-up to Kaggle to be allowed to download it.) The following scripts assume you have unzipped it and put train.csv in the same directory as the scripts.

That Kaggle URL is also where the description of fields is to be found.

Here is how to prepare H2O, and the data, in Python:

import h2o


data = h2o.import_file("train.csv")


factorsList = ['Product_Info_1', 'Product_Info_2', 'Product_Info_3', 'Product_Info_5', 'Product_Info_6', 'Product_Info_7', 'Employment_Info_2', 'Employment_Info_3', 'Employment_Info_5', 'InsuredInfo_1', 'InsuredInfo_2', 'InsuredInfo_3', 'InsuredInfo_4', 'InsuredInfo_5', 'InsuredInfo_6', 'InsuredInfo_7', 'Insurance_History_1', 'Insurance_History_2', 'Insurance_History_3', 'Insurance_History_4', 'Insurance_History_7', 'Insurance_History_8', 'Insurance_History_9', 'Family_Hist_1', 'Medical_History_2', 'Medical_History_3', 'Medical_History_4', 'Medical_History_5', 'Medical_History_6', 'Medical_History_7', 'Medical_History_8', 'Medical_History_9', 'Medical_History_11', 'Medical_History_12', 'Medical_History_13', 'Medical_History_14', 'Medical_History_16', 'Medical_History_17', 'Medical_History_18', 'Medical_History_19', 'Medical_History_20', 'Medical_History_21', 'Medical_History_22', 'Medical_History_23', 'Medical_History_25', 'Medical_History_26', 'Medical_History_27', 'Medical_History_28', 'Medical_History_29', 'Medical_History_30', 'Medical_History_31', 'Medical_History_33', 'Medical_History_34', 'Medical_History_35', 'Medical_History_36', 'Medical_History_37', 'Medical_History_38', 'Medical_History_39', 'Medical_History_40', 'Medical_History_41']

data[factorsList] = data[factorsList].asfactor()

# Split off a random 10% to use to evaluate
# the models we build.
train, test = data.split_frame([0.9], seed=123)

# Sanity check

# What the data looks like:

Here is the very quick deep learning model:

m_DL = h2o.estimators.H2ODeepLearningEstimator(epochs=1)
m_DL.train(x, y, train)

(I made the powerful one-liner claim in the talk but, as you can see, in Python they are two-liners.)

Then to evaluate that model:


m_DL.predict( test[1, x] )  #Ask prediction about first test record

m_DL.predict( test[range(1,6), x] ).cbind(test[range(1,6), y] )  #Compare result for first 6 records

m_DL.model_performance(test) #Average performance on all 6060 test records

m_DL.model_performance(train)  #For comparison: the performance on the data it was trained on

Here is the default GBM model:

m_GBM = h2o.estimators.H2OGradientBoostingEstimator()
m_GBM.train(x, y, train)


Then here is the tuned GBM model - basically it is all about giving it more trees to play with:

m_GBM_best = h2o.estimators.H2OGradientBoostingEstimator(
m_GBM_best.train(x, y, train, validation_frame=test)


And here is the tuned deep learning model:

m_DL_best = h2o.estimators.H2ODeepLearningEstimator(
    hidden_dropout_ratios=[0.4, 0.4, 0.4],
m_DL_best.train(x, y, train, validation_frame=test)


And here is the R code, that does the same as the above:



data = h2o.importFile("train.csv")
# View it on Flow

h2o.cor(data$Wt, data$BMI)

factorsList = c('Product_Info_1', 'Product_Info_2', 'Product_Info_3', 'Product_Info_5', 'Product_Info_6', 'Product_Info_7', 'Employment_Info_2', 'Employment_Info_3', 'Employment_Info_5', 'InsuredInfo_1', 'InsuredInfo_2', 'InsuredInfo_3', 'InsuredInfo_4', 'InsuredInfo_5', 'InsuredInfo_6', 'InsuredInfo_7', 'Insurance_History_1', 'Insurance_History_2', 'Insurance_History_3', 'Insurance_History_4', 'Insurance_History_7', 'Insurance_History_8', 'Insurance_History_9', 'Family_Hist_1', 'Medical_History_2', 'Medical_History_3', 'Medical_History_4', 'Medical_History_5', 'Medical_History_6', 'Medical_History_7', 'Medical_History_8', 'Medical_History_9', 'Medical_History_11', 'Medical_History_12', 'Medical_History_13', 'Medical_History_14', 'Medical_History_16', 'Medical_History_17', 'Medical_History_18', 'Medical_History_19', 'Medical_History_20', 'Medical_History_21', 'Medical_History_22', 'Medical_History_23', 'Medical_History_25', 'Medical_History_26', 'Medical_History_27', 'Medical_History_28', 'Medical_History_29', 'Medical_History_30', 'Medical_History_31', 'Medical_History_33', 'Medical_History_34', 'Medical_History_35', 'Medical_History_36', 'Medical_History_37', 'Medical_History_38', 'Medical_History_39', 'Medical_History_40', 'Medical_History_41')
data[,factorsList] <- as.factor(data[,factorsList])

splits <- h2o.splitFrame(data, 0.9, seed=123)
train <- h2o.assign(splits[[1]], "train")  #90% for training
test <- h2o.assign(splits[[2]], "test")  #10% to evaluate with

ncol(train)   #128
ncol(test)    #128

nrow(train)  #53321
nrow(test)  #6060

t(head(train, 1))
t( as.matrix(test[1,1:127]) )

m_DL <- h2o.deeplearning(2:127, 128, train)
m_DL <- h2o.deeplearning(2:127, 128, train, epochs = 1)  #7 to 9 secs
#system.time( m_DL <- h2o.deeplearning(2:127, 128, train) )  #42.5 secs

h2o.predict(m_DL, test[1,2:127])

  h2o.predict(m_DL, test[1:6, 2:127]),
  test[1:6, 128]

#    predict Response
# 1 7.402184        8
# 2 5.414277        1
# 3 6.946732        8
# 4 6.542647        1
# 5 2.596471        6
# 6 6.224758        5

h2o.performance(m_DL, test)
# H2ORegressionMetrics: deeplearning
# MSE:  3.770782
# RMSE:  1.94185
# MAE:  1.444321
# RMSLE:  0.4248774
# Mean Residual Deviance :  3.770782

m_GBM <- h2o.gbm(2:127, 128, train)  #7.3s

h2o.predict(m_GBM, test[1, 2:127])

  h2o.predict(m_GBM, test[1:6, 2:127]),
  test[1:6, 128]

#    predict Response
# 1 6.934054        8
# 2 5.231893        1
# 3 7.135411        8
# 4 5.906502        1
# 5 3.056508        6
# 6 5.049540        5

h2o.performance(m_GBM, test)

# MSE:  3.599897
# RMSE:  1.89734
# MAE:  1.433456
# RMSLE:  0.4225507
# Mean Residual Deviance :  3.599897


#Takes 20-30secs
m_GBM_best = h2o.gbm(
  2:127, 128, train,

  sample_rate = 0.95,
  validation_frame = test,
  stopping_tolerance = 0,
  stopping_rounds = 4,
  stopping_metric = "MSE",
  ntrees = 200


#h2o.performance gave MSE of 3.473637856428858




# 3-4 minutes  (204secs)
m_DL_best <- h2o.deeplearning(
  2:127, 128, train,
  epochs = 1000,

  validation_frame = test,
  stopping_tolerance = 0,
  stopping_rounds = 4,
  stopping_metric = "MSE",

  activation = "RectifierWithDropout",
  hidden = c(300, 300, 300),
  l1 = 1e-5,
  l2 = 0,
  input_dropout_ratio = 0.2,
  hidden_dropout_ratios = c(0.4, 0.4, 0.4)

h2o.performance(m_DL_best, test)

# MSE:  3.609624
# RMSE:  1.899901
# MAE:  1.444417
# RMSLE:  0.4164153
# Mean Residual Deviance :  3.609624

Finally, and not surprisingly, I can highly recommend my own book, if you would like to learn more about how to use H2O. Examples in the book are on three different data sets, and go into more depth about the different machine learning algorithms that H2O offers, as well as some ideas about how to tune:

From O’Reilly here:

From Amazon UK:

(And other good bookshops, of course!)