Wednesday, December 6, 2017

Word Embeddings, NLP and I18N in H2O

Word embeddings can be thought of as a dimension-reduction tool, needing a sequence of tokens to learn from. They are really that generic, but I’ve only ever heard of them used for languages; i.e. the sequences are sentences, the tokens are words (or compound words, or n-grams, or morphemes).
This blog post is for code I presented recently on how to use the H2O implementation of word embeddings, aka word2vec. The main thing being demonstrated is that they apply equally well for any language, but you may need some language-specific tokenization, and other data engineering, first.
Here is the preparation code, in R; bring in H2O, and define a couple of helper functions.
library(h2o)

h2o.init(nthreads = -1)


show <- function(w, v){
  x = as.data.frame(h2o.cbind(w, v))
  x = unique(x)
  plot(x[,2:3], pch=16, type="n")
  text(x[,2:3], x[,1], cex = 2.2)
}


reduceShow <- function(w,v){
m <- h2o.prcomp(v, 1:ncol(v), k = 2, impute_missing=T)
p <- h2o.predict(m, v)
show(w, p)
}
Then, I define an artificial corpus, and try with word embedding dimensions of 2, 4 and 9. For dimensions above 2, reduceShow() is using PCA to just show the first two dimensions.
eg1 = c(
  "I like to drive a blue car",
  "I like to drive a red car",
  "I like to drive a green car",
  "I like to drive a blue lorry",
  "I like to drive a yellow lorry",
  "I like to drive a brown lorry",
  "I like to drive a green lorry",
  "I like to drive a red Ferrari",
  "I like to drive a blue Mercedes"
)

eg1.words <- h2o.tokenize(
  as.character(as.h2o(eg1)), "\\\\W+")

head(eg1.words,12)

eg1.wordsNoNA <- eg1.words[!is.na(eg1.words),]

eg1.wv <- h2o.word2vec(eg1.words,
                       min_word_freq = 1,
                       vec_size = 2)

eg1.vectors = h2o.transform(eg1.wv,
                            eg1.wordsNoNA,
                            "NONE")
show(eg1.wordsNoNA, eg1.vectors)


eg1.wv4 <- h2o.word2vec(eg1.words,
                        min_word_freq = 1,
                        vec_size = 4)

eg1.vectors4 = h2o.transform(eg1.wv4,
                             eg1.wordsNoNA,
                             "NONE")
reduceShow(eg1.wordsNoNA, eg1.vectors4)

eg1.wv9 <- h2o.word2vec(eg1.words,
                        min_word_freq = 1,
                        vec_size = 9,
                        epochs = 50 * 9)

eg1.vectors9 = h2o.transform(eg1.wv9,
                             eg1.wordsNoNA,
                             "NONE")
reduceShow(eg1.wordsNoNA, eg1.vectors9)
Those results are fairly poor, as we only have 9 sentences; we can do better by sampling those 9 sentences 1000 times. I.e. more data, even if it is exactly the same data, is better!
eg2 = sample(eg1, size = 1000, replace = T)

#(rest of code exactly the same, just changing eg1 to eg2)
What about Japanese? Here are the same 9 sentences (well, almost “This is a …” each time, rather than “I drive a …”), but hand-tokenized in a realistic way (in particular の is a separate token). I’ve gone straight to having 1000 sentences, as we know that helps:
  # これは青い車です。
  # これは赤い車です。
  # これは緑の車です。
  # これは青いトラックです。
  # これは黄色いトラックです。
  # これは茶色のトラックです。
  # これは緑のトラックです。
  # これは赤いフェラーリです。
  # これは青いメルセデスです。

ja1 = c(  #Pre-tokenized
  "これ","は","青い","車","です","。",NA,
  "これ","は","赤い","車","です","。",NA,
  "これ","は","緑","の","車","です","。",NA,  # ***
  "これ","は","青い","トラック","です","。",NA,
  "これ","は","黄色い","トラック","です","。",NA,
  "これ","は","茶色","の","トラック","です","。",NA,  # ***
  "これ","は","緑","の","トラック","です","。",NA,  # ***
  "これ","は","赤い","フェラーリ","です","。",NA,
  "これ","は","青い","メルセデス","です","。",NA
  )
ja2 = rep(ja1, times = 120)
# nrow(ja2) is 7920 tokens; representing 1080 sentences.
The code to try it is exactly as with the English, except we just import ja2 and don’t run it through h2o.tokenize().
ja2.words = as.character(as.h2o(ja2))
head(ja2.words, 12)

ja2.wordsNoNA <- ja2.words[!is.na(ja2.words),]

ja2.wv2 <- h2o.word2vec(ja2.words,
                        min_word_freq = 1,
                        vec_size = 2,
                        epochs = 20)
ja2.vectors2 = h2o.transform(ja2.wv2,
                             ja2.wordsNoNA,
                             "NONE")
show(ja2.wordsNoNA, ja2.vectors2)

ja2.wv4 <- h2o.word2vec(ja2.words,
                        min_word_freq = 1,
                        vec_size = 4,
                        epochs = 20)
ja2.vectors4 = h2o.transform(ja2.wv4,
                             ja2.wordsNoNA,
                             "NONE")
reduceShow(ja2.wordsNoNA, ja2.vectors4)


ja2.wv9 <- h2o.word2vec(ja2.words,
                        min_word_freq = 1,
                        vec_size = 9,
                        epochs = 50)
ja2.vectors9 = h2o.transform(ja2.wv9,
                             ja2.wordsNoNA,
                             "NONE")
reduceShow(ja2.wordsNoNA, ja2.vectors9)
In Python? I’ll just quickly show the key changes (there are full word embedding examples in Python, for H2O, floating around, e.g. https://github.com/h2oai/h2o-meetups/blob/master/2017_11_14_NLP_H2O/Amazon%20Reviews.ipynb )
To bring the data in and tokenize it:
eg1 = [...]
sentences = h2o.H2OFrame(eg1).ascharacter()
eg1_words = sentences.tokenize("\\W+")
eg1_words.head()
Then to make the embeddings:
from h2o.estimators.word2vec import H2OWord2vecEstimator
eg1_wv = H2OWord2vecEstimator(vec_size = 2, min_word_freq = 1)
eg1_wv.train(eg1_words)
And to get the vectors for visualization:
eg1_vectors = eg1_wv.transform(eg1_words_no_NA, "NONE")
(The Python code is untested as I type this - if I have a typo, let me know in the comments or at darren at dcook dot org, and I will fix it.)