Friday, March 31, 2017

The Seven Day A Year Bug

I’ll cut straight to the chase: when you use d.setMonth(m - 1) in JavaScript, always set the optional second parameter.

What’s that, you didn’t know there was one? Neither did I until earlier today. It allows you to set the date. Cute, I thought at the time, minor time-saver, but hardly worth complicating an API for.

Ooh, how wrong I was. Let me take you back to when it happened. Friday, March 31st….

After a long coding session, I did a check-in. And then ran all unit tests. That is strange, three failing, but in code I hadn’t touched all day. I peer at the code, but it looks correct - it was to do with dates, specifically months, and I was correctly subtracting 1.

Aside: JavaScript dates inherit C’s approach of counting months from 0. In the first draft of this blog post I used a more judgemental phrase than “approach”. But to be fair, it was a 1970s design decision, and the world was different back then. Google “1970s men fashion”.

So, back to the test failures. I start using “git checkout xxxx” to go back to earlier versions, to see exactly when it broke. Running all tests every time. I know something fishy is going on, by the time I’ve gone back 10 days and the tests still fail. I am fairly sure I ran all tests yesterday, and I am certain it hasn’t been 10 days.

Timezones?! Unlikely, but we did put the clocks back last weekend. But a quick test refutes that. (TZ=XXX mocha . will run your unit tests in timezone XXX.)

So, out of ideas, I litter the failing code with console.log lines, to find out what is going on.

Here is what is happening. I initialize a Date object to the current date (to set the current year), then call setMonth(). I don’t use the day, so don’t explicitly set it. I was calling setMonth(8), expecting to see “September”, but the unit test was being given “October”. Where it gets interesting is that the default date today is March 31st. In other words, when I set month to September the date object becomes “September 31st”, which isn’t allowed. So it automatically changes it to October 1st.

You hopefully see where the title of this piece comes from now? If I was setting a date in February I would have discovered the bug two days earlier, and if my unit test had chose October instead of September, the bug would never have been detected. If I’d thought, “ah, I’ll run them Monday”, the bug would not have been discovered until someone used the code in production on May 31st. I’d have processed their bug report on June 1st and told them, “can’t reproduce it”. And they’d have gone, “Oh, you’re right, neither can I now.”

To conclude with a happy ending, I changed all occurrences of d.setMonth(m - 1) into d.setMonth(m-1, 1), and the test failures all went away. I also changed all occurrences of d.setMonth(m-1);d.setDate(v) (where v is the day of the month) into: d.setMonth(m-1, v) not because it is shorter and I can impress people with my knowledge of JavaScript API calls, but because two separate calls was a bug that I simply didn’t have a unit test for.

But writing that unit test can wait until Monday.

Friday, February 24, 2017

NorDevCon 2017: code samples

This is the sample code, in Python and R, for the talk I gave yesterday at the NorDevCon 2017, pre-meeting talks.

To install h2o for Python, from the commandline do:

  pip install h2o

To install it in R, from inside an R session do:

install.packages("h2o")

Either way, they should get all the dependencies that you need.

The data was the “train.csv.zip” file found at Kaggle (You need to sign-up to Kaggle to be allowed to download it.) The following scripts assume you have unzipped it and put train.csv in the same directory as the scripts.

That Kaggle URL is also where the description of fields is to be found.

Here is how to prepare H2O, and the data, in Python:

import h2o

h2o.init()

data = h2o.import_file("train.csv")

data["Ht"].cor(data["Wt"])

factorsList = ['Product_Info_1', 'Product_Info_2', 'Product_Info_3', 'Product_Info_5', 'Product_Info_6', 'Product_Info_7', 'Employment_Info_2', 'Employment_Info_3', 'Employment_Info_5', 'InsuredInfo_1', 'InsuredInfo_2', 'InsuredInfo_3', 'InsuredInfo_4', 'InsuredInfo_5', 'InsuredInfo_6', 'InsuredInfo_7', 'Insurance_History_1', 'Insurance_History_2', 'Insurance_History_3', 'Insurance_History_4', 'Insurance_History_7', 'Insurance_History_8', 'Insurance_History_9', 'Family_Hist_1', 'Medical_History_2', 'Medical_History_3', 'Medical_History_4', 'Medical_History_5', 'Medical_History_6', 'Medical_History_7', 'Medical_History_8', 'Medical_History_9', 'Medical_History_11', 'Medical_History_12', 'Medical_History_13', 'Medical_History_14', 'Medical_History_16', 'Medical_History_17', 'Medical_History_18', 'Medical_History_19', 'Medical_History_20', 'Medical_History_21', 'Medical_History_22', 'Medical_History_23', 'Medical_History_25', 'Medical_History_26', 'Medical_History_27', 'Medical_History_28', 'Medical_History_29', 'Medical_History_30', 'Medical_History_31', 'Medical_History_33', 'Medical_History_34', 'Medical_History_35', 'Medical_History_36', 'Medical_History_37', 'Medical_History_38', 'Medical_History_39', 'Medical_History_40', 'Medical_History_41']

data[factorsList] = data[factorsList].asfactor()

# Split off a random 10% to use to evaluate
# the models we build.
train, test = data.split_frame([0.9], seed=123)

# Sanity check
train.ncol
test.ncol
train.nrow
test.nrow

# What the data looks like:
train.head(rows=1)
test.head(rows=1)

Here is the very quick deep learning model:

m_DL = h2o.estimators.H2ODeepLearningEstimator(epochs=1)
m_DL.train(x, y, train)

(I made the powerful one-liner claim in the talk but, as you can see, in Python they are two-liners.)

Then to evaluate that model:

m_DL

m_DL.predict( test[1, x] )  #Ask prediction about first test record

m_DL.predict( test[range(1,6), x] ).cbind(test[range(1,6), y] )  #Compare result for first 6 records

m_DL.model_performance(test) #Average performance on all 6060 test records

m_DL.model_performance(train)  #For comparison: the performance on the data it was trained on

Here is the default GBM model:

m_GBM = h2o.estimators.H2OGradientBoostingEstimator()
m_GBM.train(x, y, train)

m_GBM.model_performance(test)

Then here is the tuned GBM model - basically it is all about giving it more trees to play with:

m_GBM_best = h2o.estimators.H2OGradientBoostingEstimator(
    sample_rate=0.95,
    ntrees=200,
    stopping_tolerance=0,stopping_rounds=4,stopping_metric="MSE"
    )
m_GBM_best.train(x, y, train, validation_frame=test)

m_GBM_best.model_performance(test)

And here is the tuned deep learning model:

m_DL_best = h2o.estimators.H2ODeepLearningEstimator(
    activation="RectifierWithDropout",
    hidden=[300,300,300],
    l1=1e-5,
    l2=0,
    input_dropout_ratio=0.2,
    hidden_dropout_ratios=[0.4, 0.4, 0.4],
    epochs=1000,
    stopping_tolerance=0,stopping_rounds=4,stopping_metric="MSE"
    )
m_DL_best.train(x, y, train, validation_frame=test)

m_DL_best.model_performance(test)

And here is the R code, that does the same as the above:

library(h2o)

h2o.init(nthreads=-1)


data = h2o.importFile("train.csv")
# View it on Flow

h2o.cor(data$Wt, data$BMI)


factorsList = c('Product_Info_1', 'Product_Info_2', 'Product_Info_3', 'Product_Info_5', 'Product_Info_6', 'Product_Info_7', 'Employment_Info_2', 'Employment_Info_3', 'Employment_Info_5', 'InsuredInfo_1', 'InsuredInfo_2', 'InsuredInfo_3', 'InsuredInfo_4', 'InsuredInfo_5', 'InsuredInfo_6', 'InsuredInfo_7', 'Insurance_History_1', 'Insurance_History_2', 'Insurance_History_3', 'Insurance_History_4', 'Insurance_History_7', 'Insurance_History_8', 'Insurance_History_9', 'Family_Hist_1', 'Medical_History_2', 'Medical_History_3', 'Medical_History_4', 'Medical_History_5', 'Medical_History_6', 'Medical_History_7', 'Medical_History_8', 'Medical_History_9', 'Medical_History_11', 'Medical_History_12', 'Medical_History_13', 'Medical_History_14', 'Medical_History_16', 'Medical_History_17', 'Medical_History_18', 'Medical_History_19', 'Medical_History_20', 'Medical_History_21', 'Medical_History_22', 'Medical_History_23', 'Medical_History_25', 'Medical_History_26', 'Medical_History_27', 'Medical_History_28', 'Medical_History_29', 'Medical_History_30', 'Medical_History_31', 'Medical_History_33', 'Medical_History_34', 'Medical_History_35', 'Medical_History_36', 'Medical_History_37', 'Medical_History_38', 'Medical_History_39', 'Medical_History_40', 'Medical_History_41')
data[,factorsList] <- as.factor(data[,factorsList])


splits <- h2o.splitFrame(data, 0.9, seed=123)
train <- h2o.assign(splits[[1]], "train")  #90% for training
test <- h2o.assign(splits[[2]], "test")  #10% to evaluate with

ncol(train)   #128
ncol(test)    #128

nrow(train)  #53321
nrow(test)  #6060

t(head(train, 1))
t( as.matrix(test[1,1:127]) )

m_DL <- h2o.deeplearning(2:127, 128, train)
m_DL <- h2o.deeplearning(2:127, 128, train, epochs = 1)  #7 to 9 secs
#system.time( m_DL <- h2o.deeplearning(2:127, 128, train) )  #42.5 secs

h2o.predict(m_DL, test[1,2:127])

h2o.cbind(
  h2o.predict(m_DL, test[1:6, 2:127]),
  test[1:6, 128]
  )

#    predict Response
# 1 7.402184        8
# 2 5.414277        1
# 3 6.946732        8
# 4 6.542647        1
# 5 2.596471        6
# 6 6.224758        5


h2o.performance(m_DL, test)
# H2ORegressionMetrics: deeplearning
# 
# MSE:  3.770782
# RMSE:  1.94185
# MAE:  1.444321
# RMSLE:  0.4248774
# Mean Residual Deviance :  3.770782




######
m_GBM <- h2o.gbm(2:127, 128, train)  #7.3s

h2o.predict(m_GBM, test[1, 2:127])

h2o.cbind(
  h2o.predict(m_GBM, test[1:6, 2:127]),
  test[1:6, 128]
  )

#    predict Response
# 1 6.934054        8
# 2 5.231893        1
# 3 7.135411        8
# 4 5.906502        1
# 5 3.056508        6
# 6 5.049540        5

h2o.performance(m_GBM, test)

# MSE:  3.599897
# RMSE:  1.89734
# MAE:  1.433456
# RMSLE:  0.4225507
# Mean Residual Deviance :  3.599897


##########

#Takes 20-30secs
m_GBM_best = h2o.gbm(
  2:127, 128, train,

  sample_rate = 0.95,
  validation_frame = test,
  stopping_tolerance = 0,
  stopping_rounds = 4,
  stopping_metric = "MSE",
  ntrees = 200

  )

#h2o.performance gave MSE of 3.473637856428858

plot(m_GBM_best)

h2o.scoreHistory(m_GBM_best)


####################

# 3-4 minutes  (204secs)
m_DL_best <- h2o.deeplearning(
  2:127, 128, train,
  epochs = 1000,

  validation_frame = test,
  stopping_tolerance = 0,
  stopping_rounds = 4,
  stopping_metric = "MSE",

  activation = "RectifierWithDropout",
  hidden = c(300, 300, 300),
  l1 = 1e-5,
  l2 = 0,
  input_dropout_ratio = 0.2,
  hidden_dropout_ratios = c(0.4, 0.4, 0.4)
  )  


h2o.performance(m_DL_best, test)

# MSE:  3.609624
# RMSE:  1.899901
# MAE:  1.444417
# RMSLE:  0.4164153
# Mean Residual Deviance :  3.609624

Finally, and not surprisingly, I can highly recommend my own book, if you would like to learn more about how to use H2O. Examples in the book are on three different data sets, and go into more depth about the different machine learning algorithms that H2O offers, as well as some ideas about how to tune:

From O’Reilly here:
http://shop.oreilly.com/product/0636920053170.do

From Amazon UK:
https://www.amazon.co.uk/Practical-Machine-Learning-Darren-Cook/dp/149196460X

(And other good bookshops, of course!)

Thanks!