Friday, September 23, 2016

Hacking the H2O R API

Hacking the H2O R API

H2O comes with a comprehensive R API, but sometimes you want to do something that it does not (yet) support. This article will show how to add a couple of functions for fetching and saving models. Beyond giving you these functions, I want to show how to approach hacking on the API, including using internals. (Code in this article has been tested on the 3.8.2.x, 3.8.3.x and 3.10.0.x releases.)
If you want to learn more about H2O, and machine learning, may I recommend my book: Practical Machine Learning with H2O, published by O’Reilly? (It is “coming really soon” as I write this; I will add ordering information as soon as I have it!) And, my company, QQ Trend, are available for helping you with all your machine learning needs, everything from a few hours of H2O-related consulting to helping you build massive models to solve the mysteries of life. (Contact me at dc at qqtrend.com )

Saving it all for another day

Say you have 30 models stored on H2O, and you want to save them all. The scenario might be that you want to stop the cluster overnight, but want to use your current set of models as the starting point for better models tomorrow. Or in an ensemble. Or something. At the time of writing H2O does not offer this functionality, in any of its various APIs and front-ends. So I want to write an h2o.saveAllModels() function.
Breaking that down a bit, I’m going to need these two functions:
  • get a list of all models
  • save models, given a list of models or model IDs.
Let’s start with the “or model IDs” requirement. H2O’s API offers h2o.saveModel(), but that only takes a model object, so how can we use it when all we have is an ID?

Exposing The Guts…

I am a huge fan of open source. H2O is open source. R is open source. But there is open, and then there is open, and one of the things I like about R is if you want to see how something was implemented, you just type the function name.
Type h2o.saveModel (without parentheses) in an R session (where you’ve already done library(h2o), of course) and you will see the source code. Here it is; notice how the only part of object that it uses is the model id - that is a stroke of luck, because it means that (under the covers) the API works just the way we needed it to!
function (object, path = "", force = FALSE) 
{
    #... Error-checking elided ...
    path <- file.path(path, object@model_id)
    res <- .h2o.__remoteSend(
      paste0("Models.bin/", object@model_id),
      dir = path, force = force, h2oRestApiVersion = 99)
    res$dir
}
If you are new to H2O, you need to understand that all the hard work is done in a Java application (which can equally well be running on your machine or on a cluster the other side of the world), and the clients (whether R, Python or Flow’s CoffeeScript) are all using the same REST API to send commands to it. So it should be no surprise to see .h2o.__remoteSend there; it is making a call to the “Models.bin” REST endpoint.
.h2o.__remoteSend is a private function in the R API. That means you cannot call it directly. Luckily, R doesn’t get in our way like Java or C++ would. We can use the package name followed by the triple colon operator to run it from normal user code: h2o:::.h2o.__remoteSend(...)
WARNING: Remember that hacking with the internals of an API is not future-proof. An upgrade might break everything you’ve written. …ooh, look at us, adrenaline flowing, living the life of danger. (Note to self: need to get out more.)

Let’s Write Some Code!

We now have enough to make the saveModels() function:
h2o.saveModels <- function(models, path, force = FALSE){
sapply(models, function(id){
  if(is.object(id))id <- id@model_id
  res <- h2o:::.h2o.__remoteSend(
    paste0("Models.bin/", id),
    dir = file.path(path, id),
    force = force, h2oRestApiVersion = 99)
  res$dir
  }, USE.NAMES=F)
}
The if(is.object(id))id = id@model_id line is what allows it to work with a mix of model id strings or model objects. The use of sapply(..., USE.NAMES=F) means it returns a character vector, containing the full path of each model file that was saved. Use it as follows:
h2o.saveModels(
  c("DL:defaults", "DL:200x200-500", "RF:100-40"),
  "/path/to/h2o_models/todays_hard_work",
  force = TRUE
  )
and it will output:
[1] "/path/to/h2o_models/todays_hard_work/DL:defaults"
[2] "/path/to/h2o_models/todays_hard_work/DL:200x200-500"
[3] "/path/to/h2o_models/todays_hard_work/RF:100-40"
(By the way, there is one irritating problem with this function: if any failure occurs, such as a file already existing, or a model ID not found, it stops with a long error message, and doesn’t attempt to save the other models. I’ll leave improving that to you. Hint: consider wrapping the h2o:::.h2o.__remoteSend() call with ?tryCatch.)

What Models Have I Made?

Next, how to get a list of all models? The Flow interface has getModels, and the REST API has GET /3/Models, but the R and Python APIs do not; the closest they have is h2o.ls() which returns the names of all data frames, models, prediction results, etc. with no (reliable) way to tell them apart. But GET /3/Models is not ideal either, because it returns everything about the model, whereas all we want is the model id. Having trawled through the H2O source (BTW, If you fancied submitting a patch to add a GET /3/ModelIds/ command, this looks like a good starting point I.e. what we want is just the first half of that function) it appears we are stuck with this. It is unlikely to matter unless you have 1000s of models, or a slow connection to your remote H2O cluster.
Start the same way as before by typing h2o.getModel (no parentheses) into R. Ooh! That function is long, and it is doing an awful lot. If you want a generally useful h2o.getModels() function I leave that as another of those exercises for the reader. Instead I’m going to call my function
h2o.getAllModelIds(), and limit the scope to just that, which makes the code much simpler. (Did you notice the pro tip there: just by calling my function “getAllModelIds” instead of “getModels” I saved myself hours of work. You see kids, naming really does matter.)
Here it is:
h2o.getAllModelIds <- function(){
d <- h2o:::.h2o.__remoteSend(method = "GET", "Models")
sapply(d[[3]], function(x) x$model_id$name)
}
Line 1 says get all the models. Line 2 says filter just the model id out, and throw the rest of it away. (Yeah, that d[[3]] bit is particularly fragile.)
Anyway, the final step is to simply put our two new functions together:
h2o.saveAllModels <- function(path){
h2o.saveModels(h2o.getAllModelIds(), path)
}
Use it, as shown here:
fnames <- h2o.saveAllModels("/path/to/todays_hard_work")
On one test, length(fnames) returned 154 (I’d been busy), and those 154 models totalled 150MB. However some models (e.g. random forest) are bigger than others, so make sure you have plenty of disk space to hand, just in case. Speaking of which, h2o.saveAllModels() should work equally well with S3 or HDFS destinations.

The day after the night before…

I could’ve done a dput(fnames) after running h2o.saveAllModels(), and saved the output somewhere. But as I’m not putting anything else in that particular directory, I can get the list again with Sys.glob(). So, I might start my next day’s session as follows.
 library(h2o)
 h2o.init(nthreads = -1)
 fnames <- Sys.glob("/path/to/todays_hard_work/*")
 models <- lapply(fnames, h2o.loadModel)
Voila! models will be an R list of the H2O model objects.

Clusters

If you are working on a remote cluster, with more than one node, there is a little twist to be aware of. h2o.saveModel() (and therefore our h2o.saveModels() extension) will create files on whichever node of the cluster your client is connected to. (At least, as of 3.10.0.7; I suspect this behaviour might change in future.)
But h2o.loadModel() will look for it on the file system of node 1 of the cluster. And node 1 is not (necessarily) the first node you listed in your flatfile. Instead it is the one listed first in h2o.clusterStatus().
This won’t concern you if you saved to HDFS or S3.

Bonus

What I actually use to load model files is shown below. It will get all the files from sub-directories. (Look at the list.file() documentation for how you can use pattern to choose just some files to be loaded in.)
h2o.loadModelsDirectory <- function(path, pattern=NULL, recursive=T, verbose=F){
fnames <- list.files(path, pattern = pattern, recursive = recursive, full.names = T, include.dirs = F)
h2o.loadModels(fnames, verbose)
}
(This code still has one problem: if I’m loading in work from both yesterday and two days ago, and I had made a model called “DF:default” on both days, I lose one of them. Sorting that out is my final exercise for the reader - please post your answer, or a link to your answer, in the comments!)

Tuesday, August 9, 2016

H2O Upgrade: Detailed Steps

H2O Upgrade: Detailed Steps

I wanted to upgrade both R and Python to the latest version of H2O (as of Aug 9th 2016). Here are the exact steps, and I think you will find them relevant even if you only need to update one or the other of those clients. Remember if following this at a later date, that you should follow the spirit of it, rather than copy-and-pasting: all the version numbers will have changed. This was with Linux Mint, but it should apply equally well to all other Linux distros.
Make sure you first close any R or Python clients that are using the H2O library; and separately shutdown H2O if it is still running after that.

The First Time

The below instructions are all for upgrading, which means I know I have all the dependencies in place. If this is your first H2O install, well I’d first recommend you buy my new book: Practical Machine Learning with H2O, published by O’Reilly… But it is not out as I type this! So let me know if you are interested, and I’ll not just let you know when it is available, but also send the best discount code I can find.
As a quick guide, from R I recommend you use CRAN
install.packages("h2o")
and from Python I recommend you use pip:
pip install h2o
Both these approaches get all the dependences for you; you may end up back a version or two from the very latest, but it won’t matter.

The Download

cd /usr/local/src/
wget http://download.h2o.ai/versions/h2o-3.10.0.3.zip
unzip h2o-3.10.0.3.zip
It made a “h2o-3.10.0.3” directory, with python and R sub-directories.

R

I installed the h2o the first time as root, so I will continue to do that, hence the sudo:
cd /usr/local/src/h2o-3.10.0.3/R/
sudo R
Then:
remove.packages("h2o")
install.packages("h2o_3.10.0.3.tar.gz")
Then ctrl-d to exit.

Python

cd /usr/local/src/h2o-3.10.0.3/python/
sudo pip uninstall h2o
sudo pip install -U h2o-3.10.0.3-py2.py3-none-any.whl 
(The -U means upgrade any dependencies; the first time I forgot it, and ended up with some very weird errors when trying to do anything in Python.)

The Test

I started RStudio, and ran:
library(h2o)
h2o.init(nthreads=-1)
as.h2o(iris)
I then started ipython and ran:
import h2o,pandas
h2o.init()
iris = h2o.get_frame("iris")
print(iris)
As well as making sure the data arrived, I’m also checking the h2o.init() call in both cases said the cluster version was “3.10.0.3”.

AWS Scripts

If you use the AWS scripts (https://github.com/h2oai/h2o-3/tree/master/ec2) and want to make sure EC2 instances start with exactly the same version as you have installed locally, the file to edit is h2o-cluster-download-h2o.sh. (If not using those scripts, just skip this section.)
First find the h2oBranch= line and set it to “rel-turing” (notice the “g” on the end - there is also a version without the “g”!). Then comment out the two curl calls that follow, and instead set version to be whatever you have above, and build to be the last digit in the version number. So, for 3.10.0.3, I set:
h2oBranch=rel-turing

#echo "Fetching latest build number for branch ${h2oBranch}..."
#curl --silent -o latest https://h2o-release.s3.amazonaws.com/h2o/${h2oBranch}/latest
h2oBuild=3

#echo "Fetching full version number for build ${h2oBuild}..."
#curl --silent -o project_version https://h2o-release.s3.amazonaws.com/h2o/${h2oBranch}/${h2oBuild}/project_version
h2oVersion=3.10.0.3
The rest of that script, and the other EC2 scripts, can be left untouched.

Summary

Well that was easy! No excuses! Having said that, I recommend you upgrade cautiously - I have seen some hard-to-test-for regressions (e.g. model learning no longer scaling as well over a cluster) when grabbing the latest version.

Monday, March 21, 2016

WebGL: RWD, Mobile-First And Progressive Enhancement

WebGL: adapting to the device

This is a rather under-studied area, but it is going to become more important as WebGL is increasingly used to make websites. This article is a summary of my thoughts and learnings on the topic, so far.

I should say that my context is using WebGL for things other than games. Informational websites, educational apps, data visualization, etc., etc.

Please use the comments to add related links you can recommend; or just to disagree.

What is RWD?

Responsive Web Design. The idea is that, rather than make a mobile version of your website and a separate desktop version of your website, you make a single version in such a way that it will adapt and be viewable on all devices, from mobile phones (both portrait and landscape), through tablets, to desktop computers.

Mobile-First? Progressive Enhancement?

Mobile First is the idea that you first make your site work on the smallest screen, and the device with least capability. Then for the larger screens you add more sections.

This is contrast to starting at the other end: make beautiful graphics designed for a FullHD desktop monitor, using both mouse and keyboard, then hiding and removing things as you move to the smaller devices.

Just remember Mobile-First is a guideline, not a rule. If you end up with a desktop site where the user is getting frustrated by having to simulate touch gestures with their mouse, then you’ve missed the point.

It Is Hard!

RWD for a complex website can get rather hard. On toy examples it all seems nice and simple. But then add some forms. Make it multilingual. Add some CSS transitions and animations. Add user uploaded text or images. Then just as you start to crack under the strain of all the combinations that need testing, the fatal blow: the client insists on Internet Explorer 8 being supported.

But if you thought RWD and UX for normal websites was hard, then WebGL/3D takes it to a whole new dimension…

Progressive Enhancement In 3D

Progressive enhancement can be obvious things like using lower-polygon models and lower-resolution textures, or adding/removing shadows (see below).

But it can also be quite subtle things: in a 3D game presentation I did recently, the main avatar had an “idle” state animation: his chest moved up and down. But this requires redrawing the whole screen 60 times a second; without that idle animation the screen only needs to be redrawn when the character moves. Removing the idle animation can extend mobile battery life by an order of magnitude.

And that can lead to political issues. If you’ve ever seen a designer throw a tantrum just because you re-saved his graphics as 80% quality jpegs, think about what will happen if two-thirds of the design budget, and over three-quarters of the designer’s time went on making those subtle animations, and you’ve just switched them off for most of your users.

By the way, it is also about the difference between zero and one continuous animations. Remember an always-on “flyover” effect counts as animation. An arcade game where the user’s character is constantly being moved around the screen does too. So, once one effect requires constantly re-drawing the scene, the extra load of adding those little avatar animations will be negligible.

Lower-Poly Models

I mentioned this in passing above. Be aware that it is often more involved than with 2D images, where using Gimp/Photoshop to turn the 800x600 image to 320x240 and as lower quality can be automated. In fact you may end up with doubling your designer costs, if they have to make two versions.

If the motivation for low-poly is to reduce download time, you could consider running a sub-surf modifier once the data has been downloaded. Or describing the shape with a spline and dynamically extrude it.

If the motivation is to reduce the number of polygons to reduce CPU/GPU effort, again consider the extrude approach but using different splines, and/or different bevels.

Shadows

Adding shadows increases the realism of a 3D scene, but adds more CPU/GPU effort. Also more programmer effort: you need to specify which objects cast shadows, which objects receive shadows, and which light sources cast shadows. (All libraries I mentioned in Comparison Of Three WebGL Libraries handle shadows in this way.)

For many data visualization tasks, shadows are unnecessary, and could even get in the way. Even for games they are usually an optional extra. But in some applications the sense of depth that shadows give can really improve the user experience (UX).

If you have a fixed viewing angle on your 3D scene, and fixed lighting, you can use pre-made shadows: these are simply a 2D graphic that holds the shadow.

VR

With virtual reality headsets you will be updating two displays, and it has to be at a very high refresh rate, so it is demanding on hardware.

But virtual reality is well-suited for progressive enhancement: just make sure your website or application is fully usable without it, but if a user has the headset they are able to experience a deeper immersion.

Controls In 3D

Your standard web page didn’t need much controlling: up/down was the only axis of movement, and being able to click a link. Touch-only mobile devices could adapt easily: drag up/down, and tap to follow a link.

Mouseover hints are usually done as progressive enhancements, meaning they are not available to people not using a device with a mouse. (Meaning in mobile apps I often have no idea what all the different icons do…)

If your WebGL involves the user navigating around a 3D world, the four arrow keys can be a very natural approach. But there is no common convention on a touch-only device. Some games show a semi-transparent joystick control on top, so you press that for the 4 directions. Others have you touch the left/right halves of the screen to steer left and right, and perhaps you move at a constant speed.

Another approach is to touch/click the point you want to move to, and have your app choose the route, and animate following it.

Zoom is an interesting one, as the approach for standard web sites can generally be used for 3D too. There are two conventions on mobile: the pinch to grow/shrink, or double-tap to zoom a fixed distance (and double-tap to restore). With a mouse, the scroll-wheel, while holding down ctrl, zooms. With only a keyboard, ctrl and plus/minus, with ctrl and zero to restore to default zoom.

Friday, March 18, 2016

Timestamp helper in Handlebars

Handlebars is a widely-used templating language for web pages. In a nutshell, the variables to insert go between {{ and }}. Easy. It offers a few bits of logic, such as if/else clauses, for-each loops, etc. But, just as usefully, Handlebars allows you to add helper functions of your own.
In this article I will show a nice little Handlebars helper to format datestamps and timestamps. Its raison d’etre is its support for multiple languages and timezones. The simplest use case (assuming birthday is their birthday in some common format):
<p>Your birthday is on {{timestamp birthday}}.</p>
It builds on top of sugar.js’s Date enhancements; I was going to do this article without using them, to keep it focused, but that would have made it unreasonably complex.
There are two ways to configure it: with global variables, or with per-tag options. For most applications, setting the globals once will be best. Here are the globals it expects to find:
  • tzOffset: the number of seconds your timezone is ahead of UTC. E.g. if in Japan, then tzOffset = 9*3600. If in the U.K. this is either 0 or 3600 depending on if it is summer time or not.
  • lang: The user-interface language, e.g. “en” for English, “ja” for Japanese, etc.
(By the way, if setting lang to something other than “en”, you will also need to have included locale support into sugar.js for the languages you are supporting - this is easy, see the sugar.js customize page, and check Date Locales.)
The default timestamp format is the one built-in to sugar.js for your specified language. All these configuration options (the two above, and format) can be overridden when using the tag. E.g. if start is the timestamp of when an online event starts, you could write:
<p>The live streaming will start at
{{timestamp start tzOffset=0}} UTC,
which is {{timestamp start tzOffset=32400}}
in Tokyo and {{timestamp start tzOffset=-25200}}
in San Francisco.</p>
Here is the basic version:
Handlebars.registerHelper('timestamp', function(t, options){
var offset = options.hash.tzOffset;
if(!offset)offset = tzOffset;

if(!Object.isDate(t)){
    if(!t)return "";
    if(Object.isString(t))t = Date.create(t + "+0000").setUTC(true).addSeconds(offset);
    else t = Date.create(t*1000).setUTC(true).addSeconds(offset);
    }
else t = t.clone().addSeconds(offset);

if(!t.isValid())return "";

var code = options.hash.lang;
if(!code)code = lang;   //Use global as default

var format = options.hash.format ? options.hash.format : '';
return t.format(format, lang);
});
The first two-thirds of the function turn t into a Date object, coping whether it was already a Date object, or a string (in UTC, and in any common format the Date.create() can cope with), or a number (in which case it is seconds since Jan 1st 1970 UTC). However, be careful if giving a pre-made Date object: make sure it was the time in UTC and specifies that is in UTC.
The rest of the function just chooses the language and format, and returns the formatted date string.
If you were paying attention you would have noticed t stores a lie. E.g. for 5pm BST, t would be given as 4pm UTC. We then turn it into a date that claims to be 5pm UTC. Basically this is to stop format() being too clever, and adjusting for local browser time. (This trick is so you can show a date in a browser for something other than the user’s local timezone.)
But it does mean that if you include any of the timezone specifiers in your format string, they will wrongly claim it is UTC. {{timestamp
theDeadline format="{HH}:{mm} {tz}" }} will output 17:00 +0000.
To allow you to explicitly specify the true timezone, here is an enhanced version:
Handlebars.registerHelper('timestamp', function(t, options){
var offset = options.hash.tzOffset;
if(!offset)offset = tzOffset;   //Use global as default
if(!Object.isDate(t)){
    if(!t)return "";
    if(Object.isString(t))t = Date.create(t + "+0000").setUTC(true).addSeconds(offset);
    else t = Date.create(t*1000).setUTC(true).addSeconds(offset);
    }
else t = t.clone().addSeconds(offset);
if(!t.isValid())return "";

var code = options.hash.lang;
if(!code)code = lang;   //Use global as default

var format = options.hash.format ? options.hash.format : '';
var s = t.format(format, lang);
if(options.hash.appendTZ)s+=tzString;
if(options.hash.append)s+=options.hash.append;
return s;
});
(the only change is to add a couple of lines near the end)
Now if you specify appendTZ=true then it will append the global tzString. Alternatively you can append any text you want by specifying append. So, our earlier example becomes one of these:
{{timestamp theDeadline format="{HH}:{mm}" appendTZ=true}}
{{timestamp theDeadline format="{HH}:{mm}" append="BST"}}
{{timestamp theDeadline format="{HH}:{mm}" append=theDeadlineTimezone}}
The first one assumes a global tzString is set. The second one hard-codes the timezone, which is unlikely to be the case; the third one is the same idea but getting timezone from another variable.
VERSION INFO: The above code is for sugar.js v.1.5.0, which is the latest version at the time of writing, and likely to be so for a while. If you need it for sugar.js 1.4.x then please change all occurrences of setUTC(true) to utc().