Thursday, October 20, 2011

R: what to do when the jitter gets too much

The subject sounds like this will be a discussion of an over-full bladder... instead, brace yourself for some serious functional programming.

So, I had this little bit of R to generate a "rough" exponent curve (as test data):
  x=1:40
  y=exp( jitter( x, factor=30 ) ^ 0.5 )

The problem was it was sometimes giving NA values - too much jitter!
I started off with this fix:

  y[is.na(y)]=exp( x^0.5 )

I.e. wherever it was an NA use the un-jittered value.

Unfortunately it was often complaining with "number of items to replace is not a multiple of replacement length". In this case this was not a warning to ignore, because it revealed I was doing something very wrong. exp(x) is a vector of 40 items. y[is.na(y)] is a vector of however many entries in y are NA. The dangerous thing is if there is just one NA in y, but it is the 5th element, it will be set to exp(1)^0.5, instead of exp(5)^0.5.

Let's cut straight to the solution:
  y[is.na(y)]=exp(x[is.na(y)]^0.5)

x[is.na(y)] means the values out of x where the same-sized y-vector has a NA in that position. So, for the above example where just the 5th element in y is NA, then x[is.na(y)] will be "5".

Simple when you know how. (?!)


No comments: