Thursday, August 23, 2012

Small R/XTS Code Snippets And Tips

1. Splitting by week

The split function in xts can split your data into weeks:  split(data,f="weeks")
Under the covers it uses endpoints, which in turn uses a C function of the same name in the XTS package. The problem I discovered it that it considers the start of the week as Monday, 00:00:00, and the end of the week as Sunday, 23:59:59. This is a problem because the FX markets start at 9pm or 10pm on a Sunday night in the UTC/GMT timezone (at Sunday, 5pm in New York timezone).

What we want is for it to treat Sunday 00:00:00 as the start of the week. That gives us good buffer on both sides (about 24hrs after Nasdaq after-market hours finish, and about 21hrs before the FX markets open). If you look at the source of endpoints (just type endpoints in an interactive R session) you'll see it has the magic constant: 3L * 86400L. After some trial and error I found we want that to be 4L not 3L. Though it may be brittle, as you're using XTS internals, you can use this line in your code:

    ep= .Call("endpoints", .index(data)+4L*86400, 604800L, k=k, PACKAGE="xts")

(I don't suppose the XTS package can consider this a bug, as there may be people relying on the current "weeks" behaviour; but maybe they could add a "sunweeks" option to endpoints and split? What do you think?)

P.S. A minor optimization for the XTS endpoints function:
    if (on %in% c("years", "quarters", "months", "weeks", "days"))
could read:
    if (on %in% c("years", "quarters", "months", "days"))
because weeks is not using posixltindex, and it takes CPU time to generate.

2. Put an XTS object in a string

Say you want to show the first 4 rows of an xts object in a string. The first trick you need is capture.output(). This takes the print() output and puts it in a string. But it returns each row as an entry in a vector. So we'll use paste() to convert that to a single string. Here is our code:


    msg = paste("Here are the first 5 rows:",paste(capture.output(x[1:5,]),collapse="\n") )

See item 3 for an example of this.

3. Show me just the NA rows

Here is some test data:
    library(xts)
    x=xts(  data.frame(a=c(1,2,3,NA,5,6), b=c(100,99,NA,NA,NA,95)),
            as.Date("2012-10-01")-6:1)
Which looks like:

                a   b
    2012-09-25  1 100
    2012-09-26  2  99
    2012-09-27  3  NA
    2012-09-28 NA  NA
    2012-09-29  5  NA
    2012-09-30  6  95


Imagine there should be none, and you want to print them out in an error message. This is the command to just show the rows with no NAs:

    x[complete.cases(x),]

And therefore just the rows with  NAs:


    x[!complete.cases(x),]

So, to print a fatal error message that shows the NA rows:

    if(any(is.na(x))){
      stop( paste("We have NAs:",

        paste(capture.output(x[!complete.cases(x),]),collapse="\n")
        ))
      }


(see Item 2, "Put an XTS object in a string", if the second half of that line looks scarier than a rabid dog who has just bitten through his leash.)


4. Loop through as key=>value


In many languages there is a foreach(container as key=>value) type construct. I've not found something so compact for XTS objects, so I use this:
    for(ix in 1:nrow(x)){
      datestamp=index(x)[ix]
      b=x$b[ix]  #Or, b=coredata(x$b)[ix]

      #print(datestamp);print(b)
      }

I've shown two ways of setting 'b'. The first way leaves 'b' as an XTS object. In the second way 'b' is a number. If using the latter, I'm sure it is more efficient to do the coredata() call before the loop. I.e. like this:
    xcd=coredata(x$b)
    for(ix in 1:nrow(x)){
      datestamp=index(x)[ix]
      b=xcd[ix]

      #print(datestamp);print(b)
      }

 

5. Merging, but excluding values only in one xts object

Here is some test data:
  library(xts)
  Sys.setenv(TZ = "UTC")
  d=xts(1:5,Sys.Date()+(2:6))
  ix=Sys.Date()+(1:5)

So, d is our data, whereas ix lists the days we should have data for.
ix looks like:
  [1] "2013-01-10" "2013-01-11" "2013-01-12" "2013-01-13" "2013-01-14"

and d looks like:
             [,1]
  2013-01-11    1
  2013-01-12    2
  2013-01-13    3
  2013-01-14    4
  2013-01-15    5


In other words, d is missing an entry for 2013-01-10, so we need to add an NA entry for it. But also d has an entry for 2013-01-15, which it shouldn't have yet. (In a finance context it might be a bar that we are still collecting ticks for, so we don't have final values for yet; in a business context it might be a value that has not been approved by management for release yet.)

When we do merge(d,xts(,ix),all=F) we get:
             d
  2013-01-11 1
  2013-01-12 2
  2013-01-13 3
  2013-01-14 4



When we do merge(d,xts(,ix),all=T) we get:
             d
  2013-01-10 NA
  2013-01-11  1
  2013-01-12  2
  2013-01-13  3
  2013-01-14  4
  2013-01-15  5

Neither is what we want. The solution is  merge(d,xts(,ix),join='right') which gives us:
  2013-01-10 NA
  2013-01-11  1
  2013-01-12  2
  2013-01-13  3
  2013-01-14  4


6a. Combining two xts objects that have same column but different timestamps

Here is some test data:
  library(xts)
  Sys.setenv(TZ = "UTC")
  a=xts(1:5,as.Date("2013-02-01")+1:5);colnames(a)="v"
  b=xts(0:-2,as.Date("2013-02-01")-0:2);colnames(b)="v"

When we do merge(a,b) we get:
             v v.1
 2013-01-30 NA  -2
 2013-01-31 NA  -1
 2013-02-01 NA   0
 2013-02-02  1  NA
 2013-02-03  2  NA
 2013-02-04  3  NA
 2013-02-05  4  NA
 2013-02-06  5  NA


You can play with various values for the join parameter, but it won't help. The solution is  rbind(a,b):
             v
 2013-01-30 -2
 2013-01-31 -1
 2013-02-01  0
 2013-02-02  1
 2013-02-03  2
 2013-02-04  3
 2013-02-05  4
 2013-02-06  5


What if both a and b have a timestamp in common? Then you get two rows. In that case you can remove the duplicates afterwards:
  x=rbind(a,b)
  x=x[!duplicated(index(x)),]

The duplicate from b is the one that gets removed. If you wanted the one from a to be removed instead, then pass fromLast = TRUE to duplicated(), or simply do rbind(b,a) !

6b. Combining two xts objects that have same columns but different timestamps

This follows on from 6a, but when the xts object have more than one column. Here is some test data:
  library(xts)
  Sys.setenv(TZ = "UTC")
  a=xts(data.frame(x=1:5,y=5:9),as.Date("2013-02-01")+1:5)
  b=xts(data.frame(x=0:-2,y=10:12),as.Date("2013-02-01")-0:2)

Again,  rbind(a,b) is the answer: 
             x  y
 2013-01-30 -2 12
 2013-01-31 -1 11
 2013-02-01  0 10
 2013-02-02  1  5
 2013-02-03  2  6
 2013-02-04  3  7
 2013-02-05  4  8
 2013-02-06  5  9