Sunday, July 31, 2011

Rcpp: embedding C++ in R scripts

Rcpp is an R package to aid in making R extensions written in C++. It also includes a side-project for embedding R code in a C++ program (including being able to use all the R packages), but most amazing of all is the integration with R's inline package. Let me show you a minimal (but complete) script:
src='Rcpp::List output(1);output[0]="Hello World";return output;'
Yes, it is the classic Hello World script, the output looks like:
[1] "Hello World"
What do I call this amazing? The contents of src is the body of a C++ function. The cxxfunction takes your C++ code, sends it to your C++ compiler, compiles it with the necessary headers, and finally embeds it in your R session as an R extension. It just works, there is no catch (*).

(I could have used a character object instead of a list. The Rcpp::List is just like an R list, meaning each element can be of a different type. If you've always wanted something like this in C++, also take a look at boost::any.)

The cxxfunction() call takes 1.9s to run on my machine, so the overhead to start-up the compiler and run the compilation process is not too bad.

Okay, not a motivating example, but the Rcpp project is not just technically amazing, it is also well documented. You can find some more interesting examples in the documentation, and if this has got your interest you will enjoy this video (1.5hrs) of a talk the Rcpp main authors did at Google.

By the way, if anyone knows of a PHP project for embedding C++ like this, please let me know!

*: You need to install the Rcpp and inline packages, and you need R version 2.12.0 or later. If you are using Ubuntu and your default version of R is too old, there are good instructions here on getting the latest R version as a deb package.

Sunday, July 17, 2011

R: extract column from xts object as a vector

A quickie for the R language... you will look at the answer and think it is hardly worth writing a blog post over. Well it turns out that everyone has thought the same... which is why it took me over half an hour of failed searching and trial and error to work it out.

I've an XTS object, x, and I've added a column which is the sum of the other columns:
print(x$sum) shows me the sum for each row, along with the datestamp of that row. I.e. x$sum is still an XTS object. Normally this is good, but I wanted it as a simple vector, without the dates associated. Here is how you do that:
Why did I want that? Simply because summary(x$sum) uses up 7 lines, half of it being noise about the datestamps; summary(as.vector(x$sum)) is 2 lines, all signal, no noise.

Thursday, July 14, 2011

Customize less: less annoying

Open ~/.bashrc (or ~/.bash_profile) and add this line at the end:
Don't ask why, just trust me. ('cos actually I don't know what it means either...)

This was the magic incantation to tell "git log" to wrap commit messages (instead of just chopping off everything extra).

But it has had the delightful side-effect of man now works properly. About 6 or 7 years ago I used to be able to type "man whatever", press q, and the advice would stay on screen. Then linux (all distros as far as I could tell) changed to hiding the man page as soon as you press q. Frequently I'd have to have two terminal windows open, simply so I could keep the man page open while I type my command. Yes, you're ahead of me; the above LESS command fixes this. If I'd known it would be so easy I'd have done this years ago!!

Even better, the help system in the R language was working just like man pages (and it was even more annoying there), and that has been fixed too.

It is a miracle, I tell you, a miracle! July 13th should go down as "Saint LESS=-FRX" day; in 100 years time it will be a public holiday and children will be taught about it in schools. They might even be taught what it means...

Monday, July 4, 2011

R: foreach parallelization

I was experimenting with the foreach and doMC packages in R, which make your code multi-threaded. It actually took a bit of refactoring to use foreach(), as there seems to be no shared memory between the threads - they each took their own copy of the in-scope variables. So I had to return my results as a one-row data frame from each iteration, and combine them when the foreach loop had finished.

Here are my results; the user CPU column was giving me strange numbers, so I'm just going to list total time spent (wall clock time). I have three different loops, so here are the base timings (running the foreach loop in sequential mode, without doMC loaded) for each:
4.33         7.72         16.14
My CPU has 4 cores, but the OS sees it as 8 virtual cores. Here are my results for 1, 2, 4 and 8 threads:
1   4.36         7.90         16.20
  2   2.50 [2.18]  4.30 [3.95]   8.80 [8.10]  [8-15% slower]
  4   1.46 [1.09]  2.50 [1.98]   5.06 [4.05]  [25-35% slower]
  8   1.32 [0.55]  2.30 [0.99]   4.30 [2.02]  [110-140% slower]

All times are in seconds, and this loop represents most of the time spent in my script, so while the results are a long way from linear, they represent a useful speed-up. The numbers in square brackets show the speeds if I had got linear improvement.

By the way, my foreach loop had 200 to 250 iterations. The above results tell me that when each foreach loop iteration does more work we get better efficiency. This is fairly course parallelization, which suggests to me that there is lots of room for improvement in the doMC() code.
UPDATE: When running with top, and cores set to 4, I notice 4 new instances of R appear each time it hits the foreach loop, and then they disappear after. I.e. it appears to be using processes, not threads, and creating them on-demand! No wonder my results indicate so much overhead!

Sunday, July 3, 2011

Apache: use both PHP module and PHP-CGI

I had a need the other day to configure apache so it uses the PHP Apache module for all directories except one, where I wanted it to use the cgi version of PHP. The apache module is more efficient, but runs as part of the apache process. I wanted one URL to run in its own process. (I was troubleshooting and had a hunch this might help; luckily it did in this case!) Instructions for both Linux and Windows follow.

On Ubuntu I needed some preparation: 1. install the php5-cgi package (different from php5-cli, which is what we use when using php from the commandline); 2. enable the actions module.
On Windows I was using xampp, which follows an Everything-But-The-Kitchen-Sink philosophy, and all I needed was already there.

Here is the code I added (to my VirtualHost). First for Ubuntu:
ScriptAlias /usephpcgi /usr/bin
Action  application/x-httpd-php-cgi  /usephpcgi/php5-cgi
And then for Windows XAMPP:
ScriptAlias /usephpcgi C:/xampp/php/
Action  application/x-httpd-php-cgi  /usephpcgi/php-cgi.exe
(That is copy and paste from working configurations, so I think the trailing slash must be optional!)

Then for both Apache and Windows:
AddType application/x-httpd-php-cgi .phpcgi
Now, I have to admit my dirty secret: I cheated. Instead of enabling php-cgi for all *.php in one directory, I left *.php going to the Apache PHP module, and I created the *.phpcgi extension to use the cgi binary. Initially this was simply because I managed to get it working that way; but on reflection I realized I preferred it: I can switch a script between using php module and php-cgi just by changing the extension; also I can use php-cgi anywhere in my VirtualHost. If that does not sound so useful I should explain the script in question is already hidden between an Alias, something like this, so no public-facing URLs need to change:
AliasMatch ^/admin/(.*?)/(.*)$ /path/to/admin.phpcgi

What about my original plan to configure it for a directory, without changing the file extensions? I had trouble with this, and gave up on this; but Ben kindly left a comment, so now I know how to do it. First, you still need the ScriptAlias and Action directives shown above. Then it is simply this.
<Directory /var/www/path/to>
    <FilesMatch "\.ph(p3?|tml)$">
        SetHandler application/x-httpd-php-cgi
As Ben explains (see comment#1 below) the reason we need the FilesMatch inside the Directory is because mod-php is setting a global FilesMatch directive; and that takes priority over our attempts to use AddType or AddHandler for a directory.