Monday, July 4, 2011

R: foreach parallelization

I was experimenting with the foreach and doMC packages in R, which make your code multi-threaded. It actually took a bit of refactoring to use foreach(), as there seems to be no shared memory between the threads - they each took their own copy of the in-scope variables. So I had to return my results as a one-row data frame from each iteration, and combine them when the foreach loop had finished.

Here are my results; the user CPU column was giving me strange numbers, so I'm just going to list total time spent (wall clock time). I have three different loops, so here are the base timings (running the foreach loop in sequential mode, without doMC loaded) for each:
4.33         7.72         16.14
My CPU has 4 cores, but the OS sees it as 8 virtual cores. Here are my results for 1, 2, 4 and 8 threads:
1   4.36         7.90         16.20
  2   2.50 [2.18]  4.30 [3.95]   8.80 [8.10]  [8-15% slower]
  4   1.46 [1.09]  2.50 [1.98]   5.06 [4.05]  [25-35% slower]
  8   1.32 [0.55]  2.30 [0.99]   4.30 [2.02]  [110-140% slower]

All times are in seconds, and this loop represents most of the time spent in my script, so while the results are a long way from linear, they represent a useful speed-up. The numbers in square brackets show the speeds if I had got linear improvement.

By the way, my foreach loop had 200 to 250 iterations. The above results tell me that when each foreach loop iteration does more work we get better efficiency. This is fairly course parallelization, which suggests to me that there is lots of room for improvement in the doMC() code.
UPDATE: When running with top, and cores set to 4, I notice 4 new instances of R appear each time it hits the foreach loop, and then they disappear after. I.e. it appears to be using processes, not threads, and creating them on-demand! No wonder my results indicate so much overhead!

1 comment:

Unknown said...

I rewrote the function that contained the foreach() to be more "R-like". I.e. to use vectors instead of an inner loop. When the inner loop is just doing 10 iterations, this gave slightly quicker timings, but when 60 iterations it is much quicker (4-5 times quicker). In fact the new algorithm runs almost independently of inner-loop iterations!

The downsides were that this was much harder to write, and contained more bugs: I compared results with the old version and 9 times out of 10 when there was a difference the old code was correct. The other 10% of the time it was a deliberate difference that I documented: running with vectors requires thinking about the problem in a different way, and the code cannot be translated exactly (at least not without making the code horribly complex).