Darren's Developer Diary: 2012

Friday, December 21, 2012

EC2: move a large file between Windows instances

Moving a file between linux machines is easy-peasy, just use scp. (You can be sure ssh/scp is on all your linux ec2 machines.) Windows? Sigh, Windows. To get ssh/scp on Windows you need to install cygwin, and that is a non-trivial step to take.

So, how to move a 60GB file to move from one Windows ec2 instance to another Windows ec2 instance (in a different region)? Here is what I did:

Install CloudBerry Pro on the source server. (Must be the pro version: you can get a 14-day free trial; when that expires it is a $30 cost)[1]
Install CloudBerry free version on the target server.
Test copying over a small file, via your S3 account, to make sure it works. I'm assuming you already know how to use this type of two-pane file-copy application. (I created the bucket in the same region as the target machine: that means the upload takes longer than the download.)
In CloudBerry Pro, Tools menu, then Options, then choose Compression And Encryption tab. Check "Use Compression".[2]
Copy the big file. It gives no progress.
When it had finished it said it was 21% done. Very confusing. And on the server it just showed as 13GB file, not a 60GB one.[3]
Download to your target server, using CloudBerry free version. (yes it works fine to download large files, to download compressed files.)
Rename your downloaded file with a ".gz" extension, as that is what it actually is.[3]
Install 7-zip, if you don't have a program that can deal with gzip files. It tells me the file is 2GB compressed, 13GB uncompressed. Ignore that, it is just being stupid. Decompress it, and you get a 60GB file.

Phew, hard work. If you needed to do it regularly you should install cygwin and use scp! <soap-box>Or port your applications to linux where the living is easy. Apart from the fact that running Windows machine is harder, it is also significantly cheaper to run the cloud instances.</soap-box>

[1]: I've heard, but not confirmed, that you can uninstall it from one machine, then use the same install key on a different machine. If true, that is quite a fair license, and I encourage you to support them.

[2]: As we saw, this creates more work, so uncheck after doing your big file.

[3]: I think CloudBerry Pro should have put a .gz extension on the file, when it uploaded it, to make it clear what was going on.

Saturday, December 15, 2012

g++ linker: behaviour change

I was working late on a Friday night, setting up an Ubuntu 12.04 machine. Final step was compiling a couple of C++ programs, that worked fine on Ubuntu 10.04, 11.04 and Centos 5. I got linker complaints regarding boost::program_options. Sounded like it wasn't installed. Strange, as I'd installed "libboost-all-dev".

Poke around... all the files seem to be there. The lib files are in /usr/lib/. Surely g++ is already looking there, but I added "-L/usr/lib/" to be sure. No help. I'd tried all the various ideas I found on StackOverflow, until eventually I found http://stackoverflow.com/a/11250976/841830 where "panickal" suggested the list of libraries has to come last. Yeah sure. But getting desperate by this point I give it a try... and it works!

Specifically, in my Makefile, I had to change this line:
    $(TARGET): $(OBJS)
    $(CXX) -s $(LDFLAGS) $(OBJS) -o $@

To look like this:
    $(TARGET): $(OBJS)
    $(CXX) -s $(OBJS) -o $@ $(LDFLAGS)

So somewhere between g++ 4.5.2 and g++ 4.6.3 it has gone from being easy-going and taking parameters in any order, to requiring certain ones to come at the end. A strange evolution. (I understand why the ordering amongst the library files matters, I just don't see why they now have to all go together at the end.)

But luckily I was using a Makefile with some structure, so the fix was trivial, and my Friday night did not become a Saturday morning!

Tuesday, November 20, 2012

The cloud and Wall Street

An enjoyable video on how Wall Street uses the cloud, and why generally they don't: http://www.infoq.com/presentations/Cloud-Wall-Street

If you only have one minute, here is my summary:

Yes, banks could not just save money but also make their development more nimble and perhaps even more reliable, by moving to the cloud. But because they have loads of legacy systems that integrate in complex ways, there is the element of "if it ain't broke then don't try to fix it." An even bigger reason is moving would be a major project that would distract energy at all levels of the company from their real business of making money. They'd rather make money from increasing sales than make money from reducing costs.

If you have 62 minutes, and have an interest in the intersection of IT and finance, the whole thing is worth your time.
If less time, and are interested in why they should be using the cloud more, that is 35:30 to 39:00.
If you want to understand why they don't, 32:00 to 35:30, and questions from 39:00-44:00. Then 48:00 to 52:00.
If you are interested in Apache Ambari, and creating your own clouds, which is that is 52:00 to 60:00; it is only loosely related to the main theme of the talk.
Question at 60:00 on application rot is interesting.

Note: he generally uses cloud in the sense of virtualization on heavy-duty hardware, that you own and install. (As opposed to the sense of compute units running out there somewhere, that you pay for by the hour, and that you can start and stop just when you need them.)

Thursday, November 1, 2012

Problems with .net4 regasm.exe on shared disk

I've been struggling trying to use the .NET4 version of regasm on a C#4 DLL. The error was:

    RegAsm : error RA0000 : Count not load file or assembly '...' or one of its dependencies. Operation is not supported.

It turns out this is a security thing, due to not trusting my network drive (where the DLL is kept).

The fix was to go to c:\windows\Microsoft.NET\Framework\v4.0.30319\ and open regasm.exe.config in a text editor. Mine now looks like this (the lines I added are shown in red):

<configuration>
    <startup uselegacyv2runtimeactivationpolicy="true">
        <requiredruntime imageversion="v4.0.30319" safemode="true" version="v4.0.30319">
        <supportedruntime sku="client" version="v4.0">
    </supportedruntime></requiredruntime></startup>
<runtime>
    <loadfromremotesources enabled="true"> </loadfromremotesources></runtime>
</configuration>

I didn't have to do this with the .NET2 version of regasm.

By the way, if you get an error complaining about "CoCreateInstance failed with error 0x80040154", that (in my case) was due to having previously registered this DLL as a .NET2 assembly with .NET2 version of regasm. Recompiling the DLL for any of .NET3 or .NET3.5 was fine, but targetting .NET4 gave that error. Hence the need to register it again using v4 of regasm.

Monday, October 15, 2012

Various R Tips

#1 How to validate a commandline argument only uses allowed values?

Say we have a commandline R script, where the first argument is allowed to be a csv list. But that csv list can only contain 'A', 'B' or 'C'. E.g. "myscript.R A" is valid, as is "myscript.R C,B", as is "myscript.R A,C,A,B,C", but not "myscript.R X", or "myscript.R A,B,X,C,B,A"

    my_list=strsplit(argv[1],',')[[1]]
    if(! all (my_list %in% c('A','B','C') ) ) quit()

The first line is the idiom for cracking open a csv list, and getting a vector of character strings. The [[1]] at the end is just noise, live with it.

The middle part of the second line is:
    my_list %in% allowed_values

It returns a vector of logical values the same length as my_list. What we want is for all the elements to be TRUE (which means all the items in our csv list are in the allowed_values list). If there is even a single FALSE (meaning the user gave at least one bad value in the csv list) then we fail, and in this example we call quit(). (See next tip for something less crude that calling quit() with no explanation!)

Thursday, September 13, 2012

Microsoft Azure cloud hosting: vapourware??

I blogged before about Microsoft being a surprise new player in the linux IaaS cloud arena. Well, yesterday, I had a burning need for a new virtual server, and what's more I needed a Window server. So, I decided to take this chance to evaluate Azure. Went through the prices again, watched a couple of setup videos: it all looks competitive, and looks like it might be easier to manage Windows cloud instances than on Amazon EC2.

I signed up (created a LiveID). Then had to give my address and credit card details, to apply for their 90 day free trial. No problems there, though one minor gripe: Japanese postcodes are three digits, then four more digits. It refused "123", and then it refused "1230001". You have to put it in as "123-0001".

Then it tells me "Setting up your Windows Azure subscription". It takes forever, then after two very long minutes it comes back and says: "Sorry! We could not activate this feature. Please contact support." Gulp, I just gave these cowboys my credit card.

However going to my account page shows the 90 day free trail is activated. OK, we're rolling... no, we're not, I click add subscription, end up in that same long "Setting up your Windows Azure subscription" screen. But this time it does something different after 30 seconds, and it looks my account and trial are activated. (Incidentally I ended up with two emails, both telling me my credit card has been charged for $0.00)

It takes quite a bit of clicking around to find the screen where you create new instances - no link from inside the Account page, as far as I can see. Anyway, once there I get told I cannot create a virtual server without signing up for the "preview program". First mention of that!! So, I sign-up for it and I get told:
We are sorry, but we could not complete that operation.

I then click the "portal" button in top right and end up at a page that tells me I've been accepted to the "preview"?!

That then takes me to a page where I still cannot start a virtual server. I get told I need to sign up for the preview program. Umm, didn't we just do this?

Logged off, on again.

This time, at the top I see a green "preview" button. It tells me the interface I'm trying to use is a crippled new version, the old version is the one that actually works. (I'm paraphrasing.) I click that and get told to install silverlight. Silverlight?!!

FAIL.

Off to Amazon EC2, and my Windows instance was up in 20 minutes (which is still far too long, I can get a Linux instance running to the same level of usefulness in 2-3 minutes, but that is a rant for another time... at least Amazon are not wasting my time telling me they have a service that in fact they don't have.)

<- vs. = in R

One of the first thing to strike a programmer new to R is <- all over the place. "Ugly!" may be a first reaction. But probe a bit deeper and you'll find you can use = just as well. "So why bother with <- at all?!" is a common reaction.The most pragmatic advice I've seen from experienced R programmers is it really just comes down to personal preference.

As I'm coming from experience in C++, PHP, Javascript, python, java, (the list goes on), I use = in my R code. Except in one case, which I'll come to in a moment. I've used = in almost all my R-related blog posts and StackOverflow questions and answers and no-one has ever taken offence (AFAIK), shunned me (AFAIK), or done an edit to change them all to <-. So it appears to be acceptable.

Downsides

But there are three downsides I can think of, one important, one social and one bogus.

The first downside is R also uses the single equals sign for a named parameter assignment in a function call. This generally doesn't matter because using assignment in parameters is bad form in C-like languages, so I don't get confused. Just about the only time it ever matters to me is timing code. I *have* to write:

timing = system.time( x <- do_calculation(a=1,b=2) )

If I write the following then x won't get assigned:

timing = system.time( x = do_calculation(a=1,b=2) )

The second downside is the R community considers <- to be standard. So all packages use it, all books use it. If you want to be part of the in-crowd you have to use it too.

The third downside is it is easier to use search-and-replace to convert all your "<-" to "=", than it is to go the other way. But this is suspicious: the above example shows why you cannot do "<-" to "=" completely automatically anyway.

Upsides

Are there any upsides to preferring = to compensate? Yes, though they'll sound petty to people who believe using <- is the Only Way. First it is one less keystroke. Second, in comparisons, this does not work:

if(x<-5)cat("x is less than minus five\n")

So you must put spaces around < and > in R; it is not just a style thing as it is in other languages.

The third reason is when I'm using <- it is communicating intent: I'm deliberately doing an assignment to a variable in the parameter list of a function call. As that is considered bad form in most languages, I like how it stands out.

I've saved the biggest upside for last: familiarity to programmers coming from any of the other C-like languages. (R is a C-like language too.)

Comparisons

You thought I'd mention ==, and how it can be confused with = ? That potential confusion exists in all C-like languages, and we just know to look out for it. And, anyway, it still exists in R:

f(x=1) vs. f(x==1)

When I write one of those, did I mean the other?

Language Design

If I was designing R from scratch, how would I do it differently? I love how I can assign to named parameters in R - it is perhaps the most beautiful feature of the language. But it is an overload of the = sign, and one not found in other C-like languages, so I'd be tempted to change it to use ":" (it looks a bit like object notation in Javascript, but without the curly brackets). So the above example would become:

   timing = system.time( x = do_calculation(a:1,b:2) )

Notice how I can use "x=" safely, because it now has no other meaning. Also notice how this helps with the f(x=1)   vs.    f(x==1) confusion too. But I'm contradicting my comment above, about liking the way <- stands out. So maybe I want it to look like this:

   timing = system.time( x <- do_calculation(a:1,b:2) )

Now a lint tool can warn about use of a single equals sign in a function call, because there should never be one. Hhhmmm, I'm not convinced but it is something to chew on.

Do you have any thoughts or constructive criticism? Please add a comment.

Monday, September 10, 2012

They are SPOFs hiding everywhere!

One of my test systems sent me a few hundred emails between 2:38 and 7:45am JST. Just a test server so it didn't go to my phone and I noticed the problem around 7:30am. No paying clients were affected.

I dived straight in and found my Rackspace UK box couldn't find api.qqtrend.com. But DNS lookup worked for me. Other DNS was working on the Rackspace box. I also found it couldn't ping the DNS server (at GoDaddy). But I could from my office LAN. And I also could from a U.S. server.

Conclusion (totally wrong - see below): Rackspace UK data centre had issues.

So I logged in to Rackspace, to check for alerts, and post a support ticket. There was a mention, phrased very vaguely, saying "someone else" has DNS problems. Hhmmm. By this time I could ping the GoDaddy server, but DNS lookup still failed. Uncertain I pinged around a bit more, and by that time DNS had started working.

I.e. The underlying problem had already been fixed, it was just taking time to spread through the internet, and I could have "fixed it" with no effort by just staying in bed 15 minutes longer. Oh well.

Anyway, it turns out it was a sociopath attacking GoDaddy: http://www.bbc.co.uk/news/business-19549367

Here are his reasons:
"i'm taking godaddy down bacause well i'd like to test how the cyber security is safe and for more reasons that i can not talk now." (sic)

OK, English may not be his first language. But even allowing for that, he is not coming across as an upstanding member of society. Not protesting, no cause, just wanting to see if he had the skills to annoy a lot of people. This is a guy (gal?) who badly needs a girlfriend/boyfriend. If you know him, introduce him to someone. Please.

I know GoDaddy had a big PR screw-up by initially supporting SOPA, but they had the courage and sense to listen to people and change their position. Still a good company in my mind.

But, the silver lining is it nicely illustrated we (QQ Trend) have a Single Point Of Failure, at the DNS and registrar level, that had been overlooked. We'd previously got server1 and server2 as the two endpoints. We have them in different continents, and different cloud providers (no secret: Amazon and Rackspace). And I thought that was solidity to boast about. I was about to add server3 in a third continent as an option for customers who really, really need 100% uptime.

But what today's problem reveals is that if all three servers are on the same domain: server1.example.com, server2.example.com, server3.example.com, then we have a potential issue with DNS, and even with the registrar.

I think we need an alternative domain, at a completely different registrar, with DNS at an independent ISP. Then at the script level we add that in as one of the failover endpoints. They'll point to same three servers.

For instance, however big Amazon or GoDaddy (or any infrastructure provider) get, however many data centres they have around the world, they are still open to attacks, politics and human error inside their organization. We're service providers building on top of their infrastructure. It is our job to accept their limitations and do something about it.

Thursday, August 23, 2012

Small R/XTS Code Snippets And Tips

1. Splitting by week

The split function in xts can split your data into weeks: split(data,f="weeks")
Under the covers it uses endpoints, which in turn uses a C function of the same name in the XTS package. The problem I discovered it that it considers the start of the week as Monday, 00:00:00, and the end of the week as Sunday, 23:59:59. This is a problem because the FX markets start at 9pm or 10pm on a Sunday night in the UTC/GMT timezone (at Sunday, 5pm in New York timezone).

What we want is for it to treat Sunday 00:00:00 as the start of the week. That gives us good buffer on both sides (about 24hrs after Nasdaq after-market hours finish, and about 21hrs before the FX markets open). If you look at the source of endpoints (just type endpoints in an interactive R session) you'll see it has the magic constant: 3L * 86400L. After some trial and error I found we want that to be 4L not 3L. Though it may be brittle, as you're using XTS internals, you can use this line in your code:

ep= .Call("endpoints", .index(data)+4L*86400, 604800L, k=k, PACKAGE="xts")

(I don't suppose the XTS package can consider this a bug, as there may be people relying on the current "weeks" behaviour; but maybe they could add a "sunweeks" option to endpoints and split? What do you think?)

P.S. A minor optimization for the XTS endpoints function:
if (on %in% c("years", "quarters", "months", "weeks", "days"))
could read:
if (on %in% c("years", "quarters", "months", "days"))
because weeks is not using posixltindex, and it takes CPU time to generate.

2. Put an XTS object in a string

Say you want to show the first 4 rows of an xts object in a string. The first trick you need is capture.output(). This takes the print() output and puts it in a string. But it returns each row as an entry in a vector. So we'll use paste() to convert that to a single string. Here is our code:

msg = paste("Here are the first 5 rows:",paste(capture.output(x[1:5,]),collapse="\n") )

See item 3 for an example of this.

3. Show me just the NA rows

Here is some test data:
    library(xts)
    x=xts( data.frame(a=c(1,2,3,NA,5,6), b=c(100,99,NA,NA,NA,95)),
as.Date("2012-10-01")-6:1)
Which looks like:

                a   b
    2012-09-25 1 100
    2012-09-26 2 99
    2012-09-27 3 NA
    2012-09-28 NA NA
    2012-09-29 5 NA
    2012-09-30 6 95

Imagine there should be none, and you want to print them out in an error message. This is the command to just show the rows with no NAs:

    x[complete.cases(x),]

And therefore just the rows with NAs:

    x[!complete.cases(x),]

So, to print a fatal error message that shows the NA rows:

    if(any(is.na(x))){
      stop( paste("We have NAs:",
        paste(capture.output(x[!complete.cases(x),]),collapse="\n")
        ))
      }

(see Item 2, "Put an XTS object in a string", if the second half of that line looks scarier than a rabid dog who has just bitten through his leash.)

4. Loop through as key=>value

In many languages there is a foreach(container as key=>value) type construct. I've not found something so compact for XTS objects, so I use this:
    for(ix in 1:nrow(x)){
      datestamp=index(x)[ix]
      b=x$b[ix] #Or, b=coredata(x$b)[ix]
      #print(datestamp);print(b)
      }

I've shown two ways of setting 'b'. The first way leaves 'b' as an XTS object. In the second way 'b' is a number. If using the latter, I'm sure it is more efficient to do the coredata() call before the loop. I.e. like this:
    xcd=coredata(x$b)
    for(ix in 1:nrow(x)){
      datestamp=index(x)[ix]
      b=xcd[ix]
      #print(datestamp);print(b)
      }

5. Merging, but excluding values only in one xts object

Here is some test data:
library(xts)
Sys.setenv(TZ = "UTC")
d=xts(1:5,Sys.Date()+(2:6))
ix=Sys.Date()+(1:5)

So, d is our data, whereas ix lists the days we should have data for.
ix looks like:
[1] "2013-01-10" "2013-01-11" "2013-01-12" "2013-01-13" "2013-01-14"

and d looks like:
             [,1]
2013-01-11    1
2013-01-12    2
2013-01-13    3
2013-01-14    4
2013-01-15    5

In other words, d is missing an entry for 2013-01-10, so we need to add an NA entry for it. But also d has an entry for 2013-01-15, which it shouldn't have yet. (In a finance context it might be a bar that we are still collecting ticks for, so we don't have final values for yet; in a business context it might be a value that has not been approved by management for release yet.)

When we do merge(d,xts(,ix),all=F) we get:
             d
2013-01-11 1
2013-01-12 2
2013-01-13 3
2013-01-14 4

When we do merge(d,xts(,ix),all=T) we get:
             d
2013-01-10 NA
2013-01-11 1
2013-01-12 2
2013-01-13 3
2013-01-14 4
2013-01-15 5
Neither is what we want. The solution is merge(d,xts(,ix),join='right') which gives us:
2013-01-10 NA
2013-01-11 1
2013-01-12 2
2013-01-13 3
2013-01-14 4

6a. Combining two xts objects that have same column but different timestamps

Here is some test data:
library(xts)
Sys.setenv(TZ = "UTC")
a=xts(1:5,as.Date("2013-02-01")+1:5);colnames(a)="v"
b=xts(0:-2,as.Date("2013-02-01")-0:2);colnames(b)="v"

When we do merge(a,b) we get:
             v v.1
2013-01-30 NA -2
2013-01-31 NA -1
2013-02-01 NA   0
2013-02-02 1 NA
2013-02-03 2 NA
2013-02-04 3 NA
2013-02-05 4 NA
2013-02-06 5 NA

You can play with various values for the join parameter, but it won't help. The solution is rbind(a,b):
           v
2013-01-30 -2
2013-01-31 -1
2013-02-01 0
2013-02-02 1
2013-02-03 2
2013-02-04 3
2013-02-05 4
2013-02-06 5

What if both a and b have a timestamp in common? Then you get two rows. In that case you can remove the duplicates afterwards:
x=rbind(a,b)
x=x[!duplicated(index(x)),]

The duplicate from b is the one that gets removed. If you wanted the one from a to be removed instead, then pass fromLast = TRUE to duplicated(), or simply do rbind(b,a) !

6b. Combining two xts objects that have same columns but different timestamps

This follows on from 6a, but when the xts object have more than one column. Here is some test data:
library(xts)
Sys.setenv(TZ = "UTC")
a=xts(data.frame(x=1:5,y=5:9),as.Date("2013-02-01")+1:5)
b=xts(data.frame(x=0:-2,y=10:12),as.Date("2013-02-01")-0:2)

Again, rbind(a,b) is the answer:
x y
2013-01-30 -2 12
2013-01-31 -1 11
2013-02-01 0 10
2013-02-02 1 5
2013-02-03 2 6
2013-02-04 3 7
2013-02-05 4 8
2013-02-06 5 9

Sunday, July 1, 2012

Mock The Socket, in PHP

I wanted to put a unit test around some PHP code that use a socket, and of course hit a problem: how do I control what a call to fgets returns? You see, in PHP, you cannot replace one of the built-in functions: you get told "Fatal error: Cannot redeclare fgets() ...".

Rename Then Override Then Rename Again!

I asked on StackOverflow, not expecting much response, but almost immediately got told about the rename_function(). Wow! I'd never heard of that before. The challenge then was that this is in the apd extension, which was last released in 2004 and does not support php 5.3. I've put instructions on how to get it installed on the StackOverflow question so I won't repeat them here.

The next challenge you'll meet is that naively using rename_function to a move a function out of the way fails. You still get told "Fatal error: Cannot redeclare fgets() ..." ?! You need to use override_function to replace its behaviour. All Done? Not quite, what you discover next is that you can only override one function. Eh?! But all is not lost: the comments in the marvellous PHP manual described the solution, which goes like:

Use override_function to define the new behaviour
Use rename_function to give a better name to the old, original function.

However, when you go to restore a function (see further down), it turns out that does not work. What you actually need to do is:

Use rename_function to give a name to the old, original function, so we can find it later.
Use override_function to define the new behaviour
Use rename_function to give a dummy name to __overridden__

You do those three steps for each function you want to replace. Here is a complete example that shows how to override fgets and feof to return strings from a global array. NOTE: this is a simplistic example; I should really be overriding fopen and fclose too (they'd be set to do nothing).

$GLOBALS['fgets_strings']=array(
    "Line 1",
    "Line 2",
    "Line 3",
    );

rename_function('fgets','real_fgets');
override_function('fgets','$handle,$length=null','return array_shift($GLOBALS["fgets_strings"])."\n";');
rename_function("__overridden__", 'dummy_fgets');

rename_function('feof','real_feof');
override_function('feof','$handle','return count($GLOBALS["fgets_strings"])==0;');
rename_function("__overridden__", 'dummy_feof');

$fname="rename_test.php";
$fp=fopen($fname,"r");

if($fp)while(!feof($fp)){
    echo fgets($fp);
    }
fclose($fp);

Mock The Sock(et)

So, what about the original challenge, to unit test a socket function? Here is some very minimal code to request a page from a web server; let's pretend we want to test this code:

$fp=fsockopen('127.0.0.1',80);

fwrite($fp,"GET / HTTP/1.1\r\n");
fwrite($fp,"Host: 127.0.0.1\r\n");
fwrite($fp,"\r\n");

if($fp)while(!feof($fp)){
echo fgets($fp);
}

fclose($fp);

To take control of its behaviour, we prepend the following block; the above code does not have to be touched at all.

rename_function('fwrite','real_fwrite');

override_function('fwrite','$fp,$s','');

rename_function("__overridden__", 'dummy_fwrite');

rename_function('fsockopen','real_fsockopen');

override_function('fsockopen',

    '$hostname,$port=-1,&$errno=null,&$errstr=null,$timeout=null',

    'return fopen("socket_mock_contents.txt","r");'

    );

rename_function("__overridden__", 'dummy_fsockopen');

I.e. We replace the call to fsockopen with a call to fopen, and tell it to read our special file. (If you don't want to use an external file, and instead want a fully self-contained test, with the contents in a string, you could use phpstringstream or if you don't want Yet Another Dependency, you could write your own, as the code is fairly short and straightforward )

The other thing to note about the above code is that we had to replace fwrite as well. This is needed because we're creating a read-only stream to be the stand-in for a read-write stream. If you are using other functions (e.g. ftell or stream_set_blocking) you will need to consider if those functions need a mock version too.

The Fly In The Ointment

The thing about replacing a global function is that you're replacing a global function, as in, globally! Any other code that calls that function is going to call your mock version. Maybe we can get away with this with fsockopen, but it becomes quite a major problem if you are replacing things like fgets or fwrite, in a phpUnit unit test, as phpUnit is quite likely to call those functions itself!

So, we want to restore the functions once we're done with them? It is very easy to get this wrong and get a segfault. You must use rename_function before using override_function, the first time. The following code has been tested and restores the behaviour of the above example:

override_function('fwrite','$fp,$s','return real_fwrite($fp,$s);');
rename_function("__overridden__", 'dummy2_fwrite');

override_function('fsockopen','$hostname,$port','return real_fsockopen($hostname,$port);');
rename_function("__overridden__", 'dummy2_fsockopen');

Notice how we still need to deal with __overridden__ each time.

Food For Thought

There is another approach, which might be more robust. It involves checking inside each overridden function if this is the stream you want to be falsifying data for, and whenever it isn't you call the original versions. Here I'll show how to just do that with the fwrite function:

$GLOBALS["mock_fp"]=null;

rename_function('fwrite','real_fwrite');
override_function('fwrite','$fp,$s','
if($fp==$GLOBALS["mock_fp"]){echo "Skipping a fwrite.\n";return;}   //Do nothing
return real_fwrite($fp,$s);
');
rename_function("__overridden__", 'dummy_fwrite');

rename_function('fsockopen','real_fsockopen');
override_function('fsockopen','$hostname,$port=-1,&$errno=null,&$errstr=null,$timeout=null',
    'return $GLOBALS["mock_fp"]=fopen("socket_mock_contents.txt","r");');
rename_function("__overridden__", 'dummy_fsockopen');

Then we can test it, as follows:
$fp=fsockopen('127.0.0.1',80);

fwrite($fp,"GET / HTTP/1.1\r\n");
fwrite($fp,"Host: 127.0.0.1\r\n");
fwrite($fp,"\r\n");

if($fp)while(!feof($fp)){
    echo fgets($fp);
    }

fclose($fp);

$fp=fopen("tmp.txt","w");
fwrite($fp,"My output\n");
fclose($fp);

Thursday, June 28, 2012

Why did my std::runtime_error turn into an unknown exception?!

Been debugging some C++ code. It started off with a program just dying. I narrowed that down to an uncaught exception. I (eventually) narrowed that down to my own code throwing a std::runtime_error. Then much head-scratching ensued, as I was catching it in the catch( ... ) block instead of the catch (std::exception& e) block. I was even more surprised when I could reproduce it in an example that is as minimal as you can get. There is a bug in this code. Give yourself 20 seconds to see if you can spot it before moving on:

#include <iostream>
#include <stdexcept>

int main(int,char**){
try{
    throw new std::runtime_error("Hello!");
}catch (std::runtime_error& e){
    std::cerr<<"Runtime Error:" << e.what() << "\n";
}catch (std::exception& e){
    std::cerr<<"Exception:" << e.what() << "\n";
}catch (...){
    std::cerr<<"Unknown Exception\n";
}

return 0;
}

.
.
.

I'll give you a clue: if, like me, you also spend a lot of time in PHP you will find spotting the bug a lot harder.

.
.
.

Yes, it's that new statement. In PHP you throw exceptions with the throw new ErrorClass. If you do that in C++ you are throwing a pointer. Pointers have to be caught as pointers, not as the class they point to. (And, to quote from item 13 in More Effective C++: "Furthermore, catch-by-pointer runs contrary to the conventions established by the language itself.") I knew that, and never intended to throw by pointer; I wish g++ gave a warning for it.

So, to answer the question at the top: my std::runtime_error did not turn into an unknown exception; my std::runtime_error* did !!

Thursday, June 7, 2012

Push One Git Branch To A Different Server

I have a Github repository, for a library, with two branches on github: master and custom. I then have a third branch, which is application code (including information such as passwords) that I don't want to go on Github. I want to put that 3rd branch on another server.

Before I started my .git/config looked like:

    [core]
        repositoryformatversion = 0
        filemode = true
        bare = false
        logallrefupdates = true
    [remote "upstream"]
        url = git@github.com:fennb/phirehose.git
        fetch = +refs/heads/*:refs/remotes/upstream/*
    [branch "master"]
        remote = upstream
        merge = refs/heads/master
    [remote "origin"]
        url = git@github.com:DarrenCook/phirehose.git
        fetch = +refs/heads/*:refs/remotes/origin/*

git branch tells me

      custom
      master
    * application

As in my previous article on setting up a remote git repository, I'll assume three machines are involved:

devel: where most development is done
store: the central repository server
www: the web server, where I want to checkout the branch

So, first is preparation. On store:
    cd /var/git/
    mkdir phirehose.git
    cd phirehose.git
    git init --bare

Next create an alias for it on devel:
    git remote add application store:/var/git/phirehose.git/
(Note: "store" is in my ssh config: it is an alias that covers server URL, username, port, etc.)

Then (still on devel machine) I need to type:
    git config push.default current

Now check that worked by typing 'git remote -v'
    origin   git@github.com:DarrenCook/phirehose.git (fetch)
    origin   git@github.com:DarrenCook/phirehose.git (push)
    application    store:/var/git/phirehose.git/ (fetch)
    application    store:/var/git/phirehose.git/ (push)
    upstream   git@github.com:fennb/phirehose.git (fetch)
    upstream   git@github.com:fennb/phirehose.git (push)

And in .git/config these lines will have been appended at the end:
    [remote "application"]
        url = store:/var/git/phirehose.git/
        fetch = +refs/heads/*:refs/remotes/application/*
    [push]
        default = current

Now I can upload the branch with:
    git push application

Rush over to 'store' server, and you should see the files in /var/git/phirehose.git/

Now we need to do a quick fix; I'm going to go out on a limb and say this is a git bug. But, anyway, if you look inside the HEAD file you see:
    ref: refs/heads/master

But refs/heads/master does not exist! Edit HEAD so it looks like:
    ref: refs/heads/application

Now you can go to your 'www' machine and do a "git clone" command.
On your 'www' machine if you type git branch you will see just:
   * application

UPDATE:
One last thing. Now you have 2+ remote repositories, on your 'devel' machine, I recommend you type:
    git config push.default nothing

This means that git push will no longer work: you always have to specify what you want to push. This is a safety catch to prevent you pushing application to the public GitHub. (I did, and there is no way back, so my only choice was to delete the entire GitHub repository!) So, when I'm in the application branch and I want to push to store I type:
         git push application HEAD

See this StackOverflow thread for some background (including a note that default git behaviour might change from 1.7.10+)

Monday, May 28, 2012

php-webdriver bindings for selenium: how to add time-outs

Not all webpages finish loading. In particular I've a page that keeps streaming data back to the client, and never finishes. (For instance it might be used from an ajax call.) I want to test this from Selenium, but have been hitting problems. The main problem is Selenium's get() function, which is used to fetch a fresh URL, does not return until the page has finished loading [1]. In my case that meant never, and so my test script locked up!

However all is not lost; you can specify a page load timeout. It is hidden in the protocol docs, but I've added it to the php webdriver library I use (v0.9). See the three functions below [2]; just paste them in to the bottom of WebDriver.php.

I also needed one bug fix in WebDriver.php's public function get($url). It currently ends with:
    $response=curl_exec($session);

Just after that line you should add this:
    return $this->extractValueFromJsonResponse($response);

The time-out, and that bug fix, can be used like this:

require_once "/usr/local/src/selenium/php-webdriver-bindings-0.9.0/phpwebdriver/WebDriver.php";
$webdriver = new WebDriver("localhost", "4444");
$webdriver->connect("firefox");
$webdriver->setPageLoadTimeout(2000);   //2 seconds
$url="http://example.com/forever.php"; //A page that never finishes loading
$obj=$webdriver->get($url);
if($obj===null){
    $current_url=$webdriver->getCurrentUrl();
    if(!$current_url){
        //Selenium-server not running
        }
    else{
        //It worked! (it completed loading in under two seconds)
        }
    }
elseif($obj->class=='org.openqa.selenium.TimeoutException'){
    //It timed out
    }
elseif($obj->class=='org.openqa.selenium.remote.UnreachableBrowserException'){
    //Browser was closed (or selenium-server was shutdown)
    }
else{
    echo "FAILED:";print_r($obj);
    }

This is useful stuff. There is still one problem left for me: I wanted to load two seconds worth of data and then look at it. But I cannot. The browser refuses to listen to selenium while it is loading a page! So though get() returned control to my script after two seconds, I cannot do anything with that control (except close the browser window), because the URL is still actually loading. And it will do that forever!! (I've played with an interesting alternative approach, which also fails, but suggests that a solution is possible. But that is out of the scope of this post, which is to show how to add the time limit functions to php-webdriver-bindings.)

[1]: This is browser-specific behaviour, not by Selenium design. Firefox and Chrome, at least, behave this way.

[2]: Consider this code released, with no warranty, under the MIT license, and permission granted to use in the php-webdriver-bindings project with no attribution required.

    /**
    * Set wait for a page to load.
     *
     * This timeout is for the get() function. (Firefox and Chrome, at least, won't return from get()
     * until a page is fully loaded. If remote server is streaming content, they would never return
     * without this time-out.)
     *
     * @param Number $timeout Number of milliseconds to wait.
     * @author Darren Cook, 2012
     * @internal http://code.google.com/p/selenium/wiki/JsonWireProtocol#/session/:sessionId/timeouts
     */
    public function setPageLoadTimeout($timeout) {
        $request = $this->requestURL . "/timeouts";
        $session = $this->curlInit($request);
        $args = array('type'=>'page load', 'ms' => $timeout);
        $jsonData = json_encode($args);
        $this->preparePOST($session, $jsonData);
        curl_exec($session);
    }

    /**
    * Set wait for a script to finish.
     *
     * @param Number $timeout Number of milliseconds to wait.
     * @author Darren Cook, 2012
     * @internal http://code.google.com/p/selenium/wiki/JsonWireProtocol#/session/:sessionId/timeouts
     */
    public function setAsyncScriptTimeout($timeout) {
        $request = $this->requestURL . "/timeouts";
        $session = $this->curlInit($request);
        $args = array('type'=>'script', 'ms' => $timeout);
        $jsonData = json_encode($args);
        $this->preparePOST($session, $jsonData);
        curl_exec($session);
    }
    /**
    * Set implict wait.
     *
     * This is for waiting for page elements to appear. Not useful for scripts or
     * waiting for the initial get() call to time out.
     *
     * @param Number $timeout Number of milliseconds to wait.
     * @author Darren Cook, 2012
     * @internal http://code.google.com/p/selenium/wiki/JsonWireProtocol#/session/:sessionId/timeouts
     */
    public function setImplicitWaitTimeout($timeout) {
        $request = $this->requestURL . "/timeouts";
        $session = $this->curlInit($request);
        $args = array('type'=>'implicit', 'ms' => $timeout);
        $jsonData = json_encode($args);
        $this->preparePOST($session, $jsonData);
        curl_exec($session);
    }

Friday, May 18, 2012

EC2 and Windows: a match made in... Hell

Henry Ford is famous for his lack of flexibility over the Model T: You can it in any colour you want, as long as it is black.

I think Amazon have taken a leaf out if his book, when offering their Windows instances: You can have any size disk you want, as long as it is 30Gb.

Don't believe me? Go on, try automating the creation of a 100Gb boot disk Windows machine. Or creating one from the web interface, without any post-configuration steps in Windows itself.

(I'll save you the Google: here is the AWS engineers telling you the multiple steps needed to achieve that. The instructions are different depending on the exact version of Windows.)

In fact, try automating anything to do with Windows configuration, using EC2. You can't. It can't be, won't be, scripted. You always have to log on afterwards (going through the EC2 Console to get your almost-impossible-to-type password) to do something. Usually quite a tedious and time-consuming something.
The bottom-line: Windows is not designed for the cloud.

...and yet, some of my clients, and some of my potential clients, insist on trying to use Windows anyway. The cloud is where they feel they should be, so people want to move their legacy apps there. Whenever I ask them how they do it, or why they do it, they seem to find justification. It is the cloud, look we're scaling. We're faster. It works! Like a man let out of prison, and running free in the meadow... only he still has the manacles and chains from his time doing time. Am I carrying my metaphor too far by wishing people would stop and take the Linux axe to the manacles before rushing off to the meadow?

Are you an expert at automating Windows on EC2? Please post a comment showing off your knowledge. I'm willing to learn, and will edit this article if you convince me it can be done :-)

Tuesday, April 10, 2012

Cloud options compared: EC2, Rackspace, HP

I've been participating in HP Cloud's closed beta program, and they have now announced their billing They use the same cloud system and API as Rackspace, making it a fair comparison. (See Rackspace cloud server pricing and Rackspace cloud file pricing)

Executive summary: HP have priced themselves slightly lower, but don't have the two smallest server configurations. So, Rackspace is still better than HP or Amazon EC2 if you just need something minimal, as I described here.
(2012-08-15 UPDATE: Rackspace are stopping their minimal server config: no more 256MB option in their "nextgen" V2 API. Also scheduled images are not available in the V2 API (yet).)

HP offer slightly less disk space than Rackspace, for the same memory size. HP's bandwidth is slightly cheaper than Rackspace; incoming bandwidth is free on both.

HP's CDN offering is weird; based on your billing location, rather than where your servers are, or where your customers are! If you are a U.S. or European company it is cheaper. If not it is more expensive. However, for Japan, HK, Singapore it is only a fraction more expensive, so not a showstopper. Also it is tiered: if you're spending more than $200/month on CDN bandwidth it will work out cheaper still.
If your headquarters are not in North America, Europe, Latin America, Japan, Hong Kong or Singapore then HP pricing says they'd rather you took your CDN business elsewhere.

( 2012-06-12 UPDATE: I've added a surprise Linux cloud hosting option to the below: Microsoft Azure! They are competitive at the low end: 1GHz CPU, 768MB RAM, 25GB storage, 4GB outbound bandwidth (inbound is free), is $12.51/month (about 1.5c/hr). An extra 20% discount for paying for 6 months. But also reasonable at the high-end too (except 14GB is the most memory they can offer). (price calculator) )

Comparing HP's cheapest option to the closest (based on memory size) alternative:
HP: 1Gb RAM, 30GB disk, 1 virtual core: $0.04/hr ($29/month)
Rackspace: 1Gb RAM, 40GB disk, 1/16th of a 4-CPU server: $0.06/hr ($43.80/month)
EC2 Small: 1.7GB RAM, 160GB disk, 1 virtual core, $0.08/hr ($58/month)
Azure Small: 1.6GHz CPU, 1.75GB RAM, 25GB disk, $0.08/hr ($60/month)
(NB. the XSmall, with 768MB RAM, 1Ghz CPU is $12.50/month)

At the top of the CPU range:
HP: 32GB RAM, 960GB disk, 8 virtual cores: $1.28/hr ($934/month)
Rackspace: 31GB RAM, 1200GB disk, 4-CPUs: $1.80/hr ($1,314/month)
EC2, High-CPU Extra Large:7GB RAM, 1690GB disk, 8 virtual cores (20 compute units): $0.66/hr ($481/month)
EC2, High-Memory Quadruple Extra Large: 68GB RAM, 1690GB disk, 8 virtual cores (26 compute units): $1.80/hr ($1,314/month)
Azure, XLarge: 8 x 1.6GHz CPU, 14GB RAM, 975GB disk: $0.64/hr ($575/month)

(So, if you are CPU bound then Amazon is best, if memory bound then HP is best.)

Overall, the pricing seems reasonable. However Rackspace have a brand oozing stability and reliability, and Amazon are huge and have data centres all over the world, so I'm not sure HP prices are low enough to worry their competitors. The 50% discount during the beta program makes them a good buy, short-term, though!

Sunday, March 18, 2012

Changing Rackspace Cloud Instance Sizes

Following on from my post about the actual costs of using Rackspace, this article is about the "Actual downtime of changing the size of a Rackspace cloud server".

I wanted to go from a 10Gb disk to 20Gb disk. I got confused by the interface and accidentally went from 10Gb to 40Gb. So I then also got to try downsizing from 40Gb to 20Gb!

Q1. Does my IP address change? Do I need a new SSH key?
A. No and No.

Q2. How long does it take?
A. The 40Gb to 20Gb downsize took 13 minutes in total. The 10Gb to 40Gb upsize was about the same or slightly quicker. If you like to be careful and take a disk image beforehand, that takes about another 10-15 minutes (for about 9Gb, so allow more if your disk is bigger).

Q3. How much downtime?
A. At least 30 seconds... Prior to the move I saw activity in one program at 06:28:27; the new server was fully running at 06:29:05-ish, and the first new activity in that program was at 06:30:03.

The reason for the almost 60 second delay in that particular program was it gets started from a monitoring script (that I wrote) that runs on a 1-minute cronjob. The cron job didn't run at 06:29:00, so had to wait until 06:30:00. If I had started my program from the init.d scripts, or even from rc.local, the downtime would only have been 30-40 seconds. That is the same downtime you can expect for a webserver.

Q4. Does it depend on the size of the server?
A. I believe so. Bigger servers would have more files to copy.

Q5. Do in-memory caches survive? Do background processes keep running?
A. No and no. The resize implies a reboot. Any background process you'd started manually before needs restarting manually again when the new server comes up. Use /etc/rc.local or cronjobs to auto-start things.

Q6. Do I need to be around when I resize?
A. Yes to get it started. You could then go away (you are supposed to verify it when it comes up, but I think if you don't verify it then after a few hours it assumes you are happy that everything is running smoothly.) (The point of the verify step is they keep the old image around and can quickly revert - I did not test this to see how quick the revert would be.)

Q7. How do I change server size without any downtime at all?
A. Big question. The glib answer is: If you have to ask, then it is too difficult. The slightly more helpful answer is: if a web server then run two servers, with a load balancer in front of them; if a database server, look into database clusters.

Thursday, February 16, 2012

shared_from_this causing Exception tr1::bad_weak_ptr

I've been having a rotten week, with my boost::asio program keep giving me a segmentation fault... and it is not even doing the real work yet. It crashes when a client disconnects. The error message is:

   Exception: tr1::bad_weak_ptr

My code has now been littered with debug lines, lots of them showing usage counts of the shared_ptr in the hope of tracking down at what stage it is going wrong:

   std::cout << "this.use_count=" << shared_from_this().use_count() << "\n";

If that is unfamiliar, my class is defined like this:
   class client_session :
   public boost::enable_shared_from_this< client_session >{ ... }

This allows me to pass around shared pointers to this, from inside the class being pointed at, and is one of the essential tools you need to do anything useful with boost::asio.

I've been reading tutorials, studying other people's code, and progressively adding more shared pointers around objects that I am sure do not really need it. Nothing would shake it.

My code is using cross-references: the client connection object stores a vector of references to the data sources it uses, and each data source stores a vector of references to the clients subscribed to it. When I say reference I mean it holds a smart pointer instance. It is not that complicated but surely the problem must be in that cross-referencing? So, in desperation I deleted the entire data source class, and all that subscribing and unsubscribing code. Eh? It still crashes.

But then I noticed this code:
~client_session(){
    std::cout << "In client_session destructor (this.use_count="
<< shared_from_this().use_count() << ")\n";
    unsubscribe_from_all(); //'cos we'll no longer be valid after this
    }

I knew (!!) the problem was not in the destructor, but had to (!!) be before that point, because that first debug line was never reached. If you're already laughing at me, have a healthy helping of kudos. Yes, it was that call to shared_from_this() causing the crash! I was reaching the destructor, but crashing before it could print my debug line.

You see, in C++, an object does not really exist until the end of the constructor, and does not really exist when you enter the destructor. You must not use shared_from_this() in the destructor, or in any function called from the destructor. And when I thought again about what unsubscribe_from_all() (which was also calling shared_from_this()) does, I realized the destructor could not ever be called if any data sources still have a reference to us. So that call is not needed. The destructor code became:

~client_session(){
    std::cout << "In client_session constructor.\n";
    assert(subscriptions.size()==0);
    }

...and the crashes went away.

There is something very, very annoying knowing the bug I've been chasing for *two solid days* was in the debug code I added to track down the bug.