Monday, March 30, 2009

6GB on 32-bit linux

Well, it all started when a friend said their SMT (statistical machine translation) system was ready to download and install. He then casually mentioned it is a bit of a memory hog, "4Gb minimum, 8Gb preferred".

Wow. I looked at my up-until-then-perfectly-adequate-some-might-even-say-overkill 2 gigabytes and felt like a salesman who had just been told he ought to upgrade his sensible car for a sports car in order to look more successful!

I have dual-channel, and two slots free. Another 4Gb was under 5000 yen at Dospara (Japanese only), where I had bought this computer, so I emailed them to confirm it would work. After a few emails, and a bit of research I found out:
* Windows 32-bit will only go up to 3Gb.
* 64-bit Windows, 64-bit Linux are both fine well beyond 8Gb.
* 32-bit linux will allow up to 4Gb per process, and can use more than 4Gb altogether.

(I wanted to stick with 32-bit linux, as Flash is critical to my work and has no 64-bit version.)

Dospara were rather cautious, saying they don't support linux, but I went for it. When I plugged in the extra 4Gb, the bios correctly recognized 6Gb. Then Ubuntu said I had 3Gb. But that was okay, as I'd been expecting it. I went to the package manager and selected the "linux-server" meta-package, then rebooted.

Drum roll please: "free" reports I have 6Gb available. I'm using 475 Mb, and have 5,745 Mb free. See, I told you I didn't need it. But this is city driving. You wait until I take this beast down the local Formula One track, otherwise known as Difficult AI Problems.

Oh, while I was there I also bought a 1Tb SATA drive. Yes, that is a "T". A whole terabyte in a little, diddy box. It was only 10,000 yen (you could get one for 7,880 yen if you go for Seagate, but a quick bit of research showed lots of unhappy people, so I went for Western Digital which seemed to be the reliability brand).

What do you mean: "I bet he doesn't need that much storage either" ? Just because my current 250Gb drive still has 143G of free space after 18 months, doesn't mean I suddenly won't need more capacity...

And you can bet that when the ladies hear I am part of the Terabyte Generation I'm going to be fighting them off with a stick. Oooh, yeah! I am so sexy.

Monday, March 16, 2009

Periods in regex character classes

I'm trying to match Japanese numbers with a regex, and this was originally a long article describing what seemed to be a bug when using the following regex:
$regex='/((0|1|2|3|4|5|6|7|8|9|.)+)(万|億|兆)/u';

(a simplified version of my actual code, but this demonstrated the problem equally well)

Now, periods in character classes (a list of alternatives in square brackets) are not special, so don't need escaping. But they do need escaping anywhere else. I knew this, which is why I was scratching my head.

So, naturally, I felt a bit foolish when I realized I was using a list of alternatives in parentheses. It wasn't a character class at all.

But, now, this was even more confusing because I had already been down that street. My code (which in hindsight was correct) had started out like this:
$regex='/((0|1|2|3|4|5|6|7|8|9|\.)+)(万|億|兆)/u';

and that kept saying no match. E.g. for "730.28".

Did I jump out of my bath and shout "Eureka!" when I realized the problem? No. It was more of a forehead slapper. It wasn't "730.28万". It was just "730.28". The code works by trying to match the above regex, then if it fails passes it to another function. I'd been debugging the wrong function. Once I was looking at the right function it was a trivial fix.

Moral of the story: don't be stupid.

(Or more seriously, "Time To Refactor", as the code is now tangled enough that it was too easy for me to get entangled in the wrong place.)

Sunday, March 8, 2009

What is an ontology?

I went to InterOntology 09, at Keio University a week ago. Actually I only attended the first couple of talks and the (free!) banquet; I couldn't make the other talks. Ontology is one of those words I have had trouble grasping, and I attended with no more ambition that understanding what it means.

My ambition was not fulfilled, but I was relieved to know that most attendees were just as confused as me. Okay, relieved is perhaps the wrong word. And by "confused", I don't mean people stood around scratching their heads. I mean people were using it in different ways. Most people who said they were "building ontologies" were actually building databases.

Someone I met there kindly sent me this link:
What are the differences between a vocabulary, a taxonomy, a thesaurus, an ontology, and a meta-model?

This (in conjunction with its first comment) is an excellent article. A bit heavy, but definitely worth the effort.

I feel it backs up the opinion I've been forming that ontology is a word that does not really need to exist. People "building ontologies" are generally data modeling, or taxonomy-building, or semantic-network-building. These terms are more specific, and contain more information about exactly what you are doing.

People using the word ontology may want to emphasize that the grammar allows for validating models using logic. Personally I would rather call that validating the data model, though I do see how a widely used ontology representation language could encourage high quality data validation tools. But they are nowhere near that yet - using SPARQL (pronounced sparkle) appears to be more like writing in assembler than the SQL its name hints at.

On the other hand, if everyone listened to me there would have been no InterOntology conference, and I'd have missed out on a free dinner. Such things need to be taken into consideration. Perhaps I should add "Professional Ontologist" to my business card after all? ;-)

Thursday, March 5, 2009

Comparing two arrays in PHP when elements can be in any order

This comes up so often. Usually I want to compare two CSV strings, where order does not matter.

My usual strategy is to first explode(',',$s) both strings. Then loop through one array, seeing if that element exists in the second array. If it does not then we instantly fail. If it does match then remove that element from the second array. At the end, if no failures and if the second array is empty, then they are the same.

When writing the above code for the umpteenth I wondered for the umpteenth time if there was a better way. I checked the PHP manual and still no "compare_arrays_can_be_in_any_order()" function. (Or, am I missing it - please let me know!)

But maybe I can use a couple of functions?

First idea: array_intersect(). Returns all elements of array1 that are in array2. So:

$intersect1=array_intersect($list1,$list2);
if($intersect1!==$list1)return false; //list2 is a subset of list1.
$intersect2=array_intersect($list2,$list1);
if($intersect2!==$list2)return false; //list1 is a subset of list2.
return true; //They are supersets of each other. I.e. equal.

I've not tried this, as my second idea was better: sort(). The implementation is:

function compare_arrays_can_be_in_any_order($list1,$list2){
sort($list1);
sort($list2);
return ($list1===$list2);
}

Note: this is destructive, as both list1 and list2 are modified. Written as above this is fine as I pass the arrays in by value. But be careful if pasting just that code into another function.

Here is the function to solve my original csv matching problem:

function compare_csv_strings_can_be_in_any_order($csv1,$csv2){
$list1=explode(',',$csv1);
$list2=explode(',',$csv2);
sort($list1);
sort($list2);
return ($list1===$list2);
}

Bear in mind this will only work for simple csv strings: if one uses quotes and the other doesn't, or one uses whitespace around the comma and the other doesn't, you need something more sophisticated.

Now your homework question to make sure you are paying attention: sort() takes an optional parameter of flags, to specify sorting items numerically, as strings, or as strings using current locale. Do I need to specify any flags to have my code work with any type of array data, and to work on a machine using a different locale?

What about comparing arrays of arrays?

And final question, the one of most interest to me: do you have a better way? (Better could mean any of: less code, quicker execution or fewer restrictions/disclaimers!)

Tuesday, March 3, 2009

Specify the default, stupid!

I've been struggling all week to connect to a Microsoft SQL server. It was SQL Express. I'm using "ADODB.Connection" COM object from PHP.

(The reason to use that is you can specify the 3rd param as CP_UTF8 (without quotes) and then it will convert from UTF-8 to the UCS-2 that SQL Server works in, all behind the scenes.)

First, if it says it cannot connect due to there being no server: you have to enable TCP/IP connections in SQL Server. They are off by default.

That got us to the next error, which said "接続が正しくありません。" which translates as "The connection is incorrect." Not much to go on. I think this may be the same as "error 26", but I'm not 100% sure on that.

My code could connect fine to the production DB machine, running SQL workstation (which is the more expensive version I believe). And a client on another machine that could also connect to that production DB, could not connect to our SQL Express machine.

So, with all fingers pointing at the SQL Express server, we tried things and googled things and swore at things. Wrong barking tree!

I've given the answer in the subject. SQL Server listens on port 1433 by default. Our SQL server was listening on that port, so you would think no need to specify it. But this DSN fails:

DRIVER={SQL Server};
SERVER={127.0.0.1};UID={myuser};PWD={mypassword}; DATABASE={mydatabase}

whereas this one works:
DRIVER={SQL Server};
SERVER={127.0.0.1,1433};UID={myuser};PWD={mypassword}; DATABASE={mydatabase}

Only for SQL Express it seems. No need to specify the default port when connecting to SQL workstation.

Nice one Microsoft. Just got to work on quality control and documentation a bit more, oh, and consistency, and then they could probably go professional with this software business of theirs.