Saturday, December 19, 2009

Right Sed Fred, I'm too sexy to search and replace

Last week I hand-edited 22 XML files to change one attribute in each. I had only ten minutes, and I knew that solution could be done in that time.

Today I had exactly the same task, but with less time pressure, I went hunting. Here is the solution I used:
sed -i 's/old/new/g' *.xml

-i means replace inline. I had struggled with sed years ago and thought it was a horrible monster that made emacs look user-friendly in comparison. But that is so simple. Perhaps I was hurt before by trying to do something that couldn't be described as a simple regex?

Actually I vaguely remember my need at that time was to modify all html files in a directory tree, which sed cannot do. But with find it can:
find . -iname \*.html -execdir sed -i 's/<html>/<html mytag="test">/g' {} +

That inserts an attribute in the html tag of all *.html files in current directory and its subdirectories. (Shamelessly stolen from comments on this page here, and then I did a quick test to confirm I hadn't introduce a typo.)

Cool. One step closer to unix guruness. (sed is also available in cygwin, which is where I was actually doing the edit that started this article.)

UPDATE: for an example where tr is more useful than sed see
find, grep and tr.

UPDATE: Here is an expansion of the find+sed example above. I wanted to alter three types of extensions: html, php, phtml. And I only wanted to alter those files in just certain subdirectories. The -regex parameter of find seemed to do the job:

find . -regex './\(dir1\|dir2/subdir1\|dir3\|dir4\)/.+\.\(html\|php\|phtml\)' -execdir sed -i 's/ABC/XYZ/g' {} +

That is the command exactly as you run it at the bash command prompt. Notice that, in the regex, not just (), but also | need to be prefixed with a single backslash.

Tuesday, December 1, 2009

php Architect: no more print version

The subject says it all: the publishers of the excellent php|a magazine have suddenly stopped the print edition. It has gone PDF only.
(Click php|a on the right to see my other blog articles about php|a magazine.)

This is such a shame: I would willingly pay much more for a print version than a PDF version. In fact I was doing exactly that. Then a year ago they turned everyone to be Print+PDF subscribers, cut the price dramatically, and gave me free extra 12 issues to make up for the price cut. In fact that only works out as 6 or so free issues, but as they were free I can hardly complain.

It is just a shame they fiddled with their business model, as obviously the price cut made the print magazine too expensive.

Why do I prefer the print version? It can be read on the train, read in the mountains, is light and portable. It can be kept on my bookshelf, pulled off and opened on my desk while I'm coding. As an author, a print magazine is also much more impressive. It is something physical I can show a potential client, something I can show my Mum.

Monday, November 23, 2009

Escaping CSV in C++

There are two escaping rules for each field in a comma-separated value row:
1. Change each double quote to two double quotes.
2. Surround with double quotes if the field contains a comma or double quote.

These are the rules used by Excel and all other software that deals with CSV data.

As an example, if my fields are:
hello world
"CSV" is popular

Then it becomes:
hello world,"a,b,c,","""CSV"" is popular",""""""

C++ has a justified reputation as a hard language for text manipulation. Boost has libraries to make it a little easier, but I didn't want to add Boost as a dependency for a project I was working on. Fortunately std::string's replace() function turned out to be more powerful than I had realized:

void output_csv(std::ostream &out,std::string s){
if(s.find('"')!=std::string::npos){ //Escape double-quotes
std::string::size_type pos=0;
pos+=2; //Need to skip over those two quotes, to avoid an infinite loop!
else if(s.find(',')!=std::string::npos){ //Need to surround with "..."
else out<<s; //No escaping needed

If you like compact code then the while loop can be rewritten:

void output_csv(std::ostream &out,std::string s){
if(s.find('"')!=std::string::npos){ //Escape double-quotes
for(std::string::size_type n=0;(n=s.find('"',n))!=std::string::npos;n+=2)s.replace(n,1,"\"\"");
else if(s.find(',')!=std::string::npos)out<<'"'<<s<<'"';
else out<<s;

P.S. If you need to do the same in PHP, PHP 5.1 finally introduced fputcsv for it. The comments on that page show how to do it in older versions of PHP; my fclib library also contains functions for it.

Monday, November 16, 2009

samba and strange permissions

On my linux server I run a samba share, which is used by both Linux and Windows clients on the LAN. I moved it from an old FedoraCore machine to Ubuntu a few months ago, and ever since have been getting strange permissions: text files kept becoming executable (but only for user, not group or other).

It took me this long to realize what was going on, and when I tried to track it down about a week ago I concluded it was SciTE being strange on just samba partitions. I.e. I'd edit a file with rw-r--r-- permissions, save it, and it would end up with "rwxr--r--". Every time. But not on normal partitions, and gedit didn't do it on the samba partition. I noticed today that files created by a PHP script also got those weird permissions, and the penny dropped: gedit must be explicitly setting permissions when it saves a file. Scite wasn't the cause of the problem at all.

So, I went hunting again. I referred to Mount samba shares with utf8 encoding using cifs a lot, but in fact it didn't give me the answer I wanted: the instructions there gave me the same problem. (It did show me how to set my samba partition to mount from /etc/fstab, replacing my crude entry in rc.local though.)

Hunting through the troubleshooting section I found the "nounix" flag and tried that. Initially it made things worse, giving all files rwxrwxrwx permissions. Then I changed from "file_mode=0777,dir_mode=0777" to "file_mode=0644,dir_mode=0755" (which was what I had originally), and that combined with nounix works! All text files are rw-r--r-- before and after saving. Oh, the other change I added relative to the above page was including "uid=darren,gid=darren". Otherwise files were owned by "nobody:root" and I didn't have permission to edit anything (even with the suggested 0777 settings).

My guess is that my old FedoraCore samba server didn't have the unix extensions, Ubuntu 8 does, and somehow those unix extensions are misconfigured in Ubuntu. My samba server configuration is all defaults however... Anyway, it now works the way it has for the previous few years, so I'm happy.

UPDATE: I just realized this has also fixed another irritation - delete (that moves to the Trash folder) hasn't been working on that partition, but now it does. Trash folder vs. direct delete was only a minor factor; what was really annoying was every time I pressed delete it then popped up a dialog box requiring me to confirm it.

Wednesday, November 4, 2009

How much is that password worth?

These people have published details about how they used Amazon EC2 to crack passwords:

Personally, I skipped all the details and went straight to the interesting conclusions page:

It tells you how much it will cost someone (in EC2 charges) to crack your passwords, based on their lengths and the number of characters you use.

I used to think 8 characters was a good password. Seems it is worth about $3, or $45 if I've mixed in some numbers. Gulp. And all this is assuming there are no dictionary words in there. Double gulp.

Sunday, October 25, 2009

Regex Article in php|a

The August edition of PHP Architect magazine contained my introductory article on regexes. It has a PHP slant but mainly it is about regexes. Well, most of it is introductory but it contains a bonus section at the end using advanced regexes in SQL to repair a database in situ.

The print magazine arrived by air mail a couple of weeks ago and I finally got to read it. I was pleased (and relieved) that it read well. If you have read it I'd love to get some constructive criticism - especially if you are uncomfortable with regexes.

It was also my first print article to use colour, and I think that worked well too.

One correction in the complicated "Repairing With Regexes" section at the end of the article. The text says "So \1 has to be written \\\\1 (that is four backslashes)." That seemed a strange thing to write, so I took a look at the unit test source code (included with the PDF download version of php|a magazine) I had written. Four backslashes is correct when in PHP, but when in SQL it should be "\\1" not "\1".
(I just checked my emails from when we were proof-reading, and one of the edits had removed all those backslashes; I caught that at the time but ironically I got it wrong when putting them back in.)

The other minor correction I only realized after Arne Blanket's column in the same magazine issue! I had written: " IP address after that at sign, such as darren@, which, while unusual, is technically valid". In fact "darren@[]" is the technically valid form. Which is doubly annoying because my article's regex would reject those square brackets and I hadn't explicitly pointed that out. (No harm done, though, as this email form is highly discouraged.)

Arne's column also points out that top-level domains are no longer just two or three characters. So my '\.[a-zA-Z]{2,3}$' suggestion should really have been '\.[a-zA-Z]{2,}$'. Luckily, that suggestion was just in a list of other ideas, not part of the article's main regex.

(By the way I enjoyed, and will blog about real soon, some other articles in this particular issue; if you are not a subscriber, and your work involves data and PHP, then this is the back issue you should get!)

Wednesday, October 14, 2009

Control firefox from PHP?

Internet Explorer can be controlled from a COM object interface, and therefore from PHP (i.e. any scripting language that has COM support).

But is there a way to get script control of firefox? Ideally I'm looking for a platform-independent solution, and something I can use from PHP. Google is not helping (PHP's dominance as a server-side technology overwhelms the client-side related hits).

Here is my dream PHP script:

$firefox=new FirefoxInstance();
foreach($links as $id=>$info){
if(!$found)echo "MLSN link is missing...\n";



Et cetera. I.e. I'm talking about operating firefox the same way a user does; I know I can grab the raw HTML, parse it, etc. all from PHP, but that doesn't test a web site the same way clicking links in a browser does. Especially web pages with javascript, iframes, AJAX, etc.

(I'd heard of XPCOM but, if I've understood it correctly, it is a library to build firefox and its extensions, not something to control firefox? It also has no PHP bindings.)

BTW, going back to controlling IE from the COM interface, I don't suppose anyone has seen a detailed tutorial on how to use it to fill out and submit forms? I only ever see simple examples of how to set the URL, but I believe full control should be possible.
Dec 16th 2009 UPDATE: This came to the top of my to-do list so I read the MSDN docs on the Internet Explorer COM object, and now it is my understanding that I cannot manipulate and submit forms via the COM object. None of the example usage even hinted at doing this.

Thursday, September 24, 2009

PHP, reliability, ASP and remote desktops

I've been working in ASP on an IIS server recently. Not my environment of choice, but the scripts I need to write are relatively simple, so I decided it was easier to go with it than ask for PHP to be installed (which then requires apache). This is quite a helpful page:

We'll see how it goes. In many of the systems I work on, windows or linux, I usually use commandline PHP as the glue. I'm always amazed at how rock solid it is. Especially on Windows it is always the most stable element. E.g. I've a 24/7 PHP script that deals with COM objects. The COM objects crash regularly. But I just catch the exception in PHP, delete the COM object, create another one and off we go. Needing to use 3rd party libraries only available in C are the only reason I write C++ recently, but they cause me no end of trouble (mysterious crashes deep in windows system calls for instance). Windows itself needs to be rebooted every 3 or 4 weeks or it starts to go all wobbly on us, and my 24/7 PHP scripts simply run for those 4 weeks without a single problem or restart.

Of course saying PHP is more stable than C would be stupid; PHP is itself a C app. The difference is that PHP is a well-crafted piece of code, better than the COM objects, 3rd party C libraries and Microsoft OSes I'm comparing it to.

Going back to IIS, I don't have it installed locally, so have been developing on a remote server. "rdesktop" wins this month's "application I wish I'd discovered months ago" award. Previously I would have to boot up my windows notebook just to use remote desktop. rdesktop gives me this functionality on linux, works perfectly (touch wood) and copy and paste works too.

Thursday, September 3, 2009

Windows SEH in C++

A windows app was mysteriously dying, roughly once a day; no clues in any of the various log files.
So I enabled Windows SEH:

(See MSDN article about _set_se_translator)

And here is my implementation of that function:

void win32_exception_handler(unsigned int code, EXCEPTION_POINTERS *ep){
throw new win32_exception(code,ep);

And the class (cut down to basics) is:

class win32_exception{
std::string info;
win32_exception(unsigned int code,EXCEPTION_POINTERS *ep){
//Describe it in info
const char *what()const{return info.c_str();}

Then, I have this code (which was already here before I enabled SEH):

catch(std::exception &e){
ErrorLog()<<"Got a std::exception thrown by main_loop:"<<e.what()<<" (will treat as fatal)\n";
return -9;
catch(win32_exception &e){
ErrorLog()<<"Caught win32_exception thrown by main_loop:"<<e.what()<<" (will treat as fatal)\n";
return -9;
ErrorLog()<<"Got an unexpected exception thrown by main_loop (will treat as fatal)\n";
return -9;

Now, what I don't get is today the program died with this message:
Got an unexpected exception thrown by main_loop (will treat as fatal)

I'm staring and staring at the code and don't get why the win32_exception code didn't catch and instead some other exception type got caught. Enabling SEH was the only change, all such exceptions should go through my function, and that only throws win32_exception instances.

The other mystery to me is what is the type is of this "unexpected" exception.

(I'm hoping to come back and add to this blog entry when I work out what is going on!)

Sunday, August 23, 2009

php6 - maybe not so painful?

I long ago learned that if you want to get development work done then always use the distro's own version of PHP, Apache, etc, rather than compiling your own version. Otherwise every upgrade due to a security or bug fix requires you to first realize the need (costs time keeping up with the study) and then do something about it (costs time). Time is money; not spending the time is insecure. So I outsource all of that to the experts at my distro or ISP.

Consequences are that my code avoids using the latest feature, and tries to work on as wide a set of versions as possible. PHP 5.0.0 was released in July 2004 yet at the start of this year two of my clients were still using php 4. One of them was still using php 4.2. (They've both upgraded this year; one because they moved to a new ISP, the other because we were troubleshooting code and running out of ideas, and wondered if moving to php 5 might magically fix it.)

Which is a long-winded introduction to say that PHP 6 will be out before long, and I'm very unexcited. It brings native unicode; but the mbstring extension already did the job fine. It removes bad stuff that was already deprecated, but I didn't use it anyway.

This article made me concerned it may turn out to be a negative experience overall, rather than a neutral one:

It says PHP4-style classes are disappearing. Yet they work fine in PHP 5, and E_STRICT has not been complaining if I've not explicitly written "public". Of even more concern is php 5.3 apparently brings in this change too?

However, it turns out this was just a poor Linux Magazine article. What is being removed in php 6 is "ZE1 compatability mode" (see the php manual description); it had defaulted to off, and only affected the way objects were copied/compared. The "php 4" example Linux Magazine show is valid php 5 code, and (as far as I can tell) will be valid php 6 code too.

So that was a red herring. But I'm also bothered how (string) works for binary strings in php 5, but will need to be changed to (binary) for php 6 - that is a very annoying change for anyone storing binary data in strings (disabling unicode is not an option if you also need to store UTF-8 in other strings)!

Thursday, August 20, 2009

Storing website translations in SQL

In php|a magazine, April 2009, there was an article called "Storing Multilingual Records in the MySQL Database". As the title says, it has some mysql-specific elements, but the concepts are quite general. The author introduced and compared three alternatives; as I have used a fourth way I thought I'd write about it here.

The situation was a product database, where the name, url and description have to be localized. But some entries will be untranslated and should fall back to using the default language entry.

1. Translations in a separate table

Here is the database schema:

product: product_id, group_id, price.
product_translation: product_id (links to product), language_id (links to language), name, url, description.
language: language_id, collation.

This requires an SQL join which is hard for mysql to execute efficiently. Fulltext indexes are also a problem, as the name and description fields contain text from different languages. The only good point of this approach is that it is easy to add a new language.

2. Data Copy

Here the group_id and price fields have been moved into the product_translation table. That makes the SQL queries a bit cleaner, as it saves one join, but doesn't really solve the other problems. And the data redundancy is an accident waiting to happen (I'm having a physical reaction just thinking about it).

3. Translation Directly In The Database

This is like 2 above, but each language gets its own field. For instance if we have three languages, English (en), German (de) and Japanese (ja) then it looks like:

product: product_id, language_id (links to language), name_en, url_en, description_en, name_de, url_de, description_de, name_ja, url_ja, description_ja, price, group_id.

The advantage this brings is using the default language is easy; you can either just always select the default language field, or use mysql's IFNULL function. E.g. either:
SELECT name_de,name_en WHERE product_id=123;
(and then check in your PHP script to see if name_de is blank, and if so use name_en instead)
SELECT IFNULL(name_de,name_en) as name WHERE product_id=123;

The other advantage is FULLTEXT indices work well now, as only one language is kept in each text column. Disadvantages are the space-wasting if actual translations are sparse, the work required to add a new language, and possibly hitting the mysql maximum row size (64K).

4. One table per language

This is the approach I've used in MLSN, among others:

product: product_id, group_id, price.
product_translation_en: product_id, name, url, description.
product_translation_de: product_id, name, url, description.
product_translation_ja: product_id, name, url, description.

Fulltext indices work well, joins are relatively easy, no data duplication. Adding a language feels cleaner than solution three above, as it doesn't require modifying an existing table, just adding a new one. If the site takes off then different languages can be split to different servers easily. And the language table can disappear as collation will now be part of the field definition when the table is created. (I think the same can be said of solution three, but the php|a article kept the language table in its schema - was that an oversight, or am I missing something?)

The dark side? Getting the default language requires two queries. Or would something like this (untested) SQL work?
SELECT ISNULL(, FROM product_translation_de de,product_translation_en en WHERE de.product_id=123 AND en.product_id=123


The 4th option works well for me, but I'd be interested to here arguments against it. Perhaps you use something different again?

Monday, July 27, 2009

New fclib and dcflash releases

A few days ago I updated my dcflash project on SourceForge. This is an open source (MIT license) actionscript library, with a data visualization emphasis:

In particular there is now an AS3 version. In fact the AS2 and AS3 releases currently have only a couple of classes in common. The AS3 classes are mostly for loading images, movies, and setting up slideshows; none of the chart classes from the AS2 version (or the only partially ported and unreleased AS1 library) are there yet. As I say in the release notes, I'm open to offers to port them to AS3. (A look at will show you the kind of things in the AS1 and AS2 libraries.)

And now today I've put up a new release of fclib. This is an open source (MIT license again) PHP library. Fairly ad hoc, but naturally lots of internationalization-related classes. The new 0.4.19 release contains some arabic classes, but other than that it is mostly just tweaks and small improvements:

Wednesday, July 22, 2009

Date comparisons in C++

Just been scratching my head over a code snippet that looks like this:

time_t now=...;
time_t last_tick=...;

struct tm *timeinfo=localtime(&now);
struct tm *tick_timeinfo=localtime(&last_tick);
if(tick_timeinfo->tm_mday==timeinfo->tm_mday &&
  tick_timeinfo->tm_mon==timeinfo->tm_mon &&
  tick_timeinfo->tm_year==timeinfo->tm_year)return OK;
else return BAD;

I'd realized this was code was always returning OK, even when last_tick was yesterday, a month ago, or even from last year.
Can you see the bug? The above is all you need to know. Answers in a comment. No stamp required.

(To be fair to my bruised ego, the problem can really only be one thing in the way I've presented it above.)

Kudos to the first reply. And I'd be interested to hear how people do date comparisons in C++ more safely.

Tuesday, July 21, 2009

Screen still going blank! (/proc/mtrr)

I wrote before how my screen keeps going blank, and how it had seemed to get worse after upgrading to software RAID. It has still been doing it, and the past week or so I've been systematically disabling things to see if I can fix it. It still happens, but less frequently, so some combination of things I've disabled may have helped?

But, it does still happen, so I went googling again. I think I may have found something:

It seems Intel video driver and/or the linux kernel cope badly with unusual memory configurations. I moved to software RAID at the same time as I moved from 2G to 6G main memory! Here is my /proc/mtrr:

reg00: base=0xd0000000 (3328MB), size= 256MB: uncachable, count=1
reg01: base=0xe0000000 (3584MB), size= 512MB: uncachable, count=1
reg02: base=0x00000000 ( 0MB), size=4096MB: write-back, count=1
reg03: base=0x100000000 (4096MB), size=2048MB: write-back, count=1
reg04: base=0x180000000 (6144MB), size= 512MB: write-back, count=1
reg05: base=0x1a0000000 (6656MB), size= 256MB: write-back, count=1
reg06: base=0xcf700000 (3319MB), size= 1MB: uncachable, count=1

This looks like one of the problem ones from the above bug report. Unfortunately I'm at a bit of a loss what to do with it now. My understanding of the problem is something like main memory and video memory overlap, so when some program uses that memory it kills my video and X dies. My ctrl-alt-F1, breath in, breath out, ctrl-alt-F7 "fix" must reset X?? (Hhhmm, how come I never see problems with any programs having their memory overwritten by the graphics card though?)

Here are my dmesg entries either side of the only mtrr complaint:
[ 142.691737] [drm] Initialized drm 1.1.0 20060810
[ 142.700818] ACPI: PCI Interrupt 0000:00:02.0[A] -> GSI 16 (level, low) -> IRQ 17
[ 142.700827] PCI: Setting latency timer of device 0000:00:02.0 to 64
[ 142.700880] [drm] Initialized i915 1.6.0 20060119 on minor 0
[ 142.715171] mtrr: type mismatch for d0000000,10000000 old: write-back new: write-combining
[ 143.311587] set status page addr 0x00033000

These are the best explanations I've found so far:

Both the solutions in the 2nd link talk about modifying the video=... parameter given to the kernel. But I don't have one of them. I just tried throwing it on the end of the kernel commands but the /proc/mtrr output is unchanged, so I don't think it had any effect:
kernel /vmlinuz-2.6.24-24-server root=/dev/md3 ro quiet splash nomttr

I've just tried changing Advanced|Chipset|Northbridge|Memory Remap from Enabled to Disabled, in my bios. Main memory has dropped from 6G to about 5.3G, and /proc/mtrr has changed to:
reg00: base=0xd0000000 (3328MB), size= 256MB: uncachable, count=1
reg01: base=0xe0000000 (3584MB), size= 512MB: uncachable, count=1
reg02: base=0x00000000 ( 0MB), size=4096MB: write-back, count=1
reg03: base=0x100000000 (4096MB), size=2048MB: write-back, count=1
reg04: base=0xcf700000 (3319MB), size= 1MB: uncachable, count=1
reg05: base=0xcf800000 (3320MB), size= 8MB: uncachable, count=1

Well, it looks just as dubious to my untrained eye (for starters it still seems to think there is 6G of memory), but at least it is different. Let's see what happens... no screen still goes blank.

What about DVMT mode?
I currently am in "DVMT" mode, with 256M. I've plenty of memory, so let's try fixed mode, with 256M. made no difference to the /proc/mtrr output, and the problem still happens.

(There is also an "ASMT resolution" option in the bios, which is "enabled". Google isn't helping me much here, but some explanation is here and it doesn't seem to be to do with video memory:

(I've been holding off on posting this for the past week, in the hope I'd resolve the problem. But unfortunately I haven't yet. I guess fiddling with /proc/mtrr directly is needed, but I don't have time to investigate that currently.)

Friday, July 17, 2009

Email on ubuntu (exim)

When I installed real player a few months back I noticed it had installed exim. I thought that highly suspicious - what was real player planning to do behind my back? Well, I thought I'd try removing it today, and discovered realplay depends on lsb-core (linux standards), and lsb-core depends on exim. realplay was innocent.

The magic command to remove exim is:
dpkg --purge --ignore-depends=lsb-core --ignore-depends=mailx exim4 exim4-base exim4-config exim4-daemon-light

But after doing that I changed my mind. As I discovered I couldn't send any email from commandline. Did somebody just say "Duh!". Well I was hoping I could then configure something to tell it to use SMTP directly to my ISP. (Why do I want to send email from commandline? Well, yes to test programs, but much more importantly that that I realized my system has been sending me warning emails and they've been undeliverable, so for instance the daily mail telling me about my RAID problem has not been reaching me...)

Anyway, lsb-core and mailx were screaming at me that their dependencies were missing, so re-installing exim was just a matter of telling them to re-install. Then the key step I had been missing was this:
sudo dpkg-reconfigure exim4-config

Still not easy, but it becomes much clearer if you know that what exim calls a "smarthost" is what the rest of the world calls "my ISP"! I.e. in my case I told it to use a smarthost for outgoing email, no incoming email, and for the smarthost I just looked in my thunderbird configuration for the SMTP server and gave it that address. It also kept asking me for a default domain and I chose "" for everything.

Seems to have done the job!

Oh, the raid problem? It is just complaining I didn't set up swap as RAID. This was deliberate, see my original post on raid setup. At the time I wrote that I didn't understand the issues involved regarding RAID and swap. But if the system is complaining I will set it up as raid to make the message go away.

Tuesday, July 14, 2009

Alternative http logins from firefox?

Open source works on the "scratch-an-itch" principle, and so there should be solutions for the itches that developers have. So I must be using the wrong keywords as I just cannot track down the firefox plugin I need.

When I build a web site I need to login as different users. E.g. a normal permissions users, and an admin-level user. I've input both in, firefox is storing both of them, but it only ever offers me one. Ironically way back when Firefox was called the Mozilla Application Suite it did have this - when two or more usernames had been saved it would pop-up a dialog letting you choose. Firefox 1.5 was a two steps forward, one step back kind of release in that respect. (Of course Firefox 2 and Firefox 3 were both "0.6 steps forward, 0.55 step backward" releases, with the kicker that you need to waste time finding replacement plugins, but that is a rant for another time.)

Help! Please? There must be some firefox plugins that does this simple task.

P.S. I'm also keen to have some function that will allow me to selectively remove certain http authentications, but not the blanket "Miscellaneous|Clear Private Data|HTTP Authentication" option of the Web Developer plugin.

Thursday, June 4, 2009

The Shodan Go Bet

I am a natural optimist. A hopeless optimist. I spend hours battling against myself with carefully constructed cynicism, but things still get past my guard. And one day, back in 1997, I made the mistake of putting my money where my mouth is. Let me tell you were it all started...

(About a bet I made with John Tromp, in 1997, that a computer could beat him at the game of go before the year 2011; the above URL is to publicize this bet and also has places where you can vote with your opinion and leave comments. If you enjoy it please do link to it, blog about it, tell your friends, etc.)

Thursday, May 28, 2009

Linux partition advice

When I first partioned my disk I put /boot on the first partition and gave it 96M, following some advice found online I assume. This is not enough. I suggest giving it 256M. It is still a fraction of your hard disk. And because /boot is special it is impossible, or at least complex, to move some files to other partitions and link to them.

About a week ago ubuntu upgrade started complaining about a kernel upgrade problem. I ignored it for a few days assuming it would sort itself out. But it kept happening. Then when I viewed details of the upgrade I noticed (just briefly before it vanished off-screen) it was saying not enough diskspace on /boot.

Not again! When I tried to upgrade from Ubuntu 7 to Ubuntu 8 lack of space on /boot caused problems then too.

Poking around I also found I still had linux-generic packages installed, even though I'd switched to the linux-server kernel (see 6Gb on 32-bit linux). I deleted all packages that had the word "generic" in their name. After a reboot I still had one "generic" file left in /boot which I then just deleted. There was also a *.bak file for my current kernel. Datestamp was for a week ago, so I deleted that too.

It doesn't look like I've broken anything, and I'm now down to using 40M on /boot with 48M free (and I still have Ubuntu 7's linux-server kernel in there, which I think is now pointless, so I could reduce it even more).

Therefore, you can get away with a mere 96M /boot partition, it just requires more time and effort.

Conversely, I think a /boot partition above a certain size (1G?) causes problem at boot time, which is the whole reason for have a separate /boot partition. But I'm no expert, and that may be old-fashioned advice, and every BIOS on every motherboard made in the last 10 years may in fact be fine.

I dunno, and am too busy with more interesting stuff to study up on it, which is why I'll go for a 256M partition on my next computer.

Monday, May 25, 2009

Japanese NLP mailing list

A new mailing list for discussing Japanese NLP (natural language processing) in English has been set up, by Jim Breen (of JMDict fame):

There is a lot of software for processing Japanese text which is only documented in Japanese, and even then only minimally documented. So the new list is an ideal place (for those of us more comfortable in English than Japanese) for asking about how to use chasen, cabocha, namazu, etc. Or to describe what you are trying to do and get program and data suggestions. Hopefully people will also post about new software and data releases, related conferences, new academic papers, and so on.

Also, if the above interests you then you will also want to know about this Ubuntu repository for all kinds of NLP software:

Much of the Japanese stuff is UTF-8 ready (as opposed to the EUC-JP that academic Japanese software still likes to default to).

Thursday, May 21, 2009

Adobe PDF reader and Japanese fonts

Another casualty of my recent enforced ubuntu upgrade was Japanese fonts in pdf files. Adobe acroread has also moved from version 8 to version 9. When you meet Japanese in a PDF file it tells you the URL to go to to get asian font pack. Unfortunately that page only has asian fonts for acroread 8 and earlier!

The link should be:

(I'm mentioning this as it took a bit of work to discover it.)

Scroll down to the add-ons section, and it seems each language is now its own file, and the files are much bigger. (I don't know the difference between a "Font Pack" and a "Font Packs"; the files I are identical so I chose the latter.)

Unzip the bz2 file with "tar xjf FontPack910_jpn_i486-linux.tar.bz2"
Then move into the JPNKIT directory and type "./INSTALL"

The install process asks:
"Enter the location where you installed the Adobe Reader? /opt"

I didn't install it, Ubuntu did. However it seems Ubuntu is putting it in /opt!
Strange for a package-based distro to put anything there, but I accepted the /opt default and it worked. (This was different in Ubuntu 7, as I remember having to try lots of paths until I guessed the one it was after.)

Incidentally I have already got as an extra repository, but there is no acroread-fonts or similar package. Perhaps there is some legal issue (though I thought's raison d'etre was packages with legal issues).

Saturday, May 9, 2009

Moving encrypted partition to software RAID

I moved most of my partitions to software RAID a few weeks ago. But I left /home/darren/ because it was encrypted. However a few days ago I moved it too. Here is how.

The quick overview: it is exactly like moving any other type of existing partition to software RAID, except where you would format /dev/md7, prior to copying the existing data over to it, you would set up crypt on /dev/md7 instead.

Detailed Steps

These instructions assume that you have moved other partitions to software RAID, or are at least familiar with the process (see previous article). All commands are run as root.

If you have not got an existing crypt partition you need to prepare for it:
* Install cryptsetup
* modprobe dm-crypt
* Add dm-crypt to /etc/modules

I will be setting up /dev/md7 for software RAID 1, but with just /dev/sdb7 in the raid array initially, i.e.
mdadm --create /dev/md7 --level=1 --raid-disks=2 missing /dev/sdb7

Next I setup /dev/md7 for crypt with these commands:
cryptsetup luksFormat /dev/md7 -c aes -s 256 -h sha256
cryptsetup luksOpen /dev/md7 somecryptraid
mke2fs -j /dev/mapper/somecryptraid -L somecryptraid

Each command only takes a few seconds to run. You will be prompted for a password. Forget that password and the data on your partition is lost forever; there is no recovery ability. So choose carefully.

The next command sets chkdsk to run every 100 days, however many times you boot. This is just personal preference, and completely optional (the default is to run chkdsk more frequently):
tune2fs -c 0 -i 100 /dev/mapper/somecryptraid

Run "mkdir /mnt/md7" then edit /etc/fstab and add this line:
/dev/mapper/somecryptraid /mnt/md7 ext3 defaults,noatime 0 0

What we are doing here is saying we want our new encrypted software raid partition to be mounted somewhere temporary, so we can copy our existing data over to it.
(the "noatime" flag is optional, and nothing to do with software RAID or crypt)

And edit /etc/crypttab to add this line:
somecryptraid /dev/md7 none luks

This is the command that will cause it to prompt for password on boot time. If you already have an encrypted partition then you are adding the above in addition to your existing entry: you will have two encrypted partitions for the next boot.

Now reboot into single-user mode (the recovery kernel). You should be prompted for the password for your new crypt partition (in addition to any existing crypt partition). Now I move /home/darren to the new crypt partition with:
cd /home/darren/
cd -dpRx . /mnt/md7

This took over an hour to run for me.

Once it finishes edit /etc/fstab to change the /dev/mapper/somecryptraid entry from /mnt/md7 to be /home/darren. And comment out the previous entry for /home/darren. I.e. the new entry looks like:
/dev/mapper/somecryptraid /home/darren ext3 defaults,noatime 0 0

Reboot. When running df you should see a line like:
/dev/mapper/somecryptraid ... ... ... ... /home/darren

You can now remove (or comment out) the old "somecrypt" entry from crypttab. Also use fdisk to change the /dev/sda7 entry from "83 Linux" to "fd Linux RAID autodetect" (use the fdisk "t" command to do this). Reboot again.

Now you should be able to run:
mdadm --add /dev/md7 /dev/sda7

This will take a while to run, as it mirrors the data from /dev/sdb7 to /dev/sda7. "watch cat /proc/mdstat" will show its progress.

"rmdir /mnt/md7" as a tidyup step at the end. You might also want to delete commented out lines in /etc/cryptab and /etc/fstab if you like to keep those files lean and clean.


When I ran "mdadm --add /dev/md7 /dev/sda7" I got this error:
"mdadm: add new device failed for /dev/sda7 as 2: No space left on device"

I tracked this down to slightly different sizes as reported by fdisk:
/dev/sda7 16988 23361 51199092 fd Linux RAID autodetect
/dev/sdb7 16988 23361 51199123+ fd Linux RAID autodetect

Even though start and end cylinders are the same, the number of sectors is different! When I change the view ("u" command in fdisk), to show start and end sector, the problem is clearer:
/dev/sda7 272896281 375294464 51199092 fd Linux RAID autodetect
/dev/sdb7 272896218 375294464 51199123+ fd Linux RAID autodetect

In other words /dev/sda7 starts a few sectors later. My fix was to delete /dev/sda7 from the partition table, and then recreate it. Then /dev/sda7 showed the same start/end sector as /dev/sdb7. Weird and spooky, but now the "mdadm --add" command worked (after another reboot of course).

Tip: use fdisk to check your partition sizes are exactly the same before starting! The other way to fix this would have been to make /dev/sdb7 smaller but, by the time I realized, it was too late to do that.

Wednesday, April 29, 2009

Ubuntu Hardy: Hard(y) Work

After last week's upgrade on my notebook, I did the forced upgrade from Ubuntu 7 to Ubuntu 8.10 on my main machine yesterday. Numerous breakages.

Easy Fixes
* Skype disappeared. I still had the deb package and installing it again cured that. All my settings were still around.

* SciTE: It overwrote my old settings. Luckily I keep a backup copy and merged in my differences.

* Evolution was back. So I deleted again all packages it would let me delete (some parts are entwined around the heart of either Ubuntu or Gnome).

* The Permit Cookies firefox plugin didn't like firefox 3. You have to go to developer's own site to get a version that will install.

* SCIM (Japanese input) hot key had stopped working. This fixed itself when I fixed something else (my guess is the xorg.conf change for the Alt-F1/F7 problem - see below).

Hard Fixes
* pcmanfm. This is better than nautilus. Full stop. But after the hardy upgrade trying to launch anything fails - it tries to launch, but after 20s or so gives up. This even includes opening the terminal. I tried reinstalling, and tried getting the latest version, but same problems.
Solution: I gave up. My nautilus problems seem to have gone, and I installed nautilus-terminal package which gives me the "open a terminal in this directory" feature that I'd become addicted to. (Cannot get F4 to open terminal though: the closest I can manage is "Alt-F, t")

* Ctrl-Alt-F1/F7 no longer worked. See my screen blanking blog entry for the xorg.conf fix for this.

* Real Player Plugin. Firefox 3 decided to move the goal posts and keep plugins in a new directory; so you have to move the firefox plugins to the correct place (/usr/lib/firefox-addons/plugins/) yourself.
That still wasn't enough. I've been disabling various other plugins, reinstalling real player, and lots of swearing. I finally have it working, but I cannot tell you exactly which steps were needed.
In the end I used the Real Player's deb, which is highly suspicious as it has dependencies such as "exim" (an email client)?!! Many years ago Real got a bad reputation for malware, and I'm wondering if they are still up to their bad tricks?
But even just that deb didn't work. So, I think that the thing that finally worked was deleting the totem-mozilla package. (I'd already disabled all plugins with the word totem in them however, which was no help, so I'm not entirely sure... the real Real Player plugin calls itself "Helix DNA Plugin", and I think I'd also disabled that... which with hindsight was probably the cause of all my problems?)

Other Fixes

A few months back the red icon to shutdown gnome stopped working. Hardy didn't fix that. But I did find the cause today: Preference | Sessions | Enable Power Manager. A tad confusing - you'd think it was something only needed on notebooks. But, anyway, that was the key.

Things That Did Work Okay:
* Samba (aka windows) shares
* Software RAID
* Encrypted Partitions
* Email/Browser/sound/video card/etc.


Thankfully Ubuntu 8 is a LTS (Long Term Support) release, so I shouldn't need to go through this for another couple of years.
And while I now hate Ubuntu I have been reminded about its most important feature: a huge user base. Which means typing "Ubuntu 8 some problem" into Google will always find someone having the same problem, usually with some kind of solution.

Thanks to all the hundreds of Ubuntu users who answer questions, and blog about their fixes!

Friday, April 24, 2009

Ubuntu Expired On Me!

So, I boot up my notebook into linux (Ubuntu 7) for first time in a while, maybe 2-3 months, to prepare it for a meeting. I'm a good boy and first go to install the updates I know will be there. It says it cannot find some. Strange.

On a hunch I wonder if Ubuntu 7 has reached end of life. But, no, this page tells me Gutsy Gibbon (7.10) is still current:

It is wrong. Ubuntu obviously haven't read their own press releases. It EOL-ed a week ago:

What is annoying is how EOL breaks everything. Packages won't validate. Worse, I cannot install some new software (such as mysql; but I could install lame). This is not simply "we're not going to do updates any more". This is "We've pulled the plug. Should've upgraded earlier. Loser."

Two forehead slaps for Ubuntu in the space of a week. I'll be back to Fedora at this rate!

Anyway, I started the upgrade to 8.04, but after running for 20 minutes it finally told me it was going to take 19hrs to run. Not good timing for my meeting!

By the way, 19hrs was a bad estimate. It takes 10-20 minutes to get it started, then 3-4 hours downloading, then 1+hrs installing. It took 6 hrs in total for me. You can leave it alone for the downloading, but need to be around to answer questions during the install stage.

Wednesday, April 22, 2009

Screen keeps going blank

I mentioned this in previous entry (on software RAID). My screen keeps going blank. Ctrl-Alt-F1, breathe in, breathe out, Ctrl-Alt-F7 fixes it. I.e. that must reset the video card or X.

It used to happen once a day or less. Since switching to RAID it is happening once an hour roughly. But 9 times in past hour, which was past my threshold, so I went hunting...

...and came up empty-handed. Xorg.0.log has entries but they match what happens if I do the Ctrl-Alt-F1/F7 sequence. /var/log/acpid also has an entry that I think goes along with Ctrl-Alt-F7. But nothing else. No errors in either. Nothing in messages, syslog et al that could be associated.

/var/log/gdm/:0.log also gets an entry when I do ctrl-alt-F7, but nothing beyond that.

Nothing else in /var/log at all.

I'm stumped. I doubt over-heating, and I doubt CPU load (as all I've been doing the past hour is typing in a text file). w tells me "0.00 0.09 0.07".

Not a very informative blog, sorry. Anyone got any suggestions? I don't want to mess around trying different video cards, monitors, etc.; it is still just a minor irritation.

UPDATE: After updating to Ubuntu 8 I still had this problem, but worse my Ctrl-Alt-F1/F7 fix no longer worked! A bit of googling tracked down a solution:
In /etc/X11/xorg.conf, in "InputDevice" section, I commented out these three lines:

#Option "XkbLayout" "jp,jp"
#Option "XkbVariant" "latin,"
#Option "XkbOptions" "grp:alt_shift_toggle,grp_led:scroll"

and replaced with just:

Option "XkbLayout" "jp"

Even better, Ctrl-Alt-F1..F6 now give me a terminal! They never have in Ubuntu, instead they just gave me a blank screen. The above xorg.conf has been there since the start - Ubuntu 8 upgrade didn't alter it.
(Oh, and the above change has not fixed the screen going blank problem, so I'm still no closer to fixing that...)

Update: I've noticed if I move the mouse, especially the mouse wheel, just after the screen goes blank, it comes back. It doesn't always work - not sure if there are two problems here, or if I'm too slow reaching for the mouse sometimes.
Anyway, if that suggests a potential cause for someone let me know!
(By the way, I have screensaver set to blank, after 30 minutes of inactivity. Just to see if that is related I've changed screensaver to some animation, and I'll let you know!)

Monday, April 20, 2009

Software RAID on an existing linux system

So, as mentioned in a previous article I bought a terabyte hard disk, mainly to impress the ladies. It doesn't seem to have been working too well in that respect. I considered sending it back, claiming it needs more testerone. But I decided to instead set up software RAID 1.

My main guide for this task was this article:
Rather than repeat everything I will just comment on what I did differently. Note that despite the name of that article, it applied fine to Ubuntu 7, and I suspect it will apply equally well to any linux distro of the past two or three years.

First, my disks were different size: 320G vs. 1T. So I opened fdisk -l /dev/sda in one terminal, then ran fdisk /dev/sdb in a second terminal and set up the first six partitions (plus one extended partition) to be the same as /dev/sda. I then set up /dev/sdb8 to be one big partition with the rest of the disk, which I mounted as /backup.

In all software RAID guides they number the partitions /dev/md0, /dev/md1, etc. That is confusing so I decided to keep the same numbering schemes as my hard disks; it turns out this is fine. So I used commands like this:
mdadm --create /dev/md3 --level=1 --raid-disks=2 missing /dev/sdb3

compared to this from the HowToForge article:
mdadm --create /dev/md2 --level=1 --raid-disks=2 missing /dev/sdb3

The above article did all partitions in one go. I wanted to take baby steps. So I started by mirroring one partition, choosing a non-critical one, and ignoring all that stuff about grub configuration. It worked well. I then did /var and /, leaving /boot unmirrored and therefore again ignoring all that grub configuration stuff. /var went smoothly but / did not. It kept saying /dev/sda3 was in use: "mdadm: Cannot open /dev/sda3: Device or resource busy" But df didn't list it. Trawling /var/log/* didn't list it. "Eh?" thought I.

Eventually I realized that grub was set to boot from /dev/sda3, not from /. I.e. grub works at the device level not the partition level.

Key Advice: mirroring one partition at a time is fine, but then do "/", "/boot" and all that grub stuff in one go.

Now, something the HowToForge article didn't mention at all was doing file copies in single-user mode. I know just enough unix administration to scare myself and the idea of taking a copy of /var and / while all the daemons were still running made me nervous. So I tried to boot to single-user mode.

Discovery #1: Ubuntu is weird. They don't use run levels like the rest of the linux world. Sure, their numbering scheme may make more sense, but it also causes much surprise.

Discovery #2: "telinit 1" doesn't work. The GUI shuts down and then nothing. I can see the system is still running but nothing on screen. Nothing on Ctrl-Alt-F1 through to Ctrl-Alt-F12. I think ctrl-alt-del caused a reboot.

So, I booted, and choose the "recovery kernel" from the grub menu. Recovery being another name for single-user mode. It asks for the root password. Ah. Ubuntu doesn't use one. I have to continue into a normal GUI boot.

Key Advice: Set a root password. This has nothing to do with software RAID: you never know when you are going to want to boot into a recovery kernel. I think Ubuntu deserves a forehead slap for this design decision? To set a root password, do "sudo passwd": it will first ask for your user password (that is how sudo works), then it will say "Enter new UNIX password:". I chose the same password as my normal user; easy to remember. (If that makes you nervous, remember that this is equal security to being able to run "sudo passwd"; more importantly, it saves me having to write down my root password somewhere.)

Other comments: the "cp -dpRx" command is the most time-consuming step for those partitions when you have a lot of data. The system will be sluggish while doing this (and if you are in single-user mode you cannot be browsing or doing anything else anyway). But also when you do the "mdadm --add /dev/md3 /dev/sda3" to actually create a genuine RAID partition it will be copying a lot of data, and your system will be sluggish for the time this takes (about 10 minutes per 30G). Bear this in mind.

Swap. I've created /dev/md2 as a swap RAID, but haven't used it. So /dev/md2 just contains /dev/sdb2. And "cat /proc/swaps" tells me only /dev/sda2. Apparently if my sda disk dies my swap will disappear and my system might crash. On the other hand, mirroring swap has a performance downside (apparently). I don't understand the trade-offs well, but this is a workstation, not a server and a crash should /dev/sda ever die is acceptable. I've also got more memory than I need so I suspect I could switch swap off completely. Anyway, so far Inertia born from Igorance means I'm doing nothing about this. Expert opinions are welcome.

Everything works. Just two things I've noticed. Sometimes the system doesn't boot. It comes up in an "ash" shell, which is one of those recursive acronyms standing for "Ash is Satanic Hell". So far the only command I've mastered is "reboot", which works well - I've not had two boot failures in a row. After 10-20 reboots I think this boot failure is happening about 20-25% of the time. I've no idea how to troubleshoot it.
UPDATE: this boot problem doesn't seem to be happening since upgrading to Ubuntu 8. Perhaps the upgrade fixed some mistake in my grub.conf? Maybe related is that I no longer have an (hd1,0) section in grub.conf; they are all (hd0,0). I think I'll leave it like that: if "hd0" dies I can edit the grub boot string that once, and then edit grub.conf.

The other thing is my screen sometimes goes blank. The system is fine, it is just like the video card has died. My solution has always been: Ctrl-Alt-F1, pause for half a second, then Ctrl-Alt-F7. I guess this process resets the video card. It used to happen maybe once a day or so. Now it seems to have happening once an hour on average, and is becoming almost irritating enough that I'll have to look into it. The fact that running RAID could affect the display feels like an important clue.

Finally, my home partition is encrypted, using crypt. I have not moved it to software RAID yet, for that reason. I will really soon (as all my data that could really benefit from the security of software RAID is currently the only data not protected by it)! I will run crypt on top of software RAID, rather than the other way around; I'm just not sure the best way to do the file copy.
UPDATE: now done, it was straightforward, see Moving encrypted partition to software RAID.

Tuesday, April 14, 2009

Excel and PDF from PHP

Been meaning to post this for ages, but PHP Architect magazine for November 2008 had a useful article on libraries and code for alternatives to HTML.

For PDF html2pdf was introduced. Unlike other PHP PDF libraries (e.g. PDFLib, FPDF) this one has you make your content in HTML then it gets converted. For those of us comfortable with HTML, that means no learning curve.

For Excel, PHPExcel is introduced. This has classes to read and write excel files in a variety of formats. I get this request a lot but so far we have always decided outputting good old CSV is best (simple to make, compact, easy to import into Excel). But if complex layout, or built-in functions, are required this looks worth a look. It can also output into PDF.

Also very interesting is in-memory spreadsheets, with lots of mathematical functions supported. I've not tried it, and neither the article or the website mention explicitly if this will work on Linux (or a Windows machine without Excel installed), but it is described as standalone so it should. It might be an interesting way to port someones spreadsheet to a PHP app!

The article also briefly introduces Google charts, and how to include the produced image in your excel file.

Monday, March 30, 2009

6GB on 32-bit linux

Well, it all started when a friend said their SMT (statistical machine translation) system was ready to download and install. He then casually mentioned it is a bit of a memory hog, "4Gb minimum, 8Gb preferred".

Wow. I looked at my up-until-then-perfectly-adequate-some-might-even-say-overkill 2 gigabytes and felt like a salesman who had just been told he ought to upgrade his sensible car for a sports car in order to look more successful!

I have dual-channel, and two slots free. Another 4Gb was under 5000 yen at Dospara (Japanese only), where I had bought this computer, so I emailed them to confirm it would work. After a few emails, and a bit of research I found out:
* Windows 32-bit will only go up to 3Gb.
* 64-bit Windows, 64-bit Linux are both fine well beyond 8Gb.
* 32-bit linux will allow up to 4Gb per process, and can use more than 4Gb altogether.

(I wanted to stick with 32-bit linux, as Flash is critical to my work and has no 64-bit version.)

Dospara were rather cautious, saying they don't support linux, but I went for it. When I plugged in the extra 4Gb, the bios correctly recognized 6Gb. Then Ubuntu said I had 3Gb. But that was okay, as I'd been expecting it. I went to the package manager and selected the "linux-server" meta-package, then rebooted.

Drum roll please: "free" reports I have 6Gb available. I'm using 475 Mb, and have 5,745 Mb free. See, I told you I didn't need it. But this is city driving. You wait until I take this beast down the local Formula One track, otherwise known as Difficult AI Problems.

Oh, while I was there I also bought a 1Tb SATA drive. Yes, that is a "T". A whole terabyte in a little, diddy box. It was only 10,000 yen (you could get one for 7,880 yen if you go for Seagate, but a quick bit of research showed lots of unhappy people, so I went for Western Digital which seemed to be the reliability brand).

What do you mean: "I bet he doesn't need that much storage either" ? Just because my current 250Gb drive still has 143G of free space after 18 months, doesn't mean I suddenly won't need more capacity...

And you can bet that when the ladies hear I am part of the Terabyte Generation I'm going to be fighting them off with a stick. Oooh, yeah! I am so sexy.

Monday, March 16, 2009

Periods in regex character classes

I'm trying to match Japanese numbers with a regex, and this was originally a long article describing what seemed to be a bug when using the following regex:

(a simplified version of my actual code, but this demonstrated the problem equally well)

Now, periods in character classes (a list of alternatives in square brackets) are not special, so don't need escaping. But they do need escaping anywhere else. I knew this, which is why I was scratching my head.

So, naturally, I felt a bit foolish when I realized I was using a list of alternatives in parentheses. It wasn't a character class at all.

But, now, this was even more confusing because I had already been down that street. My code (which in hindsight was correct) had started out like this:

and that kept saying no match. E.g. for "730.28".

Did I jump out of my bath and shout "Eureka!" when I realized the problem? No. It was more of a forehead slapper. It wasn't "730.28万". It was just "730.28". The code works by trying to match the above regex, then if it fails passes it to another function. I'd been debugging the wrong function. Once I was looking at the right function it was a trivial fix.

Moral of the story: don't be stupid.

(Or more seriously, "Time To Refactor", as the code is now tangled enough that it was too easy for me to get entangled in the wrong place.)

Sunday, March 8, 2009

What is an ontology?

I went to InterOntology 09, at Keio University a week ago. Actually I only attended the first couple of talks and the (free!) banquet; I couldn't make the other talks. Ontology is one of those words I have had trouble grasping, and I attended with no more ambition that understanding what it means.

My ambition was not fulfilled, but I was relieved to know that most attendees were just as confused as me. Okay, relieved is perhaps the wrong word. And by "confused", I don't mean people stood around scratching their heads. I mean people were using it in different ways. Most people who said they were "building ontologies" were actually building databases.

Someone I met there kindly sent me this link:
What are the differences between a vocabulary, a taxonomy, a thesaurus, an ontology, and a meta-model?

This (in conjunction with its first comment) is an excellent article. A bit heavy, but definitely worth the effort.

I feel it backs up the opinion I've been forming that ontology is a word that does not really need to exist. People "building ontologies" are generally data modeling, or taxonomy-building, or semantic-network-building. These terms are more specific, and contain more information about exactly what you are doing.

People using the word ontology may want to emphasize that the grammar allows for validating models using logic. Personally I would rather call that validating the data model, though I do see how a widely used ontology representation language could encourage high quality data validation tools. But they are nowhere near that yet - using SPARQL (pronounced sparkle) appears to be more like writing in assembler than the SQL its name hints at.

On the other hand, if everyone listened to me there would have been no InterOntology conference, and I'd have missed out on a free dinner. Such things need to be taken into consideration. Perhaps I should add "Professional Ontologist" to my business card after all? ;-)

Thursday, March 5, 2009

Comparing two arrays in PHP when elements can be in any order

This comes up so often. Usually I want to compare two CSV strings, where order does not matter.

My usual strategy is to first explode(',',$s) both strings. Then loop through one array, seeing if that element exists in the second array. If it does not then we instantly fail. If it does match then remove that element from the second array. At the end, if no failures and if the second array is empty, then they are the same.

When writing the above code for the umpteenth I wondered for the umpteenth time if there was a better way. I checked the PHP manual and still no "compare_arrays_can_be_in_any_order()" function. (Or, am I missing it - please let me know!)

But maybe I can use a couple of functions?

First idea: array_intersect(). Returns all elements of array1 that are in array2. So:

if($intersect1!==$list1)return false; //list2 is a subset of list1.
if($intersect2!==$list2)return false; //list1 is a subset of list2.
return true; //They are supersets of each other. I.e. equal.

I've not tried this, as my second idea was better: sort(). The implementation is:

function compare_arrays_can_be_in_any_order($list1,$list2){
return ($list1===$list2);

Note: this is destructive, as both list1 and list2 are modified. Written as above this is fine as I pass the arrays in by value. But be careful if pasting just that code into another function.

Here is the function to solve my original csv matching problem:

function compare_csv_strings_can_be_in_any_order($csv1,$csv2){
return ($list1===$list2);

Bear in mind this will only work for simple csv strings: if one uses quotes and the other doesn't, or one uses whitespace around the comma and the other doesn't, you need something more sophisticated.

Now your homework question to make sure you are paying attention: sort() takes an optional parameter of flags, to specify sorting items numerically, as strings, or as strings using current locale. Do I need to specify any flags to have my code work with any type of array data, and to work on a machine using a different locale?

What about comparing arrays of arrays?

And final question, the one of most interest to me: do you have a better way? (Better could mean any of: less code, quicker execution or fewer restrictions/disclaimers!)

Tuesday, March 3, 2009

Specify the default, stupid!

I've been struggling all week to connect to a Microsoft SQL server. It was SQL Express. I'm using "ADODB.Connection" COM object from PHP.

(The reason to use that is you can specify the 3rd param as CP_UTF8 (without quotes) and then it will convert from UTF-8 to the UCS-2 that SQL Server works in, all behind the scenes.)

First, if it says it cannot connect due to there being no server: you have to enable TCP/IP connections in SQL Server. They are off by default.

That got us to the next error, which said "接続が正しくありません。" which translates as "The connection is incorrect." Not much to go on. I think this may be the same as "error 26", but I'm not 100% sure on that.

My code could connect fine to the production DB machine, running SQL workstation (which is the more expensive version I believe). And a client on another machine that could also connect to that production DB, could not connect to our SQL Express machine.

So, with all fingers pointing at the SQL Express server, we tried things and googled things and swore at things. Wrong barking tree!

I've given the answer in the subject. SQL Server listens on port 1433 by default. Our SQL server was listening on that port, so you would think no need to specify it. But this DSN fails:

DRIVER={SQL Server};
SERVER={};UID={myuser};PWD={mypassword}; DATABASE={mydatabase}

whereas this one works:
DRIVER={SQL Server};
SERVER={,1433};UID={myuser};PWD={mypassword}; DATABASE={mydatabase}

Only for SQL Express it seems. No need to specify the default port when connecting to SQL workstation.

Nice one Microsoft. Just got to work on quality control and documentation a bit more, oh, and consistency, and then they could probably go professional with this software business of theirs.

Sunday, February 15, 2009

PHP comments from PHP!

php|architect December 2008 edition's most interesting article was on phpdoc. I'm a fanatic when it comes to the javadoc style of commenting. When I create a new function the first thing I do is write function somename(){}, then the second thing I do is write /** */ above it.

If I'm refactoring then I might copy and paste in some code straightaway, but usually the third thing I do is document. I describe what the function does, er, will do, and usually define the @return tag (if I've not worked out what data, if any, it should return then I've got ahead of myself - how can I even name a function if I don't know what it produces??). The @param tags I usually leave until after I've written and tested the function, as they can change a lot (i.e. I need an algorithm before I know what inputs my algorithm takes).

During the writing of the function body I will usually jump up to the comment block and add an @internal comment (describing the implementation), or a @todo comment ("Need to add error-checking on call to somefunc()" is a very common one!).

To quote from the php|a article: "What's really exciting,though, is that since PHP 5.1, the Reflection classes can also supply the subject's "doc comment".


The php|a article introduced a base class for automatically creating setters for class vars. My own interest is in somehow using it for contract based programming (also called DBC: Design By Contract).

I believe the Java implementation of DBC uses docblock entries, but with a pre-compiler. Pre-compilers suck: being able to edit and run the same source code file is a big bonus, and maybe reflection of docblocks enables that?

Of course using comments is poor cousin of DBC built into the language itself (especially if it can do compile-time analysis). But, for some reason I've not yet grasped, language designers just don't understand how important it is. They limp along with lint and asserts instead.

Back to the docblocks, I've not thought out the details; does PHP provide some hook that can be called when each function is entered and exited? If not I guess you'd have to write VALIDATE() as the first line of each function, and a RETURN() function everywhere you use return, which would be rather intrusive.

Either way, it could then look for @paramvalidate tags which could describe valid data to be found in parameters. It could look for @variant tags in the class doc block to check the object is valid on each of entry and exit.

Did I mention DBC finds loads of bugs, without you having to write a single unit test; all you have to do is document properly. Which, of course, you are going to do anyway if you are a professional programmer.

But did you realize DBC can also be used for optimization, by prioviding hints to the optimizer? As a concrete example (I'm thinking about C++ here) you have two input variables both described as being in the range 1..100. When they are multiplied together the range is 1..10000. Therefore there is no need to consider overflow. If you then divide by that variable then a divide by zero error is impossible. I'm not a compiler writer, and haven't written assembly in about 15 years, but I'm sure the above information, or similar, can help.

As another example, that also applies to PHP, static analysis of possible values can tell you that for an if/elseif/elseif/else block only the second elseif block can ever be reached. This can be both reported as a compiler warning and all the conditional stuff thrown away.

But, that is all a pipe dream, it requires DBC being implemented in the language itself, and I've drifted from the point of this article. Which was: a PHP script can read its own comments! Cool!

Thursday, January 29, 2009

Merging PDF files (and more!)

My mobile phone company decided, 7 months ago, to save the world by not sending paper invoices any more. Unfortunately they didn't tell me (and registering to see invoices online required jumping through lots of hoops, while holding a PIN number I didn't know).
All sorted now, and I only lost one invoice (they only keep the last six online). So I downloaded six shiny new PDF files. Like my phone company I also want to save the world, so wanted to print them four to a page.

The solution is pdftk (in Ubuntu's package list; also available for other major distributions I believe). This simple command:
pdftk 2008*.pdf cat output all.pdf

It just worked. Nice. It also does lots of other cool stuff, like adding or removing encryption, remove just one page out of the middle of a PDF, rotate, extract text. No mention of i18n, so I don't know (yet) how it will cope with Japanese PDFs. Or PDFs where the text is actually an image. The projects home page is

It appears there are also Windows and Mac versions.

Wednesday, January 21, 2009

Gnome file extensions are screwing up my life!

Okay, perhaps a touch of frustration crept into that choice of subject.
When you double click a file in gnome file manager a lot of the time it works. When it doesn't you are in deep trouble. I've visited this problem before:

I never did fix the AIR problem - I had to uninstall AIR completely.
Now, since yesterday, all my inc files refuse to open in SciTE. They say "this claims to be an inc file, but the contents look like PHP. Even though SciTE is your editor of choice for both inc and php files I'm not going to let you open it. Please also check behind the sofa for terrorists."

I suspect it was because I was editing an inc file in jedit a few days ago and manually chose PHP for the syntax highlighting. But if so, how and why is jedit allowed to alter my system configuration??
(I've also noticed that when clicking links within zip files gedit is now associated with txt files; it used to be open office writer. Something weird is going on, but at least this one is an improvement: application associations in zip files are a world of their own, and I've never tracked down any method of configuring them.)

Following my advice in the second of my above blog entries I tried making a php.xml file for *.inc files. But no luck there.

Somebody help! How do I restore this? Or, even better, how do I switch off Gnome's stupid wrong-extension-is-a-security-risk check? In many years of using gnome it has never once complained in a useful way, or detected a genuine problem. Every single time it has just got in the way.

(If you type "gnome-control-center" there is supposed to be an icon in there for file types. It is missing in ubuntu 7! Or perhaps it was only in there in gnome 1 and got removed in gnome 2.)

UPDATE: Problem half-solved. I've installed PC File Manager, and replaced Nautilus everywhere. See the instructions here:
(they say Ubuntu 8 but everything seems to apply to Ubuntu 7 fine.)
I had to go through each text file extension and assign it to /usr/bin/scite (it was defaulting to jedit for most of them, and defaulting to nothing for the troublesome *.inc), but after I did that it is working nicely. File associations appear to be much more transparent than in Nautilus.
However following instructions in 2nd post here: for replacing the desktop didn't go so well. The icons cannot be moved around on the desktop, and launchers were not working. So I undid those changes, and will now try and change my directory launchers on my desktop to use pcmanfm. It seems I'll need to go through and replace each "location launcher" with an application launcher that calls pcmanfm. A bit of a pain, but looks like it will work.

Perhaps I'll investigate kubuntu... Surely KDE cannot be worse than Gnome/Nautilus?

Tuesday, January 13, 2009

linux shell: two directories quick switch

Sometimes I have to work in two directories from the commandline. For instance to check a log file got created, then back to the directory where the command I am testing is run from. The quick tip is simply "cd -". This takes you back to the previous directory. If you type it again it takes you back to where you were, acting as a nice toggle.

"cd" with no parameters takes you to your home directory. "cd ~/somewhere" takes you directly to a directory relative to your home directory (i.e. tilde expands to your home directory). I had a hunt around but couldn't find any other special characters to use with "cd", but if you know one please let me know.

(This tip was found in Linux Magazine, June 2008)