Monday, March 16, 2009

Periods in regex character classes

I'm trying to match Japanese numbers with a regex, and this was originally a long article describing what seemed to be a bug when using the following regex:
$regex='/((0|1|2|3|4|5|6|7|8|9|.)+)(万|億|兆)/u';

(a simplified version of my actual code, but this demonstrated the problem equally well)

Now, periods in character classes (a list of alternatives in square brackets) are not special, so don't need escaping. But they do need escaping anywhere else. I knew this, which is why I was scratching my head.

So, naturally, I felt a bit foolish when I realized I was using a list of alternatives in parentheses. It wasn't a character class at all.

But, now, this was even more confusing because I had already been down that street. My code (which in hindsight was correct) had started out like this:
$regex='/((0|1|2|3|4|5|6|7|8|9|\.)+)(万|億|兆)/u';

and that kept saying no match. E.g. for "730.28".

Did I jump out of my bath and shout "Eureka!" when I realized the problem? No. It was more of a forehead slapper. It wasn't "730.28万". It was just "730.28". The code works by trying to match the above regex, then if it fails passes it to another function. I'd been debugging the wrong function. Once I was looking at the right function it was a trivial fix.

Moral of the story: don't be stupid.

(Or more seriously, "Time To Refactor", as the code is now tangled enough that it was too easy for me to get entangled in the wrong place.)

3 comments:

darren said...

Postscript
You may be wondering why I used alternatives in parentheses, instead of alternatives in a character class.
Short answer: kanji (in UTF8) didn't seem to match in a character class, even with the 'u' flag.

If you don't believe me then please show me your non-ascii-in-character-class code that works and I'll take another look.

keith.s.wilkinson said...

Seems that many people like the O'Reilly Pocket Guide (ISBN 0596514271) because it summarizes regex differences between different computer languages.
There are now at least three CJKV books, the new Lunde (ISBN 0596514476), the Unicode book (059610121X), and the Fonts & Encodings book (0596102429), and I vaguely recall one of them (one of the first two?) mentioning regexes. (I don't yet own these, but have seen them in Kinokuniya books.)

keith.s.wilkinson said...

The new Regex Cookbook (O'Reilly, ISBN 0596520689) has about 12 pages of Unicode matching examples.