Sunday, October 25, 2009

Regex Article in php|a

The August edition of PHP Architect magazine contained my introductory article on regexes. It has a PHP slant but mainly it is about regexes. Well, most of it is introductory but it contains a bonus section at the end using advanced regexes in SQL to repair a database in situ.

The print magazine arrived by air mail a couple of weeks ago and I finally got to read it. I was pleased (and relieved) that it read well. If you have read it I'd love to get some constructive criticism - especially if you are uncomfortable with regexes.

It was also my first print article to use colour, and I think that worked well too.

One correction in the complicated "Repairing With Regexes" section at the end of the article. The text says "So \1 has to be written \\\\1 (that is four backslashes)." That seemed a strange thing to write, so I took a look at the unit test source code (included with the PDF download version of php|a magazine) I had written. Four backslashes is correct when in PHP, but when in SQL it should be "\\1" not "\1".
(I just checked my emails from when we were proof-reading, and one of the edits had removed all those backslashes; I caught that at the time but ironically I got it wrong when putting them back in.)

The other minor correction I only realized after Arne Blanket's column in the same magazine issue! I had written: " IP address after that at sign, such as darren@, which, while unusual, is technically valid". In fact "darren@[]" is the technically valid form. Which is doubly annoying because my article's regex would reject those square brackets and I hadn't explicitly pointed that out. (No harm done, though, as this email form is highly discouraged.)

Arne's column also points out that top-level domains are no longer just two or three characters. So my '\.[a-zA-Z]{2,3}$' suggestion should really have been '\.[a-zA-Z]{2,}$'. Luckily, that suggestion was just in a list of other ideas, not part of the article's main regex.

(By the way I enjoyed, and will blog about real soon, some other articles in this particular issue; if you are not a subscriber, and your work involves data and PHP, then this is the back issue you should get!)


keith.s.wilkinson said...

Professional editing, professional graphic design, and first-class content -- a fascinating read from cover to cover. Much better than Dr. Dobbs used to be, in my opinion. "Beware! Addictive!" you should have warned ;-)

keith.s.wilkinson said...

> I enjoyed, and will blog about real soon, some other articles in this particular issue.

The article on improved garbage collection in PHP 5.3 was interesting. One popular CMS is based on an open-source framework that supports older versions of PHP; the CMS can post material from any blog on any page of the web site, and apparently uses regexes internally to create the corresponding SQL query.
One programmer commented that his work mainly involves creating large web sites. He'd rather use the CMS than the framework because so much is already done for him, but the regex memory consumption problem causes big sites to crash and burn unless the PHP memory limit is increased to a very large figure (the CMS vendor responded that they recommend a PHP memory limit of at least 64M -- this is far more than the PHP default, and far more than most hosting companies offer as standard). However if the programmer customizes the regex part of the CMS to use less memory, then his customer can no longer apply standard CMS updates from the CMS vendor, and the vendor will no longer support it -- the customer must ask the programmer to apply updates -- which zaps one key advantage of using a company-supported CMS.

keith.s.wilkinson said...

I think the main reason that PHP has a bad reputation for security, and why most Japanese Internet providers don't support it (PHP) is that forms spam can be a major headache.

"Canned" forms, for non-programmers to use, often have the "mailto:" address in a hidden field, and spammers will try to alter this on the fly and hijack
the form to relay spam.

Most of the little spam that gets through my ICA signup form at comes from "people" who enter alphabetic characters, or numbers outside of the range "" or 1-3 in the "number attending" field. Ideally I should use a PCRE that lets thru' "" or 1-3 entered as either ascii or unicode 全角数字 -- so that is the next little exercise when I get around to it... It's useful to have a field like this that can be used as a honeypot to catch spam ;-)

I mainly use regexes to stop attempts to send spam through this ICA signup form, however (in my experience with the ICA site) malicious DOS attempts -- repeated hammering of the form with no "@" in the email address field -- seem to be the heaviest load on the server. The "lightest" way of detecting whether or not there's an "@" seems to be not PCRE but strpos().

The "kind" way of handling such "fatal" errors -- displaying an error message next to the invalid field(s) -- allows automated DOS spammers to repeatedly hammer the form, so is best avoided as far as possible: it's better to disable the form and so stop the auto-spammer ;-)

Ideally, each time that "naughty" text is caught by the form's PCRE, the text (and possibly the number of occurrences) should be logged, and the most frequently-used stuff periodically moved automatically or semi-automatically up to the front of the list of blacklisted words. Likewise, naughty words that are no longer being used do not need to be tested for. Similarly, when blacklisted IPs cease spamming, they can be automatically or semi-automatically dropped from the blacklist. This kind of "adaptive" programming is another interesting challenge I'd like to explore.

darren said...

Hi Keith,
Interesting point about the extra load of preg_match compared to strpos; normally it is not going to be significant in amongst all the other stuff the script does, but when trying to detect and terminate a junk request as quickly as possible I can see how it could make a measurable difference.