Email address validation: an addendum

When I was writing my last article, I wasn’t expecting too much of a response. Perhaps a comment or two from the 30-50 visits I get for most of my posts would be nice. The post actually circulated a bit more than usual, and I got quite a bit of interesting feedback. Hopefully these afterthoughts will get to at least some of those who read the original article (whether it’s the usual 50, or the 50,000).

One crucial thing that came up was the reasoning behind rejecting an email address that the user enters into the form. It boils down to wanting or needing to be able to contact the user at a later time. A very common case is account recovery. If a user loses their password, and you don’t have any way of contacting them, they’re probably stuffed. If you can tell them as quickly as possible that what they’ve entered doesn’t look right, then you’re going to save a lot of bother.

A few people commented that they often get less technically savvy users entering all kinds of incorrect things, ranging from just the local part (or conversely just the domain) of their email address to their desired user name on the service they’re trying  to register with. Given that this sort of thing goes on, validating that an email address is composed of an “@” symbol with some characters on either side is sensible (of course, you should still send a validation email if you want to make sure you’re being given correct data).

As for using such validation to prevent fake account creation, it’s trivially bypassed. If I’m working for nasty-corporation.com and want to sign up a bunch of accounts on your site to post loads of spam, I’m probably going to be able to generate email addresses, or better still create valid addresses on my domain to register an account. As annoying as they can be, some form of Captcha is better for this (preferably one with an audio alternative to the picture, for accessibility reasons), and doesn’t rely on spammers being totally incompetent.

Just one more thing, I know the comment system built into WordPress does a poor job of all this. More than a few of you were good enough to point it out. I’m switching away from the default one soon anyway.

Email address validation: please stop

It’s something that’s been bugging me for a long time. All around the web, people are making flawed attempts at validating email addresses, causing a headache for their users, and probably for themselves.

I really started to notice this when I began to use the disposable addresses system that Gmail provides. Any mail sent to <youraddress>+<some_other_string>@gmail.com arrives in the Gmail inbox for <youraddress>@gmail.com. This is quite handy, and I personally use it for automatically tagging email I receive. For instance, for any email related to unicorns, I’d simply enter “<myaddress>+unicorns@gmail.com” on the sign-up form, and my mail filters would automatically tag all mail sent to that address for me (as an aside, these don’t really work as “proper” disposable email addresses as it’s easy to just strip everything after the “+” character in the local part of the address, and get the proper address). Sounds great, right? Well it is, until half of the internet fails at email address validation and rejects it.

The problem is that the email address specification allows for far more than most programmers expect it to. For instance, things like ” ! $ & * – = ^ ` | ~ # % ‘ + / ? _ { } ” are all valid, along with a whole bunch of others (even “@” if you quote or escape it). Some of these are a tad silly. Using another “@” sign by escaping, for instance, is just confusing, and is probably only used by sociopaths. Reject some of those others however, and you’ll start to annoy your users.

I was recently at a talk given by Andrew Godwin at FOSDEM. In that he mentioned a problem Django ran into, where their regular expression used for email validation would hang on long input (scratch that, I think this is the bug he mentioned, that other one is hideously old). After some head scratching, they came up with an improved regular expression, which didn’t have the issue. I’m not sure that either solution actually validates according to the specification though, and if the validation falls on the side of being too strict, it’s probably out there irritating people right now. As a fun aside Perl’s Mail::RFC822::Address module gives you a glimpse at a regular expression that actually follows the specification from RFC822.

Even the best validation is only going to get you a syntactically correct email address, with no guarantee that it actually exists. If you want to know that you’re being given a valid address, send it an email and have the user click a validation link in it, and stop annoying your users!

EDIT: I wrote a little follow up article on some of the points raised by commenters.