It’s something that’s been bugging me for a long time. All around the web, people are making flawed attempts at validating email addresses, causing a headache for their users, and probably for themselves.

I really started to notice this when I began to use the disposable addresses system that Gmail provides. Any mail sent to <youraddress>+<some_other_string>@gmail.com arrives in the Gmail inbox for <youraddress>@gmail.com. This is quite handy, and I personally use it for automatically tagging email I receive. For instance, for any email related to unicorns, I’d simply enter <myaddress>+unicorns@gmail.com on the sign-up form, and my mail filters would automatically tag all mail sent to that address for me (as an aside, these don’t really work as “proper” disposable email addresses as it’s easy to just strip everything after the + character in the local part of the address, and get the proper address). Sounds great, right? Well it is, until half of the internet fails at email address validation and rejects it.

The problem is that the email address specification allows for far more than most programmers expect it to. For instance, things like ! $ & * - = ^ ` | ~ # % ' + / ? _ { } are all valid, along with a whole bunch of others (even @ if you quote or escape it). Some of these are a tad silly. Using another @ sign by escaping, for instance, is just confusing, and is probably only used by sociopaths. Reject some of those others however, and you’ll start to annoy your users.

I was recently at a talk given by Andrew Godwin at FOSDEM. In that he mentioned a problem Django ran into, where their regular expression used for email validation would hang on long input (scratch that, I think this is the bug he mentioned, that other one is hideously old). After some head scratching, they came up with an improved regular expression, which didn’t have the issue. I’m not sure that either solution actually validates according to the specification though, and if the validation falls on the side of being too strict, it’s probably out there irritating people right now. As a fun aside Perl’s Mail::RFC822::Address module gives you a glimpse at a regular expression that actually follows the specification from RFC822.

Even the best validation is only going to get you a syntactically correct email address, with no guarantee that it actually exists. If you want to know that you’re being given a valid address, send it an email and have the user click a validation link in it, and stop annoying your users!

EDIT: I wrote a little follow up article on some of the points raised by commenters.