Properly validating e-mail addresses (or converting EBNF to Parsec)
by Porges
Update: See the better code in the next post.
In recent times there have been several calls for websites to properly validate email addresses. Invariably, the compiled regex from Perl’s RFC822 is pasted up as The Way To Do It. The problem with this is (as the source code from the Perl module notes) is that email addresses cannot be validated by a simple regex (due to requiring parenthesis-matching). The Perl code addresses this by first stripping out all comments and then parsing via regex.
With this in mind, I thought that implementing the Addr-Spec specification from RFC 5322 (only released less than 6 months ago) might be a good test of the Haskell library Parsec. So, without further ado I went ahead and translated the EBNF from RFC 5322 directly into Parsec.
The mapping is something like this:
- juxtaposition
>>- /
<|>- *
many- 1*
many1- []
optional
Here is the result:
import Text.Parsec import Text.Parsec.Char import Data.Char (chr) isValid :: String -> Bool isValid x = let result = valid x in either (const False) (const True) result valid :: String -> Either ParseError () valid x = parse addrSpec "" x ignore x = x >> return () addrSpec = localPart >> char '@' >> domain >> eof localPart = dotAtom <|> quotedString <|> obsLocalPart <?> "local part" domain = dotAtom <|> domainLiteral <|> obsDomain <?> "domain" domainLiteral = optional cfws >> char '[' >> many ( optional fws >> dtext) >> optional fws >> char ']' >> optional cfws <?> "domain literal" ranges = oneOf . map chr . concat vchar = ranges [[0x21..0x7E]] -- from Backus-Naur RFC dtext = ranges [[33..90],[94..126]] <|> obsDtext qtext = ranges [[33],[35..91],[93..126]] <|> obsQtext atext = alphaNum <|> oneOf "!#$%&'*+-/=?^_`{|}~" ctext = ranges [[33..39],[42..91],[93..126]] <|> obsCtext wsp = char ' ' <|> char '\t' <?> "space or tab" cr = char '\r' <?> "carriage return" lf = char '\n' <?> "line feed" crlf = cr >> lf <?> "CRLF line ending" -- # modification: added try cfws = try (many1 (optional fws >> comment) >> optional fws) <|> ignore fws -- # modification from RFC: adding try because of overlap fws = try (optional (many wsp >> crlf) >> many1 wsp) <|> many1 wsp <|> obsFws -- # modification: added try comment = between (char '(') (char ')') (many (try (optional fws >> ccontent)) >> optional fws) <?> "comment" ccontent = ignore ctext <|> ignore quotedPair <|> comment atom = optional cfws >> many1 atext >> optional cfws dotAtomText = many1 atext >> many (char '.' >> many1 atext) dotAtom = optional cfws >> dotAtomText >> optional cfws -- # other change from RFC -- merge prefix quotedPair = char '\\' >> ((vchar <|> wsp) <|> obsQp) qcontent = qtext <|> quotedPair quotedString = optional cfws >> char '\"' >> many (optional fws >> qcontent) >> optional fws >> char '\"' >> optional cfws <?> "quoted string" -- # Obsolete syntax obsNoWsCtl = ranges [[1..8],[11..12],[14..31],[127]] obsCtext = obsNoWsCtl obsDtext = obsNoWsCtl <|> quotedPair obsQtext = obsNoWsCtl -- # change: see above obsQp = (char (chr 0) <|> obsNoWsCtl <|> lf <|> cr) obsLocalPart = word >> many (char '.' >> word) >> return () obsDomain = atom >> many (char '.' >> atom) >> return () obsFws = many1 wsp >> many (crlf >> many1 wsp) >> return [] word = atom <|> quotedString |
A note before I continue: since Parsec by default does no backtracking (in order to remain efficient), there are a couple of places (two that I’ve found so far) where the original EBNF needs to be changed slightly. I have noted these in the source above. It is possible there are a couple more places that need fixing, but I haven’t run this against a large test suite yet to find them. (They are most likely to be in the ‘obsolete syntax’ section.)
And of course, some demonstrations (keep in mind that there is an extra level of escaping operating here… where relevant I’ve included the unescaped email address in a comment):
isValid "porges@example.com" == True isValid "porges@@example.com" == False isValid "\"porges@\"@example.com" == True -- # "porges@"@porg.es isValid "\"por(g)es@\"@example.com" == True -- # "por(g)es@"@porg.es isValid "porges(comment)@example.com" == True isValid "porges(comme(nests)nt)@example.com" == True isValid "porges(comme(nests)nt())@example.com" == True isValid "porges(()comme(nests)nt())@example.com" == True isValid "()porges(()comme(nests)nt())@example.com" == True isValid "((lol)porges(()comme(nests)nt())@example.com" == False isValid "((lol))porges(()comme(nests)nt())@example.com" == True isValid "(lol))porges(()comme(nests)nt())@example.com" == False isValid "((lol))porges(()comme(nests)nt())@example.com" == True isValid "\"s\\\0\"@example.com" == True -- # "s\NUL"@example.com, where NUL is actually the -- # null character! Yep, can't strlen() on email addresses... |
I managed to find this post on email addresses by Phil Haack which has the following tests:
[("\"Abc\\@def\"@example.com",True), ("\"Fred Bloggs\"@example.com",True), ("\"Joe\\Blow\"@example.com",True), ("\"Abc@def\"@example.com",True), ("customer/department=shipping@example.com",True), ("$A12345@example.com",True), ("!def!xyz%abc@example.com",True), ("_somename@example.com",True), ("NotAnEmail",False), ("@NotAnEmail",False), ("\"test\\\\blah\"@example.com",True), ("\"test\\blah\"@example.com",True), -- # Phil gets false for this, which I think is wrong -- # (Dominic Sayers notes the same at the end of the comment thread) ("\"test\\\rblah\"@example.com",True), ("\"test\rblah\"@example.com",False), ("\"test\\\"blah\"@example.com",True), ("\"test\"blah\"@example.com",False), ("customer/department@example.com",True), ("$A12345@example.com",True), ("!def!xyz%abc@example.com",True), ("_Yosemite.Sam@example.com",True), ("~@example.com",True), (".wooly@example.com",False), ("wo..oly@example.com",False), ("pootietang.@example.com",False), (".@example.com",False), ("\"Austin@Powers\"@example.com",True), ("Ima.Fool@example.com",True), ("\"Ima.Fool\"@example.com",False), ("\"Ima Fool\"@example.com",False), ("Ima Fool@example.com",False)] |
Next job is to test it against this batch…
Update: Yep, fails in several areas with the obsolete syntax. I’ve fixed one above. (Note that I’m not concerned with the failures in the domain part of the address, as the RFC5322 EBNF for this is more liberal than the tests require.)
Might have to refactor the syntax… there is a large overlap with the obsolete syntax. (Or just use ‘try’, but that’s not so efficient.)
How does this support unicode domains?
It doesn’t; RFC5322 has a rather liberal handling of the domain name part. If I finish this I might extend it with support for the recent draft RFCs regarding UTF-8.
This seems really useful. If you are not inclined to hackagize and maintain this yourself after the initial release, I volunteer.
This (actually the better version in the next post) is now on hackage as “email-validate”.