Properly validating e-mail addresses (or converting EBNF to Parsec)

by George Pollard

Update: See the better code in the next post.

In recent times there have been several calls for websites to properly validate email addresses. Invariably, the compiled regex from Perl’s RFC822 is pasted up as The Way To Do It. The problem with this is (as the source code from the Perl module notes) is that email addresses cannot be validated by a simple regex (due to requiring parenthesis-matching). The Perl code addresses this by first stripping out all comments and then parsing via regex.

With this in mind, I thought that implementing the Addr-Spec specification from RFC 5322 (only released less than 6 months ago) might be a good test of the Haskell library Parsec. So, without further ado I went ahead and translated the EBNF from RFC 5322 directly into Parsec.

The mapping is something like this:

juxtaposition
>>
/
<|>
*
many
1*
many1
[]
optional

Here is the result:

import Text.Parsec
import Text.Parsec.Char
import Data.Char (chr)
 
isValid :: String -> Bool
isValid x = let result = valid x in
	either (const False) (const True) result
 
valid :: String -> Either ParseError ()
valid x = parse addrSpec "" x
 
ignore x = x >> return ()
 
addrSpec = localPart >> char '@' >> domain >> eof
 
localPart = dotAtom <|> quotedString <|> obsLocalPart <?> "local part"
domain = dotAtom <|> domainLiteral <|> obsDomain <?> "domain"
 
domainLiteral = optional cfws >> char '[' >>
		many ( optional fws >> dtext) >>
		optional fws  >> char ']' >> optional cfws
		<?> "domain literal"
 
ranges = oneOf . map chr . concat
vchar = ranges [[0x21..0x7E]] -- from Backus-Naur RFC
dtext = ranges [[33..90],[94..126]] <|> obsDtext
qtext = ranges [[33],[35..91],[93..126]] <|> obsQtext
atext = alphaNum <|> oneOf "!#$%&'*+-/=?^_`{|}~"
ctext = ranges [[33..39],[42..91],[93..126]] <|> obsCtext
wsp = char ' '
	<|> char '\t'
	<?> "space or tab"
 
cr = char '\r' <?> "carriage return"
lf = char '\n' <?> "line feed"
crlf = cr >> lf <?> "CRLF line ending"
 
-- # modification: added try
cfws = try (many1 (optional fws >> comment) >> optional fws) <|> ignore fws
-- # modification from RFC: adding try because of overlap
fws = try (optional (many wsp >> crlf) >> many1 wsp)
	<|> many1 wsp
	<|> obsFws
 
-- # modification: added try
comment = between (char '(') (char ')') (many (try (optional fws >> ccontent)) >> optional fws)
	<?> "comment"
ccontent = ignore ctext
	<|> ignore quotedPair
	<|> comment
 
atom = optional cfws >> many1 atext >> optional cfws
dotAtomText = many1 atext >> many (char '.' >> many1 atext)
dotAtom = optional cfws >> dotAtomText >> optional cfws
 
-- # other change from RFC -- merge prefix
quotedPair = char '\\' >> ((vchar <|> wsp) <|> obsQp)
qcontent = qtext <|> quotedPair
quotedString = optional cfws >>	char '\"' >> many (optional fws >> qcontent) >>
	optional fws >>	char '\"' >> optional cfws
	<?> "quoted string"
 
-- # Obsolete syntax
obsNoWsCtl = ranges [[1..8],[11..12],[14..31],[127]]
obsCtext = obsNoWsCtl
obsDtext = obsNoWsCtl <|> quotedPair
obsQtext = obsNoWsCtl
-- # change: see above
obsQp = (char (chr 0) <|> obsNoWsCtl <|> lf <|> cr)
obsLocalPart = word >> many (char '.' >> word) >> return ()
obsDomain = atom >> many (char '.' >> atom) >> return ()
obsFws = many1 wsp >> many (crlf >> many1 wsp) >> return []
word = atom <|> quotedString

A note before I continue: since Parsec by default does no backtracking (in order to remain efficient), there are a couple of places (two that I’ve found so far) where the original EBNF needs to be changed slightly. I have noted these in the source above. It is possible there are a couple more places that need fixing, but I haven’t run this against a large test suite yet to find them. (They are most likely to be in the ‘obsolete syntax’ section.)

And of course, some demonstrations (keep in mind that there is an extra level of escaping operating here… where relevant I’ve included the unescaped email address in a comment):

isValid "porges@example.com" == True
isValid "porges@@example.com" == False
isValid "\"porges@\"@example.com" == True -- # "porges@"@porg.es
isValid "\"por(g)es@\"@example.com" == True -- # "por(g)es@"@porg.es
isValid "porges(comment)@example.com" == True
isValid "porges(comme(nests)nt)@example.com" == True
isValid "porges(comme(nests)nt())@example.com" == True
isValid "porges(()comme(nests)nt())@example.com" == True
isValid "()porges(()comme(nests)nt())@example.com" == True
isValid "((lol)porges(()comme(nests)nt())@example.com" == False
isValid "((lol))porges(()comme(nests)nt())@example.com" == True
isValid "(lol))porges(()comme(nests)nt())@example.com" == False
isValid "((lol))porges(()comme(nests)nt())@example.com" == True
isValid "\"s\\\0\"@example.com" == True
-- # "s\NUL"@example.com, where NUL is actually the
-- # null character! Yep, can't strlen() on email addresses...

I managed to find this post on email addresses by Phil Haack which has the following tests:

[("\"Abc\\@def\"@example.com",True),
("\"Fred Bloggs\"@example.com",True),
("\"Joe\\Blow\"@example.com",True),
("\"Abc@def\"@example.com",True),
("customer/department=shipping@example.com",True),
("$A12345@example.com",True),
("!def!xyz%abc@example.com",True),
("_somename@example.com",True),
("NotAnEmail",False),
("@NotAnEmail",False),
("\"test\\\\blah\"@example.com",True),
("\"test\\blah\"@example.com",True),
-- # Phil gets false for this, which I think is wrong
-- # (Dominic Sayers notes the same at the end of the comment thread)
("\"test\\\rblah\"@example.com",True),
("\"test\rblah\"@example.com",False),
("\"test\\\"blah\"@example.com",True),
("\"test\"blah\"@example.com",False),
("customer/department@example.com",True),
("$A12345@example.com",True),
("!def!xyz%abc@example.com",True),
("_Yosemite.Sam@example.com",True),
("~@example.com",True),
(".wooly@example.com",False),
("wo..oly@example.com",False),
("pootietang.@example.com",False),
(".@example.com",False),
("\"Austin@Powers\"@example.com",True),
("Ima.Fool@example.com",True),
("\"Ima.Fool\"@example.com",False),
("\"Ima Fool\"@example.com",False),
("Ima Fool@example.com",False)]

Next job is to test it against this batch…

Update: Yep, fails in several areas with the obsolete syntax. I’ve fixed one above. (Note that I’m not concerned with the failures in the domain part of the address, as the RFC5322 EBNF for this is more liberal than the tests require.)

Might have to refactor the syntax… there is a large overlap with the obsolete syntax. (Or just use ‘try’, but that’s not so efficient.)