Email address validation: Simpler, Faster, More Correct

So, I have merged the obsolete-syntax into the code from the last post. This has resulted in shorter, cleaner, faster validation which is also more correct.

I didn’t like the fact that in the old code there were places where explicit try points needed to be included. It seems that these arose because the ‘obsolete’ syntax was tacked-on to the EBNF for the normal syntax, creating much overlap. Since I merged the syntaxes together, there are no explicit try points needed (there are some implicit ones, I believe, such as in optional). This makes the code both faster and easier to understand.

module Text.Email.Validation (isValid)
where
 
import Text.Parsec
import Text.Parsec.Char
import Data.Char (chr)
 
isValid :: String -> Bool
isValid x = 	either (const False) (const True) (valid x)
 
simply = (>> return ())
-- simply converts a parser returning something to a parser returning nothing
 
valid :: String -> Either ParseError ()
valid = parse addrSpec ""
 
addrSpec = localPart >> char '@' >> domain >> eof
 
localPart = dottedAtoms
domain = dottedAtoms <|> domainLiteral 
 
dottedAtoms = simply $ (optional cfws >> (atom <|> quotedString) >> optional cfws)
	`sepBy1` (char '.')
atom = simply $ many1 atomText
atomText = simply $ alphaNum <|> oneOf "!#$%&'*+-/=?^_`{|}~"
 
domainLiteral =  between (optional cfws >> char '[') (char ']' >> optional cfws) $
	many (optional fws >> domainText) >> optional fws
domainText = ranges [[33..90],[94..126]] <|> obsNoWsCtl
 
quotedString = between (char '"') (char '"') $
	many (optional fws >> quotedContent) >> optional fws
quotedContent = quotedText <|> quotedPair
quotedText = ranges [[33],[35..91],[93..126]] <|> obsNoWsCtl
quotedPair = char '\\' >> (vchar <|> wsp <|> lf <|> cr <|> obsNoWsCtl <|> nullChar)
 
cfws = simply $ many (comment <|> fws)
fws = (many1 wsp >> optional (crlf >> many1 wsp))
	<|> (many1 (crlf >> many1 wsp) >> return ())
 
comment = simply $ between (char '(') (char ')') $
	many (commentContent <|> fws)
commentContent = commentText <|> quotedPair <|> comment
commentText = ranges [[33..39],[42..91],[93..126]] <|> obsNoWsCtl
 
nullChar = simply $ char '\0'
wsp = simply $ oneOf " \t"
cr = simply $ char '\r'
lf = simply $ char '\n'
crlf = simply $ cr >> lf
vchar = ranges [[0x21..0x7e]]
obsNoWsCtl = ranges [[1..8],[11,12],[14..31],[127]]
ranges = simply . oneOf . map chr . concat

This now passes all of Dominic Sayer’s tests that it is meant to—the domain validation used in Dominic Sayer’s tests is more strict than RFC5322 specifies. Expect this to change!

For those who’d like to know, email addresses that now parse that didn’t before include the often-used (‘|’ is merely to indicate the end of whitespace):

I.                        |
 am.                  |
 a.      |
 nice.|
 guy@(yeah)you.com

Comments 2

  1. Mauricio Scheffer wrote:

    Very nice! I just ported this to F#. It passes 84% of Dominic’s tests… could you further explain how these tests are more strict than the RFC?

    Posted 12 Apr 2009 at 9:00 pm
  2. Porges wrote:

    Hi Mauricio, the tests are more strict than the RFC because the RFC has very loose validation for the domain-name part of the email address (you’ll note that all of the test failures are to do with this).

    I am currently working on some other code that will help with this

    Posted 16 Apr 2009 at 10:46 am

Trackbacks & Pingbacks 1

  1. From porges - Properly validating e-mail addresses (or converting EBNF to Parsec) on 15 Mar 2009 at 1:47 am

    [...] Update: See the better code in the next post. [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *