Pearls Before Piglets

While Googling my way through the interwebs, I came across the 2008 Western Australian Certificate of Education sample examination for Stage 2 Biological Sciences. It contains this diagram:

Diagram

If you’re wondering, the entire hierarchy is drawn from Umberto Eco‘s novel Baudolino.

What can we fit in 140 characters?

This is in reference to the current ‘Twitter image encoding challenge’ running on StackOverflow.

If we want to restrict ourselves to assigned, non-control, non-private Unicode characters, then by my reckoning that gives us 129,775 available characters.

wget http://unicode.org/Public/UNIDATA/UnicodeData.txt
awk -F ';' UnicodeData.txt -f countUnichars.awk | bc

countUnichars.awk source:

BEGIN { print "ibase=16" } # set bc to hex mode
 
$2 ~ /Private/ { # skip any lines with "private" in the description
    getline;
}
 
n { # if n is set, then print the range for bc to calculate
    printf("(%s-%s+1)+",$1,n);
    n="";
}
 
$2 ~ /First>/ { # set n if the start of a range
    n=$1;
    getline;
}
 
$3 !~ "C." { # otherwise count anything that isn't some kind of a control character
    i++;
}
 
END { # print out the count of everything else
    printf("%X\n",i)
}

This means we can store exactly 2377 bits (297 bytes) per message (this is \lfloor\log_2(129775) \times 140\rfloor), so if we use a 16-colour palette we can store about 594 pixels (2377/\log_2(16)), which can almost reproduce the Mona Lisa thumbnail in the contest page.

Rhythmbox Plugin: Stop after current track

I have wanted this for a while, and my brother linking me to this post was the last straw.

So here is a very quick and simple plugin; it simply puts a button on the toolbar that you can click when you want to stop playback after the current song. I based the toolbar button code on Alexandre Rosenfeld’s lastfm-queue plugin, since I had no idea where to start with that

Download it here, and put it into ~/.gnome2/rhythmbox/plugins/stop_after_song/. Activate it in Rhythmbox’s plugin dialog.

Email address validation: Simpler, Faster, More Correct

So, I have merged the obsolete-syntax into the code from the last post. This has resulted in shorter, cleaner, faster validation which is also more correct.

I didn’t like the fact that in the old code there were places where explicit try points needed to be included. It seems that these arose because the ‘obsolete’ syntax was tacked-on to the EBNF for the normal syntax, creating much overlap. Since I merged the syntaxes together, there are no explicit try points needed (there are some implicit ones, I believe, such as in optional). This makes the code both faster and easier to understand.

module Text.Email.Validation (isValid)
where
 
import Text.Parsec
import Text.Parsec.Char
import Data.Char (chr)
 
isValid :: String -> Bool
isValid x = 	either (const False) (const True) (valid x)
 
simply = (>> return ())
-- simply converts a parser returning something to a parser returning nothing
 
valid :: String -> Either ParseError ()
valid = parse addrSpec ""
 
addrSpec = localPart >> char '@' >> domain >> eof
 
localPart = dottedAtoms
domain = dottedAtoms <|> domainLiteral 
 
dottedAtoms = simply $ (optional cfws >> (atom <|> quotedString) >> optional cfws)
	`sepBy1` (char '.')
atom = simply $ many1 atomText
atomText = simply $ alphaNum <|> oneOf "!#$%&'*+-/=?^_`{|}~"
 
domainLiteral =  between (optional cfws >> char '[') (char ']' >> optional cfws) $
	many (optional fws >> domainText) >> optional fws
domainText = ranges [[33..90],[94..126]] <|> obsNoWsCtl
 
quotedString = between (char '"') (char '"') $
	many (optional fws >> quotedContent) >> optional fws
quotedContent = quotedText <|> quotedPair
quotedText = ranges [[33],[35..91],[93..126]] <|> obsNoWsCtl
quotedPair = char '\\' >> (vchar <|> wsp <|> lf <|> cr <|> obsNoWsCtl <|> nullChar)
 
cfws = simply $ many (comment <|> fws)
fws = (many1 wsp >> optional (crlf >> many1 wsp))
	<|> (many1 (crlf >> many1 wsp) >> return ())
 
comment = simply $ between (char '(') (char ')') $
	many (commentContent <|> fws)
commentContent = commentText <|> quotedPair <|> comment
commentText = ranges [[33..39],[42..91],[93..126]] <|> obsNoWsCtl
 
nullChar = simply $ char '\0'
wsp = simply $ oneOf " \t"
cr = simply $ char '\r'
lf = simply $ char '\n'
crlf = simply $ cr >> lf
vchar = ranges [[0x21..0x7e]]
obsNoWsCtl = ranges [[1..8],[11,12],[14..31],[127]]
ranges = simply . oneOf . map chr . concat

This now passes all of Dominic Sayer’s tests that it is meant to—the domain validation used in Dominic Sayer’s tests is more strict than RFC5322 specifies. Expect this to change!

For those who’d like to know, email addresses that now parse that didn’t before include the often-used (‘|’ is merely to indicate the end of whitespace):

I.                        |
 am.                  |
 a.      |
 nice.|
 guy@(yeah)you.com