What can we fit in 140 characters?

by George Pollard

This is in reference to the current ‘Twitter image encoding challenge’ running on StackOverflow.

If we want to restrict ourselves to assigned, non-control, non-private Unicode characters, then by my reckoning that gives us 129,775 available characters.

wget http://unicode.org/Public/UNIDATA/UnicodeData.txt
awk -F ';' UnicodeData.txt -f countUnichars.awk | bc

countUnichars.awk source:

BEGIN { print "ibase=16" } # set bc to hex mode
 
$2 ~ /Private/ { # skip any lines with "private" in the description
    getline;
}
 
n { # if n is set, then print the range for bc to calculate
    printf("(%s-%s+1)+",$1,n);
    n="";
}
 
$2 ~ /First>/ { # set n if the start of a range
    n=$1;
    getline;
}
 
$3 !~ "C." { # otherwise count anything that isn't some kind of a control character
    i++;
}
 
END { # print out the count of everything else
    printf("%X\n",i)
}

This means we can store exactly 2377 bits (297 bytes) per message (this is \lfloor\log_2(129775) \times 140\rfloor), so if we use a 16-colour palette we can store about 594 pixels (2377/\log_2(16)), which can almost reproduce the Mona Lisa thumbnail in the contest page.