What can we fit in 140 characters?

This is in reference to the current ‘Twitter image encoding challenge’ running on StackOverflow.

If we want to restrict ourselves to assigned, non-control, non-private Unicode characters, then by my reckoning that gives us 129,775 available characters.

wget http://unicode.org/Public/UNIDATA/UnicodeData.txt
awk -F ';' UnicodeData.txt -f countUnichars.awk | bc

countUnichars.awk source:

BEGIN { print "ibase=16" } # set bc to hex mode
 
$2 ~ /Private/ { # skip any lines with "private" in the description
    getline;
}
 
n { # if n is set, then print the range for bc to calculate
    printf("(%s-%s+1)+",$1,n);
    n="";
}
 
$2 ~ /First>/ { # set n if the start of a range
    n=$1;
    getline;
}
 
$3 !~ "C." { # otherwise count anything that isn't some kind of a control character
    i++;
}
 
END { # print out the count of everything else
    printf("%X\n",i)
}

This means we can store exactly 2377 bits (297 bytes) per message (this is \lfloor\log_2(129775) \times 140\rfloor), so if we use a 16-colour palette we can store about 594 pixels (2377/\log_2(16)), which can almost reproduce the Mona Lisa thumbnail in the contest page.

Post a Comment

Your email is never published nor shared. Required fields are marked *