<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>porges &#187; Unicode</title>
	<atom:link href="http://porg.es/blog/tag/unicode/feed" rel="self" type="application/rss+xml" />
	<link>http://porg.es/blog</link>
	<description></description>
	<lastBuildDate>Thu, 12 Jan 2012 23:45:56 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Unicode as she is broke</title>
		<link>http://porg.es/blog/unicode-as-she-is-broke</link>
		<comments>http://porg.es/blog/unicode-as-she-is-broke#comments</comments>
		<pubDate>Mon, 14 Mar 2011 08:34:02 +0000</pubDate>
		<dc:creator>Porges</dc:creator>
				<category><![CDATA[code]]></category>
		<category><![CDATA[Broken]]></category>
		<category><![CDATA[criticism]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Unicode]]></category>

		<guid isPermaLink="false">http://porg.es/blog/?p=418</guid>
		<description><![CDATA[Do you think your string-handling code is robust? Are there any problems with the following snippets? // Write the first character: Console.WriteLine&#40;s&#91;0&#93;&#41;; &#160; // Reverse the string: var backwards = s.ToArray&#40;&#41;; Array.Reverse&#40;backwards&#41;; Console.WriteLine&#40;new string&#40;backwards&#41;&#41;; &#160; // List the characters, separated by commas: Console.WriteLine&#40;string.Join&#40;&#34;, &#34;, s.ToArray&#40;&#41;&#41;&#41;; All of these are potential bugs. Just set the following [...]]]></description>
			<content:encoded><![CDATA[<p>Do you think your string-handling code is robust? Are there any problems with the following snippets?</p>
<p><span id="more-418"></span></p>

<div class="wp_syntax"><div class="code"><pre class="csharp" style="font-family:monospace;"><span style="color: #008080; font-style: italic;">// Write the first character:</span>
Console<span style="color: #008000;">.</span><span style="color: #0000FF;">WriteLine</span><span style="color: #008000;">&#40;</span>s<span style="color: #008000;">&#91;</span><span style="color: #FF0000;">0</span><span style="color: #008000;">&#93;</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">;</span>
&nbsp;
<span style="color: #008080; font-style: italic;">// Reverse the string:</span>
var backwards <span style="color: #008000;">=</span> s<span style="color: #008000;">.</span><span style="color: #0000FF;">ToArray</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">;</span>
Array<span style="color: #008000;">.</span><span style="color: #0000FF;">Reverse</span><span style="color: #008000;">&#40;</span>backwards<span style="color: #008000;">&#41;</span><span style="color: #008000;">;</span>
Console<span style="color: #008000;">.</span><span style="color: #0000FF;">WriteLine</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">new</span> <span style="color: #6666cc; font-weight: bold;">string</span><span style="color: #008000;">&#40;</span>backwards<span style="color: #008000;">&#41;</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">;</span>
&nbsp;
<span style="color: #008080; font-style: italic;">// List the characters, separated by commas:</span>
Console<span style="color: #008000;">.</span><span style="color: #0000FF;">WriteLine</span><span style="color: #008000;">&#40;</span><span style="color: #6666cc; font-weight: bold;">string</span><span style="color: #008000;">.</span><span style="color: #0000FF;">Join</span><span style="color: #008000;">&#40;</span><span style="color: #666666;">&quot;, &quot;</span>, s<span style="color: #008000;">.</span><span style="color: #0000FF;">ToArray</span><span style="color: #008000;">&#40;</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">;</span></pre></div></div>

 <a class="simple-footnote" title="Note that even if you have a UTF-16 implementation with a &#8220;non-broken&#8221; char, the reversal given here isn&#8217;t valid. For one thing, it will put combining characters in the wrong place. The reversal of a string is also potentially locale-dependent: one can argue that the reversal of &#8220;Dijkstra&#8221; (Dutch) should be &#8220;artsijkD&#8221;." id="return-note-418-1" href="#note-418-1"><sup>1</sup></a>
<p>All of these are potential bugs.</p>
<p>Just set the following and you&#8217;ll get some nice output:</p>

<div class="wp_syntax"><div class="code"><pre class="csharp" style="font-family:monospace;"><span style="color: #6666cc; font-weight: bold;">string</span> s <span style="color: #008000;">=</span> <span style="color: #666666;">&quot;𠂔&quot;</span><span style="color: #008000;">;</span></pre></div></div>

<pre>�
��
�,�</pre>
<p>There are other problems as well:</p>

<div class="wp_syntax"><div class="code"><pre class="csharp" style="font-family:monospace;"><span style="color: #6666cc; font-weight: bold;">string</span> A <span style="color: #008000;">=</span> <span style="color: #666666;">&quot;𝔄&quot;</span><span style="color: #008000;">;</span> <span style="color: #008080; font-style: italic;">// MATHEMATICAL FRAKTUR CAPITAL A, if you can't see it</span>
Console<span style="color: #008000;">.</span><span style="color: #0000FF;">WriteLine</span><span style="color: #008000;">&#40;</span><span style="color: #6666cc; font-weight: bold;">char</span><span style="color: #008000;">.</span><span style="color: #0000FF;">IsUpper</span><span style="color: #008000;">&#40;</span>A<span style="color: #008000;">&#91;</span><span style="color: #FF0000;">0</span><span style="color: #008000;">&#93;</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">;</span> <span style="color: #008080; font-style: italic;">// False</span>
Console<span style="color: #008000;">.</span><span style="color: #0000FF;">WriteLine</span><span style="color: #008000;">&#40;</span><span style="color: #6666cc; font-weight: bold;">char</span><span style="color: #008000;">.</span><span style="color: #0000FF;">IsUpper</span><span style="color: #008000;">&#40;</span>A<span style="color: #008000;">&#91;</span><span style="color: #FF0000;">1</span><span style="color: #008000;">&#93;</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">;</span> <span style="color: #008080; font-style: italic;">// False</span>
Console<span style="color: #008000;">.</span><span style="color: #0000FF;">WriteLine</span><span style="color: #008000;">&#40;</span><span style="color: #6666cc; font-weight: bold;">char</span><span style="color: #008000;">.</span><span style="color: #0000FF;">IsUpper</span><span style="color: #008000;">&#40;</span>A,<span style="color: #FF0000;">0</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">;</span> <span style="color: #008080; font-style: italic;">// True - this is the correct way</span></pre></div></div>

<p>Why is this?</p>

<div class="wp_syntax"><div class="code"><pre class="csharp" style="font-family:monospace;">Console<span style="color: #008000;">.</span><span style="color: #0000FF;">WriteLine</span><span style="color: #008000;">&#40;</span><span style="color: #6666cc; font-weight: bold;">char</span><span style="color: #008000;">.</span><span style="color: #0000FF;">GetUnicodeCategory</span><span style="color: #008000;">&#40;</span>A<span style="color: #008000;">&#91;</span><span style="color: #FF0000;">0</span><span style="color: #008000;">&#93;</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">;</span> <span style="color: #008080; font-style: italic;">// Surrogate</span>
Console<span style="color: #008000;">.</span><span style="color: #0000FF;">WriteLine</span><span style="color: #008000;">&#40;</span><span style="color: #6666cc; font-weight: bold;">char</span><span style="color: #008000;">.</span><span style="color: #0000FF;">GetUnicodeCategory</span><span style="color: #008000;">&#40;</span>A<span style="color: #008000;">&#91;</span><span style="color: #FF0000;">1</span><span style="color: #008000;">&#93;</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">;</span> <span style="color: #008080; font-style: italic;">// Surrogate</span>
Console<span style="color: #008000;">.</span><span style="color: #0000FF;">WriteLine</span><span style="color: #008000;">&#40;</span><span style="color: #6666cc; font-weight: bold;">char</span><span style="color: #008000;">.</span><span style="color: #0000FF;">GetUnicodeCategory</span><span style="color: #008000;">&#40;</span>A,<span style="color: #FF0000;">0</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">;</span> <span style="color: #008080; font-style: italic;">// UppercaseLetter</span></pre></div></div>

<p>Ah.</p>
<h2>The UCS-2 Hangover</h2>
<p>These are all symptoms of &#8216;migrated&#8217; libraries, originally developed for UCS-2 (in the early days of Unicode), and now touted as UTF-16.</p>
<p>Whereas UCS-2 was a fixed-width, 16-bit format, UTF-16 is a variable-width format which uses pairs of &#8216;surrogate&#8217; 16-bit units to support characters that UCS-2 does not. These are the characters outside the Basic Multilingual Plane (BMP), sometimes termed the &#8216;astral characters&#8217;.</p>
<p>Libraries with <code>char</code> as a 16-bit quantity are living a lie. They live in a delusional world of fixed-width characters, where they let you believe you can index, substring, and play around at will, whereas:</p>
<ul>
<li>You cannot select arbitrary substrings &mdash; you might get half a character included.</li>
<li>You cannot index into the string arbitrarily &mdash; you might select half a character.</li>
<li>You cannot treat a string as a bunch of isolated characters &mdash; the surrogate pairs will get broken up.</li>
</ul>
<h2>So?</h2>
<p>The general response seems to be &#8220;it doesn&#8217;t matter, most of the stuff I deal with is in the BMP, so I don&#8217;t care&#8221;.</p>
<p>A simple example of the problem with this: if any of these &#8216;orphan surrogates&#8217; gets into an XML document, you have an application-stopping bug on your hands. (Just ask <a href="http://www.linqpad.net/">LINQPad</a>, which refuses to run any of the examples in the first block of code. Luckily it has good exception handling.) <a href="http://www.google.co.nz/search?q=&quot;The+surrogate+pair+is+invalid&quot;">A quick Google search</a> shows that many developers are already encountering this problem.</p>
<p>All of the latest CJK extensions (B, C, D) also live in the <a href="http://en.wikipedia.org/wiki/Supplementary_Ideographic_Plane#Supplementary_Ideographic_Plane">Supplementary Ideographic Plane</a> (SIP, the second non-BMP plane). I haven&#8217;t really been able to find out just how many of these characters are actually used in any real sense, but you&#8217;re going to encounter them eventually (and <a href="http://www.unicode.org/charts/PDF/U1F600.pdf">recent additions to Unicode</a> might start turning up).</p>
<h3>There&#8217;s a hole in your abstraction</h3>
<p>The problem with UCS-2&ndash;flavoured UTF-16 is that it hands all the work of handling surrogate characters correctly over to the developer.</p>
<p>Underlining the third character in a string sounds like a simple task. But, if someone slips in a non-BMP character, then&#8230;</p>
 <a class="simple-footnote" title="Again, this won&#8217;t work perfectly even with a &#8220;non-broken&#8221; char type. Combining characters need to be considered. This argues for a more sophisticated library in general, one that has an idea of a visible &#8216;stack&#8217; of codepoints (or some other basic glyph construct). A discussion for another time&#8230;" id="return-note-418-2" href="#note-418-2"><sup>2</sup></a>

<div class="wp_syntax"><div class="code"><pre class="csharp" style="font-family:monospace;"><span style="color: #0600FF; font-weight: bold;">foreach</span> <span style="color: #008000;">&#40;</span>var s <span style="color: #0600FF; font-weight: bold;">in</span> <span style="color: #008000;">new</span><span style="color: #008000;">&#91;</span><span style="color: #008000;">&#93;</span> <span style="color: #008000;">&#123;</span><span style="color: #666666;">&quot;0123456789&quot;</span>, <span style="color: #666666;">&quot;〇一二三四五六七八九&quot;</span>, <span style="color: #666666;">&quot;𝟘123456789&quot;</span><span style="color: #008000;">&#125;</span><span style="color: #008000;">&#41;</span>
<span style="color: #008000;">&#123;</span>
	Console<span style="color: #008000;">.</span><span style="color: #0000FF;">WriteLine</span><span style="color: #008000;">&#40;</span>s<span style="color: #008000;">.</span><span style="color: #0000FF;">Substring</span><span style="color: #008000;">&#40;</span><span style="color: #FF0000;">0</span>,<span style="color: #FF0000;">2</span><span style="color: #008000;">&#41;</span> <span style="color: #008000;">+</span> <span style="color: #666666;">&quot;&lt;span style='border:2px #FF4500 solid'&gt;&quot;</span> <span style="color: #008000;">+</span> s<span style="color: #008000;">&#91;</span><span style="color: #FF0000;">2</span><span style="color: #008000;">&#93;</span> <span style="color: #008000;">+</span> <span style="color: #666666;">&quot;&lt;/span&gt;&quot;</span> <span style="color: #008000;">+</span> s<span style="color: #008000;">.</span><span style="color: #0000FF;">Substring</span><span style="color: #008000;">&#40;</span><span style="color: #FF0000;">3</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">&#41;</span><span style="color: #008000;">;</span>
<span style="color: #008000;">&#125;</span></pre></div></div>

<p>
01<span style='border:2px #FF4500 solid'>2</span>3456789<br />
〇一<span style='border:2px #FF4500 solid'>二</span>三四五六七八九<br />
𝟘<span style='border:2px #FF4500 solid'>1</span>23456789
</p>
<p>&#8230;and now the developer has to write their own linear-time indexing functions to actually find the character they need. This is not a good abstraction.</p>
<h2>From here</h2>
<p>Once you give up the idea that you can safely treat UTF-16 as a fixed-width format, it really doesn&#8217;t have anything going for it. (Perhaps this is why the idea is so firmly embedded &mdash; elements of <a href="http://en.wikipedia.org/wiki/Cognitive_dissonance">cognitive dissonance</a>.)</p>
<p>Here&#8217;s a quick (mainly biased) diagram:</p>
<p><a href="http://porg.es/blog/wp-content/uploads/2010/11/unicode.png"><img src="http://porg.es/blog/wp-content/uploads/2010/11/unicode.png" alt="" title="Comparison of UTFs" width="546" height="300" class="aligncenter size-full wp-image-460" /></a></p>
<div class="simple-footnotes"><p class="notes">Notes:</p><ol><li id="note-418-1">Note that even if you have a UTF-16 implementation with a &#8220;non-broken&#8221; <code>char</code>, the reversal given here isn&#8217;t valid. For one thing, it will put combining characters in the wrong place. The reversal of a string is also potentially locale-dependent: one can argue that the reversal of &#8220;Dijkstra&#8221; (Dutch) should be &#8220;artsijkD&#8221;. <a href="#return-note-418-1">&#8617;</a></li><li id="note-418-2">Again, this won&#8217;t work perfectly even with a &#8220;non-broken&#8221; <code>char</code> type. Combining characters need to be considered. This argues for a more sophisticated library in general, one that has an idea of a visible &#8216;stack&#8217; of codepoints (or some other basic glyph construct). A discussion for another time&#8230; <a href="#return-note-418-2">&#8617;</a></li></ol></div>]]></content:encoded>
			<wfw:commentRss>http://porg.es/blog/unicode-as-she-is-broke/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Symbols used to represent functions</title>
		<link>http://porg.es/blog/symbols-used-to-represent-functions</link>
		<comments>http://porg.es/blog/symbols-used-to-represent-functions#comments</comments>
		<pubDate>Mon, 22 Jun 2009 03:51:22 +0000</pubDate>
		<dc:creator>Porges</dc:creator>
				<category><![CDATA[utility]]></category>
		<category><![CDATA[9995]]></category>
		<category><![CDATA[iec]]></category>
		<category><![CDATA[iso]]></category>
		<category><![CDATA[keyboard]]></category>
		<category><![CDATA[Reference]]></category>
		<category><![CDATA[symbols]]></category>
		<category><![CDATA[Unicode]]></category>

		<guid isPermaLink="false">http://porg.es/blog/?p=356</guid>
		<description><![CDATA[I was looking for some standard symbols to represent the Control key and the Alt key, and couldn&#8217;t find one until I came across ISO/IEC 9995-7. Because I had much trouble finding a free copy of the document on the ’Net, I have made a table of the symbols and their functions below. I have [...]]]></description>
			<content:encoded><![CDATA[<p>I was looking for some standard symbols to represent the Control key and the Alt key, and couldn&#8217;t find one until I came across ISO/IEC 9995-7. <img src="http://porg.es/blog/wp-content/plugins/wp-smiley-switcher/noktahhitam/icon_smile.gif" alt="" /></p>
<p>Because I had much trouble finding a free copy of the document on the ’Net, I have made a table of the symbols and their functions below. I have marked those not present in Unicode as �.</p>
<p>Some examples: copy is usually [⎈ + C] and close is usually [⎇ + F4].</p>
<table>
<thead>
<tr>
<th style="text-align:right;padding:3px">Symbol</th>
<th>Meaning/Summary</th>
</tr>
</thead>
<tbody>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">⇧</td>
<td>Select level 2 (AKA Shift)</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">⇫</td>
<td>Lock level 2 (AKA Shift-Lock)</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">⇬</td>
<td>Caps lock</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">⇭</td>
<td>Num lock</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">⇮</td>
<td>Select level 3</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">⇯</td>
<td>Lock level 3</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">⇨</td>
<td>Group select</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">⇰</td>
<td>Group lock</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">␣</td>
<td>Space</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">⍽</td>
<td>No-break space</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">⎀</td>
<td>Insert</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">⎁</td>
<td>Underline (continuous)</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">⎂</td>
<td>Underline (discontinuous)</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">⎃</td>
<td>Emphasize</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">⎄</td>
<td>Compose characters</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">⎅</td>
<td>Center</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">⌫</td>
<td>Delete backwards</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">␥</td>
<td>Delete</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">⎚</td>
<td>Clear screen</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">⇳</td>
<td>Scrolling (I assume this means Scroll lock)</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">�</td>
<td>Help</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">⎙</td>
<td>Print Screen</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">⏎</td>
<td>Return</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">⎆</td>
<td>Enter</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">⎇</td>
<td>Alternate (Alt key)</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">⎈</td>
<td>Control (Ctrl key)</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">⎉</td>
<td>Pause</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">⎊</td>
<td>Break/Interrupt</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">⎋</td>
<td>Escape</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">⎌</td>
<td>Undo</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">↑</td>
<td>Cursor up</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">↓</td>
<td>Cursor down</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">←</td>
<td>Cursor left</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">→</td>
<td>Cursor right</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">↟</td>
<td>Fast cursor up</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">↡</td>
<td>Fast cursor down</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">↞</td>
<td>Fast cursor left</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">↠</td>
<td>Fast cursor right</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">⇱</td>
<td>Home (Beginning)</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">⇲</td>
<td>End</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">⎗</td>
<td>Previous page</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">⎘</td>
<td>Next page</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">⇤</td>
<td>Tab left</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">⇥</td>
<td>Tab right</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">�</td>
<td>Line up</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">�</td>
<td>Line down</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">�</td>
<td>Backspace</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">�</td>
<td>Partial line up</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">�</td>
<td>Partial line down</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">�</td>
<td>Partial space left</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">�</td>
<td>Partial space right</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">�</td>
<td>Set margin left</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">�</td>
<td>Set margin right</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">�</td>
<td>Release margin left</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">�</td>
<td>Release margin right</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">�</td>
<td>Release both margins</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">+</td>
<td>Addition</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">−</td>
<td>Subtraction</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">×</td>
<td>Multiplication</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">÷</td>
<td>Division</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">=</td>
<td>Equals</td>
</tr>
<tr>
<td style="font-size:2em;text-align:right;padding:3px">⎖</td>
<td>Decimal separator</td>
</tr>
</tbody>
</table>
]]></content:encoded>
			<wfw:commentRss>http://porg.es/blog/symbols-used-to-represent-functions/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>What can we fit in 140 characters?</title>
		<link>http://porg.es/blog/what-can-we-fit-in-140-characters</link>
		<comments>http://porg.es/blog/what-can-we-fit-in-140-characters#comments</comments>
		<pubDate>Wed, 27 May 2009 07:09:25 +0000</pubDate>
		<dc:creator>Porges</dc:creator>
				<category><![CDATA[code]]></category>
		<category><![CDATA[awk]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[silly]]></category>
		<category><![CDATA[stackoverflow]]></category>
		<category><![CDATA[twitter]]></category>
		<category><![CDATA[Unicode]]></category>

		<guid isPermaLink="false">http://porg.es/blog/?p=327</guid>
		<description><![CDATA[This is in reference to the current ‘Twitter image encoding challenge’ running on StackOverflow. If we want to restrict ourselves to assigned, non-control, non-private Unicode characters, then by my reckoning that gives us 129,775 available characters. wget http://unicode.org/Public/UNIDATA/UnicodeData.txt awk -F ';' UnicodeData.txt -f countUnichars.awk &#124; bc countUnichars.awk source: BEGIN &#123; print &#34;ibase=16&#34; &#125; # set [...]]]></description>
			<content:encoded><![CDATA[<p>This is in reference to the current <a href="http://stackoverflow.com/questions/891643/twitter-image-encoding-challenge">‘Twitter image encoding challenge’ running on StackOverflow</a>.</p>
<p>If we want to restrict ourselves to assigned, non-control, non-private Unicode characters, then by my reckoning that gives us 129,775 available characters.</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;"><span style="color: #c20cb9; font-weight: bold;">wget</span> http:<span style="color: #000000; font-weight: bold;">//</span>unicode.org<span style="color: #000000; font-weight: bold;">/</span>Public<span style="color: #000000; font-weight: bold;">/</span>UNIDATA<span style="color: #000000; font-weight: bold;">/</span>UnicodeData.txt
<span style="color: #c20cb9; font-weight: bold;">awk</span> <span style="color: #660033;">-F</span> <span style="color: #ff0000;">';'</span> UnicodeData.txt <span style="color: #660033;">-f</span> countUnichars.awk <span style="color: #000000; font-weight: bold;">|</span> <span style="color: #c20cb9; font-weight: bold;">bc</span></pre></div></div>

<p><tt>countUnichars.awk</tt> source:</p>

<div class="wp_syntax"><div class="code"><pre class="awk" style="font-family:monospace;"><span style="color: #C20CB9; font-weight: bold;">BEGIN</span> <span style="color: #7a0874; font-weight: bold;">&#123;</span> <span style="color: #0BD507; font-weight: bold;">print</span> <span style="color: #ff0000;">&quot;ibase=16&quot;</span> <span style="color: #7a0874; font-weight: bold;">&#125;</span> <span style="color:#808080;"># set bc to hex mode</span>
&nbsp;
<span style="color:#000088;">$2</span> <span style="color:#C4C364;">~</span> <span style="color:black;">/</span>Private<span style="color:black;">/</span> <span style="color: #7a0874; font-weight: bold;">&#123;</span> <span style="color:#808080;"># skip any lines with &quot;private&quot; in the description</span>
    <span style="color: #0BD507; font-weight: bold;">getline</span>;
<span style="color: #7a0874; font-weight: bold;">&#125;</span>
&nbsp;
n <span style="color: #7a0874; font-weight: bold;">&#123;</span> <span style="color:#808080;"># if n is set, then print the range for bc to calculate</span>
    <span style="color: #0BD507; font-weight: bold;">printf</span><span style="color: #7a0874; font-weight: bold;">&#40;</span><span style="color: #ff0000;">&quot;(%s-%s+1)+&quot;</span>,<span style="color:#000088;">$1</span>,n<span style="color: #7a0874; font-weight: bold;">&#41;</span>;
    n=<span style="color: #ff0000;">&quot;&quot;</span>;
<span style="color: #7a0874; font-weight: bold;">&#125;</span>
&nbsp;
<span style="color:#000088;">$2</span> <span style="color:#C4C364;">~</span> <span style="color:black;">/</span>First<span style="color:black;">&gt;</span><span style="color:black;">/</span> <span style="color: #7a0874; font-weight: bold;">&#123;</span> <span style="color:#808080;"># set n if the start of a range</span>
    n=<span style="color:#000088;">$1</span>;
    <span style="color: #0BD507; font-weight: bold;">getline</span>;
<span style="color: #7a0874; font-weight: bold;">&#125;</span>
&nbsp;
<span style="color:#000088;">$3</span> <span style="color:#C4C364;">!~</span> <span style="color: #ff0000;">&quot;C.&quot;</span> <span style="color: #7a0874; font-weight: bold;">&#123;</span> <span style="color:#808080;"># otherwise count anything that isn't some kind of a control character</span>
    i<span style="color:black;">++</span>;
<span style="color: #7a0874; font-weight: bold;">&#125;</span>
&nbsp;
<span style="color: #C20CB9; font-weight: bold;">END</span> <span style="color: #7a0874; font-weight: bold;">&#123;</span> <span style="color:#808080;"># print out the count of everything else</span>
    <span style="color: #0BD507; font-weight: bold;">printf</span><span style="color: #7a0874; font-weight: bold;">&#40;</span><span style="color: #ff0000;">&quot;%X<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>,i<span style="color: #7a0874; font-weight: bold;">&#41;</span>
<span style="color: #7a0874; font-weight: bold;">&#125;</span></pre></div></div>

<p>This means we can store exactly 2377 bits (297 bytes) per message (this is <img src='/blog/wp-content/plugins/latexrender/pictures/3d148faa2b961edebbfea91810c1ab28.gif' title='\lfloor\log_2(129775) \times 140\rfloor' alt='\lfloor\log_2(129775) \times 140\rfloor' align=absmiddle>), so if we use a 16-colour palette we can store about 594 pixels (<img src='/blog/wp-content/plugins/latexrender/pictures/cba7624bee8dfdd5dae1931dda7f495d.gif' title='2377/\log_2(16)' alt='2377/\log_2(16)' align=absmiddle>), which can <em>almost</em> reproduce the <i>Mona Lisa</i> thumbnail in the contest page.</p>
]]></content:encoded>
			<wfw:commentRss>http://porg.es/blog/what-can-we-fit-in-140-characters/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Unicode breaks Google search</title>
		<link>http://porg.es/blog/unicode-breaks-google-search</link>
		<comments>http://porg.es/blog/unicode-breaks-google-search#comments</comments>
		<pubDate>Mon, 26 Jun 2006 02:08:19 +0000</pubDate>
		<dc:creator>Porges</dc:creator>
				<category><![CDATA[Broken]]></category>
		<category><![CDATA[Critique]]></category>
		<category><![CDATA[Unicode]]></category>

		<guid isPermaLink="false">http://porg.es/blog/unicode-breaks-google-search</guid>
		<description><![CDATA[A search for the phrase &#34;It’s like a light of a new day,&#34; breaks in more than one way. Not only does Google search fail to recognize that &#8220;it&#8217;s&#8221; is a word, it also ignores the quote marks, searching for the phrase as individual words.]]></description>
			<content:encoded><![CDATA[<p>A <a href="http://www.google.com/search?q=%22It%E2%80%99s+like+a+light+of+a+new+day%2C%22&#038;start=0&#038;ie=utf-8&#038;oe=utf-8&#038;client=firefox&#038;rls=org.mozilla:en-US:unofficial">search for the phrase &quot;It’s like a light of a new day,&quot;</a> breaks in more than one way.</p>
<p>Not only does Google search fail to recognize that &#8220;it&#8217;s&#8221; is a word, it also ignores the quote marks, searching for the phrase as individual words.</p>
]]></content:encoded>
			<wfw:commentRss>http://porg.es/blog/unicode-breaks-google-search/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Baleegal!</title>
		<link>http://porg.es/blog/baleegal</link>
		<comments>http://porg.es/blog/baleegal#comments</comments>
		<pubDate>Sat, 24 Jun 2006 11:37:45 +0000</pubDate>
		<dc:creator>Porges</dc:creator>
				<category><![CDATA[Development]]></category>
		<category><![CDATA[Reference]]></category>
		<category><![CDATA[Unicode]]></category>
		<category><![CDATA[XML]]></category>

		<guid isPermaLink="false">http://porg.es/blog/baleegal</guid>
		<description><![CDATA[XML 1.0 allows you to insert characters from the C1 control code range, whilst those from the C0 range are outright forbidden. XML 1.1 allows you to insert characters from the C0 range as long as they are escaped as character entity references, and mandates that you do the same for those from the C1 [...]]]></description>
			<content:encoded><![CDATA[<p>XML 1.0 allows you to insert characters from the C1 control code range, whilst those from the C0 range are outright forbidden.</p>
<p>XML 1.1 allows you to insert characters from the C0 range <em>as long as they are escaped as character entity references</em>, and mandates that you do the same for those from the C1 range.</p>
<p>This little fact is the reason why not all XML 1.0-valid documents are valid under XML 1.1. Nasty.</p>
]]></content:encoded>
			<wfw:commentRss>http://porg.es/blog/baleegal/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

