<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>porges &#187; utf8</title>
	<atom:link href="http://porg.es/blog/tag/utf8/feed" rel="self" type="application/rss+xml" />
	<link>http://porg.es/blog</link>
	<description></description>
	<lastBuildDate>Thu, 12 Jan 2012 23:45:56 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Counting Characters in UTF-8 Strings Is Fast(er)</title>
		<link>http://porg.es/blog/counting-characters-in-utf-8-strings-is-faster</link>
		<comments>http://porg.es/blog/counting-characters-in-utf-8-strings-is-faster#comments</comments>
		<pubDate>Wed, 04 Jun 2008 05:34:57 +0000</pubDate>
		<dc:creator>Porges</dc:creator>
				<category><![CDATA[code]]></category>
		<category><![CDATA[C]]></category>
		<category><![CDATA[fast]]></category>
		<category><![CDATA[speed]]></category>
		<category><![CDATA[strings]]></category>
		<category><![CDATA[strlen]]></category>
		<category><![CDATA[utf8]]></category>

		<guid isPermaLink="false">http://porg.es/blog/?p=130</guid>
		<description><![CDATA[‘Counting Characters in UTF-8 Strings Is Fast’ by Kragen Sitaker shows several ways to count characters UTF-8, using both assembly and C. But, with a few assumptions, we can go faster. Assumption One: We are dealing with a valid UTF-8 string Making this assumption means that once we hit the start of a multi-byte character [...]]]></description>
			<content:encoded><![CDATA[<p>‘<a href="http://canonical.org/~kragen/strlen-utf8.html">Counting Characters in UTF-8 Strings Is Fast</a>’ by Kragen Sitaker shows several ways to count characters UTF-8, using both assembly and C. But, with a few assumptions, we can go faster.</p>
<h3>Assumption One: We are dealing with a valid UTF-8 string</h3>
<p>Making this assumption means that once we hit the start of a multi-byte character we can skip forward a few places. It also means we don&#8217;t check for hitting invalid characters (<s>this sends the algorithm into an infinite loop if run on non-valid input</s> it is possible to make the algorithm run past the end of the buffer by supplying malformed data).</p>
<h3>Assumption Two: Most strings are ASCII</h3>
<p>Therefore, run a simple ASCII count routine beforehand. As soon as we hit a non-ASCII character switch into counting UTF-8.</p>
<h3>The code</h3>
<p>Note: The current code relies on chars being signed bytes.</p>

<div class="wp_syntax"><div class="code"><pre class="c" style="font-family:monospace;"><span style="color: #993333;">int</span> porges_strlen2<span style="color: #009900;">&#40;</span><span style="color: #993333;">char</span> <span style="color: #339933;">*</span>s<span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#123;</span>
        <span style="color: #993333;">int</span> i <span style="color: #339933;">=</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">;</span>
&nbsp;
        <span style="color: #666666; font-style: italic;">//Go fast if string is only ASCII.</span>
        <span style="color: #666666; font-style: italic;">//Loop while not at end of string,</span>
        <span style="color: #666666; font-style: italic;">// and not reading anything with highest bit set.</span>
        <span style="color: #666666; font-style: italic;">//If highest bit is set, number is negative.</span>
        <span style="color: #b1b100;">while</span> <span style="color: #009900;">&#40;</span>s<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span> <span style="color: #339933;">&gt;</span> <span style="color: #0000dd;">0</span><span style="color: #009900;">&#41;</span>
                i<span style="color: #339933;">++;</span>
&nbsp;
        <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span>s<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span> <span style="color: #339933;">&lt;=</span> <span style="color: #339933;">-</span><span style="color: #0000dd;">65</span><span style="color: #009900;">&#41;</span> <span style="color: #666666; font-style: italic;">// all follower bytes have values below -65</span>
                <span style="color: #b1b100;">return</span> <span style="color: #339933;">-</span><span style="color: #0000dd;">1</span><span style="color: #339933;">;</span> <span style="color: #666666; font-style: italic;">// invalid</span>
&nbsp;
        <span style="color: #666666; font-style: italic;">//Note, however, that the following code does *not*</span>
        <span style="color: #666666; font-style: italic;">// check for invalid characters.</span>
        <span style="color: #666666; font-style: italic;">//The above is just included to bail out on the tests :)</span>
&nbsp;
        <span style="color: #993333;">int</span> count <span style="color: #339933;">=</span> i<span style="color: #339933;">;</span>
        <span style="color: #b1b100;">while</span> <span style="color: #009900;">&#40;</span>s<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span>
        <span style="color: #009900;">&#123;</span>
                <span style="color: #666666; font-style: italic;">//if ASCII just go to next character</span>
                <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span>s<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span> <span style="color: #339933;">&gt;</span> <span style="color: #0000dd;">0</span><span style="color: #009900;">&#41;</span>      i <span style="color: #339933;">+=</span> <span style="color: #0000dd;">1</span><span style="color: #339933;">;</span>
                <span style="color: #b1b100;">else</span>
                <span style="color: #666666; font-style: italic;">//select amongst multi-byte starters</span>
                <span style="color: #b1b100;">switch</span> <span style="color: #009900;">&#40;</span><span style="color: #208080;">0xF0</span> <span style="color: #339933;">&amp;</span> s<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span>
                <span style="color: #009900;">&#123;</span>
                        <span style="color: #b1b100;">case</span> <span style="color: #208080;">0xE0</span><span style="color: #339933;">:</span> i <span style="color: #339933;">+=</span> <span style="color: #0000dd;">3</span><span style="color: #339933;">;</span> <span style="color: #000000; font-weight: bold;">break</span><span style="color: #339933;">;</span>
                        <span style="color: #b1b100;">case</span> <span style="color: #208080;">0xF0</span><span style="color: #339933;">:</span> i <span style="color: #339933;">+=</span> <span style="color: #0000dd;">4</span><span style="color: #339933;">;</span> <span style="color: #000000; font-weight: bold;">break</span><span style="color: #339933;">;</span>
                        <span style="color: #b1b100;">default</span><span style="color: #339933;">:</span>   i <span style="color: #339933;">+=</span> <span style="color: #0000dd;">2</span><span style="color: #339933;">;</span> <span style="color: #000000; font-weight: bold;">break</span><span style="color: #339933;">;</span>
                <span style="color: #009900;">&#125;</span>
                <span style="color: #339933;">++</span>count<span style="color: #339933;">;</span>
        <span style="color: #009900;">&#125;</span>
        <span style="color: #b1b100;">return</span> count<span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<h3>Results</h3>
<p>I used Kragen’s testing code, but removed all <code>strlen</code>s that didn’t do UTF-8 counting, and added one test for valid UTF-8 text (just the phrase ‘こんにちは’ repeated). Twice as fast on both the ASCII-only and UTF-8 tests. Improvement on ASCII is due to the ASCII-only routine, and improvement on UTF-8 is due to skipping bytes.</p>
<pre><code>"": 0 0 0 0 0
"hello, world": 12 12 12 12 12
"naïve": 5 5 5 5 5
"こんにちは": 5 5 5 5 5
1: all 'a':
1:           porges_strlen2(string) =   33554431: 0.034672
1:         ap_strlen_utf8_s(string) =   33554431: 0.068210
1:         my_strlen_utf8_c(string) =   33554431: 0.071038
1:         my_strlen_utf8_s(string) =   33554431: 0.135856
2: all '\xe3':
2:           porges_strlen2(string) =   11184811: 0.032115
2:         ap_strlen_utf8_s(string) =   33554431: 0.068228
2:         my_strlen_utf8_c(string) =   33554431: 0.071050
2:         my_strlen_utf8_s(string) =   33554431: 0.152513
3: all '\x81':
3:           porges_strlen2(string) =         -1: 0.000001
3:         my_strlen_utf8_s(string) =          0: 0.068339
3:         ap_strlen_utf8_s(string) =          0: 0.068547
3:         my_strlen_utf8_c(string) =          0: 0.071039
4: all konichiwa:
4:           porges_strlen2(string) =   11184810: 0.032143
4:         ap_strlen_utf8_s(string) =   11184810: 0.068271
4:         my_strlen_utf8_c(string) =   11184810: 0.071036
4:         my_strlen_utf8_s(string) =   11184810: 0.089478
</code></pre>
<p>Note also that the invalid UTF-8 gives strange results; this is because the algorithm isn’t meant to work on it! (The first invalid sequence is a list of 3-byte starters, so the result is divided in 3 due to skipping, and the second is a list of follower bytes, so the code bails out.)</p>
<h3>Going faster</h3>
<p>By dropping back to the ASCII counter whenever we hit ASCII again, we go even faster. This will handle the cases (such as in English) where there are many ASCII characters and only a few multibyte ones.</p>

<div class="wp_syntax"><div class="code"><pre class="c" style="font-family:monospace;"><span style="color: #993333;">int</span> porges_strlen2<span style="color: #009900;">&#40;</span><span style="color: #993333;">char</span> <span style="color: #339933;">*</span>s<span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#123;</span>
        <span style="color: #993333;">int</span> i <span style="color: #339933;">=</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">;</span>
        <span style="color: #993333;">int</span> iBefore <span style="color: #339933;">=</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">;</span>
        <span style="color: #993333;">int</span> count <span style="color: #339933;">=</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">;</span>
&nbsp;
        <span style="color: #b1b100;">while</span> <span style="color: #009900;">&#40;</span>s<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span> <span style="color: #339933;">&gt;</span> <span style="color: #0000dd;">0</span><span style="color: #009900;">&#41;</span>
                ascii<span style="color: #339933;">:</span>  i<span style="color: #339933;">++;</span>
&nbsp;
        count <span style="color: #339933;">+=</span> i<span style="color: #339933;">-</span>iBefore<span style="color: #339933;">;</span>
        <span style="color: #b1b100;">while</span> <span style="color: #009900;">&#40;</span>s<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span>
        <span style="color: #009900;">&#123;</span>
                <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span>s<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span> <span style="color: #339933;">&gt;</span> <span style="color: #0000dd;">0</span><span style="color: #009900;">&#41;</span>
                <span style="color: #009900;">&#123;</span>
                        iBefore <span style="color: #339933;">=</span> i<span style="color: #339933;">;</span>
                        <span style="color: #b1b100;">goto</span> ascii<span style="color: #339933;">;</span>
                <span style="color: #009900;">&#125;</span>
                <span style="color: #b1b100;">else</span>
                <span style="color: #b1b100;">switch</span> <span style="color: #009900;">&#40;</span><span style="color: #208080;">0xF0</span> <span style="color: #339933;">&amp;</span> s<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span>
                <span style="color: #009900;">&#123;</span>
                        <span style="color: #b1b100;">case</span> <span style="color: #208080;">0xE0</span><span style="color: #339933;">:</span> i <span style="color: #339933;">+=</span> <span style="color: #0000dd;">3</span><span style="color: #339933;">;</span> <span style="color: #000000; font-weight: bold;">break</span><span style="color: #339933;">;</span>
                        <span style="color: #b1b100;">case</span> <span style="color: #208080;">0xF0</span><span style="color: #339933;">:</span> i <span style="color: #339933;">+=</span> <span style="color: #0000dd;">4</span><span style="color: #339933;">;</span> <span style="color: #000000; font-weight: bold;">break</span><span style="color: #339933;">;</span>
                        <span style="color: #b1b100;">default</span><span style="color: #339933;">:</span>   i <span style="color: #339933;">+=</span> <span style="color: #0000dd;">2</span><span style="color: #339933;">;</span> <span style="color: #000000; font-weight: bold;">break</span><span style="color: #339933;">;</span>
                <span style="color: #009900;">&#125;</span>
                <span style="color: #339933;">++</span>count<span style="color: #339933;">;</span>
        <span style="color: #009900;">&#125;</span>
        <span style="color: #b1b100;">return</span> count<span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>But on the ‘konichiwa’ test the speed improvement happens even though we’re counting pure multibyte, and I’m not sure exactly why&#8230; probably something to do with branch prediction or another arcane CPU topic I don’t understand. <img src="http://porg.es/blog/wp-content/plugins/wp-smiley-switcher/noktahhitam/icon_smile.gif" alt="" /></p>
<pre><code>4: all konichiwa:
4:           porges_strlen2(string) =   11184810: 0.026017
4:         ap_strlen_utf8_s(string) =   11184810: 0.068320
4:         my_strlen_utf8_c(string) =   11184810: 0.071035
4:         my_strlen_utf8_s(string) =   11184810: 0.089464
5: mixed english:
5:           porges_strlen2(string) =   32435949: 0.040342
5:         my_strlen_utf8_c(string) =   32435949: 0.071035
5:         ap_strlen_utf8_s(string) =   32435949: 0.078233
5:         my_strlen_utf8_s(string) =   32435949: 0.160676</code></pre>
<p>Without the drop-back-to-ASCII modification:</p>
<pre><code>5: mixed english:
5:           porges_strlen2(string) =   32435949: 0.067753</code></pre>
]]></content:encoded>
			<wfw:commentRss>http://porg.es/blog/counting-characters-in-utf-8-strings-is-faster/feed</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
	</channel>
</rss>

