Ridiculous UTF-8 character counting

code 6 June 2008 | 10 Comments

So, Colin Percival has posted a UTF-8 strlen which improves on my previous post. While his code runs slightly slower than mine on my PC, I assume that’s because his code is aimed at a 64-bit architecture. With 32-bits (reading 4 bytes at a time, instead of 8 ) it doesn’t quite get the same speed up.

That said, the vectorization code is clearly an improvement on mine, so let’s take that ball and run with it!

The Code

Now we use SIMD instructions to vectorize the counting of characters. I modified this from Colin’s routine, and I’m sure he has some bit-fiddling up his sleeves that would make this run even faster

As it is, I used a straightforward algorithm to extract the information.

#define GetMask(x) __builtin_ia32_pmovmskb128(x)
#define LoadBytes(x) __builtin_ia32_loaddqu(x)
#define CompareEquality(x,y) __builtin_ia32_pcmpeqb128((x),(y))
#define Or(x,y) __builtin_ia32_por128((x),(y))
#define NotExpected(x) __builtin_expect((x),0)
#define And(x,y) __builtin_ia32_pand128((x),(y))
 
typedef unsigned char v16qi __attribute__ ((vector_size(16)));
 
const char mask[16] = {
    0xc0, 0xc0, 0xc0, 0xc0,
    0xc0, 0xc0, 0xc0, 0xc0,
    0xc0, 0xc0, 0xc0, 0xc0,
    0xc0, 0xc0, 0xc0, 0xc0
};
const char match[16] = {
    0x80, 0x80, 0x80, 0x80,
    0x80, 0x80, 0x80, 0x80,
    0x80, 0x80, 0x80, 0x80,
    0x80, 0x80, 0x80, 0x80
};
const char zero[16] = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
unsigned char HammingWeight[65536]; //initialized elsewhere
 
static size_t cp_strlen_utf8_sse2(const char *_s)
{
    const char *s;
    const v16qi allZero = LoadBytes(zero);
    const v16qi masking = LoadBytes(mask);
    const v16qi matching = LoadBytes(match);
 
    v16qi row;
    size_t count = 0;
    unsigned char b;
 
    // unaligned bytes
    for (s = _s; (uintptr_t) (s) & (sizeof(v16qi) - 1); s++) {
        b = *s;
        if (b == '\0')
            goto done;
        count += (b >> 7) & ((~b) >> 6);
    }
 
    /* Handle complete blocks. */
    for (;; s += sizeof(v16qi)) {
        /* Prefetch */
        __builtin_prefetch(&s[256], 0, 0);
 
        /* Load Bytes */
        row = LoadBytes(s);
 
        /* Expect this to be false :) */
        if (NotExpected(GetMask(
                                   /* Check for zero bytes */
                                      CompareEquality(allZero, row))))
            break;
 
        /* Count number of non-starter bytes */
 
        row = And(row, masking);
        row = CompareEquality(row, matching);
        count += HammingWeight[GetMask(row)];
    }
 
    //leftover bytes
    for (;; s++) {
        b = *s;
        if (b == '\0')
            break;
        count += (b >> 7) & ((~b) >> 6);
    }
 
  done:
    return ((s - _s) - count);
}

Results

This counts about twice as fast as GCC/libc’s standard, non-UTF-8 strlen. Note the discrepancies between my timings of Colin’s code and his own tests. Damn thou, 32-bits!

"": 0 0 0 0 0 0 0
"hello, world": 12 12 12 12 12 12 12
"naïve": 6 6 6 5 5 5 5
"こんにちは": 15 15 15 5 5 5 5
"abcdefghijklmnopqrstuvwxyzβ": 28 28 28 27 27 27 27
testing 33554424 bytes of repeated "hello, world":
                      gcc_strlen =   33554424: 0.019331 +/- 0.001076
                      kjs_strlen =   33554424: 0.035095 +/- 0.000530
                       cp_strlen =   33554424: 0.021472 +/- 0.000310
                 kjs_strlen_utf8 =   33554424: 0.070260 +/- 0.000240
                  gp_strlen_utf8 =   33554424: 0.035144 +/- 0.000471
                  cp_strlen_utf8 =   33554424: 0.050539 +/- 0.000342
             cp_strlen_utf8_sse2 =   33554424: 0.010297 +/- 0.001551
testing 33554430 bytes of repeated "naïve":
                      gcc_strlen =   33554430: 0.019176 +/- 0.000824
                      kjs_strlen =   33554430: 0.035090 +/- 0.000478
                       cp_strlen =   33554430: 0.021472 +/- 0.000323
                 kjs_strlen_utf8 =   27962025: 0.070347 +/- 0.000354
                  gp_strlen_utf8 =   27962025: 0.054802 +/- 0.000299
                  cp_strlen_utf8 =   27962025: 0.050595 +/- 0.000602
             cp_strlen_utf8_sse2 =   27962025: 0.010011 +/- 0.001453
testing 33554430 bytes of repeated "こんにちは":
                      gcc_strlen =   33554430: 0.019331 +/- 0.000836
                      kjs_strlen =   33554430: 0.035225 +/- 0.000411
                       cp_strlen =   33554430: 0.021429 +/- 0.000309
                 kjs_strlen_utf8 =   11184810: 0.070249 +/- 0.000312
                  gp_strlen_utf8 =   11184810: 0.026545 +/- 0.000621
                  cp_strlen_utf8 =   11184810: 0.050512 +/- 0.000273
             cp_strlen_utf8_sse2 =   11184810: 0.010246 +/- 0.001466
testing 33554416 bytes of repeated "abcdefghijklmnopqrstuvwxyzβ":
                      gcc_strlen =   33554416: 0.019308 +/- 0.001091
                      kjs_strlen =   33554416: 0.035070 +/- 0.000486
                       cp_strlen =   33554416: 0.021441 +/- 0.000289
                 kjs_strlen_utf8 =   32356044: 0.070287 +/- 0.000297
                  gp_strlen_utf8 =   32356044: 0.043681 +/- 0.000429
                  cp_strlen_utf8 =   32356044: 0.050402 +/- 0.000204
             cp_strlen_utf8_sse2 =   32356044: 0.010407 +/- 0.001371

Tagged in , , , , , , , , ,

10 Responses on “Ridiculous UTF-8 character counting”

  1. On my Opteron, I get your SSE2 code taking 18.7ms (compared to 14.1ms for my non-sse2 version). Clearly your CPU, whatever it is, has faster SSE2 support than my Opteron.

    I’m going to see if I can coax some more performance out of the SSE2 code, though…

  2. Vineet Kumar says:

    That oughta be “damn thee”. =)

    (Also, the required name and email fields on this comment form are hard to decipher. Icons too small and similar, with not enough meaning.)

  3. Shaneal Manek says:

    Isn’t this just mental masturbation at some point?

    But boy it is fun

    I really need to learn x86 assembler – I only know a bit of MIPs. The instruction set seems huge …

  4. Stephen says:

    I see your code has way more variance than the others, probably due to the hamming lookup table.

    You’d be better off accumulating the counters and doing a horizontal add at the end.

    e.g. put a nested loop into the block loop which executes at most 255 times, then accumulate the bytes to the sum and continue.

  5. Matthew Hartman says:

    Keep masturbating guys (sorry, I couldn’t resist). I love reading this kind of constructive banter.

  6. mark++ says:

    Core2 and AMD’s Quad Core parts have 128-bit wide SSE execution units. Prior architectures from both companies, including classic Opteron, used double-pumped 64-bit execution units. This will likely make a difference.

  7. Inside the loop, you should load with movdqa instead of movdqu, since you know the addresses will be aligned.

  8. ast_tree says:

    Are you running an Intel© viral ad? What’s next, multithread/multiprocess/MPI Cluster version?

    Maybe if you try the profile mode of icc you can speed up vectorization or lower the cache-miss (if any remains).

  9. Wade says:

    Shaneal,

    It is, and it isnt masturabation. Somewhere, someone will need this code because they need to make their strlens _really_ fast.

  10. Porges says:

    Joshua,

    Unfortunately, AFAICT, there is no way to do this from the C code. If I edit the assembly manually I lose about 0.001 s

Leave a Reply