Ridiculous UTF-8 character counting

So, Colin Percival has posted a UTF-8 strlen which improves on my previous post. While his code runs slightly slower than mine on my PC, I assume that’s because his code is aimed at a 64-bit architecture. With 32-bits (reading 4 bytes at a time, instead of 8 ) it doesn’t quite get the same speed up.

That said, the vectorization code is clearly an improvement on mine, so let’s take that ball and run with it!

The Code

Now we use SIMD instructions to vectorize the counting of characters. I modified this from Colin’s routine, and I’m sure he has some bit-fiddling up his sleeves that would make this run even faster

As it is, I used a straightforward algorithm to extract the information.

#define GetMask(x) __builtin_ia32_pmovmskb128(x)
#define LoadBytes(x) __builtin_ia32_loaddqu(x)
#define CompareEquality(x,y) __builtin_ia32_pcmpeqb128((x),(y))
#define Or(x,y) __builtin_ia32_por128((x),(y))
#define NotExpected(x) __builtin_expect((x),0)
#define And(x,y) __builtin_ia32_pand128((x),(y))
 
typedef unsigned char v16qi __attribute__ ((vector_size(16)));
 
const char mask[16] = {
    0xc0, 0xc0, 0xc0, 0xc0,
    0xc0, 0xc0, 0xc0, 0xc0,
    0xc0, 0xc0, 0xc0, 0xc0,
    0xc0, 0xc0, 0xc0, 0xc0
};
const char match[16] = {
    0x80, 0x80, 0x80, 0x80,
    0x80, 0x80, 0x80, 0x80,
    0x80, 0x80, 0x80, 0x80,
    0x80, 0x80, 0x80, 0x80
};
const char zero[16] = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
unsigned char HammingWeight[65536]; //initialized elsewhere
 
static size_t cp_strlen_utf8_sse2(const char *_s)
{
    const char *s;
    const v16qi allZero = LoadBytes(zero);
    const v16qi masking = LoadBytes(mask);
    const v16qi matching = LoadBytes(match);
 
    v16qi row;
    size_t count = 0;
    unsigned char b;
 
    // unaligned bytes
    for (s = _s; (uintptr_t) (s) & (sizeof(v16qi) - 1); s++) {
        b = *s;
        if (b == '\0')
            goto done;
        count += (b >> 7) & ((~b) >> 6);
    }
 
    /* Handle complete blocks. */
    for (;; s += sizeof(v16qi)) {
        /* Prefetch */
        __builtin_prefetch(&s[256], 0, 0);
 
        /* Load Bytes */
        row = LoadBytes(s);
 
        /* Expect this to be false :) */
        if (NotExpected(GetMask(
                                   /* Check for zero bytes */
                                      CompareEquality(allZero, row))))
            break;
 
        /* Count number of non-starter bytes */
 
        row = And(row, masking);
        row = CompareEquality(row, matching);
        count += HammingWeight[GetMask(row)];
    }
 
    //leftover bytes
    for (;; s++) {
        b = *s;
        if (b == '\0')
            break;
        count += (b >> 7) & ((~b) >> 6);
    }
 
  done:
    return ((s - _s) - count);
}

Results

This counts about twice as fast as GCC/libc’s standard, non-UTF-8 strlen. Note the discrepancies between my timings of Colin’s code and his own tests. Damn thou, 32-bits!

"": 0 0 0 0 0 0 0
"hello, world": 12 12 12 12 12 12 12
"naïve": 6 6 6 5 5 5 5
"こんにちは": 15 15 15 5 5 5 5
"abcdefghijklmnopqrstuvwxyzβ": 28 28 28 27 27 27 27
testing 33554424 bytes of repeated "hello, world":
                      gcc_strlen =   33554424: 0.019331 +/- 0.001076
                      kjs_strlen =   33554424: 0.035095 +/- 0.000530
                       cp_strlen =   33554424: 0.021472 +/- 0.000310
                 kjs_strlen_utf8 =   33554424: 0.070260 +/- 0.000240
                  gp_strlen_utf8 =   33554424: 0.035144 +/- 0.000471
                  cp_strlen_utf8 =   33554424: 0.050539 +/- 0.000342
             cp_strlen_utf8_sse2 =   33554424: 0.010297 +/- 0.001551
testing 33554430 bytes of repeated "naïve":
                      gcc_strlen =   33554430: 0.019176 +/- 0.000824
                      kjs_strlen =   33554430: 0.035090 +/- 0.000478
                       cp_strlen =   33554430: 0.021472 +/- 0.000323
                 kjs_strlen_utf8 =   27962025: 0.070347 +/- 0.000354
                  gp_strlen_utf8 =   27962025: 0.054802 +/- 0.000299
                  cp_strlen_utf8 =   27962025: 0.050595 +/- 0.000602
             cp_strlen_utf8_sse2 =   27962025: 0.010011 +/- 0.001453
testing 33554430 bytes of repeated "こんにちは":
                      gcc_strlen =   33554430: 0.019331 +/- 0.000836
                      kjs_strlen =   33554430: 0.035225 +/- 0.000411
                       cp_strlen =   33554430: 0.021429 +/- 0.000309
                 kjs_strlen_utf8 =   11184810: 0.070249 +/- 0.000312
                  gp_strlen_utf8 =   11184810: 0.026545 +/- 0.000621
                  cp_strlen_utf8 =   11184810: 0.050512 +/- 0.000273
             cp_strlen_utf8_sse2 =   11184810: 0.010246 +/- 0.001466
testing 33554416 bytes of repeated "abcdefghijklmnopqrstuvwxyzβ":
                      gcc_strlen =   33554416: 0.019308 +/- 0.001091
                      kjs_strlen =   33554416: 0.035070 +/- 0.000486
                       cp_strlen =   33554416: 0.021441 +/- 0.000289
                 kjs_strlen_utf8 =   32356044: 0.070287 +/- 0.000297
                  gp_strlen_utf8 =   32356044: 0.043681 +/- 0.000429
                  cp_strlen_utf8 =   32356044: 0.050402 +/- 0.000204
             cp_strlen_utf8_sse2 =   32356044: 0.010407 +/- 0.001371

Comments 10

  1. Colin Percival wrote:

    On my Opteron, I get your SSE2 code taking 18.7ms (compared to 14.1ms for my non-sse2 version). Clearly your CPU, whatever it is, has faster SSE2 support than my Opteron.

    I’m going to see if I can coax some more performance out of the SSE2 code, though…

    Posted 06 Jun 2008 at 3:29 am
  2. Vineet Kumar wrote:

    That oughta be “damn thee”. =)

    (Also, the required name and email fields on this comment form are hard to decipher. Icons too small and similar, with not enough meaning.)

    Posted 06 Jun 2008 at 3:37 am
  3. Shaneal Manek wrote:

    Isn’t this just mental masturbation at some point?

    But boy it is fun

    I really need to learn x86 assembler – I only know a bit of MIPs. The instruction set seems huge …

    Posted 06 Jun 2008 at 3:38 am
  4. Stephen wrote:

    I see your code has way more variance than the others, probably due to the hamming lookup table.

    You’d be better off accumulating the counters and doing a horizontal add at the end.

    e.g. put a nested loop into the block loop which executes at most 255 times, then accumulate the bytes to the sum and continue.

    Posted 06 Jun 2008 at 4:49 am
  5. Matthew Hartman wrote:

    Keep masturbating guys (sorry, I couldn’t resist). I love reading this kind of constructive banter.

    Posted 06 Jun 2008 at 4:59 am
  6. mark++ wrote:

    Core2 and AMD’s Quad Core parts have 128-bit wide SSE execution units. Prior architectures from both companies, including classic Opteron, used double-pumped 64-bit execution units. This will likely make a difference.

    Posted 06 Jun 2008 at 6:05 am
  7. Joshua Haberman wrote:

    Inside the loop, you should load with movdqa instead of movdqu, since you know the addresses will be aligned.

    Posted 06 Jun 2008 at 6:30 am
  8. ast_tree wrote:

    Are you running an Intel© viral ad? What’s next, multithread/multiprocess/MPI Cluster version?

    Maybe if you try the profile mode of icc you can speed up vectorization or lower the cache-miss (if any remains).

    Posted 06 Jun 2008 at 8:08 am
  9. Wade wrote:

    Shaneal,

    It is, and it isnt masturabation. Somewhere, someone will need this code because they need to make their strlens _really_ fast.

    Posted 06 Jun 2008 at 10:09 am
  10. Porges wrote:

    Joshua,

    Unfortunately, AFAICT, there is no way to do this from the C code. If I edit the assembly manually I lose about 0.001 s

    Posted 06 Jun 2008 at 11:40 am

Post a Comment

Your email is never published nor shared. Required fields are marked *