C++ UTF-8 strlen function


When unicode first came into being, it was generally implemented as UCS2 (now called UTF-16BE). For instance Java, and Microsoft Windows store unicode strings in this way. UTF-16BE uses 2 bytes to store each character (wide chars), and so the character é has a code point of U+00e9, which is encoded as \x00 \xe9. This works okay, except you are limited to the range U+0000 to U+FFFF (the basic multilingual plane - BMP). There is a way to go beyond this range, but its kind of a hack in my opinion.

UTF-8 is a superior because it is backwards compatible with ASCII (1 byte per character for the first 128 code points), and uses a few bits to indicate if multi-byte, how many bytes each character takes. This means a UTF-8 character could be 1 byte, 2 bytes, 3 bytes and sometimes even 4 bytes long. Chinese for instance usually takes up 3 bytes per character.

The C standard strlen function returns the number of non-null bytes used to store a string. If the string is encoded in UTF-8, it is nice to have a way to see how many UTF-8 characters there are. For instance, 你好(ni hao in chinese) in UTF-8 is encoded as 6 bytes, "\xe4\xbd\xa0\xe5\xa5\xbd".
#include <iostream>
 
using namespace std;
 
int utf8_strlen(const string& s);
 
int main(int argc, char *argv[])
{
    string hello = "hello world"; //length 11
    string portg = "ol\xc3\xa1 mundo";//olá mundo length 9
    string nihao = "\xe4\xbd\xa0\xe5\xa5\xbd\xe4\xb8\x96\xe7\x95\x8c"; //你好世界 length 4
 
    cout << "string: " << hello << " length:" << utf8_strlen(hello) << endl;
    cout << "string: " << portg << " length:" << utf8_strlen(portg) << endl;
    cout << "string: " << nihao << " length:" << utf8_strlen(nihao) << endl;
    return 0;
}
 
int utf8_strlen(const string& str)
{
    int c,i,ix,q;
    for (q=0, i=0, ix=str.length(); i < ix; i++, q++)
    {
        c = (unsigned char) str[i];
        if      (c>=0   && c<=127) i+=0;
        else if ((c & 0xE0) == 0xC0) i+=1;
        else if ((c & 0xF0) == 0xE0) i+=2;
        else if ((c & 0xF8) == 0xF0) i+=3;
        //else if (($c & 0xFC) == 0xF8) i+=4; // 111110bb //byte 5, unnecessary in 4 byte UTF-8
        //else if (($c & 0xFE) == 0xFC) i+=5; // 1111110b //byte 6, unnecessary in 4 byte UTF-8
        else return 0;//invalid utf8
    }
    return q;
}
string: hello world length:11
string: olá mundo length:9
string: 你好世界 length:4
code snippets are licensed under Creative Commons CC-By-SA 3.0 (unless otherwise specified)

Adeet Phanse on 2016-06-21 17:51:57
Hi thanks this code really helped me understand what was going on with the UTF Encoding because I was very confused why string length was returning much larger values than the actual length of the string. Thanks a lot for writing this!