UTF-8 Validation in C++



Suppose we have a list integers representing the data. We have to check whether it is valid UTF-8 encoding or not. One UTF-8 character can be 1 to 4-byte long. There are some properties −

  • For 1-byte character, the first bit is a 0, followed by its unicode code.

  • For n-bytes character, the first n-bits are all 1s, the n+1 bit is 0, followed by n-1 bytes with most significant 2 bits being 10.

So the encoding technique is as follows −

Character Number Range UTF-8 octet sequence
0000 0000 0000 007F 0xxxxxxx
0000 0080 0000 07FF 110xxxxx 10xxxxxx
0000 0800 0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
0001 0000 0010 FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

So if the input is like [197, 130, 1], this represents octet sequence 11000101 10000010 00000001, so this will return true. It is a valid utf-8 encoding for a 2-bytes character followed by a 1-byte character.

To solve this, we will follow these steps −

  • cnt := 0

  • for i in range 0 to size of data array

    • x := data[i]

    • if cnt is 0, then

      • if x/32 = 110, then set cnt as 1

      • otherwise when x/16 = 1110, then cnt = 2

      • otherwise when x/8 = 11110, then cnt = 3

      • otherwise when x/128 is 0, then return false

    • otherwise when x /64 is not 10, then return false and decrease cnt by 1

  • return true when cnt is 0

Example(C++)

Let us see the following implementation to get better understanding −

 Live Demo

#include <bits/stdc++.h>
using namespace std;
class Solution {
   public:
   bool validUtf8(vector<int>& data) {
      int cnt = 0;
      for(int i = 0; i <data.size(); i++){
         int x = data[i];
         if(!cnt){
            if((x >> 5) == 0b110){
               cnt = 1;
            }
            else if((x >> 4) == 0b1110){
               cnt = 2;
            }
            else if((x >> 3) == 0b11110){
               cnt = 3;
            }
            else if((x >> 7) != 0) return false;
            } else {
               if((x >> 6) != 0b10) return false;
               cnt--;
            }
         }
         return cnt == 0;
      }
};
main(){
   Solution ob;
   vector<int> v = {197,130,1};
   cout << (ob.validUtf8(v));
}

Input

[197,130,1]

Output

1
Updated on: 2020-05-02T08:26:28+05:30

2K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements