393.UTF-8 Validation

Tags: [bit], [encoding], [UTF-8]

Com: {g}

Link: https://leetcode.com/problems/utf-8-validation/?tab=Description

A character in UTF8 can be from1 to 4 bytes long, subjected to the following rules:

  1. For 1-byte character, the first bit is a 0, followed by its unicode code.
  2. For n-bytes character, the first n-bits are all one's, the n+1 bit is 0, followed by n-1 bytes with most significant 2 bits being 10.

This is how the UTF-8 encoding would work:Given an array of integers representing the data, return whether it is a valid utf-8 encoding.

Note:
The input is an array of integers. Only the least significant 8 bits of each integer is used to store the data. This means each integer represents only 1 byte of data.

Example 1:

data = [197, 130, 1], which represents the octet sequence: 11000101 10000010 00000001.

Return true.
It is a valid utf-8 encoding for a 2-bytes character followed by a 1-byte character.

Example 2:

data = [235, 140, 4], which represented the octet sequence: 11101011 10001100 00000100.

Return false.
The first 3 bits are all one's and the 4th bit is 0 means it is a 3-bytes character.
The next byte is a continuation byte which starts with 10 and that's correct.
But the second continuation byte does not start with 10, so it is invalid.

Better Solution:

class Solution(object):
    def validUtf8(self, data):
        """
        :type data: List[int]
        :rtype: bool
        """
        if not data:
            return False

        index = 0
        while index < len(data):
            num_of_bytes = self.get_num_of_leading_ones(data[index])
            if num_of_bytes > 4 or num_of_bytes == 1 or index + num_of_bytes > len(data):
                return False

            i = index + 1
            while i < len(data) and i < index + num_of_bytes:
                num_of_ones = self.get_num_of_leading_ones(data[i])
                if num_of_ones != 1:
                    return False

                i += 1

            index = i

        return True

    def get_num_of_leading_ones(self, num):
        counter = 0
        for i in xrange(8):
            bit = (num >> i) & 1
            if bit == 0:
                counter = 0
            else:
                counter += 1

        return counter

Revelation:

  • The number of bytes cannot be equal to 1, because if the UTF-8 has one byte, then its most significant bit should be 0.

Note:

  • Time complexity = O(n), n is the number of elements in data.
  • Space complexity = O(1).

Solution: bit manipulation

class Solution(object):
    def validUtf8(self, data):
        """
        :type data: List[int]
        :rtype: bool
        """
        if not data:
            return False

        index = 0
        while index < len(data):
            curr_byte = data[index]
            num_of_prefix_ones = self.get_num_of_prefix_ones(curr_byte)
            if not num_of_prefix_ones:
                index += 1
                continue

            if num_of_prefix_ones == 1 or num_of_prefix_ones > 4 or\
               not self.after_prefix_ones_is_zero(curr_byte, num_of_prefix_ones):
                return False

            next_byte_index = index + 1
            last_byte_index = next_byte_index + (num_of_prefix_ones - 1) - 1

            if next_byte_index >= len(data) or last_byte_index >= len(data):
                return False
            else:
                for i in xrange(next_byte_index, last_byte_index + 1):
                    if not self.start_with_10(data[i]):
                        return False

                index = last_byte_index + 1

        return True

    def get_num_of_prefix_ones(self, num):
        counter = 0
        for i in xrange(7, -1, -1):
            if (num >> i) & 1:
                counter += 1
            else:
                break

        return counter

    def after_prefix_ones_is_zero(self, byte, num_of_prefix_ones):
        return (byte >> (7 - num_of_prefix_ones)) & 1 == 0 

    def start_with_10(self, num):
        return (num >> 7) == 1 and (num >> 6) & 1 == 0

Revelation:

  • If we want to check one of the bits in the number is 1 or 0, we should shift the number itself to right. For example, if we want to check the 194's 8th bit (right to left) is 1 or 0, we can do (194 >> 7 ) & 1 == 1. And please note 194 & (1 << 7) = 128.

Note:

  • Time complexity = O(n), n is the number of elements of the given array.
  • num << 0 or num >> 0 is equal to num itself.
  • 1 << 1 = 2, 1 << 2 = 4, 1 << 3 = 8, so 2^n = 1 << n.
  • For example: data = [A, B, C], if the A's num_of_prefix_ones = 0, so B's num_of_prefix_ones cannot be 1, because B is not the continuation byte for A, so B at least should have two prefix ones.

results matching ""

    No results matching ""