534.Design TinyURL

Tags: [system], [system_design], [url], [distributed_system], [qps], [queries_per_sencond], [knowledge]

Com: {fb}

Link: https://leetcode.com/problems/design-tinyurl/#/description

Note: For the coding companion problem, please see:

Encode and Decode TinyURL.

How would you design a URL shortening service that is similar toTinyURL?

Background:
TinyURL is a URL shortening service where you enter a URL such ashttps://leetcode.com/problems/design-tinyurland it returns a short URL such ashttp://tinyurl.com/4e9iAk.

Requirements:

  1. For instance, "http://tinyurl.com/4e9iAk" is the tiny url for the page"https://leetcode.com/problems/design-tinyurl"The identifier (the highlighted part) can be any string with 6 alphanumeric characters containing0-9,a-z,A-Z.
  2. Each shortened URL must be unique; that is, no two different URLs can be shortened to the same URL.

Note about Questions:
Below are just a small subset of questions to get you started. In real world, there could be many follow ups and questions possible and the discussion is open-ended (No one true or correct way to solve a problem). If you have more ideas or questions, please ask in Discuss and we may compile it here!

Questions:

  1. How many unique identifiers possible? Will you run out of unique URLs?
  2. Should the identifier be increment or not? Which is easier to design? Pros and cons?
  3. Mapping an identifier to an URL and its reversal - Does this problem ring a bell to you?
  4. How do you store the URLs? Does a simple flat file database work?
  5. What is the bottleneck of the system? Is it read-heavy or write-heavy?
  6. Estimate the maximum number of URLs a single machine can store.
  7. Estimate the maximum number of queries per second (QPS) for decoding a shortened URL in a single machine.
  8. How would you scale the service? For example, a viral link which is shared in social media could result in a peak QPS at a moment's notice.
  9. How could you handle redundancy? i,e, if a server is down, how could you ensure the service is still operational?
  10. Keep URLs forever or prune, pros/cons? How we do pruning? (Contributed by @alex_svetkin)
  11. What API would you provide to a third-party developer? (Contributed by @alex_svetkin)
  12. If you can enable caching, what would you cache and what's the expiry time? (Contributed by @Humandroid)

Solution:

# You can type code here and execute it

# Knowledge:
#     (1) QPS:
#         Reference: https://en.wikipedia.org/wiki/Queries_per_second
#         Number of queries per second
#     (2) 千级别的qps可以单台SSD MySQL搞定
#     (3) 1 GB = 1000 MB, 1 MB = 1 million Bytes, so 1 GB = 1 billion Bytes.
#     (4) Using 6-digit short url identifier, each digit is stored in one char, assume one char use 2 bytes, so each identifer will be 12 bytes. So if one single machine is 1 TB, it means we can store 1000 billion / 12 identifiers, which is about 83 billion identifiers. If each day there will be 1GB disk usage, 1TB machine can be used for 3 years. But some database need half of total disk to do optimization.
#     (5) NoSQL cannot do transaction, Relational DB can do transaction.
#     (6) Usually SQL gets more ability to describe the data than NoSQL, but today's NoSQL has become more and more better.
#     (7) If we use 6-digit(0-9,a-z,A-Z) to represent the idenifier, (10 + 26 + 26)^6 = 57 billion, 整个互联网大约有trillion级别的网页数, 1 trillion = 1000 billion, 从某个角度看57 billion不太够,但其实也挺大了,如果我们不需要考虑整个互联网这么大规模的系统的话, so 6 digit is enough.
#     (8) 如果不用6-digit(0-9, a-z, A-Z), 可以用hash,md5或sha1, 但所有的hash都难以避免冲突,如果遇到冲突可以用retry来解决冲突,如果总数据量小的话,冲突发生的不频繁,但如果数据量大的话,冲突会频繁发生,如果数据量超过 1 billion 后,冲突会非常频繁发生,大量的retry会是整个系统变得低效.
#     (9) 那么我们这里选择用6-digit的方法:有两种方式来产生6-digit identifier
#         (1) buff = [None for _ in xrange(6)], 然后随机的去填每一个slot,如果遇到冲突时做retry
#         (2) Maintain a global counter, we can convert this 10base counter to 62base 6-digit identifier, then after generating the identifier, we increase the counter. 
#         我们维护一个全局的counter,这个counter是10进制的,
#         我们利用(0-9, a-z, A-Z)把它转化成62进制,这个62进制就是我们的identifier,但这么做有两个缺点,如果分布式系统的话,就要在分布式系统的全局同步一个counter,缺点二是外部用户可以把这个62进制转化成10进制,然后就可以知道系统里已经有多少url啦。


class Codec(object):
    def __init__(self):
        self.counter = 0
        self.short_to_long_url = {}
        self.long_to_short_url = {}

    def encode(self, long_url):
        if not long_url:
            return ''
        if long_url in self.long_to_short_url:
            return self.long_to_short_url[long_url]

        identifier = self.convert_10base_to_62base(self.counter)
        self.counter += 1

        self.long_to_short_url[long_url] = identifier
        self.short_to_long_url[identifier] = long_url
        return identifier

    def decode(self, short_url):
        if not short_url:
            return ''
        if short_url in self.short_to_long_url:
            return self.short_to_long_url[short_url]
        else:
            raise Exception('Fail to decode')

    def convert_10base_to_62base(self, num):
        if not num:
            return '0'

        result = []
        while num > 0:
            mod = num % 62
            if 0 <= mod <= 9:
                result.append(str(mod))
            elif 10 <= mod <= 35:
                result.append(chr(ord('a') + (mod - 10)))
            else:
                # 36 <= mod <= 61
                result.append(chr(ord('A') + (mod - 36)))

            num /= 62

        result.reverse()
        return ''.join(result)

if __name__ == '__main__':
    codec = Codec()
    long_url = 'http://leetcode/tiny_url'
    short_url = codec.encode(long_url)
    print short_url

    long_url = codec.decode(short_url)
    print long_url

results matching ""

    No results matching ""