534.Design TinyURL
Tags: [system], [system_design], [url], [distributed_system], [qps], [queries_per_sencond], [knowledge]
Com: {fb}
Link: https://leetcode.com/problems/design-tinyurl/#/description
Note: For the coding companion problem, please see:
How would you design a URL shortening service that is similar toTinyURL?
Background:
TinyURL is a URL shortening service where you enter a URL such ashttps://leetcode.com/problems/design-tinyurl
and it returns a short URL such ashttp://tinyurl.com/4e9iAk
.
Requirements:
- For instance, "http://tinyurl.com/4e9iAk" is the tiny url for the page
"https://leetcode.com/problems/design-tinyurl"
The identifier (the highlighted part) can be any string with 6 alphanumeric characters containing0-9
,a-z
,A-Z
. - Each shortened URL must be unique; that is, no two different URLs can be shortened to the same URL.
Note about Questions:
Below are just a small subset of questions to get you started. In real world, there could be many follow ups and questions possible and the discussion is open-ended (No one true or correct way to solve a problem). If you have more ideas or questions, please ask in Discuss and we may compile it here!
Questions:
- How many unique identifiers possible? Will you run out of unique URLs?
- Should the identifier be increment or not? Which is easier to design? Pros and cons?
- Mapping an identifier to an URL and its reversal - Does this problem ring a bell to you?
- How do you store the URLs? Does a simple flat file database work?
- What is the bottleneck of the system? Is it read-heavy or write-heavy?
- Estimate the maximum number of URLs a single machine can store.
- Estimate the maximum number of queries per second (QPS) for decoding a shortened URL in a single machine.
- How would you scale the service? For example, a viral link which is shared in social media could result in a peak QPS at a moment's notice.
- How could you handle redundancy? i,e, if a server is down, how could you ensure the service is still operational?
- Keep URLs forever or prune, pros/cons? How we do pruning? (Contributed by @alex_svetkin)
- What API would you provide to a third-party developer? (Contributed by @alex_svetkin)
- If you can enable caching, what would you cache and what's the expiry time? (Contributed by @Humandroid)
Solution:
# You can type code here and execute it
# Knowledge:
# (1) QPS:
# Reference: https://en.wikipedia.org/wiki/Queries_per_second
# Number of queries per second
# (2) 千级别的qps可以单台SSD MySQL搞定
# (3) 1 GB = 1000 MB, 1 MB = 1 million Bytes, so 1 GB = 1 billion Bytes.
# (4) Using 6-digit short url identifier, each digit is stored in one char, assume one char use 2 bytes, so each identifer will be 12 bytes. So if one single machine is 1 TB, it means we can store 1000 billion / 12 identifiers, which is about 83 billion identifiers. If each day there will be 1GB disk usage, 1TB machine can be used for 3 years. But some database need half of total disk to do optimization.
# (5) NoSQL cannot do transaction, Relational DB can do transaction.
# (6) Usually SQL gets more ability to describe the data than NoSQL, but today's NoSQL has become more and more better.
# (7) If we use 6-digit(0-9,a-z,A-Z) to represent the idenifier, (10 + 26 + 26)^6 = 57 billion, 整个互联网大约有trillion级别的网页数, 1 trillion = 1000 billion, 从某个角度看57 billion不太够,但其实也挺大了,如果我们不需要考虑整个互联网这么大规模的系统的话, so 6 digit is enough.
# (8) 如果不用6-digit(0-9, a-z, A-Z), 可以用hash,md5或sha1, 但所有的hash都难以避免冲突,如果遇到冲突可以用retry来解决冲突,如果总数据量小的话,冲突发生的不频繁,但如果数据量大的话,冲突会频繁发生,如果数据量超过 1 billion 后,冲突会非常频繁发生,大量的retry会是整个系统变得低效.
# (9) 那么我们这里选择用6-digit的方法:有两种方式来产生6-digit identifier
# (1) buff = [None for _ in xrange(6)], 然后随机的去填每一个slot,如果遇到冲突时做retry
# (2) Maintain a global counter, we can convert this 10base counter to 62base 6-digit identifier, then after generating the identifier, we increase the counter.
# 我们维护一个全局的counter,这个counter是10进制的,
# 我们利用(0-9, a-z, A-Z)把它转化成62进制,这个62进制就是我们的identifier,但这么做有两个缺点,如果分布式系统的话,就要在分布式系统的全局同步一个counter,缺点二是外部用户可以把这个62进制转化成10进制,然后就可以知道系统里已经有多少url啦。
class Codec(object):
def __init__(self):
self.counter = 0
self.short_to_long_url = {}
self.long_to_short_url = {}
def encode(self, long_url):
if not long_url:
return ''
if long_url in self.long_to_short_url:
return self.long_to_short_url[long_url]
identifier = self.convert_10base_to_62base(self.counter)
self.counter += 1
self.long_to_short_url[long_url] = identifier
self.short_to_long_url[identifier] = long_url
return identifier
def decode(self, short_url):
if not short_url:
return ''
if short_url in self.short_to_long_url:
return self.short_to_long_url[short_url]
else:
raise Exception('Fail to decode')
def convert_10base_to_62base(self, num):
if not num:
return '0'
result = []
while num > 0:
mod = num % 62
if 0 <= mod <= 9:
result.append(str(mod))
elif 10 <= mod <= 35:
result.append(chr(ord('a') + (mod - 10)))
else:
# 36 <= mod <= 61
result.append(chr(ord('A') + (mod - 36)))
num /= 62
result.reverse()
return ''.join(result)
if __name__ == '__main__':
codec = Codec()
long_url = 'http://leetcode/tiny_url'
short_url = codec.encode(long_url)
print short_url
long_url = codec.decode(short_url)
print long_url