designing a URL shortener!
Step 1: Requirement Justification
Functional Requirements
- Given a long URL, we want to shorten it into something like bit.ly/abcxyz.
- The shortened URL should redirect to the actual URL.
- The shortened URL must be unique and should only point to one long URL.
- Users should be able to select a custom alias.
- The shortlink should expire after some time.
Non Functional Requirements
- The service should be highly available.
- The data should not be lost in any case.
- Shortened links should not be guessable
Step 2: System Interface Definition
shortenURL(string api_key, string long_url, string custom_link -> optional)
This should return a string which should be the shortlink
Step 3: Capacity Estimation
- Let's assume we have 1 Million people using our service to generate shortlinks everyday.
- We keep the shortlinks for 10 years. That's about ~3.7B shortlinks
- If we use base-64, we need 6 characters which gives us 68.7B possible combinations.
- Each shortlink comes out to be 6 Bytes.
Storage
- 3.67*6Bytes ~ 22B Bytes ~ 22 GBs of shortlinks.
- Assume we have 500 Bytes of data per shortlink (long URL, timestamp, userID etc).
- We need 3.67*500B Bytes ~ 1.85 TB of storage.
Network
- Assume we have a 100 : 1 read ratio.
- We will have 100M read requests (egress) each day, which converts to 100*500M Bytes ~ 50GB/day ~ 500KB/s
- For write requests (ingress) each day, we need ~5KB/s.
Memory
- Following the 80-20% rule, meaning 20% of our URLs bring 80% of our traffic, we can cache 20M shortlinks.
- 20M*500Bytes ~ 10GBs of Cache.
Summarising
- Storage: 1.85 TB
- Network: 500KB/s (egress), 5KB/s (ingress)
- Memory: 10GBs
Step 4: Data Model
We need to have 2 tables. One for User info and one for URL mappings
URL Table
string longURL string shortLink int userID timestamp creationDate timestamp expiryDate
User Table
int userID string userName string emailID timestamp creationTime
Algorithm
a. Encoding the actual URL
If we try to encode using base64, we will need 6 characters (that gives us ~68.7B combinations)
Assume there is a hash function that gives you the encoded string.
Issues with this approach
- The issue with this approach is that if multiple people give the same URL, they get the same shortlink.
- What if parts of the URL are URL encoded?
Solutions
- Append a increasing number to the URL before generating the shortlink and then encode it.
- Append the userID to the URL. (login mandatory)
b. Key Generation Service
Have a standalone service that pre-generates random 6-character strings and stores them in a DB (key-DB).
Whenever we need a new shortlink, we just pick one from the key-DB and mark it as used. No collisions, no duplicates.
Concurrency issue — two servers could grab the same key at the same time. Solution: each server gets a range of keys loaded into memory. Once exhausted, fetch a new range.
Key-DB size — 6 characters in base64 = ~68.7B keys. At 6 bytes each, that's ~412 GB. Totally manageable.
Step 5: High Level Design
Client
|
Load Balancer
|
App Servers ---- Key Generation Service ---- Key-DB
|
DB (URL mappings + User info)
|
Cache (hot URLs)
- Write path: client sends long URL → app server grabs a key from KGS → stores mapping in DB → returns shortlink.
- Read path: client hits shortlink → app server checks cache → if miss, checks DB → returns 302 redirect.
Step 6: Detailed Design
DB Choice
We're looking at simple key-value lookups with no complex joins. A NoSQL store like DynamoDB or Cassandra works well here. We need high availability and horizontal scaling.
Partitioning
- Range-based (by first letter of shortlink): can lead to unbalanced partitions.
- Hash-based: distribute URLs uniformly across partitions. This is the better choice.
Cache
- Use an in-memory cache (Redis/Memcached) in front of the DB.
- LRU eviction policy works fine since we're caching the 20% hot URLs.
- Cache sits between app servers and DB.
Cleanup
A lightweight service that periodically scans for expired links and removes them. Freed keys can be recycled back into the key-DB.
Step 7: Bottlenecks
- Single point of failure: replicate KGS with a standby. Replicate DB across regions.
- High traffic: load balancer distributes reads across servers. Cache absorbs most read load.
- Data loss: DB replication + regular backups.
- Analytics: if we want click tracking, we can log redirects asynchronously to avoid slowing down reads.
That's it. A minimal but solid URL shortener design.