Deep Diving into Amazon S3

Hello Hello :D
Welcome to Part 2! Now that we understand the problems of massive scale, let's explore some of AWS's solutions. In this post, we'll dive into:

Decorrelated Systems
Shuffle Sharding
A Sneak Peek at Erasure Coding

03 - Designing Decorrelated systems

A core design principle of AWS S3 is decorrelation, which aims to ensure that failures are isolated and don't cascade through the system. This concept is applied on two critical levels:

Between Customers: The activity of one customer should be completely independent of another's. A massive spike in traffic from Customer A shouldn't slow down or affect Customer B.
Within a Single Workload: Even different requests from the same customer should be decorrelated from each other. For example, one PUT request to upload a file shouldn't fail just because a second, unrelated PUT request from that same customer ran into a problem.

Let's simplify this. From a high-level view, S3 has two main functions: PUT (to upload data) and GET (to retrieve it). When a customer initiates a PUT request, the system's fundamental challenge is to decide which hard drives will store that data.

A simplistic approach would be to store the entire file on one hard drive and simply assign a new one if the file is too large. However, as we've discussed, this creates a single point of failure and many other correlated risks.

this approach has 2 problems:

Single Point of Failure: If that one hard drive fails for any reason, the entire file is permanently lost. This offers zero durability, which is unacceptable for a service designed to protect data.
Storage Inefficiency: Imagine a file is slightly too large for the remaining space on a drive. The system would have to assign a brand-new 20 TB drive just to store a tiny amount of overflow data, like 1 GB. This is not a feasible or cost-effective solution.

This is why splitting a file into smaller shards and distributing them across multiple hard drives is a far more feasible and robust solution.

Shuffle Sharding

The second approach is to shuffle the shards and store them on random hard drives. This technique is called shuffle sharding.

Shuffle sharding is a concept that appears frequently across AWS services.

Shuffle Sharding in DNS with Route 53

Before your computer can send a request to S3, it has to find it on the internet. This crucial first step relies on another critical AWS service: Amazon Route 53.

Simply put, Route 53 is AWS's Domain Name System (DNS). Think of it as the internet's phonebook: it translates human-readable domain names (like my-bucket.s3.amazonaws.com) into the numeric IP addresses computers need to connect.

This means that every single request to S3 first passes through Route 53. As you might guess, this makes the DNS layer another critical place where AWS applies shuffling to ensure maximum resilience.

When your computer asks Route 53 for a bucket's address, it doesn't get just one IP address back. Instead, Route 53 returns a set of multiple IP addresses.

Each of these IPs points to a different entry point within the massive S3 fleet. This isn't an accident, it's shuffle sharding in action at the DNS layer.

Route 53's algorithm has a large pool of servers it can direct requests to. For each DNS query, it pseudo-randomly selects a small, unique subset of these servers and returns their IP addresses. The primary goal is to decorrelate requests by spreading them across different hardware.

This system is so dynamic that if you make a second DNS query for the exact same bucket moments later, Route 53 will likely return a completely different set of IP addresses, ensuring that your interactions are not tied to a fixed point of failure.

This shuffle sharding at the DNS level is also a powerful tool for fault tolerance.

Here's how it works when a request fails:

Let's say this request fails because the specific server it connected to is temporarily overloaded or has an issue.
The Route 53 is designed with built-in logic that automatically retries the failed request.

Crucially, this retry doesn't just go back to the same failed address. The SDK will often re-resolve the DNS name, and because of shuffling, Route 53 provides a brand-new set of IP addresses.

This means the new attempt is directed to a completely different set of healthy servers, effectively navigating around the original problem. This combination of client-side retries and shuffle-sharded DNS responses dramatically reduces the chance of a follow-up failure, making the entire system incredibly resilient.

The Placement Problem: Where Do the Shards Go?

Now that we understand the what (shuffle sharding) and the why (decorrelation), we get to the how: How does S3 actually place these shards onto its drives in the most optimal way?

Solution 1: Scan Every Drive

The most "optimal" solution would be to scan every single drive in the fleet, find the one with the most available space and the lowest current workload, and place the shard there. However, with millions of drives, a system-wide scan for every single write operation would be incredibly slow and completely impractical.

Solution 2: A Single Random Choice

The next logical approach is to simply pick one drive at random and place the shard there. It's fast and simple. When the AWS engineers modeled this approach, however, they discovered a problem. The resulting disk usage across the fleet looked something like this:

This graph shows a classic bell-curve distribution. While the load is spread out, it's far from even. Many drives are significantly underutilized (the short bars at the ends), while a smaller group takes on most of the load (the tall bars in the middle). This isn't the hyper-efficiency S3 is known for.

So, the engineers implemented a brilliantly simple alternative.

Solution 3: The "Power of Two Choices"

The solution is remarkably simple yet incredibly effective. Instead of picking just one random drive for a shard, the system picks two.

It then quickly compares them and places the shard on whichever of the two drives is less busy (i.e., has more free space). This tiny change from one random choice to choosing the better of two has a dramatic effect on the overall system balance. Refer the above image for reference

Sneak Peek: Erasure Coding

Alright, I'm saving one of the most interesting topics for last and will give you a quick preview here before we explore it thoroughly in the next post.

When S3 divides your files into shards, it doesn't just split the original data. Instead, it uses a process to create additional, redundant shards.

For example, if your file is broken into 8 data shards, the system might generate 4 extra "parity" shards. This means a total of 12 shards are then distributed across different drives.

"But why add extra data?" you might ask.

For those who don't know, AWS famously advertises that S3 is designed for "eleven nines" of durability (that's 99.999999999%). This mind-boggling level of data safety is no accident. How do they achieve it? One of the biggest players is a technique called Erasure Coding.

We'll unpack exactly how it works in the next blog post. Until then, ciao! 😁

Deep Diving into Amazon S3 - Part 2