Deep Diving into Amazon S3

We've all been there: mindlessly scrolling through YouTube, wondering if the suggestion algorithm has any idea what we actually like. Most of the time, the algorithm sucks. But every once in a while, it strikes gold. Recently, while deep in one of those YouTube rabbit holes, I was served an AWS re:Invent talk. The title piqued my interest, and what I discovered was completely unexpected: the incredible science that powers Amazon S3. So here I am, sharing the Gyaan no one asked for :D

S3 at scale

AWS is BIGG. It currently dominates the cloud business, holding 30% of the market. To give you an idea of its scale, here are some key statistics about its S3 service:

Data Volume: Amazon S3 stores more than 350 trillion objects.
Request Volume: It averages over 100 million requests per second.
Write Performance: It can handle at least 3,500 write requests per second.
Read Performance: It can handle at least 5,500 read requests per second.

At that scale, uploading or retrieving a file from S3 is far more complex than simply saving/reading a file on your computer. The engineering challenges are enormous, but here's the brilliant twist: AWS designed S3 to use its incredible scale as a feature, not a bug. In fact, the bigger it gets, the more efficient and reliable it becomes.

Scale to S3s Advantage:

01 - Physics of data

So, we all know how a traditional disk based hard drives work right? here’s a recap:

The shiny, spinning disk is where your information is stored, like a song on a record. An arm with a tiny recording head on its tip moves back and forth over the disk. To save (write) a file, the head uses a tiny magnet to store the data on the disk. To open (read) a file, the arm moves the head to the exact right spot to sense the information that's already there.

There are two key terms used to measure how long it will take to retrieve or write a file:

Rotational Latency: The time it takes for the disk to rotate to the correct position where your file is stored.
Seek Time: The time it takes for the read/write head to move to the correct track on the disk.

Here’s the catch, S3 doesn't store your entire file on a single hard drive. Instead, it uses a much more robust strategy to ensure high availability, fault tolerance, and performance.

We'll dive deeper into those benefits later, but here's the key concept: when you upload an object to S3, the system breaks it into smaller pieces called shards. These shards are then distributed across numerous physical drives and servers. So, if your file is split into 10 shards, it doesn't just exist in one place, it's spread across 10 different hard drives.

Every storage drive that powers AWS S3 runs a specialized piece of software. This component, sometimes referred to as ShardStore, is a custom-built, log-structured file system. This specific design is key to how S3 achieves its incredible speed and durability.

💡

The big advantage of this "journal" method is for saving new data. Shards are written neatly in a row on the disk, one after the other. This is extremely fast because the drive's moving arm doesn't have to search for empty space, it just goes to the end of the line. But this speed boost is mainly for writing. When you need to read a file, the system still has to find all the different shards, which might be in various spots on the disk.

Individual Workload:

Many workloads are "bursty," meaning they have periods of peak data processing for a few hours, followed by long periods of very low resource usage. This pattern creates a significant efficiency problem.

For example, I once worked on a project that processed gigabytes of employee data and stored it on S3. This entire job ran for only 4-5 hours each day. For the remaining 19-20 hours, the dedicated cloud resources were operating at less than 10% of their capacity.

Let's imagine a demanding scenario: your company needs to process 1 PB (petabyte) of data, with an average object size of 1 MB. The crucial requirement is that all of this data must be accessed within a single hour.

This works out to a required throughput of roughly 275 Gigabytes per second (GB/s).
Let’s calculate how many disk-drives we’ll need to store this data, and access them at 275gb/sec:

The Math: Calculating Single-Drive Performance 🧮

Let’s calculate the throughput speed of a hard disk with 7200 RPM. We'll start by figuring out the two main delays: rotational latency and seek time.

Step 1: Determine the Latencies

Average Rotational Latency

This is the time we wait for the disk to spin to the correct spot. We can calculate it directly from the drive's speed.

Revolutions per Second: A 7200 RPM drive spins 7200 times every minute. 7200 rotations / 60 seconds = 120 rotations per second
Time per Full Rotation: The time for one full spin is the inverse of the above. 1 second / 120 rotations ≈ 8.3 milliseconds (ms)
Average Latency: On average, the data we need is halfway around the disk, so we only wait for half a rotation. 8.3 ms / 2 ≈ 4.15 ms

This gives us an average rotational latency of about 4 ms.

Average Seek Time (The Assumption)

This is the time it takes for the actuator arm to physically move the read/write head to the correct track on the disk.

Unlike rotational latency, this isn't a fixed calculation. However, a widely used rule of thumb is that an average seek operation involves moving the head across roughly one-third of the disk's total tracks.
For a standard 7200 RPM consumer drive, this works out to an industry-accepted average of about 4 ms.

Step 2: Calculate the Total Throughput

Now that we have our latency numbers, we can calculate the drive's total data speed.

Total-Time/Shard: First, we find the total time to read a single 0.5 MB shard (we assume that each file is divided into 2 shards, so 1mb/2 = 0.5mb). This includes the latencies we just figured out, plus the time to actually transfer the data (estimated at 2 ms).

Time = Rotational Latency + Seek Time + Data Transfer Time Time = 4 ms + 4 ms + 2 ms = **10 ms per shard**
Reads Per Second: Next, we see how many shards we can read in one second (1000 ms).

1000 ms / 10 ms per shard = **100 reads per second**
Final Throughput: Finally, we multiply the number of reads by the size of each shard to get our total data speed.

Throughput = 100 reads/sec × 0.5 MB/read = **50 MB/s**

So, after all the math, a single hard drive in our scenario can deliver data at a rate of about 50 MB per second.

So, we've established that a single hard drive has a relatively low read speed due to its physical limitations. Now, let's see how this creates a massive challenge when we revisit our company's requirements: storing 1 PB of data with an access speed of 275 GB per second.

No of disks required to store 1 PB of data (Given 1 hard disk can store 20 tb of data) = 1000/20 = 50 hard disks

No of disks required to access data at 275 GB per second (at 50mb per sec): 275000/50 = 5500 hard disks

50 hard disks to store the data, 5500 hard disks to read the data (Crazyyy)

Don’t worry if the above math is a little too complicated (I suck at math too). Just know that this reveals the core problem: the company needs over 100 times more hardware to meet its performance requirements than it does for simple storage capacity. This forces them to pay for an enormous infrastructure that, due to the bursty nature of the workload, would sit almost completely idle for 23 hours a day

Ohh but wait, s3 has millions of customers, right? so isn’t this problem actually 100x worst for AWS?
NO

Combining Workloads:

Like we saw above, Individual workloads are bursty, but, when you combine millions of workloads, the aggregate throughput requirement, isn’t bursty at all, and is surprisingly constant and predictable:

See how the IOPS becomes consistent as the workloads increase?

Because this aggregate demand is so predictable, AWS can provision exactly the right amount of throughput and the precise number of disk drives required to meet it. This is the first key insight into how massive scale, which seems like a challenge, actually works in AWS's favor, allowing them to operate with incredible efficiency.

02 - Thermodynamics - Balancing the aggregates

Another problem with storing shards on different drives is figuring out how traffic gets distributed across the drives. You may ask why?

First, we need to understand two key concepts: younger data and older data. In large-scale storage, a universal pattern emerges:

Younger Data (Hot 🔥): Data that was recently created or modified is accessed very frequently.
Older Data (Cold ❄️): As data ages, it's accessed far less often.

If a system were to group data naively-placing all new, "hot" data on one set of disks and all old, "cold" data on another-it would create two critical problems.

Problem 1: Unbalanced and Inefficient Workloads

This approach leads to a massive imbalance. The disks holding "young" data would be constantly overworked, handling a high volume of read and write requests. At the same time, the disks with "old" data would "cool off," sitting almost completely idle. This is incredibly inefficient, as a large portion of your hardware would be doing nothing.

Problem 2: Physical Hotspots and Hardware Risk

There's also a serious physical danger. At AWS's scale, millions of drives are organized into server racks. If you concentrated all the frequently accessed "hot" data onto the drives within a single rack, that entire rack would experience intense, non-stop activity.

This constant I/O generates an enormous amount of heat. The temperature inside the rack could skyrocket, creating a dangerous "hotspot" that risks equipment failure and, in a worst-case scenario, a server meltdown.

To solve this, S3 constantly rebalances data to ensure the "temperature" or read/write activity is evenly distributed across all its server racks. One opportunity for this rebalancing occurs when new hardware is added.

Seeding New Racks for Balance

When a new rack is added to the infrastructure, a basic would be to send all the new, "hot" data to it. However, this would just create a new hotspot and risk server meltdown.

Instead, S3 uses an intelligent "seeding" strategy:

~80% Cold Data: They transfer “old” data from across the servers to the new rack to fill 80% of its total capacity.
~20% Hot Data: The remaining space is reserved for the incoming stream of new, frequently accessed ("hot") data.

Whew, this ended up longer than I thought! Gonna hit pause for now, but I’ll pick things up in the next post. Stay tuned!

To the infinity and beyond 🚀

Sources used in this post:

Deep Diving into Amazon S3 - Part 1