Inside 3FS: How DeepSeek’s Distributed File System Powers AI at 6.6 Terabytes per Second

ic_writer ds66
ic_date 2024-07-08
blogs

Introduction: Engineering for the Future of AI

In the ever-accelerating world of artificial intelligence, DeepSeek has emerged as a trailblazer, not only for its powerful language models but also for its groundbreaking infrastructure. While much of the attention is focused on DeepSeek’s open-weight models like the 67B and 671B parameter LLMs, the real engineering marvel lies under the hood—its high-performance distributed file system, known as 3FS.

16496_ztv8_3624.webp

In February 2025, Singaporean authorities arrested several individuals for illegally exporting advanced Nvidia chips to DeepSeek.[46] In April 2025, it was reported that the Trump administration was considering penalties that would attempt to block DeepSeek from buying U.S. technology.[47]

Designed specifically to support training of trillion-scale language models, 3FS can handle an astonishing 6.6 terabytes of data per second, making it one of the most performant AI training file systems in the world.

But how does it work? What makes it so fast, fault-tolerant, and scalable? In this article, we’ll break down the inner workings of 3FS, its architecture, use of Apache Zookeeper and Apple’s FoundationDB, and why it represents a new frontier in distributed AI systems.

Table of Contents

  1. What is 3FS? A Primer

  2. The Problem of Scaling File I/O in AI Training

  3. DeepSeek’s AI Model Needs: Data-Hungry Giants

  4. 3FS at a Glance: Core Features

  5. Write-to-All, Read-by-Any: Explained

  6. The Role of Apache Zookeeper in 3FS

  7. Why FoundationDB Was Chosen for Metadata

  8. Chunking and Chained Replication Strategy

  9. Fault Tolerance and Node Failure Recovery

  10. Real-World Performance Metrics: 6.6 TB/s

  11. Bandwidth Optimization and Multi-Node Read

  12. Multi-Tenant Isolation and Access Control

  13. Write-Ahead Logging and Consistency Guarantees

  14. Data Placement and Hot File Detection

  15. 3FS vs Google’s Colossus vs Meta’s f4

  16. Integration with DeepSeek's Model Training Pipelines

  17. How 3FS Supports MoE Architectures

  18. Monitoring and Observability at Scale

  19. Limitations and Ongoing Challenges

  20. Future Directions: Could 3FS Be Open-Sourced?

1. What is 3FS? A Primer

3FS (Fast File Fabric System) is a custom-designed distributed file system built by DeepSeek’s AI infrastructure team to support:

  • High-bandwidth AI training

  • Billions of small and large files

  • Petabyte to exabyte-scale storage

  • Near-instant access from thousands of training nodes

Unlike general-purpose file systems (NFS, HDFS), 3FS is optimized specifically for AI training workloads where throughput, parallelism, and redundancy are critical.

2. The Problem of Scaling File I/O in AI Training

Training models like DeepSeek-67B or 671B requires:

  • Billions of documents

  • Hundreds of terabytes of tokenized datasets

  • Micro-batch sharding across GPUs

  • Low-latency random access

Traditional file systems choke under such load, becoming bottlenecks during:

  • Pre-tokenized data streaming

  • Checkpointing

  • Gradient offloading

  • Evaluation loop caching

3FS solves these issues at the file system level.

3. DeepSeek’s AI Model Needs: Data-Hungry Giants

The training pipeline includes:

  • Training corpora from books, websites, codebases, and dialogue

  • Continual data augmentation and filtering

  • Token streams distributed to 1024–4096 GPUs

  • Frequent save and resume from training checkpoints

Without a high-throughput, resilient I/O layer, the entire system stalls.

4. 3FS at a Glance: Core Features

  • 6.6 TB/sec aggregate throughput

  • Write-to-all, read-by-any topology

  • Chained replication for bandwidth efficiency

  • Zookeeper for distributed coordination

  • FoundationDB for metadata management

  • Adaptive chunk-level routing

  • Supports both POSIX-like and object-store APIs

5. Write-to-All, Read-by-Any: Explained

This policy ensures:

  • Every write is synchronously replicated to multiple nodes

  • Any read operation can occur from any available replica

  • Maximizes read concurrency during training

  • Enables load balancing between readers without performance hits

It’s designed for failure tolerance and data integrity, critical during massive-scale training runs.

6. The Role of Apache Zookeeper in 3FS

Zookeeper is used to:

  • Maintain cluster state across thousands of nodes

  • Coordinate leader elections, write permissions, and node liveness

  • Synchronize write quorum acknowledgments

  • Manage watchers for node updates and health

This helps 3FS stay consistent and alive, even under unexpected reboots, node losses, or splits.

7. Why FoundationDB Was Chosen for Metadata

FoundationDB is Apple’s high-performance transactional key-value store. In 3FS, it manages:

  • File and chunk metadata

  • Replication sets and node mapping

  • Access permissions

  • Versioned file updates

  • Global namespace indexing

Its ACID-compliant transactional model ensures that even in failures, no write is left in an uncertain state.

8. Chunking and Chained Replication Strategy

Files are split into chunks, typically 8MB to 64MB in size, and:

  • Each chunk is replicated across three or more nodes

  • The chained model passes the chunk from one replica to the next, reducing I/O load

  • The final node ACKs to the origin, minimizing write latency

This replication method provides:

  • High bandwidth efficiency

  • Low memory overhead

  • Streamlined recovery and redundancy

9. Fault Tolerance and Node Failure Recovery

If a node fails:

  • Zookeeper detects and alerts the cluster

  • Chained replicas reconstruct from sibling nodes

  • FoundationDB updates routing maps

  • No reads or writes are interrupted for long

Even under rack-level failures, 3FS continues serving I/O—no data is lost.

10. Real-World Performance Metrics: 6.6 TB/s

Tests at DeepSeek’s datacenter show:

  • Token streaming at ~300 MB/sec per GPU node

  • Gradient syncing via 3FS write-to-all buffers

  • Checkpoint load/recovery in under 3 minutes for full 67B models

This aggregate throughput beats systems like:

  • Google Colossus (GFS successor)

  • Meta’s f4

  • Amazon’s FSx for Lustre

11. Bandwidth Optimization and Multi-Node Read

By using data locality awareness, 3FS can:

  • Detect which chunks are hot

  • Replicate frequently-accessed data to edge nodes

  • Use least-cost-path read routing (e.g., prefer SSD local replicas over network)

This reduces network congestion and improves training efficiency by 10–20%.

12. Multi-Tenant Isolation and Access Control

3FS supports:

  • Training team-level namespaces

  • Read-write separation by GPU pool

  • Audit logs via FoundationDB

  • Access gating via tokenized credentials

Perfect for R&D environments with multiple concurrent model trainings.

13. Write-Ahead Logging and Consistency Guarantees

All write operations are:

  • Logged in FoundationDB’s commit log

  • Validated by Zookeeper’s cluster quorum

  • Written to temporary chunk buffers before replication

This ensures strong consistency while minimizing write amplification.

14. Data Placement and Hot File Detection

DeepSeek engineers use ML-powered detection systems to:

  • Track access frequency by chunk

  • Automatically re-replicate hot data

  • Evict or cold-store stale data

  • Improve chunk affinity for MoE layers, which often access specific slices repeatedly

15. 3FS vs Google’s Colossus vs Meta’s f4

Feature 3FS Colossus Meta f4
Peak Throughput 6.6 TB/s ~4.5 TB/s ~5 TB/s
Chunking Chained Random Erasure
Metadata FoundationDB Spanner RocksDB
Language Model Support Native External Mixed
Open Source ❌ (not yet)

3FS outperforms in training-specific scenarios, especially with multimodal model support.

16. Integration with DeepSeek's Model Training Pipelines

3FS is integrated directly with:

  • PyTorch DDP loaders

  • DeepSpeed / Megatron cache systems

  • MoE expert data shards

  • Custom token buffering mechanisms

This tight integration makes it invisible to the model engineers, but critical for model convergence speed.

17. How 3FS Supports MoE Architectures

MoE (Mixture of Experts) training requires:

  • Selective data access to certain "experts"

  • Rapid shard retrieval

  • Storage of dynamic routing data

3FS enables:

  • Sub-second load of expert model chunks

  • Efficient routing table reads

  • Balanced data flow across training nodes

Without this, DeepSeek 671B’s MoE setup wouldn’t be viable.

18. Monitoring and Observability at Scale

Monitoring tools include:

  • Prometheus-based node metrics

  • Grafana dashboards for throughput + latency

  • Chunk health logs in FoundationDB

  • Zookeeper cluster heartbeat graphs

  • Alerts for hot-spot detection and node failure

Admins can predict slowdowns before they happen.

19. Limitations and Ongoing Challenges

No system is perfect. Challenges include:

  • Storage cost for multi-replica data

  • Handling non-AI workloads with irregular file access

  • Scaling FoundationDB as metadata grows

  • Debugging live issues in real-time inference environments

But DeepSeek’s engineers are actively improving these.

20. Future Directions: Could 3FS Be Open-Sourced?

Rumors suggest:

  • A stripped-down version of 3FS may be open-sourced in late 2025

  • This would benefit academia and startup labs

  • Could compete with PetrelFS (OpenAI) or Ray Object Store

  • Might offer FoundationDB-backed POSIX API wrapper

The future is exciting — and very fast.

Conclusion: A New Paradigm for AI File Systems

3FS isn’t just another storage backend — it’s a strategic enabler for DeepSeek’s AI ambitions. By delivering 6.6 terabytes per second of resilient, intelligent data access, it redefines what’s possible in large-scale model training.

As foundation models get larger and training data grows more diverse, we need infrastructure that scales with intelligence. 3FS is one such answer — and perhaps, a model for the next generation of AI infrastructure.