Inside 3FS: How DeepSeek’s Distributed File System Powers AI at 6.6 Terabytes per Second

ds66

2024-07-08

Introduction: Engineering for the Future of AI

In the ever-accelerating world of artificial intelligence, DeepSeek has emerged as a trailblazer, not only for its powerful language models but also for its groundbreaking infrastructure. While much of the attention is focused on DeepSeek’s open-weight models like the 67B and 671B parameter LLMs, the real engineering marvel lies under the hood—its high-performance distributed file system, known as 3FS.

In February 2025, Singaporean authorities arrested several individuals for illegally exporting advanced Nvidia chips to DeepSeek.[46] In April 2025, it was reported that the Trump administration was considering penalties that would attempt to block DeepSeek from buying U.S. technology.[47]

Designed specifically to support training of trillion-scale language models, 3FS can handle an astonishing 6.6 terabytes of data per second, making it one of the most performant AI training file systems in the world.

But how does it work? What makes it so fast, fault-tolerant, and scalable? In this article, we’ll break down the inner workings of 3FS, its architecture, use of Apache Zookeeper and Apple’s FoundationDB, and why it represents a new frontier in distributed AI systems.

What is 3FS? A Primer
The Problem of Scaling File I/O in AI Training
DeepSeek’s AI Model Needs: Data-Hungry Giants
3FS at a Glance: Core Features
Write-to-All, Read-by-Any: Explained
The Role of Apache Zookeeper in 3FS
Why FoundationDB Was Chosen for Metadata
Chunking and Chained Replication Strategy
Fault Tolerance and Node Failure Recovery
Real-World Performance Metrics: 6.6 TB/s
Bandwidth Optimization and Multi-Node Read
Multi-Tenant Isolation and Access Control
Write-Ahead Logging and Consistency Guarantees
Data Placement and Hot File Detection
3FS vs Google’s Colossus vs Meta’s f4
Integration with DeepSeek's Model Training Pipelines
How 3FS Supports MoE Architectures
Monitoring and Observability at Scale
Limitations and Ongoing Challenges
Future Directions: Could 3FS Be Open-Sourced?

1. What is 3FS? A Primer

3FS (Fast File Fabric System) is a custom-designed distributed file system built by DeepSeek’s AI infrastructure team to support:

High-bandwidth AI training
Billions of small and large files
Petabyte to exabyte-scale storage
Near-instant access from thousands of training nodes

Unlike general-purpose file systems (NFS, HDFS), 3FS is optimized specifically for AI training workloads where throughput, parallelism, and redundancy are critical.

2. The Problem of Scaling File I/O in AI Training

Training models like DeepSeek-67B or 671B requires:

Billions of documents
Hundreds of terabytes of tokenized datasets
Micro-batch sharding across GPUs
Low-latency random access

Traditional file systems choke under such load, becoming bottlenecks during:

Pre-tokenized data streaming
Checkpointing
Gradient offloading
Evaluation loop caching

3FS solves these issues at the file system level.

3. DeepSeek’s AI Model Needs: Data-Hungry Giants

The training pipeline includes:

Training corpora from books, websites, codebases, and dialogue
Continual data augmentation and filtering
Token streams distributed to 1024–4096 GPUs
Frequent save and resume from training checkpoints

Without a high-throughput, resilient I/O layer, the entire system stalls.

4. 3FS at a Glance: Core Features

6.6 TB/sec aggregate throughput
Write-to-all, read-by-any topology
Chained replication for bandwidth efficiency
Zookeeper for distributed coordination
FoundationDB for metadata management
Adaptive chunk-level routing
Supports both POSIX-like and object-store APIs

5. Write-to-All, Read-by-Any: Explained

This policy ensures:

Every write is synchronously replicated to multiple nodes
Any read operation can occur from any available replica
Maximizes read concurrency during training
Enables load balancing between readers without performance hits

It’s designed for failure tolerance and data integrity, critical during massive-scale training runs.

6. The Role of Apache Zookeeper in 3FS

Zookeeper is used to:

Maintain cluster state across thousands of nodes
Coordinate leader elections, write permissions, and node liveness
Synchronize write quorum acknowledgments
Manage watchers for node updates and health

This helps 3FS stay consistent and alive, even under unexpected reboots, node losses, or splits.

7. Why FoundationDB Was Chosen for Metadata

FoundationDB is Apple’s high-performance transactional key-value store. In 3FS, it manages:

File and chunk metadata
Replication sets and node mapping
Access permissions
Versioned file updates
Global namespace indexing

Its ACID-compliant transactional model ensures that even in failures, no write is left in an uncertain state.

8. Chunking and Chained Replication Strategy

Files are split into chunks, typically 8MB to 64MB in size, and:

Each chunk is replicated across three or more nodes
The chained model passes the chunk from one replica to the next, reducing I/O load
The final node ACKs to the origin, minimizing write latency

This replication method provides:

High bandwidth efficiency
Low memory overhead
Streamlined recovery and redundancy

9. Fault Tolerance and Node Failure Recovery

If a node fails:

Zookeeper detects and alerts the cluster
Chained replicas reconstruct from sibling nodes
FoundationDB updates routing maps
No reads or writes are interrupted for long

Even under rack-level failures, 3FS continues serving I/O—no data is lost.

10. Real-World Performance Metrics: 6.6 TB/s

Tests at DeepSeek’s datacenter show:

Token streaming at ~300 MB/sec per GPU node
Gradient syncing via 3FS write-to-all buffers
Checkpoint load/recovery in under 3 minutes for full 67B models

This aggregate throughput beats systems like:

Google Colossus (GFS successor)
Meta’s f4
Amazon’s FSx for Lustre

11. Bandwidth Optimization and Multi-Node Read

By using data locality awareness, 3FS can:

Detect which chunks are hot
Replicate frequently-accessed data to edge nodes
Use least-cost-path read routing (e.g., prefer SSD local replicas over network)

This reduces network congestion and improves training efficiency by 10–20%.

12. Multi-Tenant Isolation and Access Control

3FS supports:

Training team-level namespaces
Read-write separation by GPU pool
Audit logs via FoundationDB
Access gating via tokenized credentials

Perfect for R&D environments with multiple concurrent model trainings.

13. Write-Ahead Logging and Consistency Guarantees

All write operations are:

Logged in FoundationDB’s commit log
Validated by Zookeeper’s cluster quorum
Written to temporary chunk buffers before replication

This ensures strong consistency while minimizing write amplification.

14. Data Placement and Hot File Detection

DeepSeek engineers use ML-powered detection systems to:

Track access frequency by chunk
Automatically re-replicate hot data
Evict or cold-store stale data
Improve chunk affinity for MoE layers, which often access specific slices repeatedly

15. 3FS vs Google’s Colossus vs Meta’s f4

Feature	3FS	Colossus	Meta f4
Peak Throughput	6.6 TB/s	~4.5 TB/s	~5 TB/s
Chunking	Chained	Random	Erasure
Metadata	FoundationDB	Spanner	RocksDB
Language Model Support	Native	External	Mixed
Open Source	❌ (not yet)	❌	❌

3FS outperforms in training-specific scenarios, especially with multimodal model support.

16. Integration with DeepSeek's Model Training Pipelines

3FS is integrated directly with:

PyTorch DDP loaders
DeepSpeed / Megatron cache systems
MoE expert data shards
Custom token buffering mechanisms

This tight integration makes it invisible to the model engineers, but critical for model convergence speed.

17. How 3FS Supports MoE Architectures

MoE (Mixture of Experts) training requires:

Selective data access to certain "experts"
Rapid shard retrieval
Storage of dynamic routing data

3FS enables:

Sub-second load of expert model chunks
Efficient routing table reads
Balanced data flow across training nodes

Without this, DeepSeek 671B’s MoE setup wouldn’t be viable.

18. Monitoring and Observability at Scale

Monitoring tools include:

Prometheus-based node metrics
Grafana dashboards for throughput + latency
Chunk health logs in FoundationDB
Zookeeper cluster heartbeat graphs
Alerts for hot-spot detection and node failure

Admins can predict slowdowns before they happen.

19. Limitations and Ongoing Challenges

No system is perfect. Challenges include:

Storage cost for multi-replica data
Handling non-AI workloads with irregular file access
Scaling FoundationDB as metadata grows
Debugging live issues in real-time inference environments

But DeepSeek’s engineers are actively improving these.

20. Future Directions: Could 3FS Be Open-Sourced?

Rumors suggest:

A stripped-down version of 3FS may be open-sourced in late 2025
This would benefit academia and startup labs
Could compete with PetrelFS (OpenAI) or Ray Object Store
Might offer FoundationDB-backed POSIX API wrapper

The future is exciting — and very fast.

Conclusion: A New Paradigm for AI File Systems

3FS isn’t just another storage backend — it’s a strategic enabler for DeepSeek’s AI ambitions. By delivering 6.6 terabytes per second of resilient, intelligent data access, it redefines what’s possible in large-scale model training.

As foundation models get larger and training data grows more diverse, we need infrastructure that scales with intelligence. 3FS is one such answer — and perhaps, a model for the next generation of AI infrastructure.