Inside 3FS: How DeepSeek’s Distributed File System Powers AI at 6.6 Terabytes per Second
Introduction: Engineering for the Future of AI
In the ever-accelerating world of artificial intelligence, DeepSeek has emerged as a trailblazer, not only for its powerful language models but also for its groundbreaking infrastructure. While much of the attention is focused on DeepSeek’s open-weight models like the 67B and 671B parameter LLMs, the real engineering marvel lies under the hood—its high-performance distributed file system, known as 3FS.
In February 2025, Singaporean authorities arrested several individuals for illegally exporting advanced Nvidia chips to DeepSeek.[46] In April 2025, it was reported that the Trump administration was considering penalties that would attempt to block DeepSeek from buying U.S. technology.[47]
Designed specifically to support training of trillion-scale language models, 3FS can handle an astonishing 6.6 terabytes of data per second, making it one of the most performant AI training file systems in the world.
But how does it work? What makes it so fast, fault-tolerant, and scalable? In this article, we’ll break down the inner workings of 3FS, its architecture, use of Apache Zookeeper and Apple’s FoundationDB, and why it represents a new frontier in distributed AI systems.
Table of Contents
-
What is 3FS? A Primer
-
The Problem of Scaling File I/O in AI Training
-
DeepSeek’s AI Model Needs: Data-Hungry Giants
-
3FS at a Glance: Core Features
-
Write-to-All, Read-by-Any: Explained
-
The Role of Apache Zookeeper in 3FS
-
Why FoundationDB Was Chosen for Metadata
-
Chunking and Chained Replication Strategy
-
Fault Tolerance and Node Failure Recovery
-
Real-World Performance Metrics: 6.6 TB/s
-
Bandwidth Optimization and Multi-Node Read
-
Multi-Tenant Isolation and Access Control
-
Write-Ahead Logging and Consistency Guarantees
-
Data Placement and Hot File Detection
-
3FS vs Google’s Colossus vs Meta’s f4
-
Integration with DeepSeek's Model Training Pipelines
-
How 3FS Supports MoE Architectures
-
Monitoring and Observability at Scale
-
Limitations and Ongoing Challenges
-
Future Directions: Could 3FS Be Open-Sourced?
1. What is 3FS? A Primer
3FS (Fast File Fabric System) is a custom-designed distributed file system built by DeepSeek’s AI infrastructure team to support:
-
High-bandwidth AI training
-
Billions of small and large files
-
Petabyte to exabyte-scale storage
-
Near-instant access from thousands of training nodes
Unlike general-purpose file systems (NFS, HDFS), 3FS is optimized specifically for AI training workloads where throughput, parallelism, and redundancy are critical.
2. The Problem of Scaling File I/O in AI Training
Training models like DeepSeek-67B or 671B requires:
-
Billions of documents
-
Hundreds of terabytes of tokenized datasets
-
Micro-batch sharding across GPUs
-
Low-latency random access
Traditional file systems choke under such load, becoming bottlenecks during:
-
Pre-tokenized data streaming
-
Checkpointing
-
Gradient offloading
-
Evaluation loop caching
3FS solves these issues at the file system level.
3. DeepSeek’s AI Model Needs: Data-Hungry Giants
The training pipeline includes:
-
Training corpora from books, websites, codebases, and dialogue
-
Continual data augmentation and filtering
-
Token streams distributed to 1024–4096 GPUs
-
Frequent save and resume from training checkpoints
Without a high-throughput, resilient I/O layer, the entire system stalls.
4. 3FS at a Glance: Core Features
-
6.6 TB/sec aggregate throughput
-
Write-to-all, read-by-any topology
-
Chained replication for bandwidth efficiency
-
Zookeeper for distributed coordination
-
FoundationDB for metadata management
-
Adaptive chunk-level routing
-
Supports both POSIX-like and object-store APIs
5. Write-to-All, Read-by-Any: Explained
This policy ensures:
-
Every write is synchronously replicated to multiple nodes
-
Any read operation can occur from any available replica
-
Maximizes read concurrency during training
-
Enables load balancing between readers without performance hits
It’s designed for failure tolerance and data integrity, critical during massive-scale training runs.
6. The Role of Apache Zookeeper in 3FS
Zookeeper is used to:
-
Maintain cluster state across thousands of nodes
-
Coordinate leader elections, write permissions, and node liveness
-
Synchronize write quorum acknowledgments
-
Manage watchers for node updates and health
This helps 3FS stay consistent and alive, even under unexpected reboots, node losses, or splits.
7. Why FoundationDB Was Chosen for Metadata
FoundationDB is Apple’s high-performance transactional key-value store. In 3FS, it manages:
-
File and chunk metadata
-
Replication sets and node mapping
-
Access permissions
-
Versioned file updates
-
Global namespace indexing
Its ACID-compliant transactional model ensures that even in failures, no write is left in an uncertain state.
8. Chunking and Chained Replication Strategy
Files are split into chunks, typically 8MB to 64MB in size, and:
-
Each chunk is replicated across three or more nodes
-
The chained model passes the chunk from one replica to the next, reducing I/O load
-
The final node ACKs to the origin, minimizing write latency
This replication method provides:
-
High bandwidth efficiency
-
Low memory overhead
-
Streamlined recovery and redundancy
9. Fault Tolerance and Node Failure Recovery
If a node fails:
-
Zookeeper detects and alerts the cluster
-
Chained replicas reconstruct from sibling nodes
-
FoundationDB updates routing maps
-
No reads or writes are interrupted for long
Even under rack-level failures, 3FS continues serving I/O—no data is lost.
10. Real-World Performance Metrics: 6.6 TB/s
Tests at DeepSeek’s datacenter show:
-
Token streaming at ~300 MB/sec per GPU node
-
Gradient syncing via 3FS write-to-all buffers
-
Checkpoint load/recovery in under 3 minutes for full 67B models
This aggregate throughput beats systems like:
-
Google Colossus (GFS successor)
-
Meta’s f4
-
Amazon’s FSx for Lustre
11. Bandwidth Optimization and Multi-Node Read
By using data locality awareness, 3FS can:
-
Detect which chunks are hot
-
Replicate frequently-accessed data to edge nodes
-
Use least-cost-path read routing (e.g., prefer SSD local replicas over network)
This reduces network congestion and improves training efficiency by 10–20%.
12. Multi-Tenant Isolation and Access Control
3FS supports:
-
Training team-level namespaces
-
Read-write separation by GPU pool
-
Audit logs via FoundationDB
-
Access gating via tokenized credentials
Perfect for R&D environments with multiple concurrent model trainings.
13. Write-Ahead Logging and Consistency Guarantees
All write operations are:
-
Logged in FoundationDB’s commit log
-
Validated by Zookeeper’s cluster quorum
-
Written to temporary chunk buffers before replication
This ensures strong consistency while minimizing write amplification.
14. Data Placement and Hot File Detection
DeepSeek engineers use ML-powered detection systems to:
-
Track access frequency by chunk
-
Automatically re-replicate hot data
-
Evict or cold-store stale data
-
Improve chunk affinity for MoE layers, which often access specific slices repeatedly
15. 3FS vs Google’s Colossus vs Meta’s f4
Feature | 3FS | Colossus | Meta f4 |
---|---|---|---|
Peak Throughput | 6.6 TB/s | ~4.5 TB/s | ~5 TB/s |
Chunking | Chained | Random | Erasure |
Metadata | FoundationDB | Spanner | RocksDB |
Language Model Support | Native | External | Mixed |
Open Source | ❌ (not yet) | ❌ | ❌ |
3FS outperforms in training-specific scenarios, especially with multimodal model support.
16. Integration with DeepSeek's Model Training Pipelines
3FS is integrated directly with:
-
PyTorch DDP loaders
-
DeepSpeed / Megatron cache systems
-
MoE expert data shards
-
Custom token buffering mechanisms
This tight integration makes it invisible to the model engineers, but critical for model convergence speed.
17. How 3FS Supports MoE Architectures
MoE (Mixture of Experts) training requires:
-
Selective data access to certain "experts"
-
Rapid shard retrieval
-
Storage of dynamic routing data
3FS enables:
-
Sub-second load of expert model chunks
-
Efficient routing table reads
-
Balanced data flow across training nodes
Without this, DeepSeek 671B’s MoE setup wouldn’t be viable.
18. Monitoring and Observability at Scale
Monitoring tools include:
-
Prometheus-based node metrics
-
Grafana dashboards for throughput + latency
-
Chunk health logs in FoundationDB
-
Zookeeper cluster heartbeat graphs
-
Alerts for hot-spot detection and node failure
Admins can predict slowdowns before they happen.
19. Limitations and Ongoing Challenges
No system is perfect. Challenges include:
-
Storage cost for multi-replica data
-
Handling non-AI workloads with irregular file access
-
Scaling FoundationDB as metadata grows
-
Debugging live issues in real-time inference environments
But DeepSeek’s engineers are actively improving these.
20. Future Directions: Could 3FS Be Open-Sourced?
Rumors suggest:
-
A stripped-down version of 3FS may be open-sourced in late 2025
-
This would benefit academia and startup labs
-
Could compete with PetrelFS (OpenAI) or Ray Object Store
-
Might offer FoundationDB-backed POSIX API wrapper
The future is exciting — and very fast.
Conclusion: A New Paradigm for AI File Systems
3FS isn’t just another storage backend — it’s a strategic enabler for DeepSeek’s AI ambitions. By delivering 6.6 terabytes per second of resilient, intelligent data access, it redefines what’s possible in large-scale model training.
As foundation models get larger and training data grows more diverse, we need infrastructure that scales with intelligence. 3FS is one such answer — and perhaps, a model for the next generation of AI infrastructure.