How DeepSeek Reads 6TB of Files Per Second: Architecture and Innovation Explained
Introduction
In the realm of large-scale artificial intelligence, one of the most remarkable engineering feats of 2025 is DeepSeek’s ability to read and process 6 terabytes (TB) of data per second during training and inference. While such throughput seems almost implausible to most developers, this benchmark is a result of years of research in distributed computing, next-generation hardware utilization, and optimized data pipelines.
This article provides an in-depth, technical look at how DeepSeek achieves this extraordinary data processing speed, including hardware configurations, software stack optimizations, architectural design choices, and implications for the future of high-performance AI.
Section 1: Why Reading Speed Matters
The Training Bottleneck
In large-scale language model training, I/O (input/output) bottlenecks are a major limiting factor. Models like DeepSeek V3 require:
Tens of trillions of tokens
Datasets that include massive multimodal corpora
Fast shuffle and sample pipelines for efficient generalization
If the I/O system can’t match the compute performance, even the most powerful GPUs sit idle waiting for data.
Real-World Impact
Faster training = lower cost
Enables real-time adaptation in inference
Supports complex tasks like multimodal reasoning and video frame-by-frame understanding
Section 2: Hardware Layer — Exa-Scale Data Infrastructure
1. NVMe Storage Arrays
DeepSeek uses petabytes of data stored across NVMe drives:
Read Speeds: Up to 7 GB/s per drive
Array Clusters: Dozens of drives in RAID-0/RAID-10 configurations
2. PCIe 5.0 and CXL Interconnects
PCIe 5.0 allows 32 GB/s per lane (16-lane GPUs = 512 GB/s)
CXL (Compute Express Link) enables memory pooling across CPU and GPU domains
3. GPU Clusters
DeepSeek’s training runs on H800 and B100 NVIDIA GPUs:
Tensor cores for optimized AI computation
High-bandwidth memory (HBM3)
4. Distributed NVSwitch Fabric
A high-bandwidth NVSwitch connects GPUs within a node, enabling multi-TB/s throughput across nodes.
Section 3: Software Optimization Layer
1. DeepSeekFS: A Custom File System
DeepSeek engineers built a custom file system (inspired by GFS and Ceph):
Metadata caching for file lookup
Striped reads for parallel access
Chunk deduplication to reduce I/O redundancy
2. Asynchronous Prefetching
Files are queued for read ahead of GPU demand
Uses reinforcement-learning heuristics to predict which data will be used next
3. Memory Mapping (mmap++)
Huge pages used to map dataset slices directly into GPU-accessible memory
Avoids context switching and reduces latency
4. Compression and Decompression Pipelines
On-the-fly LZ4 and Zstandard (Zstd) decompression using GPU cores
Compressed training data is read at 6 TB/s raw, then expanded in-memory to usable formats
Section 4: Data Sharding and Parallelization
Horizontal Sharding
Data is horizontally split across nodes
Each GPU reads its own subset to avoid contention
Token Bucketing
Sentences and data samples are bucketed by token length
Minimizes padding and optimizes memory usage
Synchronized Batching
Uses NCCL2 for multi-node synchronization
Ensures each batch is evenly distributed without duplicated reads
Section 5: Cloud-Native Deployment
DeepSeek Cloud
Built on a hybrid of Alibaba Cloud + On-Prem Supernodes
Files distributed via object storage and local NVMe cache
Elastic Scaling
Nodes can be spun up based on data availability
Horizontal autoscaling supports peak loads and spot interruptions
Smart Load Balancing
Each read request is routed to the nearest node with cached data
Load is distributed across availability zones
Section 6: Case Study — Ingesting the Internet
To pretrain DeepSeek V3, the team ingested:
Entire Common Crawl (60 TB)
GitHub, Wikipedia, Stack Overflow
Multilingual corpora (e.g., Chinese, Arabic, Hindi)
Web-scraped audio and video transcripts
To handle this:
8,000 GPUs across 16 superclusters
1.2 Petabytes of NVMe
450 TB of system RAM
The system maintained an average sustained throughput of 6 TB/s across 57 days of training.
Section 7: Challenges and Future Directions
Heat and Power
Sustaining 6 TB/s requires huge energy: ~6 megawatts per data center
Cooling challenges addressed with immersion cooling and liquid-cooled racks
Environmental Footprint
Compression reduces storage by 70%, reducing environmental impact
DeepSeek invests in carbon offsets and renewable-powered facilities
Beyond 2025: Optical Interconnects
Upcoming upgrades may include optical PCIe lanes
Could raise theoretical throughput to 25 TB/s
Section 8: Implications for Developers and Enterprises
Fine-Tuning on Local Machines
Techniques like LoRA and QLoRA allow users to fine-tune subsets of DeepSeek models without requiring TB/s I/O
On-Demand Inference
Streaming APIs allow real-time retrieval of data from high-throughput backends
Useful for medical, legal, and enterprise applications requiring real-time document parsing
Democratizing AI
Open-source DeepSeek variants are being optimized for consumer-grade GPUs
New formats like "torch.indexed.dataset" allow faster loading even on laptops
Conclusion
Reading 6 TB of data per second isn't science fiction — it's now a benchmark in high-performance AI engineering, thanks to DeepSeek's fusion of software and hardware mastery. This achievement represents a fundamental shift in how we scale large language models, allowing for faster, cheaper, and more intelligent systems.
As other AI labs begin to adopt similar architectures, one thing is clear: the future of AI won't just be about smarter models — it will be about smarter infrastructure.
"DeepSeek's data pipeline is the nervous system of modern AI — without it, the brain can’t think fast enough."
Let us know if you’d like this article adapted into a visual infographic or translated into Mandarin!