How DeepSeek Reads 6TB of Files Per Second: Architecture and Innovation Explained

ds66

2024-12-31

Introduction

In the realm of large-scale artificial intelligence, one of the most remarkable engineering feats of 2025 is DeepSeek’s ability to read and process 6 terabytes (TB) of data per second during training and inference. While such throughput seems almost implausible to most developers, this benchmark is a result of years of research in distributed computing, next-generation hardware utilization, and optimized data pipelines.

This article provides an in-depth, technical look at how DeepSeek achieves this extraordinary data processing speed, including hardware configurations, software stack optimizations, architectural design choices, and implications for the future of high-performance AI.

Section 1: Why Reading Speed Matters

The Training Bottleneck

In large-scale language model training, I/O (input/output) bottlenecks are a major limiting factor. Models like DeepSeek V3 require:

Tens of trillions of tokens
Datasets that include massive multimodal corpora
Fast shuffle and sample pipelines for efficient generalization

If the I/O system can’t match the compute performance, even the most powerful GPUs sit idle waiting for data.

Real-World Impact

Faster training = lower cost
Enables real-time adaptation in inference
Supports complex tasks like multimodal reasoning and video frame-by-frame understanding

Section 2: Hardware Layer — Exa-Scale Data Infrastructure

1. NVMe Storage Arrays

DeepSeek uses petabytes of data stored across NVMe drives:

Read Speeds: Up to 7 GB/s per drive
Array Clusters: Dozens of drives in RAID-0/RAID-10 configurations

2. PCIe 5.0 and CXL Interconnects

PCIe 5.0 allows 32 GB/s per lane (16-lane GPUs = 512 GB/s)
CXL (Compute Express Link) enables memory pooling across CPU and GPU domains

3. GPU Clusters

DeepSeek’s training runs on H800 and B100 NVIDIA GPUs:

Tensor cores for optimized AI computation
High-bandwidth memory (HBM3)

4. Distributed NVSwitch Fabric

A high-bandwidth NVSwitch connects GPUs within a node, enabling multi-TB/s throughput across nodes.

Section 3: Software Optimization Layer

1. DeepSeekFS: A Custom File System

DeepSeek engineers built a custom file system (inspired by GFS and Ceph):

Metadata caching for file lookup
Striped reads for parallel access
Chunk deduplication to reduce I/O redundancy

2. Asynchronous Prefetching

Files are queued for read ahead of GPU demand
Uses reinforcement-learning heuristics to predict which data will be used next

3. Memory Mapping (mmap++)

Huge pages used to map dataset slices directly into GPU-accessible memory
Avoids context switching and reduces latency

4. Compression and Decompression Pipelines

On-the-fly LZ4 and Zstandard (Zstd) decompression using GPU cores
Compressed training data is read at 6 TB/s raw, then expanded in-memory to usable formats

Section 4: Data Sharding and Parallelization

Horizontal Sharding

Data is horizontally split across nodes
Each GPU reads its own subset to avoid contention

Token Bucketing

Sentences and data samples are bucketed by token length
Minimizes padding and optimizes memory usage

Synchronized Batching

Uses NCCL2 for multi-node synchronization
Ensures each batch is evenly distributed without duplicated reads

Section 5: Cloud-Native Deployment

DeepSeek Cloud

Built on a hybrid of Alibaba Cloud + On-Prem Supernodes
Files distributed via object storage and local NVMe cache

Elastic Scaling

Nodes can be spun up based on data availability
Horizontal autoscaling supports peak loads and spot interruptions

Smart Load Balancing

Each read request is routed to the nearest node with cached data
Load is distributed across availability zones

Section 6: Case Study — Ingesting the Internet

To pretrain DeepSeek V3, the team ingested:

Entire Common Crawl (60 TB)
GitHub, Wikipedia, Stack Overflow
Multilingual corpora (e.g., Chinese, Arabic, Hindi)
Web-scraped audio and video transcripts

To handle this:

8,000 GPUs across 16 superclusters
1.2 Petabytes of NVMe
450 TB of system RAM

The system maintained an average sustained throughput of 6 TB/s across 57 days of training.

Section 7: Challenges and Future Directions

Heat and Power

Sustaining 6 TB/s requires huge energy: ~6 megawatts per data center
Cooling challenges addressed with immersion cooling and liquid-cooled racks

Environmental Footprint

Compression reduces storage by 70%, reducing environmental impact
DeepSeek invests in carbon offsets and renewable-powered facilities

Beyond 2025: Optical Interconnects

Upcoming upgrades may include optical PCIe lanes
Could raise theoretical throughput to 25 TB/s

Section 8: Implications for Developers and Enterprises

Fine-Tuning on Local Machines

Techniques like LoRA and QLoRA allow users to fine-tune subsets of DeepSeek models without requiring TB/s I/O

On-Demand Inference

Streaming APIs allow real-time retrieval of data from high-throughput backends
Useful for medical, legal, and enterprise applications requiring real-time document parsing

Democratizing AI

Open-source DeepSeek variants are being optimized for consumer-grade GPUs
New formats like "torch.indexed.dataset" allow faster loading even on laptops

Conclusion

Reading 6 TB of data per second isn't science fiction — it's now a benchmark in high-performance AI engineering, thanks to DeepSeek's fusion of software and hardware mastery. This achievement represents a fundamental shift in how we scale large language models, allowing for faster, cheaper, and more intelligent systems.

As other AI labs begin to adopt similar architectures, one thing is clear: the future of AI won't just be about smarter models — it will be about smarter infrastructure.