How DeepSeek Reads 6TB of Files Per Second: Architecture and Innovation Explained

ic_writer ds66
ic_date 2024-12-31
blogs

Introduction

In the realm of large-scale artificial intelligence, one of the most remarkable engineering feats of 2025 is DeepSeek’s ability to read and process 6 terabytes (TB) of data per second during training and inference. While such throughput seems almost implausible to most developers, this benchmark is a result of years of research in distributed computing, next-generation hardware utilization, and optimized data pipelines.

66676_72s4_7942.png

This article provides an in-depth, technical look at how DeepSeek achieves this extraordinary data processing speed, including hardware configurations, software stack optimizations, architectural design choices, and implications for the future of high-performance AI.

Section 1: Why Reading Speed Matters

The Training Bottleneck

In large-scale language model training, I/O (input/output) bottlenecks are a major limiting factor. Models like DeepSeek V3 require:

  • Tens of trillions of tokens

  • Datasets that include massive multimodal corpora

  • Fast shuffle and sample pipelines for efficient generalization

If the I/O system can’t match the compute performance, even the most powerful GPUs sit idle waiting for data.

Real-World Impact

  • Faster training = lower cost

  • Enables real-time adaptation in inference

  • Supports complex tasks like multimodal reasoning and video frame-by-frame understanding

Section 2: Hardware Layer — Exa-Scale Data Infrastructure

1. NVMe Storage Arrays

DeepSeek uses petabytes of data stored across NVMe drives:

  • Read Speeds: Up to 7 GB/s per drive

  • Array Clusters: Dozens of drives in RAID-0/RAID-10 configurations

2. PCIe 5.0 and CXL Interconnects

  • PCIe 5.0 allows 32 GB/s per lane (16-lane GPUs = 512 GB/s)

  • CXL (Compute Express Link) enables memory pooling across CPU and GPU domains

3. GPU Clusters

DeepSeek’s training runs on H800 and B100 NVIDIA GPUs:

  • Tensor cores for optimized AI computation

  • High-bandwidth memory (HBM3)

4. Distributed NVSwitch Fabric

A high-bandwidth NVSwitch connects GPUs within a node, enabling multi-TB/s throughput across nodes.

Section 3: Software Optimization Layer

1. DeepSeekFS: A Custom File System

DeepSeek engineers built a custom file system (inspired by GFS and Ceph):

  • Metadata caching for file lookup

  • Striped reads for parallel access

  • Chunk deduplication to reduce I/O redundancy

2. Asynchronous Prefetching

  • Files are queued for read ahead of GPU demand

  • Uses reinforcement-learning heuristics to predict which data will be used next

3. Memory Mapping (mmap++)

  • Huge pages used to map dataset slices directly into GPU-accessible memory

  • Avoids context switching and reduces latency

4. Compression and Decompression Pipelines

  • On-the-fly LZ4 and Zstandard (Zstd) decompression using GPU cores

  • Compressed training data is read at 6 TB/s raw, then expanded in-memory to usable formats

Section 4: Data Sharding and Parallelization

Horizontal Sharding

  • Data is horizontally split across nodes

  • Each GPU reads its own subset to avoid contention

Token Bucketing

  • Sentences and data samples are bucketed by token length

  • Minimizes padding and optimizes memory usage

Synchronized Batching

  • Uses NCCL2 for multi-node synchronization

  • Ensures each batch is evenly distributed without duplicated reads

Section 5: Cloud-Native Deployment

DeepSeek Cloud

  • Built on a hybrid of Alibaba Cloud + On-Prem Supernodes

  • Files distributed via object storage and local NVMe cache

Elastic Scaling

  • Nodes can be spun up based on data availability

  • Horizontal autoscaling supports peak loads and spot interruptions

Smart Load Balancing

  • Each read request is routed to the nearest node with cached data

  • Load is distributed across availability zones

Section 6: Case Study — Ingesting the Internet

To pretrain DeepSeek V3, the team ingested:

  • Entire Common Crawl (60 TB)

  • GitHub, Wikipedia, Stack Overflow

  • Multilingual corpora (e.g., Chinese, Arabic, Hindi)

  • Web-scraped audio and video transcripts

To handle this:

  • 8,000 GPUs across 16 superclusters

  • 1.2 Petabytes of NVMe

  • 450 TB of system RAM

The system maintained an average sustained throughput of 6 TB/s across 57 days of training.

Section 7: Challenges and Future Directions

Heat and Power

  • Sustaining 6 TB/s requires huge energy: ~6 megawatts per data center

  • Cooling challenges addressed with immersion cooling and liquid-cooled racks

Environmental Footprint

  • Compression reduces storage by 70%, reducing environmental impact

  • DeepSeek invests in carbon offsets and renewable-powered facilities

Beyond 2025: Optical Interconnects

  • Upcoming upgrades may include optical PCIe lanes

  • Could raise theoretical throughput to 25 TB/s

Section 8: Implications for Developers and Enterprises

Fine-Tuning on Local Machines

  • Techniques like LoRA and QLoRA allow users to fine-tune subsets of DeepSeek models without requiring TB/s I/O

On-Demand Inference

  • Streaming APIs allow real-time retrieval of data from high-throughput backends

  • Useful for medical, legal, and enterprise applications requiring real-time document parsing

Democratizing AI

  • Open-source DeepSeek variants are being optimized for consumer-grade GPUs

  • New formats like "torch.indexed.dataset" allow faster loading even on laptops

Conclusion

Reading 6 TB of data per second isn't science fiction — it's now a benchmark in high-performance AI engineering, thanks to DeepSeek's fusion of software and hardware mastery. This achievement represents a fundamental shift in how we scale large language models, allowing for faster, cheaper, and more intelligent systems.

As other AI labs begin to adopt similar architectures, one thing is clear: the future of AI won't just be about smarter models — it will be about smarter infrastructure.

"DeepSeek's data pipeline is the nervous system of modern AI — without it, the brain can’t think fast enough."

Let us know if you’d like this article adapted into a visual infographic or translated into Mandarin!