🧠 A Complete Flask-Based DeepSeek Chatbot API: Build Your Own Local AI Assistant
With the rise of open-weight models like DeepSeek R1 and DeepSeek-Coder, developers now have the opportunity to build powerful, private, and customizable chatbots without relying on commercial cloud APIs. In this guide, we’ll walk you through the process of creating a fully functional Flask-based chatbot API using DeepSeek running locally or on a server, accessible via HTTP requests.

High-Flyer/DeepSeek operates at least two primary computing clusters: Fire-Flyer (萤火一号) and Fire-Flyer 2 (萤火二号). Fire-Flyer 2 consists of co-designed software and hardware architecture. On the hardware side, Nvidia GPUs use 200 Gbps interconnects. The cluster is divided into two "zones", and the platform supports cross-zone tasks. The network topology was two fat trees, chosen for high bisection bandwidth. On the software side are:[30][24]
This article is ideal for:
Developers building AI SaaS tools
Educators and researchers wanting a secure LLM
Startups looking to save on OpenAI costs
Hackers and tinkerers interested in privacy-preserving AI
✅ Table of Contents
Introduction to the Project
Why Use Flask + DeepSeek?
Requirements and Setup
Downloading and Running DeepSeek Locally
Building the Flask App
Testing the API via Curl & Postman
Adding Frontend (Optional)
Extending Features: Streaming, Chat History
Security: API Tokens, Rate Limits
Deploying to Cloud (or LAN)
Final Tips & Troubleshooting
What’s Next? Add-ons and Upgrades
1. 🧠 Introduction to the Project
You will learn how to:
Run DeepSeek R1 locally using
llama-cpp-pythonCreate a lightweight chatbot backend using Flask
Accept user prompts and return LLM responses
Secure and extend the chatbot for production use
Result: a chat-style API accessible from a browser, mobile app, or other services.
2. 🔍 Why Use Flask + DeepSeek?
| Feature | Reason to Use |
|---|---|
| Flask | Lightweight, scalable, and easy to build APIs |
| DeepSeek R1/Coder | Open-weight, powerful, 128K context, no API costs |
| llama-cpp-python | Efficient GGUF inference for CPUs/GPUs |
| Local-first | Secure and self-hosted, no cloud needed |
No API fees, unlimited usage, and full customization power.
3. ⚙️ Requirements and Setup
System Requirements:
macOS, Linux, or WSL on Windows
16GB+ RAM recommended (for quantized model)
Python 3.10+
Python Libraries:
bash复制编辑pip install flask flask-cors llama-cpp-python
4. 📥 Downloading and Running DeepSeek Locally
Step 1: Download the GGUF Model
Go to Hugging Face or TheBloke’s GGUF mirror.
Download a quantized model, e.g.:
deepseek-7b-chat.Q4_K_M.gguf
Place it in a directory: ./models/
5. 🛠️ Building the Flask App
python
from flask import Flask, request, jsonifyfrom flask_cors import CORSfrom llama_cpp import Llama
app = Flask(__name__)
CORS(app)
# Load the DeepSeek GGUF modelllm = Llama(model_path="./models/deepseek-7b-chat.Q4_K_M.gguf", n_ctx=4096, n_threads=8)
@app.route('/chat', methods=['POST'])def chat():
data = request.get_json()
user_prompt = data.get('prompt', '')
full_prompt = f"[INST] {user_prompt} [/INST]"
response = llm(
prompt=full_prompt,
max_tokens=512,
temperature=0.7,
top_p=0.9,
stop=["</s>", "[INST]"]
)
output = response['choices'][0]['text'].strip() return jsonify({'response': output})if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
Save this file as chatbot_api.py
6. 🧪 Testing the API via Curl or Postman
Curl:
bash
curl -X POST http://localhost:5000/chat \
-H "Content-Type: application/json" \
-d '{"prompt":"Tell me a joke about AI."}'
Postman:
URL:
http://localhost:5000/chatMethod:
POSTBody:
json
{
"prompt": "Explain black holes in simple terms."}
You’ll get a JSON response with the generated answer.
7. 💻 Adding Frontend (Optional)
Here’s a basic HTML page to test it:
html
<!DOCTYPE html><html>
<head>
<title>DeepSeek Chatbot</title>
</head>
<body> <h1>Talk to DeepSeek</h1>
<textarea id="prompt" rows="4" cols="60"></textarea><br>
<button onclick="sendPrompt()">Ask</button>
<pre id="response"></pre> <script>
async function sendPrompt() {
const prompt = document.getElementById("prompt").value;
const res = await fetch("http://localhost:5000/chat",
{ method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ prompt })
}); const data = await res.json();
document.getElementById("response").innerText = data.response;
} </script></body></html>
Save as index.html and open it locally in your browser.
8. 🚀 Extending Features: Streaming & Memory
a. Enable Streaming:
In llama-cpp-python, you can pass stream=True:
python for token in llm.create_completion(prompt=..., stream=True): print(token['text'], end='', flush=True)
b. Add Chat History (Memory):
Create a chat log per session:
python
history = []@app.route('/chat', methods=['POST'])def chat():
data = request.get_json()
user_input = data.get('prompt', '')
history.append(f"User: {user_input}")
full_prompt = "\n".join(history) + "\nAI:"
response = llm(prompt=full_prompt, max_tokens=512)
answer = response['choices'][0]['text'].strip()
history.append(f"AI: {answer}") return jsonify({'response': answer})
9. 🔐 Security: API Keys, Rate Limits
Add basic key-based auth:
python
@app.before_requestdef check_auth():
api_key = request.headers.get("Authorization")
if api_key != "Bearer your-secret-key":
return jsonify({"error": "Unauthorized"}),
401
You can also integrate:
Flask-Limiter for rate limits
JWT tokens for multi-user systems
Cloudflare or Nginx for IP filtering and HTTPS
10. ☁️ Deploying to the Cloud
Use platforms like:
Render (simple, free-tier friendly)
Fly.io (low latency global deploy)
Hetzner (GPU cloud for inference)
AWS EC2 (for GPU + Flask combo)
Docker:
dockerfile FROM python:3.11-slim COPY . /app WORKDIR /app RUN pip install flask flask-cors llama-cpp-python CMD ["python", "chatbot_api.py"]
Then:
bash docker build -t deepseek-chatbot . docker run -p 5000:5000 deepseek-chatbot
11. 🛠 Final Tips & Troubleshooting
| Problem | Fix |
|---|---|
| Out of memory | Use smaller GGUF (Q4_K_M or Q5_0) |
| Slow response | Reduce max_tokens, optimize quant |
| CORS issues | Use Flask-CORS |
| Long loading time | Pre-load model on server start |
| Bad answers | Improve prompt formatting ([INST]...[/INST]) |
12. 🧩 What’s Next? Add-ons and Upgrades
You can build on top of this base:
✅ Add support for LangChain agents
✅ Integrate with Telegram Bot API
✅ Create a Slack assistant
✅ Add voice support via Whisper + TTS
✅ Implement streaming WebSocket responses
✅ Conclusion
This Flask + DeepSeek chatbot is a foundation for a fully controllable, open, and monetizable AI application. Unlike cloud APIs, you:
Own your model
Pay no token fees
Can customize to your industry/domain
Can deploy on any hardware, even offline