🧠 A Complete Flask-Based DeepSeek Chatbot API: Build Your Own Local AI Assistant
With the rise of open-weight models like DeepSeek R1 and DeepSeek-Coder, developers now have the opportunity to build powerful, private, and customizable chatbots without relying on commercial cloud APIs. In this guide, we’ll walk you through the process of creating a fully functional Flask-based chatbot API using DeepSeek running locally or on a server, accessible via HTTP requests.
High-Flyer/DeepSeek operates at least two primary computing clusters: Fire-Flyer (萤火一号) and Fire-Flyer 2 (萤火二号). Fire-Flyer 2 consists of co-designed software and hardware architecture. On the hardware side, Nvidia GPUs use 200 Gbps interconnects. The cluster is divided into two "zones", and the platform supports cross-zone tasks. The network topology was two fat trees, chosen for high bisection bandwidth. On the software side are:[30][24]
This article is ideal for:
Developers building AI SaaS tools
Educators and researchers wanting a secure LLM
Startups looking to save on OpenAI costs
Hackers and tinkerers interested in privacy-preserving AI
✅ Table of Contents
Introduction to the Project
Why Use Flask + DeepSeek?
Requirements and Setup
Downloading and Running DeepSeek Locally
Building the Flask App
Testing the API via Curl & Postman
Adding Frontend (Optional)
Extending Features: Streaming, Chat History
Security: API Tokens, Rate Limits
Deploying to Cloud (or LAN)
Final Tips & Troubleshooting
What’s Next? Add-ons and Upgrades
1. 🧠 Introduction to the Project
You will learn how to:
Run DeepSeek R1 locally using
llama-cpp-python
Create a lightweight chatbot backend using Flask
Accept user prompts and return LLM responses
Secure and extend the chatbot for production use
Result: a chat-style API accessible from a browser, mobile app, or other services.
2. 🔍 Why Use Flask + DeepSeek?
Feature | Reason to Use |
---|---|
Flask | Lightweight, scalable, and easy to build APIs |
DeepSeek R1/Coder | Open-weight, powerful, 128K context, no API costs |
llama-cpp-python | Efficient GGUF inference for CPUs/GPUs |
Local-first | Secure and self-hosted, no cloud needed |
No API fees, unlimited usage, and full customization power.
3. ⚙️ Requirements and Setup
System Requirements:
macOS, Linux, or WSL on Windows
16GB+ RAM recommended (for quantized model)
Python 3.10+
Python Libraries:
bash复制编辑pip install flask flask-cors llama-cpp-python
4. 📥 Downloading and Running DeepSeek Locally
Step 1: Download the GGUF Model
Go to Hugging Face or TheBloke’s GGUF mirror.
Download a quantized model, e.g.:
deepseek-7b-chat.Q4_K_M.gguf
Place it in a directory: ./models/
5. 🛠️ Building the Flask App
python from flask import Flask, request, jsonifyfrom flask_cors import CORSfrom llama_cpp import Llama app = Flask(__name__) CORS(app) # Load the DeepSeek GGUF modelllm = Llama(model_path="./models/deepseek-7b-chat.Q4_K_M.gguf", n_ctx=4096, n_threads=8) @app.route('/chat', methods=['POST'])def chat(): data = request.get_json() user_prompt = data.get('prompt', '') full_prompt = f"[INST] {user_prompt} [/INST]" response = llm( prompt=full_prompt, max_tokens=512, temperature=0.7, top_p=0.9, stop=["</s>", "[INST]"] ) output = response['choices'][0]['text'].strip() return jsonify({'response': output})if __name__ == '__main__': app.run(host='0.0.0.0', port=5000)
Save this file as chatbot_api.py
6. 🧪 Testing the API via Curl or Postman
Curl:
bash curl -X POST http://localhost:5000/chat \ -H "Content-Type: application/json" \ -d '{"prompt":"Tell me a joke about AI."}'
Postman:
URL:
http://localhost:5000/chat
Method:
POST
Body:
json { "prompt": "Explain black holes in simple terms."}
You’ll get a JSON response with the generated answer.
7. 💻 Adding Frontend (Optional)
Here’s a basic HTML page to test it:
html <!DOCTYPE html><html> <head> <title>DeepSeek Chatbot</title> </head> <body> <h1>Talk to DeepSeek</h1> <textarea id="prompt" rows="4" cols="60"></textarea><br> <button onclick="sendPrompt()">Ask</button> <pre id="response"></pre> <script> async function sendPrompt() { const prompt = document.getElementById("prompt").value; const res = await fetch("http://localhost:5000/chat", { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify({ prompt }) }); const data = await res.json(); document.getElementById("response").innerText = data.response; } </script></body></html>
Save as index.html
and open it locally in your browser.
8. 🚀 Extending Features: Streaming & Memory
a. Enable Streaming:
In llama-cpp-python
, you can pass stream=True
:
python for token in llm.create_completion(prompt=..., stream=True): print(token['text'], end='', flush=True)
b. Add Chat History (Memory):
Create a chat log per session:
python history = []@app.route('/chat', methods=['POST'])def chat(): data = request.get_json() user_input = data.get('prompt', '') history.append(f"User: {user_input}") full_prompt = "\n".join(history) + "\nAI:" response = llm(prompt=full_prompt, max_tokens=512) answer = response['choices'][0]['text'].strip() history.append(f"AI: {answer}") return jsonify({'response': answer})
9. 🔐 Security: API Keys, Rate Limits
Add basic key-based auth:
python @app.before_requestdef check_auth(): api_key = request.headers.get("Authorization") if api_key != "Bearer your-secret-key": return jsonify({"error": "Unauthorized"}), 401
You can also integrate:
Flask-Limiter for rate limits
JWT tokens for multi-user systems
Cloudflare or Nginx for IP filtering and HTTPS
10. ☁️ Deploying to the Cloud
Use platforms like:
Render (simple, free-tier friendly)
Fly.io (low latency global deploy)
Hetzner (GPU cloud for inference)
AWS EC2 (for GPU + Flask combo)
Docker:
dockerfile FROM python:3.11-slim COPY . /app WORKDIR /app RUN pip install flask flask-cors llama-cpp-python CMD ["python", "chatbot_api.py"]
Then:
bash docker build -t deepseek-chatbot . docker run -p 5000:5000 deepseek-chatbot
11. 🛠 Final Tips & Troubleshooting
Problem | Fix |
---|---|
Out of memory | Use smaller GGUF (Q4_K_M or Q5_0) |
Slow response | Reduce max_tokens , optimize quant |
CORS issues | Use Flask-CORS |
Long loading time | Pre-load model on server start |
Bad answers | Improve prompt formatting ([INST]...[/INST] ) |
12. 🧩 What’s Next? Add-ons and Upgrades
You can build on top of this base:
✅ Add support for LangChain agents
✅ Integrate with Telegram Bot API
✅ Create a Slack assistant
✅ Add voice support via Whisper + TTS
✅ Implement streaming WebSocket responses
✅ Conclusion
This Flask + DeepSeek chatbot is a foundation for a fully controllable, open, and monetizable AI application. Unlike cloud APIs, you:
Own your model
Pay no token fees
Can customize to your industry/domain
Can deploy on any hardware, even offline