🧠 A Complete Flask-Based DeepSeek Chatbot API: Build Your Own Local AI Assistant

ds66

2024-07-08

With the rise of open-weight models like DeepSeek R1 and DeepSeek-Coder, developers now have the opportunity to build powerful, private, and customizable chatbots without relying on commercial cloud APIs. In this guide, we’ll walk you through the process of creating a fully functional Flask-based chatbot API using DeepSeek running locally or on a server, accessible via HTTP requests.

High-Flyer/DeepSeek operates at least two primary computing clusters: Fire-Flyer (萤火一号) and Fire-Flyer 2 (萤火二号). Fire-Flyer 2 consists of co-designed software and hardware architecture. On the hardware side, Nvidia GPUs use 200 Gbps interconnects. The cluster is divided into two "zones", and the platform supports cross-zone tasks. The network topology was two fat trees, chosen for high bisection bandwidth. On the software side are:[30][24]

This article is ideal for:

Developers building AI SaaS tools
Educators and researchers wanting a secure LLM
Startups looking to save on OpenAI costs
Hackers and tinkerers interested in privacy-preserving AI

✅ Table of Contents

Introduction to the Project
Why Use Flask + DeepSeek?
Requirements and Setup
Downloading and Running DeepSeek Locally
Building the Flask App
Testing the API via Curl & Postman
Adding Frontend (Optional)
Extending Features: Streaming, Chat History
Security: API Tokens, Rate Limits
Deploying to Cloud (or LAN)
Final Tips & Troubleshooting
What’s Next? Add-ons and Upgrades

1. 🧠 Introduction to the Project

You will learn how to:

Run DeepSeek R1 locally using llama-cpp-python
Create a lightweight chatbot backend using Flask
Accept user prompts and return LLM responses
Secure and extend the chatbot for production use

Result: a chat-style API accessible from a browser, mobile app, or other services.

2. 🔍 Why Use Flask + DeepSeek?

Feature	Reason to Use
Flask	Lightweight, scalable, and easy to build APIs
DeepSeek R1/Coder	Open-weight, powerful, 128K context, no API costs
llama-cpp-python	Efficient GGUF inference for CPUs/GPUs
Local-first	Secure and self-hosted, no cloud needed

No API fees, unlimited usage, and full customization power.

3. ⚙️ Requirements and Setup

System Requirements:

macOS, Linux, or WSL on Windows
16GB+ RAM recommended (for quantized model)
Python 3.10+

Python Libraries:

bash复制编辑pip install flask flask-cors llama-cpp-python

4. 📥 Downloading and Running DeepSeek Locally

Step 1: Download the GGUF Model

Go to Hugging Face or TheBloke’s GGUF mirror.

Download a quantized model, e.g.:

deepseek-7b-chat.Q4_K_M.gguf

Place it in a directory: ./models/

5. 🛠️ Building the Flask App

python
from flask import Flask, request, jsonifyfrom flask_cors import CORSfrom llama_cpp import Llama

app = Flask(__name__)
CORS(app)
# Load the DeepSeek GGUF modelllm = Llama(model_path="./models/deepseek-7b-chat.Q4_K_M.gguf", n_ctx=4096, n_threads=8)
@app.route('/chat', methods=['POST'])def chat():
    data = request.get_json()
    user_prompt = data.get('prompt', '')

    full_prompt = f"[INST] {user_prompt} [/INST]"

    response = llm(
        prompt=full_prompt,
        max_tokens=512,
        temperature=0.7,
        top_p=0.9,
        stop=["</s>", "[INST]"]
    )

    output = response['choices'][0]['text'].strip()    return jsonify({'response': output})if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

Save this file as chatbot_api.py

6. 🧪 Testing the API via Curl or Postman

Curl:

bash
curl -X POST http://localhost:5000/chat \
  -H "Content-Type: application/json" \
  -d '{"prompt":"Tell me a joke about AI."}'

Postman:

URL: http://localhost:5000/chat
Method: POST
Body:

json
{
  "prompt": "Explain black holes in simple terms."}

You’ll get a JSON response with the generated answer.

7. 💻 Adding Frontend (Optional)

Here’s a basic HTML page to test it:

html
<!DOCTYPE html><html>
<head>  
<title>DeepSeek Chatbot</title>
</head>
<body>  <h1>Talk to DeepSeek</h1>  
<textarea id="prompt" rows="4" cols="60"></textarea><br>  
<button onclick="sendPrompt()">Ask</button>  
<pre id="response"></pre>  <script>
    async function sendPrompt() {      
    const prompt = document.getElementById("prompt").value;      
    const res = await fetch("http://localhost:5000/chat", 
    {        method: "POST",        
    headers: { "Content-Type": "application/json" },        
    body: JSON.stringify({ prompt })
      });      const data = await res.json();      
      document.getElementById("response").innerText = data.response;
    }  </script></body></html>

Save as index.html and open it locally in your browser.

8. 🚀 Extending Features: Streaming & Memory

a. Enable Streaming:

In llama-cpp-python, you can pass stream=True:

python
for token in llm.create_completion(prompt=..., stream=True):    
print(token['text'], end='', flush=True)

b. Add Chat History (Memory):

Create a chat log per session:

python
history = []@app.route('/chat', methods=['POST'])def chat():
    data = request.get_json()
    user_input = data.get('prompt', '')

    history.append(f"User: {user_input}")
    full_prompt = "\n".join(history) + "\nAI:"

    response = llm(prompt=full_prompt, max_tokens=512)
    answer = response['choices'][0]['text'].strip()
    history.append(f"AI: {answer}")    return jsonify({'response': answer})

9. 🔐 Security: API Keys, Rate Limits

Add basic key-based auth:

python
@app.before_requestdef check_auth():
    api_key = request.headers.get("Authorization")    
    if api_key != "Bearer your-secret-key":        
    return jsonify({"error": "Unauthorized"}), 
    401

You can also integrate:

Flask-Limiter for rate limits
JWT tokens for multi-user systems
Cloudflare or Nginx for IP filtering and HTTPS

10. ☁️ Deploying to the Cloud

Use platforms like:

Render (simple, free-tier friendly)
Fly.io (low latency global deploy)
Hetzner (GPU cloud for inference)
AWS EC2 (for GPU + Flask combo)
Docker:

dockerfile
FROM python:3.11-slim
COPY . /app
WORKDIR /app
RUN pip install flask flask-cors llama-cpp-python
CMD ["python", "chatbot_api.py"]

Then:

bash
docker build -t deepseek-chatbot .
docker run -p 5000:5000 deepseek-chatbot

11. 🛠 Final Tips & Troubleshooting

Problem	Fix
Out of memory	Use smaller GGUF (Q4_K_M or Q5_0)
Slow response	Reduce `max_tokens`, optimize quant
CORS issues	Use Flask-CORS
Long loading time	Pre-load model on server start
Bad answers	Improve prompt formatting (`[INST]...[/INST]`)