🧠 A Complete Flask-Based DeepSeek Chatbot API: Build Your Own Local AI Assistant

ic_writer ds66
ic_date 2024-07-08
blogs

With the rise of open-weight models like DeepSeek R1 and DeepSeek-Coder, developers now have the opportunity to build powerful, private, and customizable chatbots without relying on commercial cloud APIs. In this guide, we’ll walk you through the process of creating a fully functional Flask-based chatbot API using DeepSeek running locally or on a server, accessible via HTTP requests.

26175_xu6i_8963.jpeg

High-Flyer/DeepSeek operates at least two primary computing clusters: Fire-Flyer (萤火一号) and Fire-Flyer 2 (萤火二号). Fire-Flyer 2 consists of co-designed software and hardware architecture. On the hardware side, Nvidia GPUs use 200 Gbps interconnects. The cluster is divided into two "zones", and the platform supports cross-zone tasks. The network topology was two fat trees, chosen for high bisection bandwidth. On the software side are:[30][24]

This article is ideal for:

  • Developers building AI SaaS tools

  • Educators and researchers wanting a secure LLM

  • Startups looking to save on OpenAI costs

  • Hackers and tinkerers interested in privacy-preserving AI

Table of Contents

  1. Introduction to the Project

  2. Why Use Flask + DeepSeek?

  3. Requirements and Setup

  4. Downloading and Running DeepSeek Locally

  5. Building the Flask App

  6. Testing the API via Curl & Postman

  7. Adding Frontend (Optional)

  8. Extending Features: Streaming, Chat History

  9. Security: API Tokens, Rate Limits

  10. Deploying to Cloud (or LAN)

  11. Final Tips & Troubleshooting

  12. What’s Next? Add-ons and Upgrades

1. 🧠 Introduction to the Project

You will learn how to:

  • Run DeepSeek R1 locally using llama-cpp-python

  • Create a lightweight chatbot backend using Flask

  • Accept user prompts and return LLM responses

  • Secure and extend the chatbot for production use

Result: a chat-style API accessible from a browser, mobile app, or other services.

2. 🔍 Why Use Flask + DeepSeek?

FeatureReason to Use
FlaskLightweight, scalable, and easy to build APIs
DeepSeek R1/CoderOpen-weight, powerful, 128K context, no API costs
llama-cpp-pythonEfficient GGUF inference for CPUs/GPUs
Local-firstSecure and self-hosted, no cloud needed

No API fees, unlimited usage, and full customization power.

3. ⚙️ Requirements and Setup

System Requirements:

  • macOS, Linux, or WSL on Windows

  • 16GB+ RAM recommended (for quantized model)

  • Python 3.10+

Python Libraries:

bash复制编辑pip install flask flask-cors llama-cpp-python

4. 📥 Downloading and Running DeepSeek Locally

Step 1: Download the GGUF Model

Go to Hugging Face or TheBloke’s GGUF mirror.

Download a quantized model, e.g.:

deepseek-7b-chat.Q4_K_M.gguf

Place it in a directory: ./models/

5. 🛠️ Building the Flask App

python
from flask import Flask, request, jsonifyfrom flask_cors import CORSfrom llama_cpp import Llama

app = Flask(__name__)
CORS(app)
# Load the DeepSeek GGUF modelllm = Llama(model_path="./models/deepseek-7b-chat.Q4_K_M.gguf", n_ctx=4096, n_threads=8)
@app.route('/chat', methods=['POST'])def chat():
    data = request.get_json()
    user_prompt = data.get('prompt', '')

    full_prompt = f"[INST] {user_prompt} [/INST]"

    response = llm(
        prompt=full_prompt,
        max_tokens=512,
        temperature=0.7,
        top_p=0.9,
        stop=["</s>", "[INST]"]
    )

    output = response['choices'][0]['text'].strip()    return jsonify({'response': output})if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

Save this file as chatbot_api.py

6. 🧪 Testing the API via Curl or Postman

Curl:

bash
curl -X POST http://localhost:5000/chat \
  -H "Content-Type: application/json" \
  -d '{"prompt":"Tell me a joke about AI."}'

Postman:

  • URL: http://localhost:5000/chat

  • Method: POST

  • Body:

json
{
  "prompt": "Explain black holes in simple terms."}

You’ll get a JSON response with the generated answer.

7. 💻 Adding Frontend (Optional)

Here’s a basic HTML page to test it:

html
<!DOCTYPE html><html>
<head>  
<title>DeepSeek Chatbot</title>
</head>
<body>  <h1>Talk to DeepSeek</h1>  
<textarea id="prompt" rows="4" cols="60"></textarea><br>  
<button onclick="sendPrompt()">Ask</button>  
<pre id="response"></pre>  <script>
    async function sendPrompt() {      
    const prompt = document.getElementById("prompt").value;      
    const res = await fetch("http://localhost:5000/chat", 
    {        method: "POST",        
    headers: { "Content-Type": "application/json" },        
    body: JSON.stringify({ prompt })
      });      const data = await res.json();      
      document.getElementById("response").innerText = data.response;
    }  </script></body></html>

Save as index.html and open it locally in your browser.

8. 🚀 Extending Features: Streaming & Memory

a. Enable Streaming:

In llama-cpp-python, you can pass stream=True:

python
for token in llm.create_completion(prompt=..., stream=True):    
print(token['text'], end='', flush=True)

b. Add Chat History (Memory):

Create a chat log per session:

python
history = []@app.route('/chat', methods=['POST'])def chat():
    data = request.get_json()
    user_input = data.get('prompt', '')

    history.append(f"User: {user_input}")
    full_prompt = "\n".join(history) + "\nAI:"

    response = llm(prompt=full_prompt, max_tokens=512)
    answer = response['choices'][0]['text'].strip()
    history.append(f"AI: {answer}")    return jsonify({'response': answer})

9. 🔐 Security: API Keys, Rate Limits

Add basic key-based auth:

python
@app.before_requestdef check_auth():
    api_key = request.headers.get("Authorization")    
    if api_key != "Bearer your-secret-key":        
    return jsonify({"error": "Unauthorized"}), 
    401

You can also integrate:

  • Flask-Limiter for rate limits

  • JWT tokens for multi-user systems

  • Cloudflare or Nginx for IP filtering and HTTPS

10. ☁️ Deploying to the Cloud

Use platforms like:

  • Render (simple, free-tier friendly)

  • Fly.io (low latency global deploy)

  • Hetzner (GPU cloud for inference)

  • AWS EC2 (for GPU + Flask combo)

  • Docker:

dockerfile
FROM python:3.11-slim
COPY . /app
WORKDIR /app
RUN pip install flask flask-cors llama-cpp-python
CMD ["python", "chatbot_api.py"]

Then:

bash
docker build -t deepseek-chatbot .
docker run -p 5000:5000 deepseek-chatbot

11. 🛠 Final Tips & Troubleshooting

ProblemFix
Out of memoryUse smaller GGUF (Q4_K_M or Q5_0)
Slow responseReduce max_tokens, optimize quant
CORS issuesUse Flask-CORS
Long loading timePre-load model on server start
Bad answersImprove prompt formatting ([INST]...[/INST])

12. 🧩 What’s Next? Add-ons and Upgrades

You can build on top of this base:

  • ✅ Add support for LangChain agents

  • ✅ Integrate with Telegram Bot API

  • ✅ Create a Slack assistant

  • ✅ Add voice support via Whisper + TTS

  • ✅ Implement streaming WebSocket responses

Conclusion

This Flask + DeepSeek chatbot is a foundation for a fully controllable, open, and monetizable AI application. Unlike cloud APIs, you:

  • Own your model

  • Pay no token fees

  • Can customize to your industry/domain

  • Can deploy on any hardware, even offline