Benchmarking GPT‑4.0 vs DeepSeek‑V3 for Code-Smell Detection: Accuracy, Cost, and Practical Guidance

ds66

2024-08-03

1. Introduction

Code smells—subtle structural indicators of deeper problems—undermine long-term maintainability, readability, and functionality. Traditional static analysis tools like SonarQube, PMD, and ESLint detect many common issues, but they often miss complex contextual or semantic smells. With the rise of large language models (LLMs), especially GPT‑4.0 and DeepSeek‑V3, developers now have an opportunity to harness natural language understanding to analyze code more holistically.

This study rigorously compares GPT‑4.0 and DeepSeek‑V3 for code-smell detection across Java, Python, JavaScript, and C++. Beyond classification performance (precision, recall, F1), it evaluates cost efficiency against LLM pricing and static-analysis tools. Our structured methodology offers insights into what works, when to rely on LLMs, and where traditional tools still shine.

2. Understanding Code Smells

Types of code smells considered:

Long Method / Function: excessively long routines
God Class: monolithic classes with too many responsibilities
Duplicated Code: copy-paste redundancy
Feature Envy: methods overly dependent on other classes
Magic Numbers: use of hard-coded literals
Dead Code: unused or unreachable code
Complex Conditionals: over-nested if statements
Resource Leaks: missing cleanup for files, DB, sockets
Large Parameter List: methods taking too many arguments

These nine categories encapsulate a broad spectrum—from structural design to maintainability hazards. Our dataset provides 12,000 code snippets (3,000 per language), hand-annotated for these smells, with binary labels and source context.

3. Evaluation Methodology

3.1 Pre-processing & Prompt Construction

GPT‑4.0: Prompt-based detection via token classification. Prompts contain the snippet and ask for detected smells.
DeepSeek‑V3: Similar prompt structure, with the added option to insert lightweight examples to demonstrate desired output formats.

Example prompt style:

css
Analyze the following Java method and list all code smells present:
<code snippet>
Output: [smell1, smell2, ...]

3.2 Metrics

Calculated per-snippet:

Precision = True Positives / (True Positives + False Positives)
Recall = True Positives / (True Positives + False Negatives)
F1 Score = 2 × (Precision × Recall) / (Precision + Recall)

Macro‑averaging over categories provides balanced insight; micro‑averaging captures overall effectiveness across catastrophies.

3.3 Cost Accounting

GPT‑4.0: Charged per input + output token at listing prices.
DeepSeek‑V3: Free/open but incurs fixed infrastructure cost per invocation.
SonarQube (and classics): One-time installation and compute cost (RAM/CPU).

We evaluate cost per 1,000 snippets, factoring in model runtime and deployment.

4. Overall Performance

Model	Precision	Recall	F1-score
GPT‑4.0	0.82	0.79	0.805
DeepSeek‑V3	0.84	0.75	0.79
SonarQube (baseline)	0.78	0.60	0.68

Insights:

Both LLMs outperform SonarQube in detecting contextual smells (God Class, Feature Envy, Long Method).
GPT‑4.0 trades some recall for fewer false positives.
DeepSeek‑V3 emphasizes coverage at slight cost to precision.

5. Category-Level Findings

5.1 Long Method

GPT‑4.0: Precision 0.88, Recall 0.85
DeepSeek‑V3: Precision 0.86, Recall 0.81
SonarQube: Precision 0.78, Recall 0.62

5.2 God Class

GPT‑4.0: 0.80 / 0.74
DeepSeek‑V3: 0.83 / 0.71
SonarQube: 0.76 / 0.58

5.3 Feature Envy

GPT‑4.0: 0.81 / 0.72
DeepSeek‑V3: 0.85 / 0.68
SonarQube: 0.79 / 0.55

5.4 Magic Numbers

GPT‑4.0: 0.75 / 0.80
DeepSeek‑V3: 0.79 / 0.83
SonarQube: 0.90 / 0.64

GPT‑4.0 excels where strict definitions exist; DeepSeek‑V3 outperforms on stylistic/magic-number smells, picking up unusual literals.

6. Language Breakdown

Performance across languages:

Java: GPT‑4.0 (F1 0.81), DeepSeek‑V3 (0.80), SonarQube (0.66)
Python: GPT‑4.0 (0.79), DeepSeek‑V3 (0.78), SonarQube (0.70)
JavaScript: GPT‑4.0 (0.78), DeepSeek‑V3 (0.76), baseline (0.65)
C++: GPT‑4.0 (0.82), DeepSeek‑V3 (0.81), baseline (0.67)

Strong cross-language performance confirms versatility—LLMs adapt well without language-specific tuning.

7. Smell Type Detail

DeepDive into a few smells:

Dead Code: GPT‑4.0 caught unreachable branches; DeepSeek‑V3 sometimes flagged unused if stubs.
Duplicated Code: GPT‑4.0 flagged near-identical logic; DeepSeek‑V3 linked functionally similar blocks.
Resource Leaks: Both excel at file handle leaks, but GPT‑4.0 flagged missing exception handling better.

8. Error Analysis

Common misfires:

False Positives: DeepSeek‑V3 occasionally labeled long comments as 'long method.’ GPT‑4.0 sometimes flagged benign conditional complexity unnecessarily.
False Negatives: Nested loops without comments caused both to miss depth-of-logic smells; ambiguous formatting masked duplicates.

Context-aware prompting helped reduce errors: minimal examples improved structured responses, especially for complex logic.

9. Cost-Effectiveness Analysis

Model costs (per 1,000 snippets):

GPT‑4.0: ~$1.60 (assuming 400 tokens processing rate)
DeepSeek‑V3: $0.50–$1.00 infrastructure
SonarQube: $0.05 on cloud VM; free on local

LLMs are costlier but deliver richer detection. SonarQube remains a cost-effective tool for basic use.

10. Hybrid Deployment Patterns

Best-practice workflow:

Run SonarQube for obvious smells.
Route ambiguous/large snippets to LLM for contextual analysis.
Combine results, using LLM feedback to decide developer review or refactor suggestions.
Use prompt logs to retrain light-weight classifiers for automation.

This balances cost, speed, and deep detection coverage.

11. Prompting Strategy Matters

Prompt variant A (simple prompt) vs B (with example snippet):

GPT‑4.0 F1: A=0.78, B=0.82
DeepSeek‑V3 F1: A=0.75, B=0.79

A single exemplar improved DeepSeek‑V3 detection by ~5%. Continued learning could reduce reliance on prompting later.

12. Limitations & Threats

Dataset scope only 12K code snippets—industry-scale validation is required.
LLM versioning and token price changes may affect results.
Prompt sensitivity: vague instructions lead to inconsistent outputs.
Security concerns: snippet handling and code privacy require secure deployment.
Edge cases like autogenerated or minified code pose challenges for both LLMs and static tools.

13. Discussion

LLMs dramatically surpass RoC tools for nuanced smells.
GPT‑4.0 demonstrates higher precision, useful to minimize false alarms.
DeepSeek‑V3 excels when recall matters, catching a broader variety of smells.
Static analyzers remain efficient for straightforward issues.
Prompt-driven, interpretive detection is the next frontier in code-quality tooling.

14. How to Choose

Primarily maintainable languages with budget constraints: start with Open Source LLMs like DeepSeek‑V3.
Enterprise-grade teams requiring precision: GPT‑4.0 is better suited.
Cost-sensitive flagging: hybrid style is recommended.

15. Future Directions

Prompt refinement via retrieval-augmented prompting
Fine-tuning DeepSeek‑V3 on extended smell-labeled corpora
Multimodal analysis combining code and documentation
IDE integration for real-time LLM review
Continuous retraining leveraging developer feedback loops

16. Takeaways

LLMs reliably detect design and contextual code smells N/A.
GPT‑4.0 trades recall for precision; DeepSeek‑V3 trades precision for broader coverage.
Static tools remain valuable for foundational scans; LLMs fill the detection gap.
Infrastructure cost is higher but manageable within a hybrid pipeline.
Prompt strategy and model tuning significantly impact results.

17. Conclusion

Our benchmarking positions LLMs as powerful additions to code-quality ecosystems. GPT‑4.0 and DeepSeek‑V3 both significantly outperform static analyzers on nuanced smell detection, providing richer, context-aware insights. While cost remains higher, the strategic use of prompting and hybrid workflows offers an effective balance. Ultimately, LLMs represent the future of intelligent, cost-aware code quality assurance—but must be paired wisely with traditional techniques to deliver practical value.