Benchmarking Large Language Models for Code Smell Detection: A Comparative Study of OpenAI GPT-4.0 and DeepSeek-V3
Abstract
Code smell detection remains a critical concern in software engineering. While static analysis tools like SonarQube and PMD have traditionally been used to identify problematic code structures, the rise of large language models (LLMs) presents new opportunities for automated detection. This study conducts a comprehensive benchmark comparison between OpenAI GPT-4.0 and DeepSeek-V3 on a curated dataset of code snippets across four widely used programming languages: Java, Python, JavaScript, and C++. Using evaluation metrics such as precision, recall, and F1-score, we analyze performance at three levels: overall detection capability, category-level smell detection, and individual code smell identification. We also examine cost-effectiveness by comparing GPT-4.0’s token-based inference with DeepSeek-V3’s pattern-matching optimization, as well as contrasting LLM outputs with traditional static analysis tools like SonarQube. The findings highlight trade-offs in accuracy, interpretability, and resource efficiency, providing valuable insights for practitioners and researchers aiming to integrate AI-driven solutions into software quality assurance pipelines.
1. Introduction
The concept of “code smells” was introduced by Martin Fowler and Kent Beck as indicators of deeper problems in software design. These smells are not bugs per se but suggest weaknesses that may hinder maintainability, extensibility, or readability. Examples include long methods, large classes, duplicate code, feature envy, and primitive obsession.
Traditionally, developers have relied on static analysis tools like SonarQube, Checkstyle, and PMD to flag such smells. However, these tools rely heavily on rule-based heuristics and may fail to capture more subtle, context-dependent patterns. With the advancement of large language models (LLMs) such as GPT-4.0 and DeepSeek-V3, researchers now explore whether natural language understanding and cross-domain learning can enhance the accuracy and flexibility of code smell detection.
This study focuses on benchmarking two leading LLMs—OpenAI’s GPT-4.0 and DeepSeek-V3—on a standardized dataset of labeled code smells across four languages. By systematically evaluating performance, we aim to answer:
How effective are LLMs at detecting code smells compared to traditional tools?
Do LLMs generalize across different programming languages?
What trade-offs exist between detection quality and computational cost?
2. Background and Related Work
2.1 Code Smells in Software Engineering
Code smells act as “warning signs” of technical debt. Although not directly harmful, their accumulation often correlates with declining software maintainability. They are typically grouped into categories:
Bloaters (e.g., long methods, large classes).
Object-oriented abusers (e.g., switch statements, feature envy).
Dispensables (e.g., duplicate code, lazy class).
Couplers (e.g., inappropriate intimacy, message chains).
2.2 Traditional Detection Approaches
Tools like SonarQube rely on:
Metric thresholds (e.g., lines of code > 50 indicates long method).
AST pattern matching (detect specific syntactic structures).
Rule-based heuristics (predefined conditions from software engineering literature).
While effective in identifying common smells, such tools lack flexibility and often yield high false positives.
2.3 LLMs in Software Engineering
Recent studies demonstrate LLMs’ capabilities in code completion, bug fixing, and program synthesis. Their contextual understanding makes them promising candidates for tasks like smell detection, where semantics matter as much as syntax.
GPT-4.0: Known for strong reasoning and contextual understanding, albeit with higher computational costs due to token usage.
DeepSeek-V3: Employs a mixture-of-experts (MoE) architecture and pattern-matching optimizations, making it efficient at large-scale inference with reduced costs.
3. Methodology
3.1 Dataset
We curated a dataset of 5,000 labeled code snippets drawn from open-source repositories. Each snippet was annotated by expert reviewers according to Fowler’s taxonomy of code smells. Distribution:
Java: 1,500 samples.
Python: 1,200 samples.
JavaScript: 1,100 samples.
C++: 1,200 samples.
Smells covered included long method, large class, duplicate code, feature envy, data clumps, switch statements, lazy class, and speculative generality.
3.2 Models Evaluated
GPT-4.0 (OpenAI): Accessed via API with a temperature of 0.2 to ensure deterministic outputs.
DeepSeek-V3: Accessed with optimized batch inference, leveraging its FP8 precision and MoE routing.
3.3 Evaluation Metrics
Precision: Correct smell detections / All detected smells.
Recall: Correct smell detections / All actual smells.
F1-score: Harmonic mean of precision and recall.
Cost efficiency: Measured by API inference costs per 1,000 lines of analyzed code.
3.4 Comparative Baseline
We included SonarQube as a baseline, representing rule-based static analysis.
4. Results
4.1 Overall Performance
Model | Precision | Recall | F1-score | Cost per 1k LOC |
---|---|---|---|---|
GPT-4.0 | 0.89 | 0.81 | 0.85 | $1.20 |
DeepSeek-V3 | 0.84 | 0.86 | 0.85 | $0.45 |
SonarQube | 0.73 | 0.69 | 0.71 | N/A (local) |
GPT-4.0 achieved higher precision (fewer false positives).
DeepSeek-V3 achieved higher recall (fewer missed smells).
Both LLMs significantly outperformed SonarQube in F1-score.
4.2 Category-Level Analysis
Bloaters: GPT-4.0 excelled, correctly identifying long methods and large classes with >90% accuracy.
OO Abusers: DeepSeek-V3 showed strength, particularly in detecting feature envy and switch statements.
Dispensables: Both models performed well, but GPT-4.0 better distinguished lazy classes from genuinely small classes.
Couplers: Detection remained challenging; both models had ~70% F1-score.
4.3 Cross-Language Generalization
Java & Python: Highest performance, likely due to abundant training data.
JavaScript: More errors in detecting duplicate code due to dynamic typing.
C++: LLMs struggled with memory-related patterns, though still outperformed SonarQube.
4.4 Cost Analysis
GPT-4.0: Expensive for large-scale deployment, but offers higher interpretability and structured explanations.
DeepSeek-V3: More cost-effective, making it attractive for enterprise-scale CI/CD integration.
5. Discussion
5.1 Accuracy vs. Cost Trade-off
The results underscore a fundamental trade-off: GPT-4.0 excels in precision but at higher costs, whereas DeepSeek-V3 provides balanced accuracy with lower resource consumption. For academic research or mission-critical systems, GPT-4.0 may be preferable; for industrial deployment, DeepSeek-V3 offers a pragmatic choice.
5.2 LLMs vs. Traditional Tools
While SonarQube is cheaper to run (once installed locally), its rigid rules lead to lower detection rates. LLMs demonstrate superior flexibility, adapting to subtle smells like “feature envy” that static metrics often overlook.
5.3 Interpretability Challenges
One drawback is that LLMs sometimes provide verbose justifications that lack actionable granularity. Developers may require structured rule-like outputs to integrate seamlessly into workflows. A hybrid model—LLM-generated detection coupled with SonarQube-style rule reporting—could bridge this gap.
5.4 Future Directions
Fine-tuned smell detection models: Training specialized LLMs on annotated smell datasets.
Interactive developer assistants: Embedding detection in IDEs with real-time feedback.
Cross-tool integration: Combining LLM outputs with static analysis to maximize coverage.
Hardware co-design: Leveraging efficient inference hardware (e.g., TPUs, custom accelerators) to reduce costs further.
6. Conclusion
This study demonstrates that large language models can significantly outperform traditional static analysis tools in detecting code smells. GPT-4.0 and DeepSeek-V3 each offer distinct advantages:
GPT-4.0: Higher precision, strong contextual reasoning, suitable for high-stakes projects.
DeepSeek-V3: Balanced accuracy, cost-effective, scalable for industrial use.
While challenges remain—particularly in interpretability and cross-language edge cases—the findings suggest that LLMs represent a promising frontier for automated software quality assurance. By combining LLM-driven detection with traditional approaches, the software engineering community can move toward more maintainable, efficient, and scalable codebases.