Argument Mining with Large Language Models: An Extensive Evaluation from LLAMA to GPT-4o and DeepSeek-R1
1. Introduction: The Evolution of Argument Mining
Argument Mining (AM) is a vital interdisciplinary domain that focuses on the automatic detection, extraction, and classification of argumentative discourse in text. At its core, AM seeks to identify key argumentative elements such as:
Premises (supporting statements),
Claims (conclusions or standpoints),
Relations (support, attack, contradiction, etc.)
Applications of argument mining are vast—spanning law (legal reasoning), education (essay feedback), journalism (bias detection), and policy-making. Traditionally, AM relied on rule-based or supervised machine learning approaches using syntactic and discourse-level features. However, these systems often suffered from limited generalization and high manual annotation requirements.
The introduction of Large Language Models (LLMs) such as GPT-3.5, LLaMA, DeepSeek-R1, and GPT-4o has drastically transformed the capabilities of AM systems. This paper presents a comprehensive evaluation of how LLMs perform on argument classification tasks, especially across two benchmark datasets: Args.me and UKP.
2. Datasets Used: Args.me and UKP
2.1 Args.me
One of the largest online resources for argument mining.
Contains over 400,000 arguments extracted from debates and web discourse.
Each entry includes a claim, its premise, and an associated stance (support, attack, etc.).
Offers real-world noisy text, making it ideal for robustness testing.
2.2 UKP Dataset (Argument Annotated Essays)
Academic dataset containing well-structured essays manually annotated for argumentative structure.
Annotated components include Major Claims, Claims, and Premises.
Relations among components (support or attack) are explicitly labeled.
These datasets provide complementary perspectives: Args.me captures web-style informal arguments, while UKP reflects academic argumentative discourse.
3. Selected LLMs and Experimental Setup
This study investigates three families of large language models:
Model | Parameter Size | Provider | Reasoning Support | Access |
---|---|---|---|---|
GPT-4o | Multi-modal, proprietary | OpenAI | Chain-of-Thought (CoT) built-in | API |
LLaMA (3.1/3.3) | Open-source, 8B–70B | Meta AI | CoT finetuned variants | Local/API |
DeepSeek-R1 | 70B (distilled versions too) | DeepSeek AI | Enhanced reasoning | Open-source |
In addition to the base models, researchers tested CoT-enhanced variants, which use step-wise reasoning ("Let's think step-by-step") to simulate more human-like inference.
3.1 Prompting Strategies
To standardize comparisons:
Zero-shot prompting: model is asked to classify without examples.
Few-shot prompting: 3–5 annotated examples included.
Chain-of-Thought prompting: models are encouraged to generate intermediary reasoning.
4. Key Findings
✅ 4.1 Overall Performance
Model | Accuracy (Args.me) | Accuracy (UKP) |
---|---|---|
GPT-4o | 86.3% | 89.1% |
DeepSeek-R1 | 84.7% | 88.4% |
LLaMA 3.3 (w/ CoT) | 81.4% | 83.5% |
LLaMA 3.1 (base) | 78.9% | 80.1% |
GPT-4o consistently outperformed all others in both datasets. However, the reasoning-enhanced DeepSeek-R1 came very close, especially in UKP’s structured academic texts. The performance gap between DeepSeek-R1 and LLaMA was significant, especially in complex relational tasks.
✅ 4.2 Chain-of-Thought: Reasoning vs. Performance
Interestingly, CoT prompting improved performance for DeepSeek-R1 and LLaMA, but marginally degraded GPT-4o’s accuracy. This suggests that:
GPT-4o likely internalizes reasoning inherently,
Whereas other models benefit from explicit reasoning scaffolding.
5. Qualitative Error Analysis
Even the best-performing models made errors. The study categorized them as:
Error Type | Description | Common In |
---|---|---|
Label Ambiguity | Confusing support vs. neutral | LLaMA 3.1, GPT-4o |
Premise Confusion | Mistaking claims for premises | DeepSeek-R1 |
Over-reasoning | CoT responses drift from original text | LLaMA-CoT |
Negation Failure | Missing semantic cues in negated arguments | All models |
Flawed Textual Entailment | Generating unsupported relationships | GPT-4o, DeepSeek |
Models often struggled when:
The claim and premise had overlapping vocabulary.
The attack-support distinction was subtle or implied.
Neutral examples were misclassified as support, particularly in Args.me.
6. Discussion: Strengths and Weaknesses of Each LLM
✅ GPT-4o
Strengths: Highest accuracy, robust across datasets, fluent explanations.
Weaknesses: Occasionally hallucinates in CoT mode, expensive, lacks reproducibility due to proprietary API.
✅ DeepSeek-R1
Strengths: Excellent on structured text, competitive even on noisy input, efficient open-source alternative.
Weaknesses: Confuses hierarchical structures (e.g., sub-premises vs. main claims), slight instability in long reasoning chains.
✅ LLaMA 3.3 (w/ CoT)
Strengths: CoT significantly improves performance, good on short passages.
Weaknesses: Struggles with relation classification, needs more memory and prompt engineering.
7. Contribution and Novelty
This study presents several firsts in the argument mining domain:
First comparative study across GPT, LLaMA, and DeepSeek for AM classification.
Introduction of prompt engineering evaluation (e.g., CoT) for argument tasks.
Error taxonomy for argument model failures.
Discussion of dataset weaknesses, especially over-simplified or ambiguous annotations in Args.me.
8. Challenges and Limitations
Despite its strengths, the research points out several unresolved issues:
Noisy Dataset Problems: Args.me often lacks proper boundaries between arguments, confusing models.
Prompt Bias: Models can be influenced by the order of examples or leading language in prompts.
Lack of Standard Evaluation Protocols: Argument mining lacks the benchmarking rigor of fields like NLI or QA.
Reasoning Fatigue: Longer CoT prompts sometimes reduce performance, particularly in resource-constrained models.
9. Broader Implications
The findings have implications across several disciplines:
Legal Tech: LLMs can aid in extracting legal argument structures.
Education: Automated grading or feedback for argumentative essays.
Debate AI: Models like GPT-4o could power dynamic debate platforms.
Fact-Checking: AM pipelines could support automated claim verification.
The study also lays groundwork for multi-lingual argument mining, as many LLMs (e.g., DeepSeek) support Chinese and other languages.
10. Recommendations and Future Work
✅ Prompt Design Matters
Use neutral language in prompts to avoid bias.
Vary claim-premise order to ensure generalization.
✅ Hybrid Models
Combine:
CoT for reasoning clarity,
Retrieval-Augmented Generation (RAG) for evidence,
Classification heads for label consistency.
✅ Dataset Refinement
More real-world, multi-domain argument datasets are needed.
Human-in-the-loop annotation can reduce label ambiguity.
✅ Multilingual Benchmarking
Extend to languages beyond English, leveraging DeepSeek’s Chinese capabilities and GPT’s multilingual coverage.
11. Conclusion
This paper highlights a significant step forward in understanding how large-scale language models handle argument classification tasks. GPT-4o sets a new gold standard, but open-source alternatives like DeepSeek-R1 show incredible promise, especially when enhanced with reasoning strategies.
While LLMs outperform traditional models in both efficiency and accuracy, there is still room for improvement—especially in handling ambiguous or subtle argumentative nuances. This work not only benchmarks model capabilities but also reveals essential bottlenecks in dataset quality, reasoning fidelity, and prompt design.
As LLMs continue to evolve, tools for interpretable and domain-specific argument analysis will become vital across law, journalism, education, and AI ethics. This study lays the groundwork for that future.