Argument Mining with Large Language Models: An Extensive Evaluation from LLAMA to GPT-4o and DeepSeek-R1

ds66

2024-07-28

1. Introduction: The Evolution of Argument Mining

Argument Mining (AM) is a vital interdisciplinary domain that focuses on the automatic detection, extraction, and classification of argumentative discourse in text. At its core, AM seeks to identify key argumentative elements such as:

Premises (supporting statements),
Claims (conclusions or standpoints),
Relations (support, attack, contradiction, etc.)

Applications of argument mining are vast—spanning law (legal reasoning), education (essay feedback), journalism (bias detection), and policy-making. Traditionally, AM relied on rule-based or supervised machine learning approaches using syntactic and discourse-level features. However, these systems often suffered from limited generalization and high manual annotation requirements.

The introduction of Large Language Models (LLMs) such as GPT-3.5, LLaMA, DeepSeek-R1, and GPT-4o has drastically transformed the capabilities of AM systems. This paper presents a comprehensive evaluation of how LLMs perform on argument classification tasks, especially across two benchmark datasets: Args.me and UKP.

2. Datasets Used: Args.me and UKP

2.1 Args.me

One of the largest online resources for argument mining.
Contains over 400,000 arguments extracted from debates and web discourse.
Each entry includes a claim, its premise, and an associated stance (support, attack, etc.).
Offers real-world noisy text, making it ideal for robustness testing.

2.2 UKP Dataset (Argument Annotated Essays)

Academic dataset containing well-structured essays manually annotated for argumentative structure.
Annotated components include Major Claims, Claims, and Premises.
Relations among components (support or attack) are explicitly labeled.

These datasets provide complementary perspectives: Args.me captures web-style informal arguments, while UKP reflects academic argumentative discourse.

3. Selected LLMs and Experimental Setup

This study investigates three families of large language models:

Model	Parameter Size	Provider	Reasoning Support	Access
GPT-4o	Multi-modal, proprietary	OpenAI	Chain-of-Thought (CoT) built-in	API
LLaMA (3.1/3.3)	Open-source, 8B–70B	Meta AI	CoT finetuned variants	Local/API
DeepSeek-R1	70B (distilled versions too)	DeepSeek AI	Enhanced reasoning	Open-source

In addition to the base models, researchers tested CoT-enhanced variants, which use step-wise reasoning ("Let's think step-by-step") to simulate more human-like inference.

3.1 Prompting Strategies

To standardize comparisons:

Zero-shot prompting: model is asked to classify without examples.
Few-shot prompting: 3–5 annotated examples included.
Chain-of-Thought prompting: models are encouraged to generate intermediary reasoning.

4. Key Findings

✅ 4.1 Overall Performance

Model	Accuracy (Args.me)	Accuracy (UKP)
GPT-4o	86.3%	89.1%
DeepSeek-R1	84.7%	88.4%
LLaMA 3.3 (w/ CoT)	81.4%	83.5%
LLaMA 3.1 (base)	78.9%	80.1%

GPT-4o consistently outperformed all others in both datasets. However, the reasoning-enhanced DeepSeek-R1 came very close, especially in UKP’s structured academic texts. The performance gap between DeepSeek-R1 and LLaMA was significant, especially in complex relational tasks.

✅ 4.2 Chain-of-Thought: Reasoning vs. Performance

Interestingly, CoT prompting improved performance for DeepSeek-R1 and LLaMA, but marginally degraded GPT-4o’s accuracy. This suggests that:

GPT-4o likely internalizes reasoning inherently,
Whereas other models benefit from explicit reasoning scaffolding.

5. Qualitative Error Analysis

Even the best-performing models made errors. The study categorized them as:

Error Type	Description	Common In
Label Ambiguity	Confusing support vs. neutral	LLaMA 3.1, GPT-4o
Premise Confusion	Mistaking claims for premises	DeepSeek-R1
Over-reasoning	CoT responses drift from original text	LLaMA-CoT
Negation Failure	Missing semantic cues in negated arguments	All models
Flawed Textual Entailment	Generating unsupported relationships	GPT-4o, DeepSeek

Models often struggled when:

The claim and premise had overlapping vocabulary.
The attack-support distinction was subtle or implied.
Neutral examples were misclassified as support, particularly in Args.me.

6. Discussion: Strengths and Weaknesses of Each LLM

✅ GPT-4o

Strengths: Highest accuracy, robust across datasets, fluent explanations.
Weaknesses: Occasionally hallucinates in CoT mode, expensive, lacks reproducibility due to proprietary API.

✅ DeepSeek-R1

Strengths: Excellent on structured text, competitive even on noisy input, efficient open-source alternative.
Weaknesses: Confuses hierarchical structures (e.g., sub-premises vs. main claims), slight instability in long reasoning chains.

✅ LLaMA 3.3 (w/ CoT)

Strengths: CoT significantly improves performance, good on short passages.
Weaknesses: Struggles with relation classification, needs more memory and prompt engineering.

7. Contribution and Novelty

This study presents several firsts in the argument mining domain:

First comparative study across GPT, LLaMA, and DeepSeek for AM classification.
Introduction of prompt engineering evaluation (e.g., CoT) for argument tasks.
Error taxonomy for argument model failures.
Discussion of dataset weaknesses, especially over-simplified or ambiguous annotations in Args.me.

8. Challenges and Limitations

Despite its strengths, the research points out several unresolved issues:

Noisy Dataset Problems: Args.me often lacks proper boundaries between arguments, confusing models.
Prompt Bias: Models can be influenced by the order of examples or leading language in prompts.
Lack of Standard Evaluation Protocols: Argument mining lacks the benchmarking rigor of fields like NLI or QA.
Reasoning Fatigue: Longer CoT prompts sometimes reduce performance, particularly in resource-constrained models.

9. Broader Implications

The findings have implications across several disciplines:

Legal Tech: LLMs can aid in extracting legal argument structures.
Education: Automated grading or feedback for argumentative essays.
Debate AI: Models like GPT-4o could power dynamic debate platforms.
Fact-Checking: AM pipelines could support automated claim verification.

The study also lays groundwork for multi-lingual argument mining, as many LLMs (e.g., DeepSeek) support Chinese and other languages.

10. Recommendations and Future Work

✅ Prompt Design Matters

Use neutral language in prompts to avoid bias.
Vary claim-premise order to ensure generalization.

✅ Hybrid Models

Combine:

CoT for reasoning clarity,
Retrieval-Augmented Generation (RAG) for evidence,
Classification heads for label consistency.

✅ Dataset Refinement

More real-world, multi-domain argument datasets are needed.
Human-in-the-loop annotation can reduce label ambiguity.

✅ Multilingual Benchmarking

Extend to languages beyond English, leveraging DeepSeek’s Chinese capabilities and GPT’s multilingual coverage.

11. Conclusion

This paper highlights a significant step forward in understanding how large-scale language models handle argument classification tasks. GPT-4o sets a new gold standard, but open-source alternatives like DeepSeek-R1 show incredible promise, especially when enhanced with reasoning strategies.

While LLMs outperform traditional models in both efficiency and accuracy, there is still room for improvement—especially in handling ambiguous or subtle argumentative nuances. This work not only benchmarks model capabilities but also reveals essential bottlenecks in dataset quality, reasoning fidelity, and prompt design.

As LLMs continue to evolve, tools for interpretable and domain-specific argument analysis will become vital across law, journalism, education, and AI ethics. This study lays the groundwork for that future.