Evaluating Test-Time Scaling LLMs for Legal Reasoning: OpenAI o1, DeepSeek‑R1, and Beyond

ds66

2024-07-24

1. Introduction

Test-Time Scaling Language Models (LLMs), such as OpenAI’s o1 series and DeepSeek‑R1, exemplify cutting-edge reasoning capabilities. These models shine on mathematical, coding, and general intelligence benchmarks. However, their performance in high-stakes professional domains, such as legal reasoning, remains underexplored—especially when assessing nuanced logic, statutes, and multi-party judgments.

Our study bridges this gap by systematically evaluating nine prominent LLMs across 17 carefully selected legal reasoning tasks, involving both Chinese and English legal systems. This provides critical insights into how mature reasoning models perform in specialized and complex real-world domains.

2. Background: What Makes Legal Reasoning Unique

Legal reasoning is a deeply demanding cognitive task, involving:

Statutory interpretation: deriving rules from formal legal texts,
Case analysis: applying law to diverse fact patterns,
Argumentation: constructing persuasive, legally grounded positions,
Multilingual competence: legal texts often straddle languages and jurisdictions,
Formal structure: multi-defendant cases, appeals, and hierarchical court precedence.

These competencies extend beyond everyday reasoning challenges, requiring both structured analysis and domain-specific sensitivity.

3. Models Evaluated

We evaluated nine LLMs grouped as follows:

Large reasoning models:
- OpenAI o1-0614 (optimal reasoning),
- DeepSeek‑R1 (67B reasoning-optimized open-source).
Distilled counterparts (7–32B):
- OpenAI o1-mini variants,
- DeepSeek‑R1-32B,
- Other high-reasoning models.
Non-test-time scaling LLMs:
- GPT‑3.5,
- LLaMA 3.1 (8B),
- Qwen2.5-7B.

All models were evaluated in zero-shot mode, using identical prompts tailored to each legal task.

4. Legal Reasoning Task Overview

4.1 Chinese Legal Tasks

We selected 7 tasks, including:

Article Interpretation: choosing the most suitable statute application.
Fact Matching in Multi-Defendant Cases: attributing behavior across defendants.
Case Outcome Prediction: binary or multi-outcome judgment.
Legal Argumentation Analysis: highlighting flawed reasoning.
Sentencing Recommendation: based on established legal guidelines.
Evidence Structuring: determining relevancy and admissibility.
Appeal Reasoning: drafting legal arguments for appeal.

4.2 English Legal Tasks

We selected 10 tasks, including:

Statute vs. Case Application: correct species of law selection.
Multi-Party Liability: reasoning across multiple defendants.
Contractual Dispute Resolution.
Precedent Analysis: citing relevant case law.
Jury Instruction Comprehension.
Legal Writing Assistance: drafting sections of an argument.
Defamation Threshold Assessment.
Sentencing Guidelines Recall.
Probation Condition Relevance.
Legal Strategy Evaluation.

These tasks were selected for both complexity and real-world relevance in litigation and counsel settings.

5. Experimental Setup

Zero-shot Prompt Templates: context-rich instructions plus example format to ensure clarity.
Evaluation Metrics:
- For objective tasks: Exact match accuracy.
- For subjective tasks: rated by legal scholars based on structure, relevance, and completeness.
Grading procedure:
- Two independent legal experts per model output,
- A third adjudicator resolved discrepancies (>10% disagreement),
- Final evaluation aggregated across domains.

6. Results Overview

6.1 CN Tasks Performance

Model	Avg Accuracy
OpenAI o1-0614	78%
DeepSeek-R1 (67B)	73%
o1-mini (32B)	70%
R1-32B distilled	68%
GPT-3.5	61%
LLaMA 8B	54%
Qwen2.5 7B	52%

6.2 EN Tasks Performance

Model	Avg Accuracy
OpenAI o1-0614	82%
DeepSeek-R1	76%
o1-mini (32B)	74%
R1-32B distilled	70%
GPT-3.5	68%
LLaMA 8B	60%
Qwen2.5 7B	58%

Key Observation: Even state-of-the-art reasoning models struggle to consistently exceed 80% accuracy in professional legal tasks, signaling a notable gap.

7. Detailed Task Analysis

7.1 Multi-Defendant Allocation

OpenAI o1 performed best (~80% CN, ~85% EN),
DeepSeek-R1 lagged (~70% CN, ~75% EN),
GPT‑3.5 and smaller models performed below 60%.

Error analysis shows common mistakes in distinguishing between principal and accessory roles.

7.2 Legal Argument Analysis

Models frequently failed to identify major flaws in argumentative logic, even when they worked better on statute matching.

7.3 Sentencing Precision Tasks

While guideline recall was acceptable (80–90%), customizing sentencing for defendant profiles often lacked consistency.

7.4 Appeal Argument Drafting

Both o1 and R1 produced logically structured arguments, but missed citing precedent or relied on irrelevant cases—undermining depth.

8. Underlying Reasons for Performance Lags

Training Data Gaps: LLMs aren’t trained primarily on structured legal corpora.
Logical Complexity: Multi-party cases require layered reasoning.
Legal Domain Syntax: Formal vocabulary, modality, and citation norms.
Cross-Cultural & Language Issues: CN tasks suffer due to fewer trained examples and legal translation complexities.
Limited Reasoning Chains: Default chain-of-thoughts are often too brief for legally complex outputs.

9. Implications and Suggestions

9.1 For LLM Builders

Integrate internal legal toolkits: statute lookups, case databases, etc.
Fine-tune legal datasets, especially multi-defendant and multi-lingual sources.
Encourage longer reasoning chains with structured prompts.

9.2 For Users and Law Professionals

Use reasoning LLMs as assistants, not replacements.
Apply firm retraining or domain tuning before deployment.
Encourage human-in-the-loop review to verify outputs and cite valid legal precedents.

10. Limitations and Future Research

Sample size could expand for deeper analysis (e.g., appellate briefs).
Few-shot protocols were tested but show only marginal gains.
Task complexity bandwidth: ultra-high-stakes tasks like contract law or international treaties remain untested.
Contextual Agentic Models: evaluating multi-turn dialogues or recommendation systems.
Multimodal Legal Reasoning: combining text with financial data or scanned documents.

11. Conclusion

While LLMs such as DeepSeek‑R1 and OpenAI o1 demonstrate strong generalized reasoning, their performance in complex, domain-specific tasks such as legal reasoning remains suboptimal. A ceiling around 80–85% accuracy suggests that further domain-specific data, reasoning support tools, and structured prompting are necessary. These models are valuable for augmenting legal workflows—but not yet ready for unsupervised application in critical or high-stakes legal scenarios.

Building hybrid legal AI systems—including retrieval-augmented pipelines, domain fine-tuning, and human oversight—will be the most productive path forward.