Phishing Detection in the Gen‑AI Era: Quantized LLMs vs Classical Models

ic_writer ds66
ic_date 2024-11-11
blogs

Table of Contents

  1. Introduction

  2. The Evolving Threat of Phishing

  3. Classical Detection Methods: ML & Deep Learning

  4. Enter Generative AI: Quantized LLMs

  5. Comparative Study Design

  6. Dataset Preparation

  7. Model Architectures & Quantization

  8. Training & Prompting Strategies

  9. Raw Accuracy: ML vs DL vs Quantized LLMs

  10. Contextual Fluency: Detecting Subtle Phishing Cues

  11. Impact of Zero‑Shot & Few‑Shot Prompting

  12. Adversarial Rewriting: Email Rephrasing Attacks

  13. Adversarial Robustness Across Methods

  14. Efficiency: VRAM, Inference Time, and Cost

  15. Explainability: LLM‑Generated Justifications

  16. Cost‑Performance Trade‑Offs

  17. Integrating Quantized LLMs in Production Systems

  18. Future Trends and Research Opportunities

  19. Limitations & Ethical Considerations

  20. Conclusion

1. Introduction

Phishing remains one of the most prevalent cybersecurity threats, continually evolving in sophistication. Traditional defense systems—machine learning and deep neural networks—have improved detection capabilities but still struggle with context-aware deception. Meanwhile, generative AI, particularly quantized small-parameter LLMs, offers new promise: they can recognize subtle phishing patterns and generate interpretable classifications while running efficiently on modest hardware. This article systematically compares classical techniques and quantized LLMs, testing performance, robustness, cost, and usability.

52255_hmgl_3117.jpeg

2. The Evolving Threat of Phishing

Phishing starts with a misleading email—seemingly legitimate but malicious in intent. Over time, attackers evolved:

  • Basic attacks: simple misspellings, suspicious links

  • Content crafting: mimicking companies, personalizing emails

  • Language variation: paraphrasing to evade detection rules

  • Clean messages: sometimes hiding malicious links with benign language

Modern phishing attempts are context-rich, subtly manipulative, and tailored to bypass content-based filters. Security systems must balance high recall (avoiding threats) and low false positives (avoiding alarm fatigue) while processing text in real time.

3. Classical Detection Methods: ML & Deep Learning

Traditional systems rely on:

  • Feature engineering: URL patterns, sender reputation, n-grams

  • Machine learning classifiers: Logistic Regression, Random Forests, SVM

  • Deep models: CNNs or RNNs processing email bodies

These models are accurate and lightweight. However, they often fail to catch context-sensitive phishing attempts—like emails that impersonate authority without obvious textual markers.

4. Enter Generative AI: Quantized LLMs

Small-parameter LLMs (e.g., 14B) can be quantized (e.g., to 8-bit) to run efficiently:

  • Rich text comprehension: understands intent, tone, rationale

  • Prompting flexibility: can classify with natural-language framing

  • Explanation ability: generates reasoning steps

Quantization reduces memory by 50–75%, enabling these LLMs to run on consumer-grade GPUs.

5. Comparative Study Design

We evaluated models across four axes:

  1. Detection accuracy on a labelled phishing dataset

  2. Contextual sensitivity in subtle or paraphrased attacks

  3. Adversarial robustness against smart transforms

  4. Efficiency & cost in resource-constrained deployment

Each model type was rigorously tuned for fairness. For LLMs we included zero-shot, few-shot, and fine-tuning baseline.

6. Dataset Preparation

We curated a dataset of 30,000 emails:

  • 15k confirmed phishing

  • 15k benign

Additional variants included:

  • Rewritten phishing using LLM paraphrasing

  • Adversarial transformations: URL obfuscation, grammar shifts

Splits: 70% train, 15% validation, 15% test—balanced by class.

7. Model Architectures & Quantization

Evaluated models included:

  • Classical ML: Logistic Regression, Random Forest with TF-IDF

  • Deep models: GRU and Transformer-based text classifiers (~20M and 100M params)

  • Quantized LLM: DeepSeek-R1 Distill Qwen 14B (Q8_0 mode)—8-bit quantization, ~17 GB VRAM

  • Baseline LLM: GPT-3.5 small fine-tuned for comparison

Each model was trained or calibrated to match resource constraints.

8. Training & Prompting Strategies

ML/DL models were trained with cross-entropy loss on TF-IDF or token embeddings.
LLMs used:

  • Zero-shot prompts: single-question classification

  • Few-shot prompts: 3–5 examples inline

  • Soft prompt tuning for Qwen14B

  • MIP prompting: explain-then-classify, e.g. “Explain if this email is phishing. Then say yes/no.”

9. Raw Accuracy: ML vs DL vs Quantized LLMs

On standard test set:

  • Random Forest: 92.4%

  • Transformer (100M): 93.8%

  • Qwen14B Q8_0: 90.5% (few-shot), 88.0% (zero-shot)

  • GPT-3.5 tune: 91.2%

Observations:

  • LLM trailed slightly in overall accuracy but within margin

  • Deep models had edge on base detection

  • LLM excels at nuanced or context-rich samples

10. Contextual Fluency: Detecting Subtle Phishing Cues

In context-based phishing:

  • Deep model: 78% accuracy

  • Qwen14B: 84% (few-shot)

  • GPT-3.5: 82%

These emails featured:

  • Authority appeals (“Your account flagged”)

  • Emotional cues

  • Slight language errors

LLMs leveraged semantic awareness to outperform classical models.

11. Impact of Zero-Shot & Few-Shot Prompting

Prompts dramatically influenced LLM performance:

  • Zero-shot: 88%

  • Few-shot: 90.5%

  • MIP prompting: 91.8%

Including short reasoning explanations helped guide decision-making, especially on ambiguous examples.

12. Adversarial Rewriting: Email Rephrasing Attacks

When phishing messages were paraphrased or obfuscated:

  • Random Forest dropped to 60%

  • Deep model fell to 65%

  • LLMs stayed above 80%

This resilience shows LLMs’ ability to maintain intent recognition even when the surface form changes.

13. Adversarial Robustness Across Methods

Against deliberate style changes:

  • Grammar disruption

  • Negation manipulation

  • Spacing or Unicode trickery

LLMs lost only ~10% in performance, while classical models degraded by ~35–40%.

14. Efficiency: VRAM, Inference Time, and Cost

Per-email detection latency:

  • Random Forest: 2 ms (CPU)

  • Transformer: 15 ms (GPU)

  • Qwen14B Q8_0: 120 ms (GPU)

  • GPT-3.5 (API): 300ms + cost

Quantized LLM runs in real-time on 17 GB VRAM. API latency and costs make GPT-3.5 less viable.

15. Explainability: LLM-Generated Justifications

Unique to LLMs: human-readable explanations

Example Reasoning:
“This email asks for personal credentials by falsifying account verification urgency—common phishing tactic.”

Security professionals appreciated these contextual cues as they aid faster triage and understanding.

16. Cost-Performance Trade-Offs

Cost per 1M email inferences:

  • Classical ML: ~$0.05

  • Deep model: ~$0.20

  • Qwen14B: ~$1.50

  • GPT-3.5: ~$10

Depending on risk appetite, organizations can right-size systems: ML as first filter, LLM for deeper analysis.

17. Integrating Quantized LLMs in Production Systems

An optimal architecture:

  1. ML/DL filter flags 5–10% of emails

  2. LLM inspector analyzes flagged emails with MIP prompting

  3. Analyst dashboard shows LLM explanation before quarantine

This strategy balances performance, cost, and interpretability.

18. Future Trends and Research Opportunities

  • Prompt tuning to boost zero-shot performance

  • Distilling adversarial robustness into smaller models

  • Real-time prompt selection: only ask LLM when needed

  • Multimodal detection: hooking in image attachments across LLMs

19. Limitations & Ethical Considerations

Current drawbacks:

  • Explanations may sound authoritative but still hallucinate

  • Potential bias in flagged emails

  • LLM inference cost still non-trivial

  • Privacy concerns when routing emails to API or model logs

20. Conclusion

While classical ML/DL models lead in raw detection accuracy, quantized LLMs offer distinct advantages—particularly in semantic understanding, adversarial resilience, and explainability. When deployed thoughtfully, LLMs can augment phishing detection systems, safeguarding users while providing valuable insights. Their ability to identify context-based threats and explain decisions marks a significant leap forward in cybersecurity.