Lights, Camera, Language Models: Evaluating GPT‑4o, Gemini‑2.0 & DeepSeek‑V3 for Movie Review Generation 🎬

ds66

2024-08-06

1. Introduction: What Makes a Good Movie Review?

Movie reviews aren’t just summaries—they blend narrative insight, emotional resonance, and personal voice. Fans often rely on user reviews (like IMDb’s) for recommendations. With LLMs advancing in fluency, researchers are exploring whether AI can replicate—or even enhance—this nuanced form of critique. This study focuses on three cutting-edge models:

GPT‑4o, OpenAI’s multimodal LLM
Gemini‑2.0, Google's next-gen LLM
DeepSeek‑V3, a sparse Transformer from the open-source community

The team evaluates them across linguistic style, sentiment, similarity, and human perceptions .

2. Framework: From Subtitles to Screenplays to Reviews

The evaluation follows a multi-stage workflow:

Data Collection

Six Oscar-winning or nominated films
Sources: subtitles, screenplays, and IMDb reviews

Prompting Strategy

Five distinct persona prompts (e.g., "young enthusiast," "professional critic")
Each prompt paired with positive, neutral, and negative review variants

Review Generation

Models receive material and persona-mode prompts to produce ~15 reviews per film

Analyses & Human Survey

Quantitative evaluation (trigram frequencies, sentiment polarity, emotion distribution, semantic similarity)
Human survey: participants distinguish AI vs. human reviews

3. Linguistic Analysis: N‑grams & Stylistic Patterns

Trigram Frequency

IMDb reviews mention character names frequently (e.g., “li mu bai”).
GPT‑4o and DeepSeek‑V3 also reference names—GPT‑4o blends praise (“epic masterpiece”) while DeepSeek‑V3 remains objective.
Gemini‑2.0 uses more narrative-driven trigrams (“woman loses everything”); persona prompts affect variability .

Summary: AI reviews approximate human naming and structure, but classic human-style nuance remains distinct.

4. Sentiment & Emotion Profiling

Using sentiment classifiers on review text:

GPT‑4o skews strongly positive, with a Joy score of ~0.38
DeepSeek‑V3 skews neutral, balanced across emotions (Surprise, Disgust, Sadness)
Gemini‑2.0 emphasizes negative tones (Disgust >0.3)

Takeaway:

GPT‑4o is upbeat, Gemini‑2.0 is emotionally intense, and DeepSeek-V3 strikes balance—mirroring IMDb tone most closely.

5. Semantic Similarity: How Close Are AI Reviews to IMDb?

Cosine similarity between LLM-generated and IMDb reviews shows:

GPT‑4o has the highest median similarity and consistent quality
DeepSeek‑V3 follows closely, with more stylistic diversity
Gemini‑2.0 trails, especially under simplified persona prompting

6. IMDb vs. AI: Can People Tell the Difference?

In blind tests, participants were asked to pick whether a review was AI-generated or human-written:

Review accuracy ranged from 22% to 80%, falling near chance (50%) overall
Reviews with coherent tone and moderated emotion felt more realistic

Key insight: When done well (e.g., balanced style, neutral emotion), AI reviews can “pass” as human-written.

7. Input Sources: Subtitles vs. Screenplays

Reviews based on subtitles skew more sentimentally extreme
Screenplay inputs produce more balanced and controlled sentiment across models

8. Model Comparison Summary

Model	Emotional Style	Semantic Similarity	Human-Realism	Best Use Case
GPT‑4o	Very positive, warm, optimistic	🥇 Highest	High	Introducing viewers to feel-good movies
DeepSeek‑V3	Balanced, objective, neutral	🥈 Strong	Highest	Balanced critical reviews, moderate sentiment
Gemini‑2.0	Intense negative emotion	🥉 Lower	Lower	Strong emotional expression in critiques

GPT‑4o stands out for stylistic consistency; Gemini‑2.0 for emotional intensity; DeepSeek‑V3 for balance and credible neutrality.

9. Strengths and Remaining Gaps

✅ Strengths:

Excellent fluency and structural coherence
High realism results in human participants being unsure
Persona prompting boosts expressiveness and variation

⚠️ Gaps:

Emotional richness and stylistic subtlety still lag behind human authors
AI tends to exaggerate positivity or negativity
Cultural and regional nuance needs improvement

10. Practical Implications

🧠 For Filmmakers & Industry:

AI-generated reviews could complement user reviews, especially early for promotion or summary.
Fine-tune models by genre to manage sentiment tone more precisely.

🛠️ For LLM Developers:

AI setups may require emotion calibration to avoid exaggerated tones.
Input choice (screenplay vs. subtitles) significantly affects output quality and tone.

✅ For Review Platforms:

Human-AI blending as a helper tool, not a replacement
Use emotion-tone detection to balance voices for curated feeds

11. Discussion & Future Research

Future directions include:

Collecting reviews across broader genres and non-English films
Exploring richer persona frameworks (e.g., critic vs. fan)
Incorporating multimodal inputs—audio tone, poster visuals
Evaluating cultural authenticity and bias detection

12. Conclusion

This in-depth study demonstrates that LLMs are now fluent enough to craft structurally coherent, sentiment-laced movie reviews, with GPT‑4o leading in realism and consistency. Nonetheless, emotional subtlety and stylistic depth still lag. Among the models, DeepSeek‑V3 strikes the best balance, making it ideal where neutrality and credibility matter most.

While not perfect substitutes for human reviews, AI-generated content is increasingly viable as complementary tools, ready for applications in marketing, review augmentation, or curiosity-driven content creation. As LLMs evolve, so will their capacity for capturing artistic nuance and emotional authenticity in creative writing.