Lights, Camera, Language Models: Evaluating GPT‑4o, Gemini‑2.0 & DeepSeek‑V3 for Movie Review Generation 🎬

ic_writer ds66
ic_date 2024-08-06
blogs

1. Introduction: What Makes a Good Movie Review?

Movie reviews aren’t just summaries—they blend narrative insight, emotional resonance, and personal voice. Fans often rely on user reviews (like IMDb’s) for recommendations. With LLMs advancing in fluency, researchers are exploring whether AI can replicate—or even enhance—this nuanced form of critique. This study focuses on three cutting-edge models:

  • GPT‑4o, OpenAI’s multimodal LLM

  • Gemini‑2.0, Google's next-gen LLM

  • DeepSeek‑V3, a sparse Transformer from the open-source community

The team evaluates them across linguistic style, sentiment, similarity, and human perceptions .

58980_hdls_2934.jpeg

2. Framework: From Subtitles to Screenplays to Reviews

The evaluation follows a multi-stage workflow:

  1. Data Collection

  • Six Oscar-winning or nominated films

  • Sources: subtitles, screenplays, and IMDb reviews

Prompting Strategy

  • Five distinct persona prompts (e.g., "young enthusiast," "professional critic")

  • Each prompt paired with positive, neutral, and negative review variants

Review Generation

  • Models receive material and persona-mode prompts to produce ~15 reviews per film

Analyses & Human Survey

  • Quantitative evaluation (trigram frequencies, sentiment polarity, emotion distribution, semantic similarity)

  • Human survey: participants distinguish AI vs. human reviews 

3. Linguistic Analysis: N‑grams & Stylistic Patterns

Trigram Frequency

  • IMDb reviews mention character names frequently (e.g., “li mu bai”).

  • GPT‑4o and DeepSeek‑V3 also reference names—GPT‑4o blends praise (“epic masterpiece”) while DeepSeek‑V3 remains objective.

  • Gemini‑2.0 uses more narrative-driven trigrams (“woman loses everything”); persona prompts affect variability .

Summary: AI reviews approximate human naming and structure, but classic human-style nuance remains distinct.

4. Sentiment & Emotion Profiling

Using sentiment classifiers on review text:

  • GPT‑4o skews strongly positive, with a Joy score of ~0.38

  • DeepSeek‑V3 skews neutral, balanced across emotions (Surprise, Disgust, Sadness)

  • Gemini‑2.0 emphasizes negative tones (Disgust >0.3) 

Takeaway:

  • GPT‑4o is upbeat, Gemini‑2.0 is emotionally intense, and DeepSeek-V3 strikes balance—mirroring IMDb tone most closely.

5. Semantic Similarity: How Close Are AI Reviews to IMDb?

Cosine similarity between LLM-generated and IMDb reviews shows:

  • GPT‑4o has the highest median similarity and consistent quality

  • DeepSeek‑V3 follows closely, with more stylistic diversity

  • Gemini‑2.0 trails, especially under simplified persona prompting 

6. IMDb vs. AI: Can People Tell the Difference?

In blind tests, participants were asked to pick whether a review was AI-generated or human-written:

  • Review accuracy ranged from 22% to 80%, falling near chance (50%) overall

  • Reviews with coherent tone and moderated emotion felt more realistic 

Key insight: When done well (e.g., balanced style, neutral emotion), AI reviews can “pass” as human-written.

7. Input Sources: Subtitles vs. Screenplays

  • Reviews based on subtitles skew more sentimentally extreme

  • Screenplay inputs produce more balanced and controlled sentiment across models 

8. Model Comparison Summary

ModelEmotional StyleSemantic SimilarityHuman-RealismBest Use Case
GPT‑4oVery positive, warm, optimistic🥇 HighestHighIntroducing viewers to feel-good movies
DeepSeek‑V3Balanced, objective, neutral🥈 StrongHighestBalanced critical reviews, moderate sentiment
Gemini‑2.0Intense negative emotion🥉 LowerLowerStrong emotional expression in critiques

GPT‑4o stands out for stylistic consistency; Gemini‑2.0 for emotional intensity; DeepSeek‑V3 for balance and credible neutrality.

9. Strengths and Remaining Gaps

✅ Strengths:

  • Excellent fluency and structural coherence

  • High realism results in human participants being unsure

  • Persona prompting boosts expressiveness and variation

⚠️ Gaps:

  • Emotional richness and stylistic subtlety still lag behind human authors

  • AI tends to exaggerate positivity or negativity

  • Cultural and regional nuance needs improvement 

10. Practical Implications

🧠 For Filmmakers & Industry:

  • AI-generated reviews could complement user reviews, especially early for promotion or summary.

  • Fine-tune models by genre to manage sentiment tone more precisely.

🛠️ For LLM Developers:

  • AI setups may require emotion calibration to avoid exaggerated tones.

  • Input choice (screenplay vs. subtitles) significantly affects output quality and tone.

✅ For Review Platforms:

  • Human-AI blending as a helper tool, not a replacement

  • Use emotion-tone detection to balance voices for curated feeds

11. Discussion & Future Research

Future directions include:

  • Collecting reviews across broader genres and non-English films

  • Exploring richer persona frameworks (e.g., critic vs. fan)

  • Incorporating multimodal inputs—audio tone, poster visuals

  • Evaluating cultural authenticity and bias detection

12. Conclusion

This in-depth study demonstrates that LLMs are now fluent enough to craft structurally coherent, sentiment-laced movie reviews, with GPT‑4o leading in realism and consistency. Nonetheless, emotional subtlety and stylistic depth still lag. Among the models, DeepSeek‑V3 strikes the best balance, making it ideal where neutrality and credibility matter most.

While not perfect substitutes for human reviews, AI-generated content is increasingly viable as complementary tools, ready for applications in marketing, review augmentation, or curiosity-driven content creation. As LLMs evolve, so will their capacity for capturing artistic nuance and emotional authenticity in creative writing.