Evaluating Large Language Models for Movie Review Generation: A Comparative Study of GPT-4o, Gemini-2.0, and DeepSeek-V3

2025-09-10

Abstract

Large Language Models (LLMs) have emerged as powerful tools for natural language processing tasks such as text generation, summarization, and sentiment analysis. Their applications in consumer product reviews are expanding rapidly, paving the way for automated movie review generation. This study presents a comparative evaluation of three advanced LLMs—GPT-4o, DeepSeek-V3, and Gemini-2.0—for generating movie reviews. Using movie subtitles and scripts as inputs, we assess their performance against IMDb user reviews, examining lexical diversity, sentiment polarity, semantic similarity, and thematic coherence. Findings indicate that while LLM-generated reviews are grammatically fluent and structurally complete, they still differ from IMDb reviews in emotional richness and stylistic authenticity. A user-based survey further tested whether participants could distinguish between LLM-generated and IMDb reviews. Results showed that DeepSeek-V3 produced the most balanced reviews, closely resembling IMDb comments, while GPT-4o leaned toward excessive positivity and Gemini-2.0 overemphasized negative emotions. The study concludes with implications for refining LLMs for creative applications, highlighting challenges and opportunities in bridging the gap between machine-generated and human-authored reviews.

1. Introduction

The rise of generative AI has transformed how textual content is created, consumed, and evaluated. From product reviews to personalized recommendations, Large Language Models (LLMs) are increasingly applied in domains that require human-like expression and evaluative commentary. Among these, the generation of movie reviews is particularly significant, as film criticism combines objective description with subjective interpretation and emotional engagement.

IMDb, as one of the largest online movie databases, provides millions of user-generated reviews that reflect diverse perspectives and writing styles. Replicating this authenticity through LLMs poses a unique challenge: while models can generate coherent and fluent text, capturing the stylistic nuance and emotional variety of human-authored reviews remains difficult.

This study investigates how three state-of-the-art LLMs—GPT-4o, DeepSeek-V3, and Gemini-2.0—perform in generating movie reviews. We explore the following research questions:

How do LLM-generated movie reviews compare to IMDb user reviews in terms of lexical diversity, sentiment polarity, semantic similarity, and thematic coherence?
Can human participants reliably distinguish between LLM-generated reviews and authentic IMDb reviews?
Which LLM most closely approximates the qualities of human-authored movie reviews?

2. Background

2.1 LLMs in Creative Text Generation

Early applications of LLMs focused on structured tasks such as summarization, translation, and question answering. More recently, their ability to generate free-form text has expanded into creative domains such as storytelling, poetry, and review writing. These applications test LLMs’ ability to not only produce grammatical sentences but also emulate tone, style, and subjectivity.

2.2 Movie Reviews as a Test Case

Movie reviews represent a hybrid genre:

Descriptive elements (plot summary, character analysis, cinematography).
Evaluative elements (emotional impact, performance quality, originality).
Subjective style (personal voice, humor, rhetorical devices).

The richness of this genre makes it a suitable benchmark for assessing whether LLMs can capture both semantic accuracy and human-like expressiveness.

2.3 Selected Models

GPT-4o: OpenAI’s latest multimodal model optimized for conversational fluency and stylistic coherence. Known for generating optimistic and engaging outputs.
DeepSeek-V3: A Chinese-developed LLM focusing on balanced reasoning, contextual awareness, and stylistic adaptation, making it strong in maintaining neutrality while still producing engaging content.
Gemini-2.0: Google’s most advanced LLM, designed with multi-turn reasoning and contextual grounding, often excelling in capturing critical or negative tones, though at times with exaggerated intensity.

3. Methodology

3.1 Data Sources

Scripts and Subtitles: Used as structured input data for LLM prompts.
IMDb User Reviews: Served as the benchmark dataset, representing authentic human evaluations across genres.

3.2 Prompting Strategy

Each model received standardized prompts, such as:

“Using the provided script/subtitles, write a movie review in the style of an IMDb user. Include both strengths and weaknesses of the film, with attention to emotional tone.”

3.3 Evaluation Metrics

Lexical Diversity: Measured by type-token ratio (TTR).
Sentiment Polarity: Analyzed using VADER sentiment analysis.
Semantic Similarity: Computed using BERTScore against IMDb references.
Thematic Coherence: Human raters scored thematic alignment with provided scripts.

3.4 User Study

Participants: 100 undergraduate students familiar with IMDb.
Task: Each participant read 20 reviews (10 LLM-generated, 10 IMDb-authored) and indicated whether they believed each review was human- or machine-written.
Metrics: Accuracy in distinguishing reviews and confidence levels.

4. Results

4.1 Lexical Diversity

DeepSeek-V3: Highest lexical variety, comparable to IMDb (TTR = 0.62).
GPT-4o: Moderately diverse but tended toward repetitive positive adjectives (“amazing,” “brilliant”).
Gemini-2.0: Lower variety, over-reliant on emotionally charged words (“terrible,” “awful,” “disappointing”).

4.2 Sentiment Polarity

GPT-4o: Skewed toward positivity, average polarity score +0.71.
DeepSeek-V3: Balanced sentiment, average polarity +0.18, closely aligned with IMDb’s +0.22.
Gemini-2.0: Strong negativity, average polarity −0.55, producing harsher critiques than IMDb reviews.

4.3 Semantic Similarity

Measured by BERTScore:

DeepSeek-V3: 0.89
GPT-4o: 0.85
Gemini-2.0: 0.83
IMDb baseline (human-to-human agreement): 0.91

4.4 Thematic Coherence

Human evaluators rated thematic coherence (scale 1–5):

DeepSeek-V3: 4.5
GPT-4o: 4.1
Gemini-2.0: 3.8

4.5 User Study Findings

Average accuracy in distinguishing LLM vs IMDb reviews: 56% (near chance).
DeepSeek-V3 reviews were misclassified as human 72% of the time.
GPT-4o reviews were correctly identified as AI 61% of the time due to overly polished tone.
Gemini-2.0 reviews were identified as AI 67% of the time because of excessive negativity and exaggerated phrasing.

5. Discussion

5.1 Strengths of LLMs in Movie Review Generation

Grammatical Fluency: All three models produced structurally coherent and grammatically accurate text.
Thematic Alignment: By using scripts and subtitles as input, models produced reviews relevant to the narrative and characters.
Near-human Indistinguishability: Survey results suggest that, at least for DeepSeek-V3, distinguishing machine-generated from human-authored reviews is increasingly difficult.

5.2 Weaknesses and Limitations

Emotional Richness: Human reviews often include nuanced feelings, sarcasm, and personal anecdotes—elements LLMs still struggle to emulate.
Stylistic Variety: LLMs tend toward formulaic sentence structures, lacking the idiosyncrasies of real reviewers.
Polarity Bias: GPT-4o’s “optimism bias” and Gemini-2.0’s “negativity bias” show that emotional calibration remains a challenge.

5.3 Comparative Insights

DeepSeek-V3 emerges as the most human-like, offering balanced sentiment and strong lexical diversity.
GPT-4o excels in readability but risks sounding overly sanitized.
Gemini-2.0 captures negative sentiment effectively but over-intensifies critique, reducing authenticity.

6. Implications

6.1 For Film Criticism

LLMs can assist in summarizing plotlines, structuring critiques, and generating baseline reviews. However, editorial oversight is essential to ensure emotional authenticity and stylistic variety.

6.2 For LLM Development

Improved sentiment calibration is needed to avoid polarity extremes.
Training on authentic user-generated datasets (forums, fan reviews) may enhance stylistic realism.
Incorporating humor, sarcasm, and narrative voice would bridge the gap between AI and human reviewers.

6.3 For Human-AI Collaboration

Rather than replacing human reviewers, LLMs can function as co-reviewers—providing first drafts, summarizing key themes, or generating contrasting perspectives for critics to refine.

7. Limitations of the Study

Dataset Restriction: Focused only on IMDb reviews; broader platforms (Rotten Tomatoes, Letterboxd) might reveal different stylistic benchmarks.
Participant Demographics: Undergraduate participants may differ from seasoned film critics in sensitivity to review style.
Script/Subtitles Input: Reviews generated from official texts may differ from user reviews influenced by subjective viewing experiences.

8. Conclusion

This study demonstrates that LLMs can generate movie reviews approaching human-like quality in fluency, coherence, and thematic relevance, but challenges persist in emotional richness and stylistic authenticity.

Among the three models tested, DeepSeek-V3 produced the most balanced and human-like reviews, GPT-4o leaned excessively positive, and Gemini-2.0 overemphasized negative emotions. The difficulty participants faced in distinguishing LLM from IMDb reviews suggests that AI is approaching a threshold of indistinguishability in consumer opinion writing.

Future research should focus on refining emotional calibration, stylistic diversity, and contextual nuance to fully realize the potential of LLMs in creative and evaluative domains such as film criticism.

References (Selected)

Gao, C., Huang, W., & Li, X. (2023). Sentiment analysis in generative models: Opportunities and challenges. Journal of Computational Linguistics.
OpenAI. (2024). GPT-4o Technical Report. OpenAI Publications.
DeepSeek Research. (2024). DeepSeek-V3: Balanced reasoning in multilingual contexts. arXiv preprint.
Google DeepMind. (2024). Gemini-2.0: Multimodal generative intelligence. Technical Whitepaper.
Pang, B., & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2(1–2), 1–135.