Why it is relevant to enterprise users of AI
As enterprises consider deploying large language models like Google's Gemini or OpenAI's GPT, evaluating real-world accuracy and potential risks is crucial for responsible AI adoption. Even marginal performance differences could have outsized business impacts across wide scale production systems. This places immense importance on benchmark studies that closely mimic enterprise application scenarios within these high-stakes environments.
A new third-party benchmark study from Carnegie Mellon University and BerriAI provides invaluable and tangible insights for enterprises navigating adoption of this rapidly evolving technology. By systematically comparing Google's Gemini Pro to the renowned GPT 3.5 Turbo and GPT 4 models, the researchers establish a much-needed independent performance baseline touching key areas like reasoning, knowledge application, content filtering and multilingual support.
Read the full study here: https://arxiv.org/pdf/2312.11444.pdf
The Data and Scope
Researchers from Carnegie Mellon University and BerriAI systematically compared the capabilities of Gemini Pro against GPT 3.5 Turbo and GPT 4 Turbo across 10 diverse tasks testing language understanding, reasoning, translation and more.
Key Findings Relevant for Enterprise Deployments:
1. Report Generation and Research Aggregation
- Gemini Pro achieves high but slightly lower accuracy than GPT 3.5 Turbo on average (64.12% vs 67.75% on the MMLU benchmark as shown in Table 1). It struggles with mathematical reasoning and exhibits bias in multiple choice answers (see Figure 2).
- However, Gemini outperforms GPT 3.5 Turbo on longer reasoning chains over 900 tokens on the MMLU dataset (Figure 5). This demonstrates some reasoning robustness for long reports.
- Gemini has issues with multiple choice bias (Figure 2), early task termination in WebArena (Figure 23), and 28% blocked responses on sensitive topics in MMLU (Section 3.2), showing gaps in coverage.
2. Reasoning and Complex Scenario Analysis
- Gemini matches GPT 3.5 Turbo on easier reasoning tasks but does worse on longer input queries (Figure 7). It particularly struggles with object tracking and arithmetic as shown in Figure 8.
- However, Gemini Pro outperforms GPT 3.5 Turbo on the most complex reasoning chains over 100 tokens on the GSM8K math task (Figure 13). This indicates strengths for complex analysis.
3. Creativity and Marketing
- The study did not explicitly evaluate creative generation or marketing applications. But Gemini's multimodal foundations may give it an edge here based on Google's reported results.
4. Customer Engagement and Support
- Gemini's tendency for shorter responses (Figure 23) and premature task termination indicates risks for customer engagement use cases like the WebArena experiments.
- Its content filtering also led to up to 28% blocked responses on sensitive topics (Section 3.2), showing gaps in coverage that could frustrate customers.
Unknowns and Limitations
While providing valuable insights, several significant questions remain unanswered for enterprises evaluating adoption of Gemini or other language models:
- Training Data and Methods. The study did not disclose any details on Gemini's training methodology or data sources. Understanding model provenance is important for judging ethics and security.
- Performance evaluations used standardized datasets. Real-world accuracy for business applications may differ significantly. Rigorous piloting on internal data is still vital.
- Prompt Engineering Effects. The results may rely heavily on the specific prompts used. Further prompt tuning could improve Gemini's accuracy as shown in some prior studies.
Language Breadth
- Only English tasks were evaluated. Businesses operating globally require multilingual support which was not measured.
- Safety and Bias. Critical aspects like output veracity, unfair bias, or mutation testing were untouched, though crucial for enterprises.
- Future Trajectory. The Gemini Pro version evaluated will certainly evolve. Extrapolating long-term capabilities from this snapshot could underestimate progress.
In summary, this thorough third-party study provides valuable, detailed insights for enterprises assessing Gemini backed by relevant data tables. While Gemini Pro currently lags GPT 4 overall, it demonstrates strengths for long, complex scenarios and non-English use that enterprises should factor into their evaluations.