Introducing GPT-4o: flagship model that can reason across audio, vision, and text in real time
OpenAI has unveiled GPT-4o ("o" for "omni"), a significant step towards much more natural human-computer interaction. GPT-4o is a groundbreaking AI model that accepts any combination of text, audio, and image inputs and generates corresponding outputs in text, audio, or image formats. This multimodal capability represents a significant advancement in the field of AI, enabling more seamless and intuitive interactions between humans and machines.
Unprecedented Response Times
One of the most remarkable features of GPT-4o is its ability to respond to audio inputs in as little as 232 milliseconds, with an average response time of 320 milliseconds. This speed is comparable to human response times in conversation, making interactions with GPT-4o feel remarkably natural and fluid.
Improved Performance and Cost-Efficiency
While matching GPT-4 Turbo's performance on text in English and code, GPT-4o showcases significant improvements in text understanding for non-English languages. Additionally, the model is much faster and 50% cheaper to run in OpenAI API, making it a more accessible and cost-effective solution for a wide range of applications.
Multimodal Capabilities
GPT-4o excels at vision and audio understanding compared to existing models, enabling it to comprehend and process information in a multitude of modalities. This capability opens up new possibilities for applications such as visual narratives, interview preparation, real-time translation, customer service, and more.
Model Evaluations
OpenAI has conducted extensive evaluations of GPT-4o's performance across various benchmarks, and the results are impressive. The model achieves GPT-4 Turbo-level performance on text, reasoning, and coding tasks while setting new high watermarks for multilingual, audio, and vision capabilities.
In text evaluation, GPT-4o sets a new high score of 88.7% on zero-shot COT MMLU (general knowledge questions) and 87.2% on the traditional 5-shot no-CoT MMLU.
For audio tasks, GPT-4o dramatically improves speech recognition performance over Whisper-v3 across all languages, particularly for lower-resourced languages. Additionally, it sets a new state-of-the-art in speech translation, outperforming Whisper-v3 on the MLS benchmark.
In vision understanding evaluations, GPT-4o achieves state-of-the-art performance on visual perception benchmarks such as MMMU, MathVista, and ChartQA.
Language Tokenization
OpenAI has also introduced a new tokenizer for GPT-4o, which significantly reduces the number of tokens required to represent text across various languages. For example, GPT-4o uses 4.4 times fewer tokens for Gujarati, 3.5 times fewer for Telugu, and 3.3 times fewer for Tamil, compared to previous models. This improvement in tokenization efficiency not only enhances the model's performance but also contributes to its cost-effectiveness.
Model Safety and Limitations
OpenAI has prioritized safety in the development of GPT-4o, incorporating techniques such as filtering training data and refining the model's behavior through post-training. Additionally, new safety systems have been created to provide guardrails on voice outputs.
GPT-4o has undergone extensive evaluations according to OpenAI Preparedness Framework and in line with their voluntary commitments. Assessments of cybersecurity, CBRN (chemical, biological, radiological, and nuclear), persuasion, and model autonomy show that GPT-4o does not score above Medium risk in any of these categories.
Furthermore, GPT-4o has been subjected to external red teaming with over 70 experts in various domains, including social psychology, bias and fairness, and misinformation, to identify potential risks introduced or amplified by the new modalities. OpenAI has used these learnings to build out safety interventions and improve the safety of interacting with GPT-4o.
While GPT-4o represents a significant advancement, OpenAI acknowledges that the model has limitations across all modalities. These limitations will be continuously addressed as new risks are discovered.
Model Availability
GPT-4o is OpenAI latest step in pushing the boundaries of deep learning, with a focus on practical usability. After two years of research on efficiency improvements, OpenAI is now able to make a GPT-4 level model available more broadly.
GPT-4o's text and image capabilities are currently rolling out in OpenAI ChatGPT, available in the free tier and to Plus users with up to 5 times higher message limits. A new version of Voice Mode with GPT-4o will be released in alpha within ChatGPT Plus in the coming weeks.
Developers can also access GPT-4o as a text and vision model in OpenAI API, with the model being 2 times faster, half the price, and offering 5 times higher rate limits compared to GPT-4 Turbo. OpenAI plans to launch support for GPT-4o's new audio and video capabilities to a small group of trusted partners in the API in the coming weeks.
With its groundbreaking multimodal capabilities, unprecedented response times, and improved performance and cost-efficiency, GPT-4o represents a significant step forward in the field of AI, paving the way for more natural and seamless human-computer interactions.
Google Introduces Gemini 1.5 Flash and Unveils Updates Across Its AI Model Family
In December 2022, Google launched its first natively multimodal model, Gemini 1.0, in three sizes: Ultra, Pro, and Nano. Just a few months later, they released Gemini 1.5 Pro, with enhanced performance and a breakthrough long context window of 1 million tokens.
Developers and enterprise customers have been leveraging Gemini 1.5 Pro's long context window, multimodal reasoning capabilities, and impressive overall performance for various applications. However, Google recognized that some use cases require lower latency and a lower cost to serve. To address this need, they have introduced Gemini 1.5 Flash, a lighter-weight model designed to be fast and efficient for large-scale deployment.
Both Gemini 1.5 Pro and 1.5 Flash are available in public preview with a 1 million token context window in Google AI Studio and Vertex AI. Additionally, Gemini 1.5 Pro is now available with a 2 million token context window via waitlist to developers using the API and to Google Cloud customers.
Google has also introduced updates across its Gemini family of models, announced the next generation of open models, Gemma 2, and shared progress on the future of AI assistants with Project Astra.
Gemini 1.5 Flash: Optimized for Speed and Efficiency
Gemini 1.5 Flash is the newest addition to the Gemini model family and the fastest Gemini model served in the API. It is optimized for high-volume, high-frequency tasks at scale, offering cost-efficiency and featuring the breakthrough long context window.
While lighter weight than Gemini 1.5 Pro, Gemini 1.5 Flash is highly capable of multimodal reasoning across vast amounts of information and delivers impressive quality for its size. It excels at tasks such as summarization, chat applications, image and video captioning, and data extraction from long documents and tables.
Gemini 1.5 Flash has been trained through a process called "distillation," where the essential knowledge and skills from the larger Gemini 1.5 Pro model are transferred to a smaller, more efficient model.
Improvements to Gemini 1.5 Pro
Google has significantly improved Gemini 1.5 Pro, its best model for general performance across a wide range of tasks. Beyond extending its context window to 2 million tokens, they have enhanced its capabilities in code generation, logical reasoning and planning, multi-turn conversation, and audio and image understanding through data and algorithmic advances.
Gemini 1.5 Pro can now follow increasingly complex and nuanced instructions, including those that specify product-level behavior involving role, format, and style. Google has improved control over the model's responses for specific use cases, such as crafting the persona and response style of a chat agent or automating workflows through multiple function calls. Users can also steer model behavior by setting system instructions.
Additionally, Gemini 1.5 Pro now features audio understanding in the Gemini API and Google AI Studio, enabling it to reason across images and audio for uploaded videos.
Gemini Nano Understands Multimodal Inputs
Gemini Nano is expanding beyond text-only inputs to include images as well. Starting with Pixel, applications using Gemini Nano with Multimodality will be able to understand the world the way people do — not just through text, but also through sight, sound, and spoken language.
Next Generation of Open Models: Gemma 2
Google is also sharing updates to Gemma, its family of open models built from the same research and technology used to create the Gemini models. They are announcing Gemma 2, the next generation of open models for responsible AI innovation, featuring a new architecture designed for breakthrough performance and efficiency, available in new sizes.
The Gemma family is expanding with PaliGemma, the first vision-language model inspired by PaLI-3. Additionally, Google has upgraded its Responsible Generative AI Toolkit with LLM Comparator for evaluating the quality of model responses.
Progress on the Future of AI Assistants: Project Astra
As part of Google DeepMind's mission to build AI responsibly to benefit humanity, they have been working on developing universal AI agents that can be helpful in everyday life. Today, they are sharing their progress in building the future of AI assistants with Project Astra (advanced seeing and talking responsive agent).
To be truly useful, an AI agent needs to understand and respond to the complex and dynamic world just like people do — taking in and remembering what it sees and hears to understand context and take action. It also needs to be proactive, teachable, and personal, allowing users to talk to it naturally and without lag or delay.
Building on Gemini, Google has developed prototype agents that can process information faster by continuously encoding video frames, combining video and speech input into a timeline of events, and caching this information for efficient recall. They have also enhanced how these agents sound, giving them a wider range of intonations and enabling them to better understand the context they're being used in and respond quickly in conversation.
Google envisions a future where people could have an expert AI assistant by their side, through a phone or glasses, leveraging technology like Project Astra. Some of these capabilities are expected to come to Google products, such as the Gemini app and web experience, later this year.
Continued Exploration and Innovation
Google is committed to advancing the state-of-the-art in AI models and technologies. By investing in a "relentless production line of innovation," they aim to explore new ideas at the frontier while unlocking the possibility of new and exciting use cases for Gemini models.
Which one is actually performing better?
Comparing GPT-4o and Gemini: A Detailed Analysis
In the rapidly evolving landscape of artificial intelligence, two models have emerged as frontrunners in multimodal AI capabilities: OpenAI GPT-4o and Google's Gemini. Both aim to enhance natural human-computer interactions by supporting multiple input and output modalities, such as text, audio, and images. This detailed analysis compares the features, performance, safety measures, availability, and cost-efficiency of these two cutting-edge AI models.
Multimodal Capabilities
GPT-4o
- Inputs and Outputs: GPT-4o accepts any combination of text, audio, and image inputs, generating corresponding outputs in text, audio, or image formats. This makes it highly versatile for various applications, from generating visual content based on audio descriptions to providing spoken answers to text queries.
- Applications: The multimodal capabilities of GPT-4o enable it to be used in diverse fields such as visual storytelling, real-time translation, interview preparation, and enhanced customer service interactions.
Gemini
- Inputs and Outputs: Gemini also supports multimodal reasoning, handling text, images, and audio inputs and outputs. Gemini Nano, specifically, has expanded to include image understanding, starting with Google's Pixel devices.
- Applications: Gemini models, particularly Gemini 1.5 Pro and Flash, are suited for summarization, chat applications, image and video captioning, and data extraction from extensive documents and tables. Google's Project Astra aims to further these capabilities into real-time, proactive AI assistants.
Performance
GPT-4o
- Response Times: GPT-4o excels in speed, with audio input response times as low as 232 milliseconds and an average of 320 milliseconds, comparable to human conversational speeds.
- Text and Multilingual Performance: It matches GPT-4 Turbo in English text and code tasks while surpassing it in non-English text understanding. It has set new benchmarks in multilingual capabilities and performs exceptionally well in vision and audio tasks.
- Evaluations: GPT-4o scores highly on benchmarks such as zero-shot COT MMLU and traditional 5-shot no-CoT MMLU, and outperforms Whisper-v3 in speech recognition and translation tasks, particularly for less-resourced languages.
Gemini
- Context Window: Gemini 1.5 Pro offers an impressive context window of up to 2 million tokens (via waitlist), which is crucial for tasks that require understanding and retaining large volumes of information over extended interactions.
- Performance Enhancements: Gemini 1.5 Pro has shown significant improvements in code generation, logical reasoning, and planning, and can handle multi-turn conversations more effectively. The new Gemini 1.5 Flash is optimized for speed and efficiency, making it ideal for high-volume, high-frequency tasks.
- Audio and Visual Understanding: Gemini models, particularly through the API and Google AI Studio, have enhanced capabilities in audio understanding and multimodal reasoning, able to process and reason across images and audio for complex tasks.
Safety and Limitations
GPT-4o
- Safety Measures: OpenAI has prioritized safety by filtering training data and refining model behavior post-training. The model has undergone extensive evaluations by external experts in areas such as social psychology, bias, and misinformation to identify and mitigate potential risks.
- Preparedness Framework: GPT-4o has been evaluated against OpenAI Preparedness Framework, ensuring it does not score above Medium risk in cybersecurity, CBRN, persuasion, and model autonomy assessments.
- Limitations: Despite its advancements, GPT-4o has acknowledged limitations in all modalities, with continuous efforts to address new risks as they are discovered.
Gemini
- Responsible AI: Google has implemented responsible AI practices, with models like Gemma 2 designed for breakthrough performance and efficiency while maintaining ethical standards.
- Project Astra: This initiative focuses on developing AI assistants that understand and respond to the dynamic world in real-time, ensuring proactive and teachable AI systems.
- Model Improvements: Google's ongoing improvements across the Gemini family include enhancing control over model responses, enabling more personalized and accurate interactions for specific use cases.
Availability and Accessibility
GPT-4o
- Integration: GPT-4o is being rolled out in OpenAI ChatGPT, available to free tier and Plus users. It will soon support text, image, audio, and video capabilities more broadly.
- API Access: Developers can access GPT-4o through OpenAI API, which is twice as fast and half the price of GPT-4 Turbo, with higher rate limits for more extensive use cases.
Gemini
- Google AI Studio and Vertex AI: Both Gemini 1.5 Pro and Flash are available in public preview, with extended context windows accessible via API and to Google Cloud customers.
- Project Astra: Google plans to integrate advanced capabilities from Project Astra into its products, such as the Gemini app and web experience, later this year, enhancing real-time, interactive AI assistant functionalities.
Cost Efficiency
GPT-4o
- Operational Costs: GPT-4o is designed to be 50% cheaper to run compared to GPT-4 Turbo, making it a cost-effective solution for various applications while maintaining high performance.
Gemini
- Efficiency: Gemini 1.5 Flash, optimized for speed and efficiency, offers a cost-effective solution for large-scale deployments, suitable for high-volume tasks without compromising on performance.
Conclusion
Both GPT-4o and Gemini represent significant advancements in the field of AI, particularly in multimodal capabilities that enhance natural human-computer interactions. GPT-4o excels in response times and multilingual understanding, offering a cost-effective solution with extensive safety measures. Meanwhile, Gemini offers remarkable context windows, optimized performance for high-volume tasks, and advanced reasoning capabilities, paving the way for future real-time AI assistants through Project Astra.
Each model brings unique strengths to the table, catering to different needs and applications in the AI landscape. As these technologies continue to evolve, they promise to unlock new possibilities and make interactions with AI more seamless, intuitive, and efficient.
For more information: