OpenAI has launched GPT-4o. The GPT-4o (“o” for “omni”) is a step towards much more natural human-computer interaction—it accepts any combination of text, audio, and image as input and generates any combination of text, audio, and image outputs. OpenAI website states that it is making GPT-4o available in the free tier, and to Plus users with up to 5x higher message limits.

OpenAI remarks that the new model can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, similar to a conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages. It is also much faster and 50% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models. It is a new flagship model that can reason across audio, vision, and text in real-time.

OpenAI is the parent company of ChatGPT and is popular for its widely used GPT-4 model. 

What makes GPT-4o different?

Before GPT-4o, the users could use Voice Mode to talk to ChatGPT with latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. To achieve this, Voice Mode is a pipeline of three separate models: a straightforward model transcribes audio to text, GPT-3.5 or GPT-4 takes in text and outputs text, and a third simple model converts that text back to audio. This process means that the primary source of intelligence, GPT-4, loses a lot of information—it can’t directly observe tone, multiple speakers, or background noises and output laughter, singing, or expressing emotion.

The developers trained the GPT-4o, a single new model end-to-end across text, vision, and audio, meaning that the same neural network processes all inputs and outputs. Because GPT-4o is the company’s first model combining all of these modalities, we are still just scratching the surface of exploring what the model can do and its limitations.

Safety and limitation

GPT-4o has safety built in by design across modalities through techniques such as filtering training data and refining the model’s behaviour through post-training. The OpenAI team has also created new safety systems to provide guardrails on voice outputs.

The team evaluated GPT-4o according to their Preparedness Framework and in line with their voluntary commitments. Their evaluations of cybersecurity, CBRN, persuasion, and model autonomy show that GPT-4o does not score above Medium risk in any of these categories. This assessment involved running a suite of automated and human evaluations throughout the model training process. 

OpenAI recognizes that GPT-4o’s audio modalities present a variety of novel risks. The company has publicly released the text and image inputs and text outputs. Over the coming weeks and months, they will work on the technical infrastructure, usability via post-training, and safety necessary to release the other modalities. 

Sources of Article

Content and Images: OpenAI

Banner: Unsplash

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE