At I/O 2026,
Google's annual developer conference, the company
unveiled Gemini Omni, its most ambitious multimodal AI model to date. The new model can generate content from any type of input, starting with video, marking a significant leap in generative AI by combining the power of reasoning with the ability to create.
What is Gemini Omni?
Gemini Omni is a model that allows users to combine images, audio, video and text as input and generate high-quality videos grounded in
Gemini's real-world knowledge. It also enables users to edit videos through conversational language. In other words, a user can simply describe what they want changed in a video and the model executes it, with each instruction building on the last.
Characters stay consistent, physics hold up and the scene remembers what came before, addressing a longstanding limitation in AI video generation where temporal coherence often breaks down across edits.
Nano Banana, the AI image generation tool introduced last year, set the stage for Omni. It brought Gemini's intelligence to image generation and editing and while it was limited to images, Gemini Omni represents a generational leap in performance.
Omni has an improved intuitive understanding of forces like gravity, kinetic energy and fluid dynamics, allowing users to create more realistic scenes. It can also draw on Gemini's broader knowledge to connect language, imagery and meaning.
Unlike standalone video models focused on generation alone, Omni's distinguishing feature is its conversational editing loop where users can refine a video across multiple turns, changing the environment, camera angle or visual style without losing continuity from the original scene.
The first model in the Omni family called Gemini Omni Flash is rolling out to all Google AI Plus, Pro and Ultra subscribers globally through the Gemini app and Google Flow. It is also rolling out at no cost to users on YouTube Shorts and the YouTube Create App starting this week.
In the coming weeks, Google plans to roll out access to developers and enterprise customers via APIs. The company has also indicated that future versions of Omni will expand beyond video to support additional output modalities, including image and audio generation with voice references for audio supported at launch, and other audio input types to follow soon.
With Gemini Omni, Google is positioning itself squarely at the frontier of multimodal AI, not just as a tool for creation, but as an intelligent collaborator that understands context, physics and narrative.