Skip to main content
Menu

Quick Guide to Multimodal AI: Images, Speech, and Video Capabilities in Large Language Models

Location

Online

Date & Time

Thursday 06 Nov 2025 12:30 - 13:30

Delivered by Dominik Lukes

This session will provide a comprehensive overview of the rapidly advancing multimodal capabilities in generative AI, exploring how Large Language Models now process and generate content across text, images, speech, and video. The session will examine what AI can and cannot do across these different modalities and outline the key models, tools, and applications that have emerged in this fast-developing space. Multimodal capabilities represent one of the most significant sources of progress on the frontier of generative AI, with new models, products, and players introducing capabilities that did not exist even a year ago.

Key points covered

  • What are the current capabilities and limitations of AI across different modalities (text, images, audio, video)?
  • Which key models and tools power multimodal AI applications and how do they differ from previous generations?
  • What new applications and use cases have been enabled by advances in multimodal AI?
  • How do you choose between different multimodal AI tools and platforms for specific tasks?
  • What are the most significant recent developments in speech recognition, synthesis, and voice cloning?
  • Which image generation and editing capabilities are available and how do you access them?
  • What video generation tools are emerging and what can they currently achieve?