From detecting objects to understanding the world — the evolution of computer vision
🎯 Object Detection & Segmentation
Find objects, draw boxes, classify each region
Output
📦📦📦
Bounding boxes + class labels + confidence
YOLO
Real-time, single-stage detection
Faster R-CNN
Two-stage with region proposals
DETR
Transformer-based, end-to-end
SAM
Segment Anything — zero-shot
🚗 Self-Driving
📹 Surveillance
🏭 Manufacturing QC
🛒 Retail Analytics
🎬 Video Understanding
Temporal reasoning across frames — what's happening over time?
Output
🏃♂️
"Person running across field"
Key Insight: Videos add the temporal dimension — understanding actions, events, and
causality requires modeling how things change over time.
VideoMAE
Masked autoencoder for video
TimeSformer
Divided space-time attention
Video-LLaVA
Video + language understanding
Gemini
Native multimodal, long video
🎥 Video Captioning
🔍 Action Recognition
⚽ Sports Analytics
🎬 Content Moderation
💬 Vision-Language Models (VLMs)
See and talk — multimodal understanding and conversation
🏔️
What mountain is this?
This appears to be Mount Fuji in Japan, identifiable by its symmetric
volcanic cone shape.
What's the best time to visit?
The official climbing season is July-September. For photography, spring
(cherry blossoms) or winter (snow cap) are stunning!
The Breakthrough: VLMs combine vision encoders with LLMs, enabling open-ended
visual question answering, image description, and visual reasoning.
GPT
OpenAI's multimodal model
Gemini
Google's native multimodal
Claude
Anthropic's vision model
🌍 World Models
Learn the physics of the world — predict what happens next
What They Do: World models learn to simulate environments — predicting how actions
affect the world. Essential for planning in robotics and autonomous systems.
See 👁️
→
Learn Physics 📐
→
Predict Future 🔮
→
Act Intelligently 🎯
Sora
OpenAI video world simulator
Genie
DeepMind interactive worlds
Cosmos
NVIDIA world foundation
UniSim
Universal simulator
🦾 Vision-Language-Action (VLA)
See, understand, and act — bridging perception to manipulation
Output
🦾
Robot arm trajectories & gripper commands
Camera 📷
→
Vision Encoder 🧠
→
LLM Reasoning 💬
→
Action Tokens 🎯
RT
Google's Robotics Transformer
OpenVLA
Open-source, LLaVA-based
π₀
Physical Intelligence VLA
🤖 Physical AI & Embodied Intelligence
The convergence: AI that interacts with the physical world
🦾
🔍 Perceiving environment...
🧠 Planning action sequence...
⚡ Executing manipulation...
The Vision: Physical AI combines perception, reasoning, and action to create
systems that can navigate, manipulate, and interact with the real world — from household robots to
surgical assistants.
CV
+
NLP
+
Robotics
=
Physical AI 🚀
Tesla Optimus
Humanoid robot
Figure
General-purpose humanoid
Boston Dynamics
Atlas, Spot robots
1X NEO
Home assistant robot
🏠 Home Robots
🏭 Manufacturing
🏥 Healthcare
🚀 Space Exploration
🌾 Agriculture