🔮 Beyond Classification: The CV Landscape

From detecting objects to understanding the world — the evolution of computer vision

🎯
Detection
🎬
Video
💬
VLM
🌍
World Model
🦾
VLA
🤖
Physical AI

🎯 Object Detection & Segmentation

Find objects, draw boxes, classify each region

Input
🖼️
Single image with multiple objects
YOLO / DETR
Output
📦📦📦
Bounding boxes + class labels + confidence
YOLO
Real-time, single-stage detection
Faster R-CNN
Two-stage with region proposals
DETR
Transformer-based, end-to-end
SAM
Segment Anything — zero-shot
🚗 Self-Driving 📹 Surveillance 🏭 Manufacturing QC 🛒 Retail Analytics

🎬 Video Understanding

Temporal reasoning across frames — what's happening over time?

Input
🏃
🏃
🏃
🏃
🏃
Video frames over time
VideoMAE
Output
🏃‍♂️
"Person running across field"
Key Insight: Videos add the temporal dimension — understanding actions, events, and causality requires modeling how things change over time.
VideoMAE
Masked autoencoder for video
TimeSformer
Divided space-time attention
Video-LLaVA
Video + language understanding
Gemini
Native multimodal, long video
🎥 Video Captioning 🔍 Action Recognition ⚽ Sports Analytics 🎬 Content Moderation

💬 Vision-Language Models (VLMs)

See and talk — multimodal understanding and conversation

🏔️
What mountain is this?
This appears to be Mount Fuji in Japan, identifiable by its symmetric volcanic cone shape.
What's the best time to visit?
The official climbing season is July-September. For photography, spring (cherry blossoms) or winter (snow cap) are stunning!
The Breakthrough: VLMs combine vision encoders with LLMs, enabling open-ended visual question answering, image description, and visual reasoning.
GPT
OpenAI's multimodal model
Gemini
Google's native multimodal
Claude
Anthropic's vision model
LLaVA
Open-source VLM

🌍 World Models

Learn the physics of the world — predict what happens next

📸
Current State
🧠
World Model
🔮
Future Prediction
What They Do: World models learn to simulate environments — predicting how actions affect the world. Essential for planning in robotics and autonomous systems.
See 👁️
Learn Physics 📐
Predict Future 🔮
Act Intelligently 🎯
Sora
OpenAI video world simulator
Genie
DeepMind interactive worlds
Cosmos
NVIDIA world foundation
UniSim
Universal simulator

🦾 Vision-Language-Action (VLA)

See, understand, and act — bridging perception to manipulation

Input
👁️ + 🗣️
"Pick up the red cup and place it on the table"
RT-2 / OpenVLA
Output
🦾
Robot arm trajectories & gripper commands
Camera 📷
Vision Encoder 🧠
LLM Reasoning 💬
Action Tokens 🎯
RT
Google's Robotics Transformer
PaLM-E
562B embodied LLM
OpenVLA
Open-source, LLaVA-based
π₀
Physical Intelligence VLA

🤖 Physical AI & Embodied Intelligence

The convergence: AI that interacts with the physical world

🦾
🔍 Perceiving environment...
🧠 Planning action sequence...
⚡ Executing manipulation...
The Vision: Physical AI combines perception, reasoning, and action to create systems that can navigate, manipulate, and interact with the real world — from household robots to surgical assistants.
CV
+
NLP
+
Robotics
=
Physical AI 🚀
Tesla Optimus
Humanoid robot
Figure
General-purpose humanoid
Boston Dynamics
Atlas, Spot robots
1X NEO
Home assistant robot
🏠 Home Robots 🏭 Manufacturing 🏥 Healthcare 🚀 Space Exploration 🌾 Agriculture

📊 The Evolution of Computer Vision

Task Input Output Key Advance
🎯 Detection Image Boxes + Labels Localize multiple objects
🎬 Video Frames Actions/Events Temporal reasoning
💬 VLM Image + Text Text response Open-ended understanding
🌍 World Model State Future prediction Physics simulation
🦾 VLA Image + Command Robot actions Embodied AI
🤖 Physical AI Real world Physical interaction Full autonomy