🔮 Beyond Classification: The CV Landscape

From detecting objects to understanding the world — the evolution of computer vision

🎯

Detection

🎬

Video

💬

VLM

🌍

World Model

🦾

VLA

🤖

Physical AI

🎯 Object Detection & Segmentation

Find objects, draw boxes, classify each region

Input

🖼️

Single image with multiple objects

YOLO / DETR

→

Output

📦📦📦

Bounding boxes + class labels + confidence

YOLO

Real-time, single-stage detection

Faster R-CNN

Two-stage with region proposals

DETR

Transformer-based, end-to-end

SAM

Segment Anything — zero-shot

🚗 Self-Driving 📹 Surveillance 🏭 Manufacturing QC 🛒 Retail Analytics

🎬 Video Understanding

Temporal reasoning across frames — what's happening over time?

Input

🏃

Video frames over time

VideoMAE

→

Output

🏃‍♂️

"Person running across field"

                    Key Insight: Videos add the temporal dimension — understanding actions, events, and
                    causality requires modeling how things change over time.
                

VideoMAE

Masked autoencoder for video

TimeSformer

Divided space-time attention

Video-LLaVA

Video + language understanding

Gemini

Native multimodal, long video

🎥 Video Captioning 🔍 Action Recognition ⚽ Sports Analytics 🎬 Content Moderation

💬 Vision-Language Models (VLMs)

See and talk — multimodal understanding and conversation

🏔️

What mountain is this?

This appears to be Mount Fuji in Japan, identifiable by its symmetric volcanic cone shape.

What's the best time to visit?

The official climbing season is July-September. For photography, spring (cherry blossoms) or winter (snow cap) are stunning!

                    The Breakthrough: VLMs combine vision encoders with LLMs, enabling open-ended
                    visual question answering, image description, and visual reasoning.
                

GPT

OpenAI's multimodal model

Gemini

Google's native multimodal

Claude

Anthropic's vision model

LLaVA

Open-source VLM

🌍 World Models

Learn the physics of the world — predict what happens next

📸

Current State

→

🧠

World Model

→

🔮

Future Prediction

                    What They Do: World models learn to simulate environments — predicting how actions
                    affect the world. Essential for planning in robotics and autonomous systems.
                

See 👁️

→

Learn Physics 📐

→

Predict Future 🔮

→

Act Intelligently 🎯

Sora

OpenAI video world simulator

Genie

DeepMind interactive worlds

Cosmos

NVIDIA world foundation

UniSim

Universal simulator

🦾 Vision-Language-Action (VLA)

See, understand, and act — bridging perception to manipulation

Input

👁️ + 🗣️

"Pick up the red cup and place it on the table"

RT-2 / OpenVLA

→

Output

🦾

Robot arm trajectories & gripper commands

Camera 📷

→

Vision Encoder 🧠

→

LLM Reasoning 💬

→

Action Tokens 🎯

Google's Robotics Transformer

PaLM-E

562B embodied LLM

OpenVLA

Open-source, LLaVA-based

π₀

Physical Intelligence VLA

🤖 Physical AI & Embodied Intelligence

The convergence: AI that interacts with the physical world

🦾

🔍 Perceiving environment...

🧠 Planning action sequence...

⚡ Executing manipulation...

                    The Vision: Physical AI combines perception, reasoning, and action to create
                    systems that can navigate, manipulate, and interact with the real world — from household robots to
                    surgical assistants.
                

NLP

Robotics

Physical AI 🚀

Tesla Optimus

Humanoid robot

Figure

General-purpose humanoid

Boston Dynamics

Atlas, Spot robots

1X NEO

Home assistant robot

🏠 Home Robots 🏭 Manufacturing 🏥 Healthcare 🚀 Space Exploration 🌾 Agriculture

📊 The Evolution of Computer Vision

Task	Input	Output	Key Advance
🎯 Detection	Image	Boxes + Labels	Localize multiple objects
🎬 Video	Frames	Actions/Events	Temporal reasoning
💬 VLM	Image + Text	Text response	Open-ended understanding
🌍 World Model	State	Future prediction	Physics simulation
🦾 VLA	Image + Command	Robot actions	Embodied AI
🤖 Physical AI	Real world	Physical interaction	Full autonomy