A rigorous explanation that anyone can follow
A directed graph (or digraph) has two things:
A → B, A → C, B → D, C → D
Two important terms we'll need:
Imagine you're exploring a dark cave with a flashlight and a ball of string. Your strategy:
That's DFS: go as deep as possible first, then backtrack.
DFS has two parts: a main loop and a recursive visit function.
Let's break down what happens when we call DFS-Visit(u):
Every vertex is in exactly one of three states at any moment during DFS:
Think of coloring rooms in a house as you explore:
At any moment during DFS, the GRAY vertices form a path from the starting vertex down to the vertex currently being explored. This is exactly the recursion stack — the trail of rooms you've entered but haven't finished yet.
If you ever encounter a GRAY vertex while exploring, you've found a cycle (a back edge), because you've walked in a circle back to a room you're still inside.
We keep a global clock (a counter that starts at 0 and goes up by 1 each tick). We tick the clock at two moments for each vertex:
The clock value when we first visit u (paint it GRAY). This is when we "open the door" to room u.
The clock value when we finish u (paint it BLACK). This is when we've explored everything reachable from u and "lock the door" behind us.
Since the clock ticks once for each discovery and once for each finish, and there are n vertices, the clock goes from 0 up to 2n.
For any two vertices u and v, their discovery/finish intervals [d[u], f[u]] and [d[v], f[v]] are either:
They never partially overlap. Think of it like matching parentheses: ( ( ) ) is OK, ( ) ( ) is OK, but ( ( ) is impossible.
Why? If we discover u first and then discover v before finishing u, that means we called DFS-Visit(v) from within the recursive call chain starting at u. So v must finish before u finishes. The interval for v is nested inside u's interval.
If there is an edge u → v in the graph, what can we say about finish times?
When we're exploring u (u is GRAY) and we look at the edge u → v, there are three cases:
Bottom line: In a DAG (no cycles), for every edge u → v, we always have f[v] < f[u].
Let's trace DFS on this graph step by step. Watch the colors and timestamps change!
Graph: A → B, A → C, B → D, C → D, D → E
| Vertex | A | B | C | D | E |
|---|---|---|---|---|---|
| d[v] | — | — | — | — | — |
| f[v] | — | — | — | — | — |
Imagine neighborhoods in a city with one-way streets. A strongly connected component is a neighborhood where, starting from any house, you can drive to any other house in that same neighborhood and drive back, following the one-way streets.
A Strongly Connected Component (SCC) of a directed graph G is a maximal set of vertices C ⊆ V such that for every pair of vertices u, v ∈ C:
Maximal means you can't add any more vertices and still have the property hold.
Let's see an example:
Edges: 1→2, 2→3, 3→1, 2→4, 4→5, 5→4. Red dashed = cross-SCC edge.
If you shrink each SCC into a single "super-vertex", the resulting graph (called the Component Graph) is always a DAG — it has no cycles.
Why? If there were a cycle among super-vertices, say SCC_A → SCC_B → ... → SCC_A, then every vertex in all those SCCs could reach every other vertex, so they should all be in the same SCC. That contradicts them being in separate SCCs.
Kosaraju's algorithm finds all SCCs in a directed graph in O(V + E) time. Here are the three steps:
Step 1: Explore the whole city, noting what time you finish exploring each neighborhood. The neighborhoods you finish last are the ones you started from or that are "upstream" — they have roads leading out to other places.
Step 2: Magically reverse every one-way street in the city.
Step 3: Now explore again, but start from the neighborhood that finished last in Step 1. With the roads reversed, you can only reach vertices in the same SCC — the reversed roads that used to go OUT now go IN, so you can't "escape" to other SCCs. Each connected chunk you find is exactly one SCC.
GT has the same vertices as G, but every edge is reversed.
If G has edge u → v, then GT has edge v → u.
Crucial property: G and GT have exactly the same SCCs. If you can go from u to v and back in G, you can also go from u to v and back in GT (just using the reversed path).
This is the most important chapter. We'll build the proof step by step, like stacking bricks.
We need to prove that in Step 3, each DFS tree in the forest is exactly one SCC. This means two things:
For an SCC C, define f(C) = max { f[u] : u ∈ C }. That is, the finish time of the SCC is the largest finish time among all its vertices.
Claim: If there is an edge from a vertex in SCC C to a vertex in SCC C' (where C ≠ C'), then f(C) > f(C').
Why? Consider the DFS in Step 1. There are two sub-cases depending on which SCC gets discovered first:
Sub-case A: Some vertex in C is discovered before any vertex in C'. Then from C, DFS will follow the cross-SCC edge into C' and explore all of C' before coming back to finish C. So every vertex in C' finishes before the last vertex in C finishes. Therefore f(C) > f(C').
Sub-case B: Some vertex in C' is discovered first. But there's no edge from C' to C (if there were, combined with the edge from C to C', they'd be in the same SCC!). So DFS explores all of C' and finishes it entirely. Only later does DFS discover C. So again, f(C) > f(C').
In short: in the component graph (the DAG of SCCs), "upstream" SCCs have larger finish times.
In Step 3, we process vertices in decreasing order of their Step 1 finish times. By Lemma 1, this means:
We start Step 3's DFS from a vertex in the SCC with the highest finish time. By Lemma 1, this is an SCC that has no incoming edges from other SCCs in the component graph — it's a source in the component DAG.
Remember, Step 3 runs on the transpose graph GT. When we reverse all edges:
Actually, let's think about it precisely:
When we start a DFS from a vertex u in this "sink SCC" of GT, the DFS can reach all other vertices in u's SCC (since SCCs are the same in G and GT), but it cannot escape to any other SCC (because there are no outgoing cross-SCC edges from this SCC in GT).
Therefore, the DFS tree rooted at u contains exactly the vertices of u's SCC. Nothing more, nothing less.
Once we've identified the first SCC and colored all its vertices BLACK, the next unvisited vertex we try (the one with the next-highest finish time) belongs to the SCC with the next-highest f(C). By the same argument:
This keeps repeating until all vertices are processed. Each DFS tree is exactly one SCC. ∎
"Why can't we just run DFS on G (without transposing) and call each tree an SCC?"
Because DFS on the original graph can escape from one SCC to another! If SCC C has an edge to SCC C', DFS starting in C will wander into C'. The transpose prevents this escape by flipping the cross-SCC edges.
"Why not just run DFS on GT in any order?"
Because if you start from a vertex in a non-sink SCC of GT, DFS could escape to other SCCs (following the transposed cross-SCC edges that now point outward). The decreasing-finish-time order ensures you always start from a sink of the remaining transposed component DAG.
Let's trace Kosaraju's algorithm on the graph: 1→2, 2→3, 3→1, 3→4, 4→5, 5→4.
| Vertex | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| d[v] (Step 1) | — | — | — | — | — |
| f[v] (Step 1) | — | — | — | — | — |
| SCC | — | — | — | — | — |
In Step 1, SCCs that can reach other SCCs get higher finish times (Lemma 1). In Step 3, we process high-finish-time vertices first on the reversed graph. On the reversed graph, the SCC with the highest finish time is a sink — it has no outgoing cross-SCC edges. So DFS from there can only reach vertices in that same SCC. After removing them, the next SCC becomes a sink, and the process repeats. Each DFS tree captures exactly one SCC.
The algorithm is just two DFS calls and one edge reversal. The magic lies in the ordering: Step 1's finish times tell us which SCC to peel off first, and the transpose ensures DFS can't wander across SCC boundaries. Together, they give us all SCCs in linear time.