🧭 DFS, Finish Times & Kosaraju's Algorithm

A rigorous explanation that anyone can follow

🗺️ Chapter 1: What Is a Directed Graph?

🌍 Real-Life Analogy
Imagine a city where some streets are one-way. You can drive from A to B, but that doesn't mean you can drive from B back to A. A directed graph is like a map of one-way streets.

A directed graph (or digraph) has two things:

A B C D

A → B, A → C, B → D, C → D

Two important terms we'll need:

🔦 Chapter 2: DFS — Exploring a Maze

🌍 Real-Life Analogy

Imagine you're exploring a dark cave with a flashlight and a ball of string. Your strategy:

  1. Walk into a tunnel you haven't explored yet. Unroll string behind you.
  2. Keep going deeper into new tunnels whenever you can.
  3. When you hit a dead end (or only see tunnels you've already visited), backtrack along your string to the last fork, and try another tunnel.
  4. Repeat until you've explored everything reachable.

That's DFS: go as deep as possible first, then backtrack.

The DFS Procedure (Pseudocode)

DFS has two parts: a main loop and a recursive visit function.

// Global clock — starts at 0
time = 0
DFS(Graph G):
  for each vertex u in G:
    color[u] = WHITE
  for each vertex u in G:
    if color[u] == WHITE:
      DFS-Visit(u)
DFS-Visit(u):
  time = time + 1
  d[u] = time // discovery time
  color[u] = GRAY // "I'm working on u"
  for each neighbor v of u:
    if color[v] == WHITE:
      DFS-Visit(v) // go deeper!
  time = time + 1
  f[u] = time // finish time
  color[u] = BLACK // "I'm completely done with u"

Let's break down what happens when we call DFS-Visit(u):

  1. Tick the clock, record the discovery time d[u].
  2. Paint u GRAY — this means "I've started exploring u, but I'm not done yet."
  3. Look at each neighbor v of u. If v is WHITE (unexplored), dive into it by calling DFS-Visit(v). This is the recursion — we go deeper before finishing u.
  4. Once ALL of u's neighbors have been handled (either they were already visited or we recursed and came back), tick the clock again and record the finish time f[u].
  5. Paint u BLACK — this means "I'm completely finished with u and all its descendants."

🎨 Chapter 3: The Three Colors

Every vertex is in exactly one of three states at any moment during DFS:

WHITE — "Haven't visited yet"
GRAY — "Currently exploring (on the stack)"
BLACK — "Completely finished"
🌍 Real-Life Analogy

Think of coloring rooms in a house as you explore:

  • WHITE room: You've never entered it. The door is closed.
  • GRAY room: You walked in, you're currently in there (or you left to explore a room deeper inside, but you haven't come back to close the door yet). Your string is still trailing through it.
  • BLACK room: You've explored everything accessible from that room and walked back out. Door is closed and locked. Done.
🔑 Key Fact: The Gray Path

At any moment during DFS, the GRAY vertices form a path from the starting vertex down to the vertex currently being explored. This is exactly the recursion stack — the trail of rooms you've entered but haven't finished yet.

If you ever encounter a GRAY vertex while exploring, you've found a cycle (a back edge), because you've walked in a circle back to a room you're still inside.

⏱️ Chapter 4: Discovery Time & Finish Time

We keep a global clock (a counter that starts at 0 and goes up by 1 each tick). We tick the clock at two moments for each vertex:

📖 Discovery Time d[u]

The clock value when we first visit u (paint it GRAY). This is when we "open the door" to room u.

📖 Finish Time f[u]

The clock value when we finish u (paint it BLACK). This is when we've explored everything reachable from u and "lock the door" behind us.

Since the clock ticks once for each discovery and once for each finish, and there are n vertices, the clock goes from 0 up to 2n.

🔑 The Parenthesis Theorem

For any two vertices u and v, their discovery/finish intervals [d[u], f[u]] and [d[v], f[v]] are either:

  • Completely nested: one interval is entirely inside the other, OR
  • Completely separate: they don't overlap at all.

They never partially overlap. Think of it like matching parentheses: ( ( ) ) is OK, ( ) ( ) is OK, but ( ( ) is impossible.

Why? If we discover u first and then discover v before finishing u, that means we called DFS-Visit(v) from within the recursive call chain starting at u. So v must finish before u finishes. The interval for v is nested inside u's interval.

🔑 The Key Rule for Edges

If there is an edge u → v in the graph, what can we say about finish times?

When we're exploring u (u is GRAY) and we look at the edge u → v, there are three cases:

  1. v is WHITE: We haven't visited v yet. We recurse into v. v will finish before u. So f[v] < f[u]. ✓
  2. v is BLACK: v is already completely done. Its finish time was recorded earlier. So f[v] < f[u]. ✓
  3. v is GRAY: v is an ancestor of u in the DFS tree — we're inside v's exploration! This edge u → v goes "back" to an ancestor. This is called a back edge, and it means there's a cycle v → ... → u → v.

Bottom line: In a DAG (no cycles), for every edge u → v, we always have f[v] < f[u].

🎮 Chapter 5: Interactive DFS Walkthrough

Let's trace DFS on this graph step by step. Watch the colors and timestamps change!

Graph: A → B, A → C, B → D, C → D, D → E

A B C D E
White
Gray
Black
Press Next Step to begin DFS from vertex A. We'll process neighbors in alphabetical order.

Timestamps Table

VertexABCDE
d[v]
f[v]

🏘️ Chapter 6: Strongly Connected Components (SCCs)

🌍 Real-Life Analogy

Imagine neighborhoods in a city with one-way streets. A strongly connected component is a neighborhood where, starting from any house, you can drive to any other house in that same neighborhood and drive back, following the one-way streets.

📖 Formal Definition

A Strongly Connected Component (SCC) of a directed graph G is a maximal set of vertices C ⊆ V such that for every pair of vertices u, v ∈ C:

  • There exists a path from u to v, AND
  • There exists a path from v to u.

Maximal means you can't add any more vertices and still have the property hold.

Let's see an example:

SCC 1 SCC 2 1 2 3 4 5

Edges: 1→2, 2→3, 3→1, 2→4, 4→5, 5→4. Red dashed = cross-SCC edge.

🔑 The Component Graph Is a DAG

If you shrink each SCC into a single "super-vertex", the resulting graph (called the Component Graph) is always a DAG — it has no cycles.

Why? If there were a cycle among super-vertices, say SCC_A → SCC_B → ... → SCC_A, then every vertex in all those SCCs could reach every other vertex, so they should all be in the same SCC. That contradicts them being in separate SCCs.

🔧 Chapter 7: Kosaraju's Algorithm — The Steps

Kosaraju's algorithm finds all SCCs in a directed graph in O(V + E) time. Here are the three steps:

Kosaraju(Graph G):
  // STEP 1: Run DFS on original graph G.
  // Record finish times f[v] for all vertices.
  Run DFS(G)
  // STEP 2: Build the transpose graph G^T.
  // (Reverse every edge: if G has u→v, G^T has v→u)
  Compute GT
  // STEP 3: Run DFS on G^T, but in the main loop,
  // process vertices in DECREASING f[v] order.
  // Each DFS tree = one SCC.
  Run DFS(GT) with vertices sorted by decreasing f[v]
  return each tree from Step 3 as an SCC
🌍 Step-by-Step Analogy

Step 1: Explore the whole city, noting what time you finish exploring each neighborhood. The neighborhoods you finish last are the ones you started from or that are "upstream" — they have roads leading out to other places.

Step 2: Magically reverse every one-way street in the city.

Step 3: Now explore again, but start from the neighborhood that finished last in Step 1. With the roads reversed, you can only reach vertices in the same SCC — the reversed roads that used to go OUT now go IN, so you can't "escape" to other SCCs. Each connected chunk you find is exactly one SCC.

📖 What is the Transpose Graph GT?

GT has the same vertices as G, but every edge is reversed.

If G has edge u → v, then GT has edge v → u.

Crucial property: G and GT have exactly the same SCCs. If you can go from u to v and back in G, you can also go from u to v and back in GT (just using the reversed path).

🧮 Chapter 8: WHY Kosaraju's Works — The Full Proof

This is the most important chapter. We'll build the proof step by step, like stacking bricks.

Brick 1: Define what we need to show

We need to prove that in Step 3, each DFS tree in the forest is exactly one SCC. This means two things:

  1. Every vertex in a DFS tree belongs to the same SCC (we don't mix vertices from different SCCs in one tree).
  2. Every vertex of an SCC ends up in the same DFS tree (we don't split an SCC across trees).

Brick 2: Finish times and SCCs in Step 1

📖 Definition: Finish time of an SCC

For an SCC C, define f(C) = max { f[u] : u ∈ C }. That is, the finish time of the SCC is the largest finish time among all its vertices.

🔑 Lemma 1: If there's an edge between SCCs, the "source" SCC finishes later

Claim: If there is an edge from a vertex in SCC C to a vertex in SCC C' (where C ≠ C'), then f(C) > f(C').

Why? Consider the DFS in Step 1. There are two sub-cases depending on which SCC gets discovered first:

Sub-case A: Some vertex in C is discovered before any vertex in C'. Then from C, DFS will follow the cross-SCC edge into C' and explore all of C' before coming back to finish C. So every vertex in C' finishes before the last vertex in C finishes. Therefore f(C) > f(C').

Sub-case B: Some vertex in C' is discovered first. But there's no edge from C' to C (if there were, combined with the edge from C to C', they'd be in the same SCC!). So DFS explores all of C' and finishes it entirely. Only later does DFS discover C. So again, f(C) > f(C').

In short: in the component graph (the DAG of SCCs), "upstream" SCCs have larger finish times.

Brick 3: Step 3 processes SCCs in the right order

In Step 3, we process vertices in decreasing order of their Step 1 finish times. By Lemma 1, this means:

🔑 Consequence

We start Step 3's DFS from a vertex in the SCC with the highest finish time. By Lemma 1, this is an SCC that has no incoming edges from other SCCs in the component graph — it's a source in the component DAG.

Brick 4: In GT, source SCCs become sinks

Remember, Step 3 runs on the transpose graph GT. When we reverse all edges:

Actually, let's think about it precisely:

🔑 Critical Insight

When we start a DFS from a vertex u in this "sink SCC" of GT, the DFS can reach all other vertices in u's SCC (since SCCs are the same in G and GT), but it cannot escape to any other SCC (because there are no outgoing cross-SCC edges from this SCC in GT).

Therefore, the DFS tree rooted at u contains exactly the vertices of u's SCC. Nothing more, nothing less.

Brick 5: After peeling off one SCC, the argument repeats

Once we've identified the first SCC and colored all its vertices BLACK, the next unvisited vertex we try (the one with the next-highest finish time) belongs to the SCC with the next-highest f(C). By the same argument:

This keeps repeating until all vertices are processed. Each DFS tree is exactly one SCC. ∎

Putting it all together

📜 Complete Proof Summary
  1. Step 1 gives us finish times. Lemma 1 tells us that if SCC C has an edge to SCC C', then f(C) > f(C'). So SCCs are "sorted" by finish time in the component DAG, with sources having the highest times.
  2. Step 2 reverses edges. The SCCs don't change (mutual reachability is symmetric to reversal). But in the transposed component DAG, all arrows between SCCs flip.
  3. Step 3 processes vertices in decreasing f[v] order on GT. The first vertex we pick is in the SCC with the highest finish time — a source in G's component DAG, hence a sink in GT's component DAG. DFS from here explores exactly that SCC and nothing else (can't escape a sink). We mark those vertices done.
  4. Induction: After removing the found SCC, the next-highest-finish-time SCC becomes a sink in the remaining transposed component DAG. Repeat. Each DFS tree = one SCC.
  5. Complexity: Two DFS calls + one graph transposition = O(V + E) + O(V + E) + O(V + E) = O(V + E).
⚠️ Common Mistake

"Why can't we just run DFS on G (without transposing) and call each tree an SCC?"

Because DFS on the original graph can escape from one SCC to another! If SCC C has an edge to SCC C', DFS starting in C will wander into C'. The transpose prevents this escape by flipping the cross-SCC edges.

⚠️ Another Common Mistake

"Why not just run DFS on GT in any order?"

Because if you start from a vertex in a non-sink SCC of GT, DFS could escape to other SCCs (following the transposed cross-SCC edges that now point outward). The decreasing-finish-time order ensures you always start from a sink of the remaining transposed component DAG.

🎮 Chapter 9: Interactive Kosaraju's Walkthrough

Let's trace Kosaraju's algorithm on the graph: 1→2, 2→3, 3→1, 3→4, 4→5, 5→4.

Press Next Step to begin Kosaraju's algorithm.

Tracking Table

Vertex12345
d[v] (Step 1)
f[v] (Step 1)
SCC

🎯 Chapter 10: Summary

DFS Recap

Kosaraju's Algorithm Recap

  1. DFS on G → get finish times.
  2. Transpose G → reverse all edges → same SCCs.
  3. DFS on GT in decreasing finish time order → each tree = one SCC.

Why It Works (One-Paragraph Version)

In Step 1, SCCs that can reach other SCCs get higher finish times (Lemma 1). In Step 3, we process high-finish-time vertices first on the reversed graph. On the reversed graph, the SCC with the highest finish time is a sink — it has no outgoing cross-SCC edges. So DFS from there can only reach vertices in that same SCC. After removing them, the next SCC becomes a sink, and the process repeats. Each DFS tree captures exactly one SCC.

🔑 The Genius of Kosaraju's

The algorithm is just two DFS calls and one edge reversal. The magic lies in the ordering: Step 1's finish times tell us which SCC to peel off first, and the transpose ensures DFS can't wander across SCC boundaries. Together, they give us all SCCs in linear time.