Back to Series
Claude Code
Claude CodeFeatures Guide
Part 8 of the Claude Code Series

Recovering and Orchestrating Claude Code Sessions After a Reboot

Your Claude Code sessions did not crash — the machine rebooted under them. How to prove it, recover a whole fleet from disk, and orchestrate the survivors.

21 May 202613 min read

You are running a row of AI coding agents at once, and you step away from the desk. You come back, and every window is empty. No crash dialog. No error in any log. Your instinct says something killed your processes — a runaway script, an out-of-memory reaper, a rogue cleanup job.

Almost always, it did not. The whole machine rebooted underneath you. This is Part 8 of our Claude Code Features series, and it is the honest counterweight to Part 7, where we sold the promise that AI agents work while you sleep. They can. But the first time a fleet vanishes overnight, you need to know what happened, get everything back, and decide what is actually safe to run unattended. This post is that field guide.

Download

When Agents Vanish

Part 8: Session Recovery

A fleet of empty windows is rarely a kill — it is a reboot

Brand
JJM

The Empty Windows: When a Whole Fleet Disappears at Once

The scene is specific. You had seven or eight Claude Code sessions open across as many terminal tabs, each parked in a different project, each mid-task. You closed the laptop or walked off for an hour. When you return, the terminals are still open but the sessions are gone — back to a fresh shell prompt, no conversation, no context.

The emotional response is to assume sabotage. Something must have terminated the processes. That assumption sends you hunting through process logs and OOM-killer output, and you find nothing, which makes it worse.

Stop and reframe. A single event that takes out every session simultaneously, leaves no error, and resets the terminals to a clean prompt is not the signature of a process reaper. A reaper picks off the heaviest process, or the one that breached a limit. It does not silently and uniformly clear all of them at the same instant. What clears everything at once is the thing underneath all of them: the operating system itself restarting.

Diagnosing the Disappearance

1
Symptom
Every window empty
Seven sessions gone at once. No crash dialog, no error in any log.
2
Signal 1
System event 1074
SYSTEM / TrustedInstaller + 'Upgrade (Planned)' = a feature update forced the reboot.
3
Signal 2
One shared timestamp
A dozen transcripts last-modified in the exact same second. One event, not a reaper.
4
Verdict
Reboot, not a kill
Two agreeing signals close the case. Stop diagnosing — start recovering.

It Was Not a Crash: Reading Windows Event 1074

On Windows, the definitive tell is System event 1074. This event records who initiated a shutdown or restart, and it is the first place to look — not the application logs, not the terminal scrollback.

Open Event Viewer, go to Windows Logs then System, and filter for event ID 1074. Read the description carefully, because the wording tells you the cause:

  • If the "on behalf of user" field names your own account, a person or a process running as you triggered the restart. That is a deliberate or accidental human action.
  • If the initiator is SYSTEM or TrustedInstaller, and the reason reads "Operating System: Upgrade (Planned)", a Windows feature update forced the reboot.

That second case is the trap. Most people lock down auto-restart for quality updates and assume they are protected. But feature upgrades bypass that policy. The setting that defers a restart after a monthly patch does not apply to a feature update rollout, so the machine reboots even though you believed auto-restart was disabled. If you run agents overnight on Windows, this single distinction explains a large share of mysterious morning-after disappearances.

The fix is not to argue with the logs. It is to recognise the pattern instantly the next time, so you spend your energy on recovery rather than on a phantom hunt for a killer that never existed.

Read Event 1074 First

It names who initiated the restart. SYSTEM or TrustedInstaller with 'Operating System: Upgrade (Planned)' means a feature update forced the reboot — bypassing the policy that blocks auto-restart for quality updates.

The Second Signal: When Every Transcript Shares One Timestamp

Event 1074 is your primary evidence. Confirm it a second, independent way before you commit to the reboot theory, because two agreeing signals beat one.

Claude Code writes each conversation to a transcript file on disk. Those files carry a last-modified timestamp that updates as the session writes. Pull the modified time of every transcript that was live when things went dark.

If a dozen transcripts all share the exact same last-modified second, that is your confirmation. Independent agents doing independent work do not coincidentally stop writing at the same second. One external event froze all of them simultaneously — and a reboot is exactly such an event. A process reaper, by contrast, would stagger its kills across seconds or minutes as each target tripped its threshold.

So the diagnosis is a two-key turn: event 1074 says the OS restarted, and the synchronised transcript timestamps say every session died with it. Together they close the case. Now you can stop diagnosing and start recovering.

1074
Windows event ID to read first
1 sec
Shared timestamp = one event
100%
Conversations persisted to disk
1-2
Realistic concurrent agent ceiling

Your Sessions Are Not Gone: How Claude Code Persists Every Conversation

Here is the relief: the sessions are not lost. Claude Code persists every conversation to disk as it goes. The reboot ended the running processes, but the state — the full message history, the context, the work product captured in the transcript — survived on the filesystem.

That changes the recovery problem from "reconstruct everything from memory" to "point the tool back at what it already saved." You resume any single session by invoking the resume command against that session's id, and you run it from the directory the session was originally working in. The working directory matters: a session resumed in the wrong folder is a session that has lost its bearings, even though its conversation is intact.

For one session, that is the whole story. Find the session id, change into the right project directory, resume. The agent picks up with its full context restored, as if the reboot had been a long pause rather than a death.

Your Instinct

  • A reaper killed my processes
  • An OOM event picked them off
  • A cleanup job wiped them
  • Something is broken in the tool

What Happened

  • The OS restarted underneath them
  • Every session died in the same second
  • The terminals reset to a clean prompt
  • The transcripts are safe on disk

Resuming One Session, Then a Whole Fleet

One session is easy. A fleet of a dozen is a different job, and doing it by hand — finding each id, remembering each working directory, opening each tab — is slow and error-prone at exactly the moment you want to move fast.

Automate it. Each transcript's opening lines contain what you need to relaunch it: the working directory the session ran in, and enough of a label to identify which project it was. Read the head of every transcript, extract the directory and label for each, and then relaunch the whole set as named tabs in a single window with one script. You go from "everything is gone" to "everything is back, labelled, and parked in the right folders" in one command.

That is the core recovery loop, and it scales. Whether you lost three sessions or fifteen, the procedure is identical: read the transcripts, recover the coordinates, relaunch as a labelled fleet.

Download
100%
Recoverable

Claude Code persists every conversation to disk — resume from the session id

Brand
JJM

Two Caveats Before You Touch a File

Recovery is not finished the moment the tabs reappear. Two things must be true before any recovered agent is allowed to make changes.

First, re-registration. If you run any system that tracks which agents are "live" — a dashboard, an orchestration layer, a heartbeat registry — a recovered session is a stranger to it. The old process that registered is dead. Each resumed session has to re-register so your tracking reflects reality. Skip this and your live-agent view lies to you, which is dangerous precisely when you are coordinating many agents.

Second, the shared working tree. If several recovered sessions share one git working tree, they can step on each other the instant they start editing. Before any of them writes a file, run a branch-guard: confirm each session is on the branch it should be, and that two agents are not about to commit conflicting changes into the same tree. A reboot recovery that ends in a tangle of cross-committed work is not a recovery — it is a second incident.

From Empty Windows to a Restored Fleet

Read transcripts
head of each file
Recover coordinates
working dir + label
Relaunch fleet
named tabs, one script

Same procedure whether you lost three sessions or fifteen.

Putting Recovered Sessions Under One Orchestrator

The harder, more interesting question is whether you can take recovered sessions and put them under a single orchestrator to run autonomously, rather than babysitting each tab yourself.

You can, and the mechanism is worth understanding. You resume an existing session inside a controllable pseudo-terminal — a PTY the orchestrator owns — and then you drive it over a WebSocket. Driving means two directions. To act, the orchestrator sends text into the session, pauses, then sends a return to submit it. To observe, it reads the session's state by forcing the terminal to repaint and then stripping the formatting codes out of the captured screen, leaving clean text it can reason about.

With that control loop in place, the orchestrator does the one job that matters most: it gates consequential actions. Workers are free to investigate, read, search, and draft as much as they like. But a production merge is routed as a hard stop — the orchestrator must explicitly approve it before it proceeds. That separation, free investigation with a gated point of no return, is the difference between useful autonomy and a fleet of agents that can do real damage while you are not watching.

1

Read each transcript head

Pull the working directory and a label from the opening lines of every saved session.

2

Relaunch as named tabs

One script reopens the whole fleet in a single window, each parked in its own project.

3

Re-register every session

Recovered sessions are strangers to your live-agent tracker until they register again.

4

Run a branch-guard

Before any edit, confirm sessions sharing one working tree are not about to collide.

The Four Walls of Unattended Overnight Autonomy

This is where honesty matters more than salesmanship. The orchestration above is a partially-worked story, not a finished product, and the real lesson is the set of walls you must clear before you trust a fleet to run overnight on a personal machine. There are four, and each one bit hard.

Wall one: rate limits. Seven heavy sessions plus an orchestrator hit the provider's rate limits on nearly every turn. The arithmetic is simple and unforgiving — the more agents you drive in parallel, the sooner you stall. The real ceiling for sustained unattended work is one or two agents at a time, not seven. Driving fewer agents harder beats driving all of them at once.

Wall two: sleep. Modern-standby sleep froze every session and every watchdog timer the moment the machine dozed. Worse, on wake the watchdog instantly blew past its runtime cap, because from its perspective no time had passed during sleep and then suddenly hours had. Any timer-based safety net you build has to account for a machine that can be paused mid-count.

Wall three: the auto-approver. The component that auto-approved safe prompts silently broke when the tool's permission UI changed layout. It had been matching against the old screen; the new layout meant it classified every prompt as unknown, and unattended progress stalled behind prompts nobody was there to answer. Anything that reads a UI to make decisions is brittle against that UI changing.

Wall four: context limits. Sessions running near their context limit degrade in quality and risk auto-compacting mid-task — discarding or summarising history at the worst possible moment. A long unattended run is exactly the situation that pushes a session toward that edge.

Pro Tip

Two Caveats Before Any Edit

Recovered sessions must re-register with whatever tracks live agents, and sessions sharing one working tree need a branch-guard before they write a single file. A recovery that ends in cross-committed work is just a second incident.

What Actually Works: Build for the Reboot, Build for the Sleep

Put the four walls together and a clear operating posture emerges. None of them means autonomy is impossible. They mean you design for the failure modes instead of pretending they will not happen.

Build for the reboot. Assume the machine will restart under you, and make recovery a one-command routine rather than a panic. The transcripts persist; lean on that.

Build for the sleep. Assume the machine will doze, and do not trust any safety timer that a pause can corrupt. Prefer checks that read real wall-clock state over ones that count elapsed runtime.

And drive fewer agents harder rather than all of them at once. The fantasy is a wall of ten parallel workers churning through the backlog overnight. The reality that actually ships work is one or two well-fed sessions under an orchestrator that gates the dangerous moves. That is not a smaller ambition — it is the version that survives contact with rate limits, sleep, brittle UI matching, and context ceilings.

Driving a Resumed Session Over a WebSocket

act
Send text
act
Submit (return)
observe
Force repaint
observe
Strip formatting
Hard Stop
Production merge
Orchestrator must approve

What's Next

Session recovery and orchestration is the operational backbone behind everything else in this series — from spinning up parallel agent teams to running background tasks while you sleep. The promise of autonomous AI development is real, but it rests on knowing what happens when the machine restarts and being ready for it.

If you want this kind of resilient, agent-driven workflow built into your own product or internal tooling, that is exactly the work we do. Talk to us about AI application development or marketing automation, and we will help you build systems that recover instead of breaking.

Social Media Carousel

7 cards — Scroll to browse, click any card to download

1 / 7

When Agents Vanish

Part 8: Session Recovery

A fleet of empty windows is rarely a kill — it is a reboot

JJM
Download
2 / 7

Diagnose the Disappearance

1
Windows event 1074 names the initiator
2
Upgrade (Planned) bypasses auto-reboot policy
3
Identical transcript timestamps = one event
4
No staggered kills means no reaper
JJM
Download
3 / 7
100%
Recoverable

Claude Code persists every conversation to disk — resume from the session id

JJM
Download
4 / 7

Instinct vs Reality

Before

Something killed my processes

After

The machine rebooted underneath them

JJM
Download
5 / 7

Gate the Dangerous Move

Let workers investigate freely. Route the production merge as a hard stop the orchestrator must approve.

JJM
Download
6 / 7
Key Takeaway

Drive Fewer, Harder

Rate limits, sleep, brittle UI matching and context ceilings cap the real fleet at one or two agents.

JJM
Download
7 / 7

Build Systems That Recover

Resilient agent workflows for your product

Get Your Free AI Strategy
JJM
Download

Share This Article

Spread the knowledge

Build With Claude Code

We use Claude Code to ship faster, smarter, and at scale. Let us build your next project.