What changed
The old storage layout scattered session data across hash-based directories
(~/.agent-orchestrator/{12-hex-hash}-{projectId}/) with key=value flat-file metadata
and a storageKey indirection in global config. Four competing hash methods made
collision resolution fragile.
The new layout uses projects/{projectId}/ with JSON metadata files,
a single orchestrator record per project, and status computed from lifecycle state
rather than stored as a separate field.
| Before | After |
|---|---|
~/.agent-orchestrator/{hash}-{name}/sessions/{id} |
~/.agent-orchestrator/projects/{name}/sessions/{id}.json |
Key=value flat files (status=working\nbranch=main) |
JSON metadata with typed fields |
storageKey in global config, 4 hash algorithms |
Direct projectId, no hashing |
status stored and read independently |
status derived from lifecycle via deriveLegacyStatus() |
Tmux names: {hash}-{prefix}-{num} |
Tmux names: {prefix}-{num} |
Migration & rollback safety
Three rounds of code review surfaced critical data-loss risks in the migration path. Each was verified against the actual codebase before fixing.
~/.worktrees/{projectId}/{sessionId}/
(two levels deep). moveStrayWorktrees() only scanned top-level entries,
treating {projectId} dirs as session names. All nested worktrees were orphaned.
rmSync(projectDir, { recursive: true }) deleted the entire
projects/{id}/ directory during rollback — including sessions
created after migration that had no .migrated counterpart.
status field and lifecycle-derived status existed.
readMetadata preferred the stored value, meaning stale statuses
overrode the canonical lifecycle state machine.
orchestrator.json,
but the runtime reads all sessions from sessions/. Orchestrator
sessions became invisible after migration.
worktree paths were blindly rewritten to the new V2
location without verifying the directory was actually moved there. Paths pointed nowhere.
statePayload.
Migration dropped their status field, causing readMetadata
to fall through to "unknown".
How rollback safety works now
// Before renaming .migrated dirs back, count sessions that
// exist only in projects/{id}/ (created after migration)
const postMigrationSessions = countPostMigrationSessions(
projectDir,
migratedDirs.filter(d => d.projectId === projectId),
);
if (postMigrationSessions > 0) {
log(`Warning: ${postMigrationSessions} session(s) created after migration`);
log(` Skipping deletion. Remove manually after verifying.`);
} else {
rmSync(projectDir, { recursive: true });
}
Restore → instant re-kill
While investigating a live session (ao-90), we discovered that restoring a session whose PR was already merged would immediately re-terminate it. The terminal would disappear from the dashboard but the tmux process stayed alive.
The broken flow
The fix
// Reset terminal PR state so lifecycle manager doesn't
// immediately re-terminate the session
if (restoredLifecycle.pr.state === "merged" ||
restoredLifecycle.pr.state === "closed") {
restoredLifecycle.pr.state = "none";
restoredLifecycle.pr.reason = "cleared_on_restore";
restoredLifecycle.pr.number = null;
restoredLifecycle.pr.url = null;
}
Restored workers lose permissionless mode
getRestoreCommand() in the Claude Code plugin only added
--dangerously-skip-permissions for orchestrator sessions.
Worker sessions with permissionless config silently lost the flag on restore,
causing agents to stall on permission prompts mid-session.
// Before (broken): only orchestrators got the flag
const isOrchestrator = session.metadata?.["role"] === "orchestrator";
if (isOrchestrator && (permissionMode === "permissionless" || ...))
// After (fixed): matches getLaunchCommand behavior
if (permissionMode === "permissionless" || permissionMode === "auto-edit")
parts.push("--dangerously-skip-permissions");
This was Claude Code-only. Other agent plugins (Codex, Aider, OpenCode) handled permissions correctly in their restore commands.
Everything else we fixed
/^[a-zA-Z][a-zA-Z0-9_]*$/ didn't allow hyphens in prefix,
but sessionPrefix validation allows [a-zA-Z0-9_-]+. Names like
my-app-1 failed to parse.
true even when no directory existed.
Now checks existsSync() before deletion.
sessionPrefix from raw directory basename without
sanitizing. Folder names like my.app produced invalid prefixes.
Added same sanitization as config-generator.ts.
SessionStatus import left unused in metadata.ts.
Two call sites in lifecycle-manager.ts still passed
the removed previousStatus parameter.
projects/{id}/archive/
but runtime uses sessions/archive/. Aligned to
sessions/archive/ everywhere.
Full timeline
Every commit on the storage-redesign branch, from the initial
implementation through three rounds of review.
How to review this PR
Quick verification
- Pull the branch:
git checkout storage-redesign - Build:
pnpm install && pnpm build - Typecheck:
pnpm typecheck— zero errors - Tests:
pnpm test— 869 tests across 46 files - Lint:
pnpm lint— zero errors (warnings are pre-existing)
Migration dry run
- Run
ao migrate-storage --dry-runto preview changes without modifying anything - Verify it detects your legacy hash directories correctly
- Check the summary for session/archive/worktree counts
Critical files to focus on
| File | Why it matters |
|---|---|
metadata.ts | All storage flows through here. JSON serialization, status derivation on read. |
storage-v2.ts | 1065-line migration module. Conversion, rollback, stray worktree handling. |
session-manager.ts | ~20 path/metadata call sites updated. Restore lifecycle reset. |
lifecycle-state.ts | Status derivation from lifecycle state+reason. No more previousStatus. |
global-config.ts | 305 lines removed. storageKey system completely stripped. |
paths.ts | New V2 path functions alongside deprecated old ones. |
types.ts | SessionMetadata restructured. CanonicalPRReason extended. |
What to test manually
- Spawn a session, verify metadata is JSON at
~/.agent-orchestrator/projects/{id}/sessions/ - Kill and restore a session — terminal should stay connected
- Merge a PR, then restore — session should survive (not re-killed)
- Check dashboard shows correct status for all lifecycle states
- Run
ao migrate-storageon a real setup, then--rollbackto verify round-trip
Why we did it this way
Status is computed, never stored
deriveLegacyStatus(lifecycle) is the single source of truth.
writeMetadata still includes status for initial writes (sessions
without a lifecycle yet), but readMetadata always prefers
lifecycle-derived status when a lifecycle exists.
Migration is one-way with escape hatch
New code reads JSON only. No lazy dual-format reading. The migration command
converts old key=value to JSON. --rollback restores .migrated
directories but now checks for post-migration sessions before deleting V2 data.
PR state reset on restore, not grace period
We considered a time-based grace period after restore, but it was a band-aid. Instead, terminal PR states are cleared on restore: the old PR is done, and if the agent creates a new one, auto-detect picks it up.
detectingAttempts stays as string
A reviewer flagged this as a bug, but it's intentional.
buildTransitionMetadataPatch returns Record<string, string>,
and all consumers parse with Number.parseInt(). The string format
is consistent across the entire read/write pipeline.