2026-06-30 · SpecKit Companion Strategic Assessment grounded in the repo

What We're Doing vs What We Should Do

An honest self-assessment against the research — and, for every point, how it actually works in the code today and how to build the missing piece. Ends with a story map you can pull tasks from.

The verdict: Companion is strategically well-positioned but under-built and under-marketed on its own strengths. The research validates almost every foundational bet, and — good news from reading the code — several "gaps" are further along than expected (drift, coverage, adopt, auto already exist; users can already add their own commands). The real work is a small number of high-leverage additions, and one architectural unlock that gates the best of them.

The unlock: today there is one global AI provider at a time and no model selection, and dispatch is one-way (Companion hands off command text and never reads the response back). So the marquee idea — "a second model reviews your plan" — isn't a switch we can flip; it needs per-node provider/model selection plus a host-gated way to run the review inside the AI's own turn. Build that, and cross-provider stops being a convenience and becomes the quality wedge only we can ship.

How it actually works today (answers to your questions)

Grounded in the real files, so the plan below isn't guesswork. Expand each.

How do the "nodes" work? (there are two systems)

The repo deliberately keeps two things apart (speckit-extension/docs/node-model.md):

1 · Workflow steps — the pipeline in speckit-extension/workflows/speckit-companion.workflow.yml: specify → classify → route(switch) → {plan → tasks → implement} → mark-complete. Step types are command, switch, gate, shell. This is data, meant to run on spec-kit's engine (specify workflow run/resume). Caveat: the shipped VS Code path does not actually run this YAML — it has its own hardcoded 5-step list (DEFAULT_WORKFLOW/COMPANION_WORKFLOW in workflowManager.ts), and resume dispatches the installed /speckit.* commands directly. So the YAML engine path is partly aspirational.

2 · Composable command nodes — each command (e.g. speckit.companion.specify) is assembled from markdown fragments under speckit-extension/nodes/<command>/, ordered by _order.yml, with reusable parts (timing, sizing, routing, self-advance) filled into fences by assemble-nodes.py. Node kinds: investigate, author, gate, control. Big caveat: v1 is a pure refactor — the assembled body is byte-identical to the old hand-written command. "Recipes" that actually add or drop a section (the real payoff) are planned, not built.

So "add a grill-me node" means one of two concrete things: (a) add a fragment under nodes/<command>/ + its _order.yml entry (once section-level composition ships), or (b) add a before/after node hook in .specify/companion.yml — which is AI-interpreted prose, "there is no engine" yet.

Because we're a spec-kit extension, do we support every tool?

No — we support a fixed registry of 8 dispatch targets, via three styles (all one-way: we send command text, never read the response back):

Six terminal CLIs — claude, gemini, copilot (ghcs), codex, qwen, opencode. IDE Chat — the host editor's built-in chat (Copilot in VS Code, Composer in Cursor, Cascade in Windsurf, Antigravity). Claude panel — the Claude Code GUI panel via URI handler. Files: src/ai-providers/ (8 provider classes + factory).

The mechanism: we assemble a prompt and hand off the /speckit.* command text; the host CLI/chat resolves it. For IDE Chat and the Claude panel we prefill but can't auto-submit (the user presses Enter), and for all of them we never read the answer back. A new tool = a new registered provider (constructor + PROVIDER_PATHS entry) — not open-ended, but the breadth is the widest in the category and it's the #1 thing the market asks rivals for.

Can we configure nodes to use any model / AI provider?

Today: no, on both counts. The active provider is a single global machine-scoped setting (speckit.aiProvider) resolved once for the whole run — there is no per-spec, per-step, or per-node provider override. And there is no model selection anywhere — the model is whatever the underlying CLI defaults to (no model field in any setting, node frontmatter, or workflow step).

The only related knobs: workflow.yml passes one global inputs.integration uniformly to every step; VS Code custom workflows can gate a workflow to certain providers (supportedAiProviders, hides it if the active provider isn't listed) but cannot select one per step. There is a static provider capability profile (ProviderPaths: supportsHooks, autoApproveFlag, command format, …) but no canContinue/supportsSubagents flags — "continue if you can" is a prompt sentence, not a code flag.

This is the unlock. Per-node provider/model selection is the prerequisite for adversarial review, model routing (Opus for planning, Haiku for cheap steps), and cost control. It's a clean, additive build (a model/provider field on a step + resolving the right provider per dispatch).

Can users use their own "superpowers" commands?

Yes, already — two ways (both shipped, both config-level, no visual editor):

1 · VS Code custom workflows (speckit.customWorkflows setting): define your own steps[], each pointing at your own slash command (e.g. /superpowers:brainstorm), plus custom command buttons per step. Validated in workflowManager.ts (names must match ^[a-z][a-z0-9-]*$; speckit/companion reserved).

2 · .specify/companion.yml node hooks + recipes: insert a before/after node (a command, a prompt, or a user-authored node file) around any node id, or reorder a command's nodes with a recipe. Caveat: it's AI-interpreted ("there is no engine") and recipes only change order in v1, not content.

Note: the thing called src/features/workflow-editor/ is a spec-document refiner (refine a section/line of spec.md), not a visual pipeline editor. A real visual workflow/recipe editor is a future item.

How is clarify different from grill-me, and what do we need?

Clarify today = stock spec-kit /speckit.clarify: scans an ambiguity taxonomy, asks up to 5 one-at-a-time questions (multiple-choice with a recommended option), and writes a ## Clarifications session into spec.md. Companion only wraps it with the timing/capture parts. It's a non-advancing step (records a history finish, no status change).

"Grill-me" does not exist in the repo — zero references (no Wave-3/W3·4 anything in code). It's a vault-roadmap idea, not built.

What we'd need to build grill-me (as a distinct, harder clarify): (1) a companion command that loops past 5 questions until the spec is complete ("keep going until no unknowns"); (2) each question carries a recommended default (clarify already does this — keep it); (3) it persists the Q&A + the decisions into .spec-context.json, not just spec.md (today the decisions[] field is a read-only passthrough that no capture script writes); (4) it runs as a node hook before plan. Bake in the ICE authoring checks (below) while you're there.

What do we need to prevent context rot?

Today the extension does nothing about context rot — it's delegated to the underlying CLI. The prompt Companion builds is a .spec-context.json bookkeeping preamble + the slash command; it contains no fresh-context/context-window directives. Sub-agent parallelism exists only as optional prompt guidance in the implement command, host-gated ("if your host has a subagent tool… inline is the default").

The good part: the research says the winning defense is file-based state + fresh context, and our file-based trace is exactly that — the .spec-context.events.jsonl append-log is even engineered to be parallel-safe (workers append their own finish without contending). SLUMP's finding (a persistent spec file recovers 90% of lost faithfulness) is a direct endorsement of the trace.

What to add: (1) capture decisions/rationale deterministically so resume restores why, not just where; (2) ship true [P] fan-out (fresh context per task via subagent dispatch — ROADMAP step 8; the append-log foundation is ready); (3) keep the constitution/steering re-injected each step (already the model).

Do we need to be "more ICE"?

Adopt the mechanics, skip the rebrand. IDD/ICE is ~80% repositioning (its own author concedes "ICE is still a spec"). But three of its ideas are genuinely worth baking into our specify/clarify prompts:

(1) Separate directional constraints from binary failure conditions. Constraints ("<200ms", "99.99% uptime") stay human-readable in the spec; failure conditions (build fails, coverage <90%, secret in source, API contract change without a version bump) become the eval set the verifier runs — and the builder shouldn't see them ("can't teach a test it can't see"). This is the seed of our eval layer.

(2) The two-implementations goal test: if only one implementation can satisfy your "goal," it's a spec disguised as a goal — a cheap authoring check.

(3) Feed context progressively / code-first, don't dump one wall. We already stage via .spec-context.json; this just formalizes it.

You do not need to adopt IDSD as a methodology or rename anything. These are three prompt-level checks in clarify/specify.

What else should we track in .spec-context.json?

Today it tracks the lifecycle spine (workflow, specName, branch, currentStep, status, history[]) + skill-authored passthroughs (decisions[], concerns[], approach, task_summaries, step_summaries, reviewComments[], last_action, livingSpecs). The by field on history is only a coarse role (extension|user|cli|ai|derive).

Missing — and each maps to a research recommendation:

provider + model per step — which provider/model actually ran it (today: unknown; needed for the compute-allocator + governance story).
tokens / cost per step — for the budget + auto-pause feature (Uber-budget problem).
verification — { passed: bool, checks: [{name, status}] } — so mark-complete can gate on a real signal, not the AI's say-so.
evals — acceptance-criteria scores over re-runs (the trend, not a pass/fail); productizes the bench behavioral-judge.
clarifications[] / real decisions[] — captured deterministically by a grill-me/clarify command, not passthrough.
review — the second-model verdict (approved/revise + findings) from the verifier recipe.

State of play — the scorecard

strong validated watch right, under-exploited partial started, needs finishing gap missing

Dimension	Where we are (real code)
Cross-provider breadth	8 dispatch targets; one-way; #1 market demand	strong
File-based `.spec-context.json` trace	Append-only `history[]`, parallel-safe events log	strong
Riding Spec Kit additively	Never overwrites core; assembles command bodies	strong
Fast-path / right-sizing	classify + switch on ≤5 files/≤10 tasks	strong
Custom commands / workflows	Shipped (settings JSON + `companion.yml`); no GUI editor	strong
Brownfield / drift	`drift`, `coverage`, `adopt` commands exist	partial
Auto / unattended mode	`auto` conductor command exists	partial
Composable recipes (add/drop content)	Node assembly is byte-identical refactor only	partial
Per-node provider / model selection	One global provider; no model field	gap
mark-complete = verified gate	Status flip only (guarded, but no test signal)	gap
Clarify → grill-me	Stock ≤5-Q clarify; grill-me not in repo	gap
Adversarial cross-model review	Not built; blocked on per-node provider + read-back	gap
Eval layer / verification capture	Dev-only bench judge; nothing written to trace	gap
Model/token/cost tracking + governance	Absent from schema	gap

✅ What we're doing right (with the real mechanism)

Cross-provider breadth

validated · #1 demand

How today: 8 provider classes in src/ai-providers/ behind one IAIProvider seam; dispatch is one-way command text (terminal / IDE chat / Claude panel). One global provider via speckit.aiProvider.

Lean in: lead all messaging with it, keep widening the registry. But note the ceiling — one provider at a time — which the story map fixes so this asset can power adversarial review.

The file-based, append-only trace

SLUMP: +90% faithfulness

How today: history[] is the single on-disk source of truth (start/complete pairs, ISO timestamps); per-task finishes stream to a parallel-safe .spec-context.events.jsonl; stepHistory/timing derived in-memory. Writers are guarded (append-only; last entry's step == currentStep).

Steal: add decision/rationale capture and the model/cost/verification fields (story map D + F) so the trace becomes context-durability and governance surface.

Fast-path / right-sizing

ETH: ceremony can hurt

How today: a classify command emits size=small|normal|oversized (thresholds ≤5 files/≤10 tasks in presets/_parts/sizing.md); the workflow route switch folds small (drops gate pauses, still runs plan+tasks) and warns-then-runs-full on oversized.

Fix a real bug: two parallel fast-path impls disagree on the enum — classify emits small, but the specify node's persist-size writes simple. Unify the literal (story map T-01).

Custom commands + brownfield/auto already exist

further along than expected

How today: users add their own steps/commands via speckit.customWorkflows and .specify/companion.yml hooks; drift, coverage, adopt, and auto commands already ship.

Leverage, don't rebuild: the brownfield story (below) is mostly surfacing + finishing drift/coverage/adopt, not greenfield work.

⚠️ What we're missing (ranked, with how to build it)

0 · The unlock — per-node provider + model selection

prerequisite

How today: one global provider (speckit.aiProvider, machine scope), resolved once; no model field anywhere; workflows can only gate by provider, not select per step.

Build: add an optional provider/model on a workflow step (and custom-workflow step schema); resolve the right provider per dispatch instead of the global one; record what actually ran into the trace. This gates #1 (review), model routing, and cost control below.

1 · Adversarial cross-model plan review

highest leverage · blocked on #0

How today: not built. And a subtlety the code forces: dispatch is one-way — Companion can't send a plan to model B and read the verdict back to loop. So it can't orchestrate the review itself.

Build (honest path): ship it as host-gated prompt guidance in the plan command (like the existing subagent guidance) — "if your host has a second model/subagent, have it review this plan as a skeptic and return approved/revise" — and, where per-node providers land (#0), let a Claude/subagent host actually run model B. Persist the verdict to a new review field. On one-shot hosts, degrade gracefully.

2 · Verified completion (gate mark-complete on a check)

closes finger-guns

How today: write-context.py --mark-complete flips implemented → completed, guarded so it can't ship incomplete work — but the "check" is task-checkboxes, not tests. No verification result is stored.

Build: capture a verification object (tests/lint/type-check pass) into the trace at implement; have mark-complete require it (or a second-model verdict). Surface the AI's "key judgment calls" for confirmation.

3 · Grill-me (a real clarify that loops + captures)

most-validated pattern

How today: stock /speckit.clarify, ≤5 questions, written to spec.md only; decisions[] in the trace is passthrough (nothing writes it).

Build: a companion clarify/grill command that loops until complete, keeps recommended defaults, bakes in the ICE authoring checks, and writes clarifications + decisions into .spec-context.json so resume and the viewer can use them. Wire it as a before: plan node hook.

4 · Optional per-spec eval layer

the cross-camp mandate

How today: the bench behavioral-judge grades the pipeline in dev; nothing is written back to a spec.

Build: derive deterministic checks from acceptance criteria + a blind different-provider judge (needs #0); record scores into an evals[] field and show the trend across re-runs. Keep it optional/local/on-demand.

5 · Finish brownfield: surface drift/coverage + reverse-engineer

partly built

How today: drift, coverage, adopt commands exist but aren't a first-class GUI flow; Spec Kit's own #1 issue is "can't edit existing specs."

Build: surface drift/coverage as a pre-implement badge/gate in the viewer; make inline spec editing first-class (the spec-viewer refiner is a start); grow adopt into codebase→spec reverse-engineering.

6 · Cost/token budget + governance surface

emerging buying criterion

How today: no model/token/cost anywhere; the trace is a partial audit log.

Build (needs #0): capture model+tokens+cost per step; add a per-spec budget with auto-pause in auto mode; render the trace as an audit view. The in-editor GUI is the advantage here.

7 · Real composable recipes (finish the node system)

refactor done, composition not

How today: node assembly is byte-identical; recipes change order only, "there is no engine."

Build: move to section-level composition so a recipe can genuinely add (grill-me, verifier) or drop a node. This is what turns #1/#3 into insertable recipes instead of forks — the "recipes not toggles" vision.

🗺️ The story map — pull tasks from here

Grouped into epics, roughly in dependency order. S small · M medium · L large.

Epic A · Unlock per-node provider & model gates everything below

A-01Add optional provider + model to a workflow step + custom-workflow step schema (default = global).M

A-02Resolve the per-step provider/model at dispatch instead of the single global one; fall back cleanly.M

A-03Record provider + model actually used into each history[] entry (or step_summaries).S

A-04Add a "model routing" recipe: Opus/strong for plan+review, cheap model for mechanical steps (keyed off classify).M

Epic B · Verify "done" + evals

B-01Add a verification field to .spec-context.json: {passed, checks:[{name,status}]}.S

B-02Capture tests/lint/type-check results into it at implement (via capture-implement).M

B-03Gate mark-complete on verification.passed (or a second-model verdict); surface "judgment calls" for confirmation.M

B-04Add an evals[] field; productize the bench behavioral-judge as an optional per-spec, blind different-provider judge over acceptance criteria; show the trend.L

Epic C · Grill-me clarify

C-01New companion clarify/grill command: loops past 5 Qs until complete; multi-choice + recommended defaults.M

C-02Persist Q&A + decisions into .spec-context.json (clarifications[], deterministic decisions[]).S

C-03Bake in ICE authoring checks (constraint-vs-failure split; two-implementations goal test).S

C-04Wire as a before: plan node hook; surface the clarify state in the viewer.S

Epic D · Adversarial review (needs A)

D-01Add host-gated "second-model skeptic review" guidance to the plan command (approved/revise + findings).M

D-02On per-node-provider hosts, actually dispatch the review to a different provider; persist a review field.M

D-03Loop until approved (host-capable) or record the single verdict (one-shot).M

Epic E · Context durability + governance

E-01Capture decision/rationale deterministically so resume restores the "why".S

E-02Capture tokens/cost per step; add a per-spec budget + auto-pause in auto mode.M

E-03True [P] fan-out (fresh context per task) — ROADMAP step 8; events log is already parallel-safe.L

E-04Render the trace as an audit view (who/what/when/model/cost).M

Epic F · Finish brownfield

F-01Surface drift/coverage as a pre-implement badge/gate in the viewer.M

F-02Make inline spec editing first-class (build on the spec-viewer refiner).M

F-03Grow adopt into codebase → structured-spec reverse-engineering.L

Epic G · Finish the node system (recipes)

G-01Move node assembly from byte-identical to section-level composition (add/drop content).L

G-02Turn grill-me (C) and verifier (D) into insertable recipes, not forks.M

T-01Housekeeping: unify the fast-path enum (simple vs small) across the two impls.S

🚫 Risks & what to avoid

Becoming a proprietary superset of Spec Kit — the vendor-lock-in analysis is explicit: the moment we add a format the user can't leave, we become the lock-in layer. Keep specs + trace as plain portable files. This is our moat.
Over-ceremony — the data says context files can reduce success; never force the full pipeline on small changes. Fast-path stays default.
Assuming we can orchestrate cross-model loops — we can't (one-way dispatch). Design review/eval to run inside the AI's turn (host-gated), not as an extension read-back loop.
Marketing with borrowed hype numbers — lean on the traceability artifact and honest, measured claims.
Letting the agent grade its own work — verification must come from a tool or a different model, designed in from the start.

The one-line strategy

Be the cross-provider layer that keeps the AI honest. Unlock per-node provider/model (Epic A), then spend it on the two features only we can ship — a second-model plan review and a verified "done" — captured in the trace we already own. Finish brownfield (mostly surfacing what exists) and turn grill-me + verifier into insertable recipes. That's the whole game; the story map is the order.

Grounding: Trends & Demand · Medium Debrief · Video Trends. Architecture facts pulled from src/ai-providers/, src/core/types/specContext.ts, speckit-extension/workflows/, speckit-extension/nodes/, and docs/.

This revision grounds every claim in the current repo (explored 2026-06-30). "Shipped vs planned" reflects the code, not the vault roadmap — e.g. grill-me and per-node providers are not in the codebase today; drift/coverage/adopt/auto and custom workflows are.