2026-06-30 · SpecKit Companion Research 44 videos

Trends → SpecKit Companion

What 44 talks on spec-driven development say about where the field is going — and what to build next.

The headline: the field has stopped arguing about whether to write specs and started arguing about how to keep the AI honest. Everyone has a planning step now, so the fight moved to three harder questions — did we actually agree on what to build (interrogation), can we trust a second opinion checked it (adversarial review), and how do we know it's really done (verification & evals).

Those three line up almost exactly with Companion's existing Wave 3 roadmap. The corpus reads like a stack of evidence for the recipes already planned — plus one reframing worth the whole exercise: cross-provider isn't a convenience, it's the substrate for the two highest-leverage features.

The scannable per-video map lives in the vault: the playlist index (44 entries, 16 with deep notes). This page is the "so what."

→ There's more: this video report is now part of a set.

• The unified Trends & Demand report — this + the audience side (GitHub issues, YouTube comments, ~35 Medium articles) in one, with expand/collapse.
• The Medium Articles, Debriefed — the full "is SDD obsolete?" article discourse, per-article.
• Part 2 — What the Community Wants — the standalone audience report. Refreshed weekly by the demand-radar skill.

The big trends

reinforces roadmap new idea gap to close competitive edge

1 · The AI should interrogate you before it builds

reinforces

Instead of writing the perfect prompt, you let the model interview you — multiple-choice questions, one batch at a time, each with a recommended answer, looping until the spec has no holes. The questions routinely catch real mismatches before any code exists.

Seen in: the single most common pattern in the corpus — Gui Ferreira's whole talk, Matt Pocock's "grill me," Spec Kit's /clarify, OpenSpec "explore," Karpathy "seed," Jellypod write-plan, AI LABS, Kiro, BMAD.

Companion move: this is the planned grill-me recipe (W3·4). Copy the mechanics everyone converges on — a recommended default per question, "keep going until complete" as the exit, fold answers back into spec.md, and persist the Q&A itself as a trace artifact. Ships as embedded prompt text; maps onto a webview wizard.

2 · A different model should review the plan

reinforcesedge

A model reviewing its own plan is an echo chamber. Hand the locked plan to a second, different model, tell it to be a skeptic — find race conditions, security holes, wrong assumptions — and return a verdict (approved / revise), looping until approved.

Seen in: grill-me + Codex (caught a plain-text-beacon bug a JSON check would reject), OpenSpec adversarial authoring, Waldemar's "separate validator agent."

Companion move: the verifier recipe (W3·2), and the single most Companion-native idea in the corpus — because Companion already has cross-provider plumbing. Everyone else bolts on a second model; Companion routes the locked plan to a second configured provider, captures verdict + findings into the trace, and loops. Multi-provider becomes a quality claim, not just flexibility.

"A model reviewing its own plan is an echo chamber." — The Gray Cat

3 · "Done" must be verified, not self-reported

gap

The most repeated quality warning in the corpus: don't trust the model when it says it finished. It rationalizes a premature "done," teaches to the test, or quietly skips running the tests.

Seen in: Kent Beck's "finger guns" ("11 of 19 tests pass" reported as done), Waldemar's "premature victory" + "never let the agent judge itself," Anthropic's verify-loop frontier, the honest A/B reviews on "teaching to the test."

Companion move: today mark-complete is a status promotion. The corpus says gate it on a signal that isn't the implementer's self-assessment — a tool-returned pass, a second-model verdict, or human confirmation of the AI's "key judgment calls." Surface the choices the AI silently made at each step. This is the bridge from the verifier recipe to the terminal node.

4 · Evals are arriving as their own layer

new

The corpus treats evals as distinct from tests — fuzzy scores you watch as a trend, not a pass/fail gate. Your specific question gets its own section below.

Seen in: the Evalite-vs-Langfuse deep dive, the GSD 9-category scorecard, GSD's ui-review scored audit, the Farley Index, the A/B framework reviews. → jump to the eval answer.

5 · The spec is the living memory

reinforces

Across frameworks, the durable artifact isn't the code — it's a persistent, AI-readable record of intent and decisions every future session and every test reads from.

Seen in: the Karpathy project/roadmap/state trio, Anthropic's managed agents (append-only log kept independent of the executor — a near-exact mirror of Companion's trace), OpenSpec's specs/ folder, Waldemar's "state" file.

Companion move: strong validation of the resumable, provider-agnostic trace. The one steal: capture decisions and rationale (the why), not just step/status, and re-inject the trace into the prompt on resume so the next step inherits accumulated reasoning.

6 · Brownfield is where the field is going

reinforcesgap

Everyone now says greenfield demos are the easy 1%; the real work is adopting SDD into an existing codebase, and tools are racing to support it.

Seen in: Spec Kit init --here, Den Delimarsky's brownfield worktree demo, Specvia's codebase import, GSD map-codebase/forensics, Kiro's "dozens of spec files in legacy apps."

Companion move: this is Wave 4 (living specs). Two affordances recur — an /analyze-style consistency gate (cross-check spec vs plan vs tasks, severity-rated, before implement) surfaced as a sidebar badge, and a codebase-import that reverse-engineers a repo into a structured spec.

7 · Fork the spec into parallel implementations

reinforces

Because the spec is detached from implementation, you can run several implementations of one spec in parallel git worktrees, vary one decision, and pick by comparison.

Seen in: Den Delimarsky's "multiple implementations" (the canonical demo), Superpowers/GSD worktrees for per-task isolation, Waldemar's four parallel sub-agents.

Companion move: Wave 5 fan-out. Everyone who demos it admits the worktree choreography is clunky — a "compare implementations" sidebar that spins N branches off one spec and shows them side-by-side turns a manual dance into a one-click affordance. Same mechanism supports parallel implement task-waves.

8 · Lean by default; escalate only when it pays

reinforcesnew

A surprising number of credible voices conclude the heavy frameworks rarely beat baseline Claude Code on ordinary tasks — and the time saved staying simple lets you out-iterate the "smarter" pipeline.

Seen in: "less is more... 99% of use cases just use the baseline," "his strategy is more about subtraction than addition," Erik Hanchett's token levers, BMAD's quick-spec fast path.

Companion move: validates fast-path-as-default and the classify/switch node. Add model/effort routing keyed off the same small-vs-oversized signal, and a per-spec token/cost budget with auto-pause — let the user be the "compute allocator."

9 · Make the plan something a human will actually read

newedge

As plans grow to thousands of lines, Markdown goes unread. Anthropic's Thariq Shihipar shifted to authoring specs as rich, scrollable HTML artifacts with throwaway micro-UIs to edit sections. Others render design choices visually instead of in prose.

Seen in: Thariq's HTML-as-spec talk, Superpowers' "visual companion" (live aesthetic options to pick), GSD's sketch.

Companion move: Companion already owns a webview, so it's uniquely placed. A rich interactive "HTML plan" mode with section-level inline editors that round-trip back into spec.md is a near-direct lift. Lighter version: a visual design-option picker at the plan gate.

10 · Smaller stealable mechanics

new

Glossary + traceability — a per-spec glossary.md read each step catches agent-invented jargon; Specvia's requirement→task→code links flag every dependent on change. Negative constraints + learnings file — a "what NOT to do" block closes the gap positive specs leave open. EARS requirements that map 1:1 to tests. Property-based / spec-derived tests as a tasks-step artifact. A "next action" advisor (BMAD's bmad-help) — the status/resume panel can make this explicit.

The eval question, answered

Should Companion add evals? Yes — but as a trend-watching layer, not a pass/fail gate. An eval isn't a unit test. It scores fuzzy qualities (did the output follow the spec's intent, avoid a regression) and the scores never settle — the speaker's own LLM-judged scores drifted 94 / 96 / 100 with no code change.

"A single score can lie to you. A trend is much harder to fool."

A unit test

Deterministic. Green or red. Free to run. Pinned forever.

An eval

Fuzzy 0–1 score with a reason. Drifts run to run. Costs a model call each time. Read as a trend.

The mechanics that transfer cleanly

An eval = cases + scorers. For Companion, the spec's acceptance criteria are the cases — they already exist as the contract.

Two scorer kinds, both in reach: deterministic code scorers (structural checks, and a tool-call assertion — did the agent call the right command with the right params? That maps directly onto verifying Companion's own pipeline dispatch); and an LLM-as-judge for qualitative criteria, which must be a different model than the one tested and blind to who wrote the answer — exactly what Companion's cross-provider support enables natively.

Store history, diff against a baseline. The value is the delta across re-runs, surfaced as a trend, not a green check.

Already half-built: the project's bench harness already contains an LLM-as-judge (behavioral-judge) and a scorecard — but it's a developer tool for grading the pipeline, not a user feature. The opportunity is to productize that idea downward into a per-spec eval layer.

#	Opportunity	Why it ranks here	Roadmap tie
1	Adversarial cross-model plan review	Uniquely enabled by cross-provider plumbing; turns multi-provider into a quality claim nobody else can make.	Wave 3 · W3·2
2	Clarify-first interrogation gate	The most-validated pattern across all 44 videos.	Wave 3 · W3·4
3	Verified completion, not self-reported	Closes the "finger guns" gap directly.	verifier → mark-complete
4	Optional per-spec eval layer	Answers the eval question; reuses cross-provider + the existing bench judge.	new (pairs w/ 1, 3)
5	Brownfield `/analyze` gate + codebase import	Where the field is heading; Companion's biggest gap.	Wave 4 · living specs
6	Rich interactive HTML plan view	Plays to the webview Companion already owns.	new
7	"Compare implementations" sidebar	Turns a clunky manual worktree dance into a GUI affordance.	Wave 5 · fan-out
8	Model/effort routing + token-budget pause	Cheap reliability/cost win off the existing classify node.	classify + auto-mode
9	Decision/rationale capture + trace re-injection	Small change, compounding value.	core trace
10	Glossary, negative-constraints, spec-derived tests, traceability	Bundle of small mechanics worth a sweep.	assorted

Wave 3 priorities are right

Interrogation and adversarial verification are the two most common patterns across the corpus — both already on the roadmap.

Lean-by-default was correct

The most credible voices are actively warning against framework ceremony. "Present but latent" is the posture the field is converging toward.

Cross-provider is underused

Framed today as flexibility. It's actually the substrate for the two highest-leverage features. That reframing may be the most valuable takeaway here.

The near-term move

The honest gap is brownfield/living specs and verified completion — where the field moves fastest and Companion has the most ground to cover. Both are already named in the roadmap (Wave 4, mark-complete). The corpus says treat them as near-term, not "later" — and lead with the one-two punch only Companion can ship cleanly: a clarify gate up front, and a second-model review before "done."

Continue to Part 2 — What the Community Wants · full per-video map: the playlist index.

1 · The AI should interrogate you before it builds

2 · A different model should review the plan

3 · "Done" must be verified, not self-reported

4 · Evals are arriving as their own layer

5 · The spec is the living memory

6 · Brownfield is where the field is going

7 · Fork the spec into parallel implementations

8 · Lean by default; escalate only when it pays

9 · Make the plan something a human will actually read

10 · Smaller stealable mechanics

The mechanics that transfer cleanly

Recommended posture — optional, additive, on a completed spec

Wave 3 priorities are right

Lean-by-default was correct

Cross-provider is underused

The near-term move