What 44 talks on spec-driven development say about where the field is going — and what to build next.
The headline: the field has stopped arguing about whether to write specs and started arguing about how to keep the AI honest. Everyone has a planning step now, so the fight moved to three harder questions — did we actually agree on what to build (interrogation), can we trust a second opinion checked it (adversarial review), and how do we know it's really done (verification & evals).
Those three line up almost exactly with Companion's existing Wave 3 roadmap. The corpus reads like a stack of evidence for the recipes already planned — plus one reframing worth the whole exercise: cross-provider isn't a convenience, it's the substrate for the two highest-leverage features.
The scannable per-video map lives in the vault: the playlist index (44 entries, 16 with deep notes). This page is the "so what."
→ There's more: this video report is now part of a set.
• The unified Trends & Demand report — this + the audience side (GitHub issues, YouTube comments, ~35 Medium articles) in one, with expand/collapse.
• The Medium Articles, Debriefed — the full "is SDD obsolete?" article discourse, per-article.
• Part 2 — What the Community Wants — the standalone audience report. Refreshed weekly by the demand-radar skill.
Instead of writing the perfect prompt, you let the model interview you — multiple-choice questions, one batch at a time, each with a recommended answer, looping until the spec has no holes. The questions routinely catch real mismatches before any code exists.
Seen in: the single most common pattern in the corpus — Gui Ferreira's whole talk, Matt Pocock's "grill me," Spec Kit's /clarify, OpenSpec "explore," Karpathy "seed," Jellypod write-plan, AI LABS, Kiro, BMAD.
Companion move: this is the planned grill-me recipe (W3·4). Copy the mechanics everyone converges on — a recommended default per question, "keep going until complete" as the exit, fold answers back into spec.md, and persist the Q&A itself as a trace artifact. Ships as embedded prompt text; maps onto a webview wizard.
A model reviewing its own plan is an echo chamber. Hand the locked plan to a second, different model, tell it to be a skeptic — find race conditions, security holes, wrong assumptions — and return a verdict (approved / revise), looping until approved.
Seen in: grill-me + Codex (caught a plain-text-beacon bug a JSON check would reject), OpenSpec adversarial authoring, Waldemar's "separate validator agent."
Companion move: the verifier recipe (W3·2), and the single most Companion-native idea in the corpus — because Companion already has cross-provider plumbing. Everyone else bolts on a second model; Companion routes the locked plan to a second configured provider, captures verdict + findings into the trace, and loops. Multi-provider becomes a quality claim, not just flexibility.
"A model reviewing its own plan is an echo chamber." — The Gray Cat
The most repeated quality warning in the corpus: don't trust the model when it says it finished. It rationalizes a premature "done," teaches to the test, or quietly skips running the tests.
Seen in: Kent Beck's "finger guns" ("11 of 19 tests pass" reported as done), Waldemar's "premature victory" + "never let the agent judge itself," Anthropic's verify-loop frontier, the honest A/B reviews on "teaching to the test."
Companion move: today mark-complete is a status promotion. The corpus says gate it on a signal that isn't the implementer's self-assessment — a tool-returned pass, a second-model verdict, or human confirmation of the AI's "key judgment calls." Surface the choices the AI silently made at each step. This is the bridge from the verifier recipe to the terminal node.
The corpus treats evals as distinct from tests — fuzzy scores you watch as a trend, not a pass/fail gate. Your specific question gets its own section below.
Seen in: the Evalite-vs-Langfuse deep dive, the GSD 9-category scorecard, GSD's ui-review scored audit, the Farley Index, the A/B framework reviews. → jump to the eval answer.
Across frameworks, the durable artifact isn't the code — it's a persistent, AI-readable record of intent and decisions every future session and every test reads from.
Seen in: the Karpathy project/roadmap/state trio, Anthropic's managed agents (append-only log kept independent of the executor — a near-exact mirror of Companion's trace), OpenSpec's specs/ folder, Waldemar's "state" file.
Companion move: strong validation of the resumable, provider-agnostic trace. The one steal: capture decisions and rationale (the why), not just step/status, and re-inject the trace into the prompt on resume so the next step inherits accumulated reasoning.
Everyone now says greenfield demos are the easy 1%; the real work is adopting SDD into an existing codebase, and tools are racing to support it.
Seen in: Spec Kit init --here, Den Delimarsky's brownfield worktree demo, Specvia's codebase import, GSD map-codebase/forensics, Kiro's "dozens of spec files in legacy apps."
Companion move: this is Wave 4 (living specs). Two affordances recur — an /analyze-style consistency gate (cross-check spec vs plan vs tasks, severity-rated, before implement) surfaced as a sidebar badge, and a codebase-import that reverse-engineers a repo into a structured spec.
Because the spec is detached from implementation, you can run several implementations of one spec in parallel git worktrees, vary one decision, and pick by comparison.
Seen in: Den Delimarsky's "multiple implementations" (the canonical demo), Superpowers/GSD worktrees for per-task isolation, Waldemar's four parallel sub-agents.
Companion move: Wave 5 fan-out. Everyone who demos it admits the worktree choreography is clunky — a "compare implementations" sidebar that spins N branches off one spec and shows them side-by-side turns a manual dance into a one-click affordance. Same mechanism supports parallel implement task-waves.
A surprising number of credible voices conclude the heavy frameworks rarely beat baseline Claude Code on ordinary tasks — and the time saved staying simple lets you out-iterate the "smarter" pipeline.
Seen in: "less is more... 99% of use cases just use the baseline," "his strategy is more about subtraction than addition," Erik Hanchett's token levers, BMAD's quick-spec fast path.
Companion move: validates fast-path-as-default and the classify/switch node. Add model/effort routing keyed off the same small-vs-oversized signal, and a per-spec token/cost budget with auto-pause — let the user be the "compute allocator."
As plans grow to thousands of lines, Markdown goes unread. Anthropic's Thariq Shihipar shifted to authoring specs as rich, scrollable HTML artifacts with throwaway micro-UIs to edit sections. Others render design choices visually instead of in prose.
Seen in: Thariq's HTML-as-spec talk, Superpowers' "visual companion" (live aesthetic options to pick), GSD's sketch.
Companion move: Companion already owns a webview, so it's uniquely placed. A rich interactive "HTML plan" mode with section-level inline editors that round-trip back into spec.md is a near-direct lift. Lighter version: a visual design-option picker at the plan gate.
Glossary + traceability — a per-spec glossary.md read each step catches agent-invented jargon; Specvia's requirement→task→code links flag every dependent on change. Negative constraints + learnings file — a "what NOT to do" block closes the gap positive specs leave open. EARS requirements that map 1:1 to tests. Property-based / spec-derived tests as a tasks-step artifact. A "next action" advisor (BMAD's bmad-help) — the status/resume panel can make this explicit.
Should Companion add evals? Yes — but as a trend-watching layer, not a pass/fail gate. An eval isn't a unit test. It scores fuzzy qualities (did the output follow the spec's intent, avoid a regression) and the scores never settle — the speaker's own LLM-judged scores drifted 94 / 96 / 100 with no code change.
"A single score can lie to you. A trend is much harder to fool."
Deterministic. Green or red. Free to run. Pinned forever.
Fuzzy 0–1 score with a reason. Drifts run to run. Costs a model call each time. Read as a trend.
An eval = cases + scorers. For Companion, the spec's acceptance criteria are the cases — they already exist as the contract.
Two scorer kinds, both in reach: deterministic code scorers (structural checks, and a tool-call assertion — did the agent call the right command with the right params? That maps directly onto verifying Companion's own pipeline dispatch); and an LLM-as-judge for qualitative criteria, which must be a different model than the one tested and blind to who wrote the answer — exactly what Companion's cross-provider support enables natively.
Store history, diff against a baseline. The value is the delta across re-runs, surfaced as a trend, not a green check.
behavioral-judge) and a scorecard — but it's a developer tool for grading the pipeline, not a user feature. The opportunity is to productize that idea downward into a per-spec eval layer..spec-context.json per run and show a trend across re-runs in the viewer.Keeps the "present but latent" philosophy (does nothing until you opt in), reuses cross-provider, and sidesteps the trap every reviewer warned about — turning a fuzzy score into a false gate. Cost reality: every case re-calls the tested model, and a self-hosted stack (Langfuse) runs ~$300/mo always-on while the local option (Evalite) is free — so v1 should be local and on-demand, not an always-on service.
By leverage (impact × fit with Companion's existing strengths).
| # | Opportunity | Why it ranks here | Roadmap tie |
|---|---|---|---|
| 1 | Adversarial cross-model plan review | Uniquely enabled by cross-provider plumbing; turns multi-provider into a quality claim nobody else can make. | Wave 3 · W3·2 |
| 2 | Clarify-first interrogation gate | The most-validated pattern across all 44 videos. | Wave 3 · W3·4 |
| 3 | Verified completion, not self-reported | Closes the "finger guns" gap directly. | verifier → mark-complete |
| 4 | Optional per-spec eval layer | Answers the eval question; reuses cross-provider + the existing bench judge. | new (pairs w/ 1, 3) |
| 5 | Brownfield /analyze gate + codebase import | Where the field is heading; Companion's biggest gap. | Wave 4 · living specs |
| 6 | Rich interactive HTML plan view | Plays to the webview Companion already owns. | new |
| 7 | "Compare implementations" sidebar | Turns a clunky manual worktree dance into a GUI affordance. | Wave 5 · fan-out |
| 8 | Model/effort routing + token-budget pause | Cheap reliability/cost win off the existing classify node. | classify + auto-mode |
| 9 | Decision/rationale capture + trace re-injection | Small change, compounding value. | core trace |
| 10 | Glossary, negative-constraints, spec-derived tests, traceability | Bundle of small mechanics worth a sweep. | assorted |
Interrogation and adversarial verification are the two most common patterns across the corpus — both already on the roadmap.
The most credible voices are actively warning against framework ceremony. "Present but latent" is the posture the field is converging toward.
Framed today as flexibility. It's actually the substrate for the two highest-leverage features. That reframing may be the most valuable takeaway here.
The honest gap is brownfield/living specs and verified completion — where the field moves fastest and Companion has the most ground to cover. Both are already named in the roadmap (Wave 4, mark-complete). The corpus says treat them as near-term, not "later" — and lead with the one-two punch only Companion can ship cleanly: a clarify gate up front, and a second-model review before "done."
Continue to Part 2 — What the Community Wants · full per-video map: the playlist index.