2026-06-30 · SpecKit Companion Medium debrief full text · 32 articles read

The Medium Articles, Debriefed

A deep read of the SDD article discourse — now from the full member text (paywall bridged), not previews. Click any article to expand its real digest with the specifics.

Bottom line for decisions. The article world has split into a loud "is SDD already obsolete?" argument — but read in full, the "successor" (IDD/IDSD) is ~80% rebrand, 20% real insight. The genuine part is worth acting on; the disruption framing is not. Meanwhile the settled-infrastructure writers (backed by real surveys and benchmarks) point at the actual near-term battleground: governance, verification, cost, and context durability — exactly where SpecKit Companion already lives.

Three concrete takeaways: (1) the strongest critique names real contradictions in Spec-Kit's own command files that Companion could fix; (2) every camp converges on "the durable artifact is an executable check, not prose" — the mandate for an eval layer; (3) the data says don't force ceremony on small changes (context files can *hurt*), validating fast-path-as-default.

The decision summary (read this if nothing else)

Decision	What the articles say	Confidence
Build: an eval / acceptance-check layer	Every camp (facts, BDD, ICE, formal-verification, "70% problem") lands on the same primitive: verification is the durable artifact. ICE explicitly splits "binary failure-condition evals" from directional constraints.	High — cross-camp + benchmarks
Fix: audit Spec-Kit's command bodies	The sharpest critique cites verifiable contradictions in Spec-Kit: "make informed guesses," a "Maximum 3 [NEEDS CLARIFICATION]" cap, and three conflicting TDD orders across manifest/task-template/runner.	High if verified — file-level claims
Keep: fast-path-as-default	ETH Zurich: AGENTS.md context files reduced success in 5/8 settings at 20–23% higher cost. Spec-Kit = 8 files/1,300 lines for a date display. "SDD wins only when 3 factors are high."	High — controlled study
Position: context-durability + anti-lock-in	Context rot is the cross-cutting enemy (18-model Chroma study). File-based state + fresh context is the industry-standard defense. SLUMP: a persistent spec file recovered 90% of lost faithfulness.	High — named studies
Target gaps: memory, cost, crash-recovery, post-ship optimize	Papalini's 11-framework survey: 8/11 have no crash recovery; persistent memory and cost accounting are rare. None of the majors have an AutoResearch-style optimize-after-ship loop.	Medium — landscape mapping
Watch: governance is the next buying criterion	"Generation is solved; governance is the bottleneck." 48% of AI code carries vulns; "which tool won't torch my credits" is now a purchase driver (Uber burned its budget in ~4 months).	Medium-High — surveys + vendor data

The debate, resolved: is SDD obsolete?

Honest read: IDD/IDSD is ~80% rebrand, 20% real insight — not a structural threat

Kapil Ahuja's series ("SDD will collapse → IDSD/ICE") is the loudest "successor" narrative. Read in full, the author himself concedes "ICE is still a spec" and that he's "at Level 2.5, not 3" of his own model. It's separation-of-concerns applied to markdown files, heavily self-promotional (his Garura/Phoenix harness does the load-bearing work — relocating ceremony, not removing it).

The 20% that's real and worth adopting: (1) NFRs and architecture decisions get smuggled into specs and pre-locked too early → drift when upstream pivots (his Vercel→GCP example); (2) separate directional constraints ("<200ms") from binary failure conditions the validator runs as evals — and the builder must not see the failure evals ("can't teach a test it can't see"); (3) feed context progressively, don't dump one wall; (4) the "two implementations" goal test — if only one implementation can satisfy your "goal," you wrote a spec disguised as a goal. These are cheap authoring-quality checks Companion could bake into its specify prompt.

The allied critique (Wasowski) is stronger evidence than the IDSD one

"Stop Writing Specs, Start Writing Facts" makes the same point with harder data: one executable test passed unchanged through Sonnet 3.5→3.7→4→Opus 4.5+, while the equivalent 1,500-word spec needed four reinterpretations. Google Trends shows SDD interest cresting (0→100 by Mar 2026, down to 86 by May). The takeaway isn't "abandon specs" — it's "the test suite is the real behavioral contract," which is the eval-layer mandate again.

The data (now that we have the full text)

Finding	Source / study	Why it matters for Companion
Experienced devs 19% slower with AI on mature repos, while feeling 24% faster (43-pt perception gap)	METR 2025 RCT (16 devs, 246 tickets)	Argues for honest timing in the trace; don't sell a "you're fast" dashboard.
Persistent spec file recovered 90% of lost faithfulness (faithful components 118→181)	SLUMP benchmark / ProjectGuard	Direct evidence for the value of Companion's `.spec-context.json` vs chat history.
Context files (AGENTS.md) reduced task success in 5/8 settings at 20–23% higher cost; human-written only +4%	ETH Zurich "Evaluating AGENTS.md" (Feb 2026)	Honest counter-evidence; validates not forcing ceremony on small changes.
Every one of 18 frontier models degraded before its window limit ("context rot")	Chroma Research (2025)	File-based state + fresh context is the correct architecture — Companion already does this.
BDD/Gherkin scenarios as LLM input: +15.1% pass@1 vs unstructured NL; AutoUAT (BMW): 92% scripts useful, €0.12/script	Cited benchmark + Critical TechWorks case	A concrete feature angle: encourage Given/When/Then acceptance criteria + a self-verify step.
AI code: 1.7× more major issues, 2.74× more security vulns, 75% more misconfigs; 48% of AI code carries vulnerabilities	CodeRabbit (470 PRs, Dec 2025); Papalini	The "why verify" stat; governance is the next buying criterion.
Adoption: 95% use AI weekly, 55% use agents; Claude Code went zero→#1 in 8 months; ecosystem 11→30+ frameworks	Pragmatic Engineer survey (900+ engineers)	The market is real and fragmenting — cross-provider orchestration is the wedge.
Cost: Uber exhausted its 2026 AI budget in ~4 months ($500–2,000/eng/mo); a Microsoft division dropped Claude Code for cost	The Information / Fortune / The Verge	Token/cost budgeting + auto-pause is a real differentiator, not a nicety.

Camp A — "SDD is collapsing" (the post-SDD / IDSD case)

Kapil Ahuja's IDSD series + Wasowski's facts/BDD allies

The loudest "successor" narrative + the stronger evidence-based critique. Honest read above: refine, don't replace.

The Method That Replaces SDD: IDSD · Kapil Ahujaopinion

Thesis

SDD leaves holes because no human specifies everything before it exists; the agent fills them (a human gap, not a tooling defect). IDSD's ICE loop — declare outcomes, let the machine determine implementation — is pitched as the fix.

Specifics

ICE = Intent, Context, Expectations, three human-owned crafts feeding one loop. Human owns intent + expectations and never leaves them; the harness pulls context, codes, validates against expectations, re-runs until met, then merges.
OpenAI's "Symphony" spec (2,169 lines, 18 sections) is the trap: it worked — but OpenAI wrote it last, reverse-engineered from running software, then reimplemented in 6 languages. "The industry sells that output as if it were the method."
Economics: author runs 150–200M tokens/day; 3 days of rework ≈ $985 at Opus rates; gap-filling agents "burn more tokens being confidently wrong," billed to the client who never read a spec.
Concession: "Some will say this is still spec by another name, and they are right about the files and wrong about what changed."

"SDD breaks because we asked humans to do the one thing humans cannot do: specify everything before it exists."

Decision: the adoptable bits are "expectations = definition-of-done owned by whoever owns intent" and progressive context feeding. The "replaces SDD" claim is marketing. Opinion + self-promo; one credible anchor (Symphony).

SDD Isn't Broken. It Will Collapse. · Kapil Ahujaopinion

Thesis

Both SDD and vibe coding fail for one reason — they jam three distinct layers (Intent / Spec / Implementation) into one document. Separating them is what IDD does.

Specifics

Three-layer schematic: Intent (user-owned: goal, constraints, success/failure, NFRs like "1M concurrent users" — these drive architecture so they belong here); Spec (the evaluable contract — "it doesn't describe; it verifies"); Implementation (system-owned — microservice-vs-monolith is a system decision).
Substrate Stack maturity: L1 Vibe → L2 Spec-driven ("collapsing now") → L3 Intent-driven → L4 Autonomous → L5 "Dark Factory" (Dan Shapiro's term).
The "brutal math": a promised 50% delivery saving gives 30% back to drift recovery → net 20%.
Confessions: his own corpus drifted; he's "at Level 2.5–2.75, not 3."

"Vibe coding collapsed because it had no contract. Spec-driven development is collapsing because it has three contracts pretending to be one."

Decision: the real, familiar failure — architecture/NFRs pre-locked in the spec too early — is worth taking seriously (fast-path/right-sizing). Structurally it doesn't threaten SDD. Low evidentiary weight, high narrative influence.

The Anatomy of Intent (ICE) · Kapil Ahujaopinion — but verifiable claims

Why this one matters most

The most actionable article in the set and a direct shot at Spec-Kit (Companion's foundation).

Specifics

Intent's three parts: Goal (one sentence, no "and"; test: can two different implementations both satisfy it? if only one → you wrote a spec); Constraints (directional NFRs in business language, 5–7 lines: "99.99% uptime" yes, "use Kubernetes with 3 replicas" no — that's context); Failure conditions (binary, observable post-output checks the validator runs as evals: build fails, coverage <90%, secret in source, API contract change without version bump).
Decision rule: "Would knowing this change how the builder writes code? Yes → constraint. No → failure condition."
Compartmented evaluation (reward-hacking defense): builder gets goal+constraints; validator gets failure conditions "compiled into encrypted evals" — "the builder cannot teach a test it cannot see."
The Spec-Kit attack: quotes Spec-Kit telling the model to "make informed guesses," "fill gaps," cap at "Maximum 3 [NEEDS CLARIFICATION]," and three contradictory TDD orders (NON-NEGOTIABLE / OPTIONAL / "follow the TDD approach") across manifest, task template, and runner.

"Taken together, ICE is still a spec. But it's an intent that lets the model do what it was built to do, not to restrict it."

Decision (highest value in the set): (1) verify the Spec-Kit contradictions in current Spec-Kit — if real, they're quality bugs Companion could fix in its command bodies. (2) Bake the "two-implementations" goal test and the constraint-vs-failure rule into /speckit.companion.specify. (3) Separate directional constraints from binary failure-evals — maps onto the eval story.

SDD Always Gets Context Wrong. ICE Fixes It. · Kapil Ahujaopinion

Thesis

No spec is ever complete; something fills the gaps, and the costliest gap is context. Read context from where truth already lives (code first), don't guess it from a document.

Specifics

Fixed context-resolution order: code first → product memory (LTM) → knowledge base → and only last, the model's reasoning ("which is where a spec-driven tool starts on step one"). Every answer carries a confidence level; low confidence → stop and ask a human.
The "is this context or a spec?" test: could two different builds both draw on it and both be right? Yes → context; only one → you wrote a plan.
Explicitly "vectorless" — rejects RAG ("RAG with extra steps"); trusts a markdown file carrying a real decision over a nearest-neighbor embedding.

"The tools tell you the model can fill in the context for you. It can. It just fills it with someone else's world."

Decision: validates "context as durable artifact" (code-first, confidence-gated human-in-loop over .spec-context.json). Anecdotal evidence; directional.

From Specs to Intents: Why Maturity Stalled · Kapil Ahujaevidence-backed

Thesis

Most "AI-native" teams are stuck at Level 2 (still reviewing every diff); the J-Curve is really a grief curve, and the resistance to stopping code review is psychological, not rational.

Specifics

Shapiro's 5 Levels: L0 spicy autocomplete → L2 junior dev (read all code) → L4 developer-as-PM (write specs, check tests) → L5 dark factory. Shapiro: 90% of "AI-native" devs are at L2.
Evidence: METR 19% slower / 24% believed faster (43-pt gap). Copilot: 20M users, 55% faster in labs but bigger PRs, higher review cost, more vulns — "cheaper to write, more expensive to own."
IDSD's 7 principles incl. principle 6 "verify understanding before execution" (agent restates the goal in codebase context) and a 70–90% checkpoint approval target — "100% means it's ceremony."
Scenarios over tests: behavioral specs live outside the codebase as a holdout the agent never sees during dev (anti "teaching to the test"). Three execution speeds: Fast / Planned / Strategic.

"The J-Curve isn't a productivity dip. It's a grief curve."

Decision: the "3 execution speeds" mirror Companion's complexity routing; "verify understanding before execution" and "external holdout scenarios" are directly implementable prompt patterns; the 70–90% target argues against over-gating.

Software Engineering Is Done. It Forked. · Kapil Ahujaopinion (cites studies)

Thesis

SE isn't dying, it's forking into ~5% who define intent (with methodology, memory, drift detection) and 95% using AI as autocomplete who ship code they can't debug. The fork is about discipline, not tools.

Specifics

Anthropic Jan 2026: AI-delegators scored <40% on comprehension vs 65%+ for conceptual-only users. SMU: 24.2% of AI-introduced issues survive to latest revision (304k commits). Stanford/Boneh: AI-assisted devs write less secure code while more confident ("anesthesia machine").
"CTO Presence Score" = (PRs reviewed + commits)/working days, target >0.25; most leaders at 0.10.
Tell for Companion: "this whole harness is going to be about recipes, not spec-driven development" — aligns with Companion's recipes-not-toggles direction.

"The tool is neutral. The methodology is everything."

Decision: read for the narrative buyers absorb; the "recipes" framing validates Companion's composable-workflow direction. Pure opinion/anecdote but cites real studies.

SDD Is Breaking the Fifty-Year Iron Triangle · Kapil Ahujaopinion + real cost data

Thesis

Agentic coding shattered the time/cost/quality triangle: speed became free, quality dropped to an eval-held floor, and cost split into tokens + human cognition. Front-loading the spec burns both levers.

Specifics

The $80k anecdote: 5 people × 3 months of specs (~$96k) vs a 10-day intent-first build (~$16k); one question — "why are we building this?" — deleted the specs.
Agentic Iron Triangle: Speed = table stakes (from parallelization, not faster models); Quality = a floor welded by evals/SonarQube; Cost = tokens + cognition.
Verifiable cost data: Uber exhausted its 2026 AI budget in ~4 months (Claude Code ~84% of engineers, $500–2,000/mo); a Microsoft division (Experiences & Devices) dropped Claude Code for Copilot CLI by Jun 30 2026; Peter Steinberger ~$1.3M / 603B tokens in a month.

"The spec was complete and detailed and wrong about the only thing that mattered, and its completeness is exactly what hid that from the room."

Decision: the strongest "over-ceremony" critique — frames fast-path/right-sizing and the "hold intent, don't over-spec" defense. Cost anecdotes are verifiable.

Stop Writing Specs. Start Writing Facts. · Jarek Wasowskievidence-rich

Thesis

SDD won the tooling war but is losing the epistemological war: a prose spec is a prediction about the model that must be reinterpreted each upgrade; an executable assertion passes through an exit code, not interpretation. Migrate everything except audit/onboarding artifacts to a "fact-set."

Specifics

Anchor: one test written Jun 2025 passed CI unchanged through Sonnet 3.5→3.7→4→Opus 4.5+, while the 1,500-word spec needed 4 reinterpretations. Google Trends: 0→100 (Aug 2025–Mar 2026) → 86 by May.
Non-determinism persists at temp 0.0 (float non-associativity, batch scheduling): "80 unique completions across 1,000 identical prompts"; IBM: a 100B+ model reproduced outputs identically only 12.5% of the time.
Lineage: Hoare triples (1969) → Design by Contract (1992) → QuickCheck (2000). Production case: Quviq, 60,000 lines Erlang, 450 lines of PBT, 25 bugs.
SDD wins in exactly 3 domains where "ceremony is the product": compliance (DO-178C, ISO 26262, EU AI Act), cross-team B2B (Stripe's OpenAPI; Boost +80% stability after Pact), onboarding. Test: "is the artifact read by a human outside the team?"

"Models are giant disgruntled interns, except their mood changes with every API call."

Decision: the durable artifact is the test — informs whether Companion surfaces/tracks executable acceptance criteria and reframes drift detection. Author flags it's from his own 6 projects, "not a controlled study."

SDD: BDD as the Missing Link · Jarek Wasowskicase study + benchmark

Thesis

The missing spec layer for AI is 22-year-old BDD: one Given/When/Then scenario doubles as business spec plus five test levels, and AI now removes BDD's old step-definition maintenance cost.

Specifics

One scenario → five test levels (unit → integration → E2E → UAT → regression), "five granularities, one source file."
BDD scenarios as LLM input: +15.1% pass@1 vs unstructured NL. AutoUAT (Critical TechWorks / BMW): 95% of scenarios helpful, 92% of scripts useful (60% zero-change), €0.12/script, est. 60–80% per-feature cost cut.
"Three Amigos" ritual (Discovery → Formulation → AI Generation → Verification [mandatory self-verify loop] → Living Documentation, breaking a scenario blocks merge). Heuristic: Three Amigos for clear business behavior; vibe coding OK for <50-line fixes.

"BDD is not test automation — it's collaborative requirements analysis combined with TDD." — Aslak Hellesøy

Decision: concrete feature angle — encourage Given/When/Then acceptance criteria + a self-verification step (maps onto mark-complete). Compelling "5 levels from 1 scenario" docs narrative.

Camp B — "SDD won; now govern it" (settled infrastructure)

Enrico Papalini's series

The strongest evidence base of the set (surveys, Gartner, vendor data). The message: generation is solved; governance, cost, and reliability are the battleground — and no single framework wins, it's a layered stack.

The SDD Won. Now Comes the Hard Part. · Enrico Papalinistrongest evidence base

Specifics

Adoption (Pragmatic Engineer, 900+ engineers): 95% use AI weekly, 75% for ≥half their work, 55% use agents; staff+ lead at 63.5%. Claude Code zero→#1 in 8 months. 70% use 2–4 tools. Company size drives choice (10k+ favor Copilot 56%; startups Claude Code 75%).
Gartner: by 2027, 65%+ of agentic teams treat IDEs as optional. Ecosystem grew ~11→30+ frameworks.
Wins: Kiro — a 2-week feature in 2 days; Spec Kit — ~10× fewer "regenerate from scratch" cycles.
Governance gap: 48% of AI code carries vulns; code duplication rose 4×. Copilot moved to usage billing Jun 1 2026. Benchmarks: Codex/GPT-5.5 tops Terminal-Bench 2.1 (83.4%); Claude Opus 4.8 leads SWE-bench Verified (88.6%).

"June 2026 looks like the month the market stopped asking 'Are AI agents real?' and started asking 'Which part of my company gets agentized first?'"

Decision (highest strategic-priority signal): governance (permissions, sandboxes, audit logs, cost controls) is the named bottleneck and future buying criterion — points at Companion's capture/audit-trail + cost-aware execution as differentiators.

The Evolution of SDD (11-framework survey) · Enrico Papaliniwell-cited map

Specifics

Context-rot foundation: CodeRabbit (470 PRs) 1.7× issues / 2.74× vulns; Chroma 18-model study (monotonic decay well below limits — a 200K model degrades at 50K); "Lost in the Middle" U-shape.
Maps 11 frameworks by their single defining innovation (Karpathy Skills, Agent Skills/Osmani, AI-RPI, BMAD, OpenSpec, Spec Kit, GSD v1/v2, Ralph, Spec Loop Engine, AIDD) across 14 dimensions into a 4-layer stack.
Biggest systemic weakness: 8 of 11 frameworks have no crash recovery. Persistent memory (only GSD-2's SQLite graph) and cost accounting are rare.
Three enforcement types: advisory (depend on compliance) / structural (default shape of work) / programmatic (controls outside the model's prose).

"A rule written in Markdown is a suggestion. A rule encoded in the runtime is a control."

Decision: the best competitive-landscape map — names the exact capability gaps (persistent memory, cost accounting, crash recovery) that are Companion's differentiation targets; positions Spec Kit as the most composable base ("layer, don't replace").

The Public SDD Stack Has Arrived (6 layers) · Enrico Papalinisynthesis

Specifics

17 tools decompose into 6 layers (behavioral, governance, workflow, runtime, memory, infrastructure) — the right architecture is a composition, not one framework.
Three convergences: specs as operational control surfaces; fresh context replacing long conversations; execution/verification separating ("don't let a model grade its own homework").
Three consistently-missing layers: persistent memory, cost accountability, crash recovery. Names Microsoft's APM (Agent Package Manager) as the emerging enterprise infra layer ("SDD is acquiring a supply chain").

"The harness protects the agent from its environment. The loop protects the workflow from the agent."

Decision: maps Companion onto the workflow/memory boundary over Spec Kit; the three missing layers are candidate roadmap bets; APM/supply-chain is an enterprise lane Companion doesn't touch yet.

From "It Runs" to "It Builds While You Sleep" (Ralph, 6 gates) · Enrico Papalinireproducible build log

Specifics

A Python supervisor runs a local 27B model unattended, forcing every change through six verification gates before any git commit: syntax check, feature markers, change gate (declared file actually modified), wiring check, behavioral smoke test, optional LLM review (off by default — "same weak model grading its own homework").
Pac-Man run (Jun 20–25, 23 tasks/6 phases): prefix-cache HIT ~2s vs MISS ~242s (100× penalty); whole-file regeneration bloated prompts 22k→80–85k tokens.
Bug fixes: "done is dangerous" (treat any dirty finish as resume); truncation needs continuation not retry; never shrink turn budget on retry.

"A local 27B is good enough to write the code. It is not good enough to be trusted."

Decision: strong validation of deterministic per-task gates + checkpoint/resume — parallels Companion's per-task capture + mark-complete + verification. Single-build evidence, rich mechanics.

Loop Engineering Is SDD in Motion · Enrico Papalinisynthesis

Specifics

Layer model: Specification (intent/constraints/acceptance = the loop's stopping condition) → Loop → Harness → Runtime.
Six structural elements of a reliable loop: trigger, durable state, skills+knowledge, isolation (worktrees/sandboxes), least-authority connectors, independent verification (maker-checker split).
Names competing loop packages: Spec Loop Engine (Avi Brahms — YAML phases, state.json + append-only journal.jsonl, resumable), Ralph Copilot (Planner→Coordinator→Executor→Reviewer, PRD.md + PROGRESS.md + git as memory). Claude Code /loop + /goal as primitives.

"Autonomy should follow verifiability, not ambition."

Decision: frames Companion's pipeline + auto-mode + capture/resume as an "SDD control plane over a loop"; names competitors to position against.

How to Evolve an SDD Framework · Enrico Papalinilived case study

Specifics

Frameworks die two ways: stagnation or accretion — and for agentic frameworks accretion is measurable in tokens (every instruction line competes with a line of source the agent could read).
5-step loop over 27 documented waves: Harvest (on release cadence) → Gap analysis (most candidates die here) → Gate against 9 inviolable constraints → Adopt additively (opt-in, old contracts keep working) → Reject explicitly and write it down ("What NOT to Adopt" table).
Signature move: "harvest the capability, reject the mechanism." Recurring rejections: community marketplaces (supply-chain risk), programmatic state machines, ungoverned autonomous loops, runtime/IDE/provider lock-in.

"The frameworks that survive aren't the ones that adopt the most — they're the ones that know what to refuse."

Decision: directly actionable governance model for how Companion's own preset/command surface should evolve — argues for a written rejection log + a token-cost health metric; validates the "never overwrite spec-kit / additive presets" stance.

Camp C — Framework maps & competitive verdicts

Rick Hightower, Wasowski (15-framework), Mysore, Mak

Where Companion sits and who it competes with. The consensus: frameworks are layers at different rigor levels — combine, don't pick. Spec Kit is the L1 baseline Companion rides.

Comparing 15 SDD Frameworks (the rigor taxonomy) · Jarek Wasowskipositioning gold

Specifics

The 15: Superpowers (166k★), Spec-Kit (~90k), GSD (~48k), BMAD (~45k), OpenSpec (5.8k), cc-sdd (1.5k), OWASP Security Skill, SpecSwarm, MUSUBI (28); + Don Cheli SDD, Agent OS v3, Shotgun CLI, WordPress SDD Constitution, CSDD (academic); + commercial Intent ($252M) and Tessl ($125M, only true spec-as-source).
Three-level rigor taxonomy (Piskala, arXiv:2602.00180): L1 Spec-First (spec dies after merge; promise = flexibility) / L2 Spec-Anchored (spec lives the lifecycle, every change propagates; promise = auditability) / L3 Spec-as-Source (code is `// GENERATED FROM SPEC — DO NOT EDIT`; promise = guarantee-by-construction).
Hidden dividing axis = TDD (mandatory in Superpowers/Don Cheli/MUSUBI, optional in Spec-Kit/BMAD/GSD). Overhead datapoint: Spec-Kit 33m30s vs 8m minimal baseline (10× first-cycle slowdown).

"You don't choose the 'best' one; you choose the 'right' one."

Decision: the most useful competitive map — slots Spec-Kit (Companion's base) as L1, and names what higher tiers add (traceability, EARS, living spec). Companion's presets sit in L1; the upgrade path buyers ask about is L2 auditability.

The Great Framework Showdown (the verdicts) · Rick Hightowerpractitioner opinion

Per-framework "choose this if"

BMAD (40.2k★): heavyweight enterprise simulator, 12+ personas → enterprise with audit/compliance needs.
SpecKit (75.9k★, highest): gated process, ~800 lines/change → greenfield wanting the most rigorous spec. Weaknesses: 1–3 hrs/feature, static specs (drift), no multi-agent orchestration.
OpenSpec (29.5k★): brownfield delta specs, ~250 lines/change, /opsx:ff → iterative maintenance on existing code.
GSD (28.1k★): fresh 200k context per task, wave parallelism → complex features; "always the second best at everything — start with GSD." A 50-task feature ~$5 single-session vs $25–40 with GSD.
Superpowers: TDD iron law (deletes code written before tests) → when quality is primary.

"If another tool is best at something, GSD is always the second best."

Decision: names exactly where Spec-Kit is weak (1–3hr ceremony, static/drift, no multi-agent) — precisely Companion's differentiation surface (parallel implement, capture-driven state, mark-complete). Explicitly "subjective"; star counts concrete.

GSD vs Superpowers vs BMAD (Context Rot) · Rick Hightowercited studies

Specifics

Context rot is architectural (causal masking), so bigger windows only delay it. GSD = aggressive atomicity (fresh context/task, waves, file-system message bus); Superpowers = temporal containment (2–5 min micro-tasks, 10–30k tokens); BMAD = persona switching as context boundaries.
Numbers: past 50% fill primacy bias collapses; every one of 18 models degraded even on trivial replication.

"Every model tested showed degradation as context grew. Not some models. Every single one."

Decision: confirms Companion's file-based per-spec model is the correct defense against rot; fresh-context/wave fan-out is industry-standard. Good README/docs ammunition for *why* Companion externalizes state.

GSD vs Spec Kit vs OpenSpec vs Taskmaster · Rick Hightowermarket map

Specifics

Stars/license (Feb 2026): GSD 16.7k MIT; Spec Kit 70.8k MIT (18+ agents); OpenSpec 24.9k MIT; Taskmaster 25.5k MIT+Commons-Clause (Cursor-first).
5 divergence axes: execution depth (GSD orchestrates ↔ Taskmaster delegates), context strategy, brownfield (OpenSpec leads), platform breadth vs depth, licensing.

"Specifications don't serve code; code serves specifications." (Spec Kit)

Decision: situates Companion as a GUI layer on Spec Kit's 70.8k-star base. Author sells a competing installer (disclosed) — treat as a market map, not evidence.

What Is GSD? SDD Without the Ceremony · Rick Hightowerproduct walkthrough

Why it matters

GSD's pitch — "SDD without the ceremony" — is nearly Companion's own; its .planning/ artifacts parallel Companion's .spec-context.json.

Specifics

npx get-shit-done-cc; hierarchy Project > Milestone > Phase > Plan > Task; four-step per-phase loop Discuss → Plan → Execute → Verify (maps 1:1 onto specify→plan→tasks→implement).
Multi-runtime (Claude Code, OpenCode, Gemini CLI); thin orchestrator held at 30–40% context; /gsd:quick fast path; /gsd:map-codebase brownfield; conversational verify; atomic commit per task.

"The complexity is in the system, not in your workflow."

Decision: closest philosophical competitor — signals features buyers expect: multi-runtime, a /quick fast path (Companion has fast-path-as-default), brownfield map, conversational verify, atomic commits. Worth a close read for the verify-step UX.

Superpowers, GSD, gstack: What Each Constrains · Ewan Makfield notes

Specifics

Each constrains a different axis: Superpowers the process (7-phase, TDD deletes untested code), GSD the environment (context isolation, 50%/70% rot thresholds), gstack the decision-role (23+ role-mapped commands; its structural gap: no Build-phase skill — reverts to default Claude Code between plan and review).
Karpathy's AutoResearch (65k★/month): modify → run experiment in a time budget → measure one metric → keep if improved, revert if not. ~700 experiments/2 days → ~20 improvements, 11% efficiency gain. Shopify: 93 automated commits → 53% rendering improvement. Numeric metrics only. None of the majors have this post-ship optimize loop.

"I'm a solo developer. I don't write code — Claude Code does."

Decision: two roadmap ideas — (1) an unattended "optimize after ship" loop (AutoResearch) nobody has; (2) gstack's Build-phase gap shows Companion's strength is exactly build coverage (specify→implement→mark-complete). Thin evidence — trend-reading.

SDD Is Eating Software Engineering: Map of 30+ Frameworks · Vishal Mysorelisticle (short)

Note: short piece (~3-min read); despite the title it names ~22 tools with no per-tool detail. Its one useful contribution is a 4-layer mental model: (1) spec frameworks (Spec Kit, OpenSpec, BMAD, Intent, cc-sdd), (2) planning/task systems (Taskmaster, Agent OS, Beads), (3) execution agents (GSD, OpenDevin, CrewAI, LangGraph), (4) AI IDEs (Cursor, Windsurf, Kiro, Claude Code). Plus the spec-as-source movement (Tessl, Intent).

"Don't prompt AI to write code. Give it a specification and let agents implement it."

Decision: low-depth, but the 4-layer model is a clean "where we fit" one-liner — Companion sits at the boundary of layer 1 (spec framework wrapping Spec Kit) and layer 4 (VS Code extension).

Camp D — Durability, drift, verification & ROI

Wasowski's engineering series (the most evidence-forward)

How to make specs survive, when NOT to use them, how to prove ROI, and how far verification should go. The most citation-backed, honest material in the set.

Managing Agent Context: CDLC + SDD · Jarek Wasowskistrongest evidence

Specifics

Context decays two ways: Lost in the Middle (spatial, U-shaped: ~75% at position 1 → ~55% mid) and context rot (temporal). All 18 Chroma models degraded before their window limit.
Four context-debt failure modes: retrieval noise, context poisoning (untrusted PR comments as authoritative), dilution, leakage. Tool overload: >50 MCP tools → selection accuracy drops sharply (ETH Zurich).
Reality check (honest counter-evidence): Boris Cherny — Claude Code team ships dozens of releases/day, 90% AI-written, no PRDs. ETH Zurich (Feb 2026): AGENTS.md context files reduced task success in 5/8 settings at 20–23% higher cost; human-written only +4%. Zaninotto: Spec Kit = 8 files/1,300 lines for a date display.
The rule: SDD wins only when all three are high — number of teams touching the code, cost of architectural drift, cost of regenerating code.

"Context debt accumulates while the developer sleeps — because the agent keeps working."

Decision: the "3-factors" tree and ETH/Zaninotto critiques directly justify fast-path/complexity routing (don't force 8-file ceremony on trivial changes). Honest counter-evidence Companion should cite, not hide.

Three Maturity Levels (the SLUMP 90% finding) · Jarek Wasowskicitation-backed

Specifics

If you use CLAUDE.md/AGENTS.md you're already at Level 1 (spec-first) — but that spec rots after ship. Level 2 (spec-anchored) is the highest-ROI jump. Level 3 (spec-as-source) = code as artifact.
SLUMP benchmark (Faithfulness Loss Under Emergent Specification): 20 ML implementations, 371 components, 60 incremental requests. ProjectGuard (a persistent spec file outside chat history) recovered 90% of lost faithfulness — faithful components 118→181.
IFScale: ~100% adherence at 10 instructions → 68% at 500; a CLAUDE.md over ~200 lines already causes measurable adherence drop. Constitutional SDD in banking: 73% fewer security defects.
MDA warning (2001 promised the same and failed — Fowler's "Night of the Living Case Tools").

"Writing before coding doesn't mean freezing before coding."

Decision (most product-relevant): Companion's living-spec model IS the "Level 1 → Level 2" jump the author calls highest-ROI; SLUMP's "pull the spec out of chat into a persistent file → recover 90% faithfulness" is a direct evidence-backed value prop for the trace.

Designing a Spec That Survives Code Generation (5 doc types) · Jarek Wasowskidesign input

Specifics

Spec-in-the-drawer is an architecture problem: separate what must survive (Constitution) from what's meant to die (per-feature triad).
Five document types with different lifecycles: Constitution (years, 9–12 articles, maps to OWASP/CWE), Feature Spec (per-branch, EARS FR-001 numbering, Non-Goals section), Technical Plan (specific versions, becomes history), Task List (atomic, `[P]` parallel markers, one commit/task), Change Spec (delta ADDED/MODIFIED/REMOVED; bug-fix must have "Unchanged behavior" — "the most important section").
"The constitution is READ by the agent at the start of every session" — survives rot because it isn't in conversation history.

"Spec-in-the-drawer stops being a problem once you accept that the per-feature spec is MEANT to die."

Decision: validates Companion's constitution + per-feature spec/plan/tasks lifecycle and "mark-complete = the per-feature triad dies"; the change-spec/delta type maps onto living-specs.

A Spec Alone Isn't Enough (4 legs + template sync) · Jarek Wasowskipeer-reviewed cites

Specifics

Four artifacts must be wired into context together: spec, template repo, golden paths (Netflix "Paved Road," Spotify "golden path"), test patterns — plus a hidden fifth leg: template sync (Copier `copier update` 3-way merge; GitHub template repos are "clone and forget" anti-patterns).
Cited: Khojah (IEEE TSE 2025) chain-of-thought 47.1% vs signature+few-shot 57.5% pass@1 (+10.4pp); Xu (UCLA 2024) +5.70pp with 6 examples, saturation ~6; "Lost in the Middle" 20–50% mid-context degradation; DeepMind many-shot inverse scaling after ~125 examples.
Measure DORA + adoption + per-prompt pass@1 + "that's not our style" PR-comment rate — NOT lines of generated code.

"If you have only a spec, AI generates code from the internet. If you have four artifacts wired into context, AI generates code from your repository."

Decision: evidence that a steering-doc layer alone is insufficient — informs Companion's steering + golden-path/template story and a "template sync" content angle.

When SDD Loses to the Prototype · Jarek Wasowskidecision heuristic

Specifics

Five conditions where a prototype wins: low requirement certainty, short code lifespan, small team, low failure risk, discovery phase (Brooks' "plan to throw one away"; Snowden's Cynefin — complex problems need probe-sense-respond).
The graduation rule: start specifying when the cost of refining the spec drops below the cost of fixing code misunderstandings. Signals: main flow stabilizes, patterns repeat a third time, a second person joins, real risk appears (auth/payments), debt becomes visible.
Honest data: METR 19% slower (notes METR's 2nd experiment was halted Feb 2026 for selection errors; the 19% wasn't revised). The viral "Clean Code author dismissed SDD" quote is unconfirmed.

"The most expensive mistake of 2026 isn't choosing the wrong tool — it's using the right tool in the wrong phase."

Decision: Companion shouldn't present full ceremony as universally correct; a lightweight/discovery mode that ramps into full rigor (echoes fast-path-as-default). Well-reasoned heuristic, not measured proof.

ROI of SDD (defend it to the board) · Jarek Wasowskihonest, evidence-forward

Specifics

The famous numbers (Mercari +150%, "−40% bugs," "100x") are self-reports or sourceless and collapse under board scrutiny. The one durable argument is the defect-cost-by-phase curve — shift-left.
ROI = (rework_savings + defect_escape_savings + TTM_value − implementation_cost − learning_curve) / implementation_cost. Defect-escape is the biggest lever; baseline rework is 30–50% of team effort.
Decision threshold: if measured rework rate is below 15%, the ROI argument weakens — an honest model should say "don't deploy." Separate cost savings from revenue acceleration.
Honest evidence: METR 19% slower; SSRN 2026 found no spec/defect relationship after controlling for complexity; TDD still <35% adoption after 20 years.

"The real savings from SDD come from shift-left — moving defect detection to the cheapest phase, the specification phase."

Decision: don't market Companion with borrowed hype numbers — lean on the traceability artifact. Informs the bench/measurement work. High credibility (self-aware about weak data).

Formal Verification in SDD (Enterprise) · Jarek Wasowskivision (enterprise lane)

Specifics

A test and a proof are different guarantees: tests sample, formal verification exhaustively searches the modeled state space. Because LLMs are trained to pass tests, they can satisfy a suite while hiding architectural flaws.
Rigor ladder (climb only as far as blast radius demands): assertions/types → Design by Contract → property-based testing → model checking (TLA+, Alloy) → proof assistants (Lean 4). EARS bridges prose to formal specs.
AWS used TLA+ since 2011 (a data-loss bug appeared only in a 35-step trace, survived reviews+tests); regulatory pressure (EU DORA, EU AI Act, ISO/IEC 42001) demands a Compliance Traceability Matrix.

"A descriptive document is a suggestion to an AI agent and evidence of negligence to a regulator."

Decision: a high-end enterprise positioning lane (traceability matrix, contract-as-artifact) — but vision, not adoption evidence (author admits "surprisingly few public case studies"). Weak near-term priority; useful if marketing to regulated buyers.

Camp E — Forecast & lock-in

Vendor Lock-In: Five Dependency Layers · Jarek Wasowskiopinion + illustration

The five layers (cheapest → most expensive to exit)

(1) API layer — trivial ("change the key, change the endpoint," OpenAI-compatible standard).
(2) Prompt vocabulary — model-specific syntax (Claude prefers XML), "weeks per workflow" to rewrite.
(3) Toolchain & spec-format — where tool differences stop being cosmetic.
(4) Eval & test layer — the author's estimate 20–50% of total migration cost, the single largest line item.
(5) Organizational — habits/knowledge, rebuilt over 3–12 months.

Pricing + evidence

Budget 15–25% of annual TCO as a migration reserve. Historical rhyme: 70% of specific CASE tools unused one year after deployment (Kemerer 1992); MDA/UML's XMI "theater of portability." Tessl ($125M) walked back "spec-centric" to "context/skills." One real defense: Markdown is plain text — but it inherits generator nondeterminism, so the test suite is the real behavioral contract.

"The cage doesn't open — it moves one floor up."

Decision: validates Companion's Markdown-in-your-repo + local .spec-context.json + bring-your-own-provider design; "test suite is the real contract" and the 15–25% TCO framing are anti-lock-in talking points. Confirms staying spec-anchored over spec-as-source.

SDD 2026–2030: Forecasts & Two Scenarios · Jarek Wasowskicited forecast

Specifics

The trajectory is driven by economics (a16z ~10×/yr cheaper inference; Epoch ~50×/yr) + agent maturity, not an "AGI breakthrough" — which makes it falsifiable (if frontier cost stops falling, adoption flattens).
Three scale futures: solo (speed but security debt — Veracode: ~45% vuln rate in AI code vs 25–30% human), scale-up (debugging wall), enterprise (governance + portability).
Two measured risks over job loss: skill atrophy (Anthropic n=52: delegators scored ~2 grades lower on comprehension) and vendor lock-in (Zapier N=542: 81% worried, only 6% could switch without disruption).
The two 2030 scenarios: "universal intent interface" (one maintained spec generates everything, verification skill preserved) vs "waterfall comeback / new bureaucracy" (markdown nobody reads, false checklist confidence, lock-in as a cage). Four durable skills: read/evaluate specs, verify output, design intent, architectural thinking.

"You can outsource thinking, but you can't outsource understanding." — Karpathy

Decision (strongest strategic signal): if the 2030 bottleneck is human verification, Companion should lean into review/verification affordances and portability (anti-lock-in), not raw generation speed. Well-evidenced direction, speculative timeline.

The "It Works" Trap · Enrico Papalinibook teaser (short)

Note: short piece (~4-min read), essentially a promo for the author's book. Core framings worth keeping: the "70% Problem" (AI instantly gets 70% — boilerplate/logic; the final 30% — integration, edge cases, security context, architecture — "contains 100% of the risk"); the velocity trap (spikes, then grinds to a halt ~6 months later on "gobsmackingly weird" code); "treat AI output as an untrusted proposal, not a solution."

"'It works' is no longer the finish line... it is merely the starting line of a new, silent crisis."

Decision: messaging fuel, not evidence (the "180 companies" figure is unsourced here). The "verify behavior, don't just check it runs" theme aligns with Companion's verify/acceptance gates.

What this changes for Companion — the short list

Do: (1) verify & fix the Spec-Kit command-body contradictions the ICE piece names; (2) build the eval/acceptance layer (separate directional constraints from binary failure-evals); (3) keep fast-path-as-default (the data says ceremony can hurt); (4) position on context-durability + anti-lock-in (SLUMP 90%, the 5 lock-in layers). Explore: the named gaps nobody fills well — persistent memory, cost accounting, crash recovery, and a post-ship AutoResearch-style optimize loop. Watch: governance becoming the buying criterion.

→ The complete "what should we do" write-up: Strategic Assessment — What We're Doing vs What We Should Do (opportunities, gaps, what to adopt, ranked action plan). Also feeds the unified Trends & Demand report. Full articles in Knowledge/AI/Spec-Driven Development/articles/.

Sourcing: all 32 articles read in full from the member text (Cloudflare + paywall bridged via the process-medium-articles skill). Two pieces were genuinely short (~700 & ~500 words) and are marked. Stats are the authors' — cited studies (METR, Chroma, ETH, SLUMP, CodeRabbit) are verifiable; self-reported figures (funding, "73%", "$500M value") are flagged as reported, not independently verified.