What the Research Is Starting to Tell Us About Agent Skills

A core tenet of the SpecOps methodology is that human experts are critical in the legacy modernization process. Domain specialists who understand the business rules embedded in decades-old systems, policy experts who can verify that a specification actually captures what the code does, and practitioners who know where the bodies are buried in a 40-year-old COBOL codebase — these people are not incidental to SpecOps. They are the point.

A recently published benchmark study adds a new dimension to that argument. It's not the same claim, but it's closely related: human experts are also the ones who have to author the procedural knowledge that makes AI agents effective in the first place. And the data on what happens when you try to skip that step is pretty clear.

The SkillsBench Findings

A research team published a paper in February 2026 called SkillsBench, the first systematic benchmark designed to measure how much agent Skills actually improve performance. They tested 84 tasks across 11 domains using seven different AI model and agent configurations, running over 7,300 trajectories. Each task was evaluated three ways: with no Skills, with curated human-authored Skills, and with Skills the agent generated for itself.

The headline finding: curated Skills improved average pass rates by 16.2 percentage points. That's a meaningful lift across a diverse set of tasks and models. But the finding that I keep coming back to is the one about self-generated Skills.

When agents were prompted to write their own procedural knowledge before attempting tasks, performance was essentially flat compared to having no Skills at all — negative 1.3 percentage points on average. In several configurations it was meaningfully worse. Only one model showed any notable improvement from self-generated Skills, and it was modest.

The researchers identified two failure modes. First, models identify that domain-specific knowledge is needed but generate vague, imprecise procedures — they gesture at the right concepts without providing actionable guidance. Second, for tasks requiring highly specialized knowledge, models often fail to recognize they need specialized Skills at all and attempt solutions using general-purpose approaches. They don't know what they don't know.

This is the structural argument for human-curated instruction sets, now with empirical support.

Why Domain Expertise Can't Be Reconstructed on Demand

The domain-level breakdown in the paper is worth paying attention to. Healthcare tasks improved by 51.9 percentage points with curated Skills. Manufacturing improved by 41.9 points. Software engineering — a domain heavily represented in model training data — improved by only 4.5 points.

The pattern is clear: the more specialized and underrepresented a domain is in model training data, the more curated Skills matter. Models already know a lot about software engineering. They know comparatively little about clinical data harmonization workflows, manufacturing job shop scheduling, or USGS flood frequency analysis methodology using Log-Pearson Type III distributions.

Government legacy systems fit the high-benefit profile well. Benefits eligibility rules, unemployment insurance claim processing, tax calculation logic, professional licensing workflows — these are specialized domains with procedural knowledge that accumulated over decades in agencies and in the heads of people who worked there. That knowledge is not sitting in any training dataset waiting to be reconstructed on demand.

A COBOL programmer who has spent years maintaining a state unemployment system knows which patterns trip up anyone new to the codebase, which business rules are buried in places you wouldn't expect, which edge cases have caused problems before. Instruction sets that capture that knowledge are valuable precisely because the AI cannot reproduce it from scratch. The benchmark now gives us a quantitative sense of how valuable.

Less Is More

One finding in the paper has direct implications for how to build instruction sets: more content isn't better. Tasks with 2-3 Skills showed the largest improvement. Tasks with 4 or more Skills showed much smaller gains — and comprehensive Skills documentation actually hurt performance relative to having no Skills at all.

The explanation the researchers offer is that excessive Skills content creates cognitive overhead. Agents struggle to extract relevant information from lengthy documentation, and overly elaborate Skills consume context budget without providing actionable guidance.

This argues against the instinct to document everything you know. The goal isn't to write the most comprehensive instruction set possible — it's to distill what matters into focused, actionable guidance. Concise, stepwise procedures with at least one working example consistently outperformed exhaustive documentation.

I've updated the SpecOps instruction sets guide on GitHub to reflect this. The maturity model in that document previously could have been read as suggesting that more comprehensive instruction sets are better instruction sets. That framing is now clarified: Level 3 maturity means refined and focused, not exhaustive.

The Community Repository Is Already Happening

Something else in the paper caught my attention. The researchers analyzed the existing public Skills ecosystem and found tens of thousands of Skills already available in public repositories tagged with agent-skills conventions, with growth surging sharply in early 2026. The community repository model that SpecOps envisions for government instruction sets isn't speculative — it's a pattern that has already taken hold at scale in the broader developer community.

The government-specific version of this — a shared repository of instruction sets for COBOL comprehension, mainframe environments, benefits domain patterns, tax calculation logic, and the rest of the legacy system catalog — is still largely unbuilt. But the template exists and it works. Agencies starting SpecOps projects today should look for existing instruction sets in public repositories before building from scratch. Someone else may have already done the hard work of figuring out how to get AI agents to reason clearly about the platform you're working with.

A Note on the Research Itself

We need to be appropriately careful about how much weight to put on any single study, including this one. This is early empirical work in a fast-moving area. Model capabilities are improving. The Skills ecosystem is growing and the quality distribution of available Skills will shift. Future research may refine these findings, add nuance, or identify conditions where the pattern looks different.

What feels durable is the structural argument: human domain expertise encodes things that models cannot reconstruct on demand, and instruction sets are how that expertise gets made available to AI agents at inference time. The SkillsBench findings support that argument. They don't settle it permanently, but they give it empirical grounding it previously lacked.

For agencies considering whether to invest in instruction set development as part of a SpecOps modernization effort, that's a meaningful data point. The benefit is real, the domains most relevant to government show the largest gains, and the knowledge being captured genuinely cannot be replaced by asking the AI to figure it out for itself.

The book, The SpecOps Method: A New Approach to Modernizing Legacy Technology Systems, is now available on Amazon.