What Breaks When AI Agents Win?

The AI adoption story is starting to split in two.

On one side, almost everyone has tried the tools. Anthropic’s May 27 survey of quantitative social scientists found that 81% of respondents had used generative AI in their research process. That number feels familiar now. AI trial has become normal. People ask for drafts, code, summaries, comparisons, edits, and second opinions without needing a transformation program to tell them it is allowed.

On the other side, regular agent adoption is still much thinner. In the same survey, only 20% of respondents were using coding agents at least weekly. That gap matters because chat use and agent use are not the same operating behavior. A chatbot answers inside a conversation. An agent carries a piece of work. The first can spread through curiosity. The second requires trust.

The industry still talks as if the adoption gap is mostly a model problem, a training problem, or a change-management problem. Those are real, but they are not the deepest cause. The more practical diagnosis is simpler: most teams have not built reusable operating context.

They have access to AI. They do not yet have an agent-ready work environment.

The prompt is not the unit of delegation

Anthropic’s Claude Cowork product page points directly at the pressure knowledge workers are feeling. It says most AI assistants require users to break work into individual prompts, while Cowork “takes the outcome and handles the rest.” That is a strong market button because it names the exhaustion underneath the current AI workflow. People do not want to babysit every step. They want to hand off a result.

That desire is reasonable. It is also where many agent rollouts get sloppy.

A person can survive a vague prompt because the person carries private context. They know the client, the spreadsheet, the boss’s preference, the weird exception from last quarter, the naming convention nobody wrote down, and the reason one source is more trusted than another. When they prompt a chatbot, they unconsciously fill those gaps. If the answer is wrong, they steer it. If the work is incomplete, they repair it. The missing context becomes human labor.

An agent exposes that hidden labor. Once the work is delegated as an outcome, the agent needs the job, the rules, the examples, the allowed tools, the stopping points, and the standard for done. If those are not packaged, the agent has to infer them from fragments. Sometimes it guesses correctly. Sometimes it produces a plausible artifact that still requires enough review to erase the time savings.

That is the context gap: the distance between the work humans know how to do implicitly and the work an agent can perform explicitly.

The gap does not close with more enthusiasm. It closes when the team turns implicit operating knowledge into reusable context assets.

Salesforce just named the production issue

Salesforce’s May 27 post on becoming “truly agentic” is useful because it is not written from the outside looking in. It describes an engineering organization that pushed agentic tools hard enough to learn where the real constraints show up.

The headline numbers are impressive: work items completed per developer up 50.8% year over year, PRs merged per developer up 79%, and a 33-endpoint migration compressed from an estimated 231 person-days to 13 days. Those are the numbers people will quote. The more important part is the operating substrate underneath them.

Salesforce says it built “the governance scaffolding, the measurement infrastructure, and the workflows to make it real.” That sentence is the article. The agents did not become useful in a vacuum. They became useful inside a system of rules, measurements, patterns, and feedback loops.

The same post names the fragile part directly. Context management in long, complex agentic sessions remains a craft. The quality of CLAUDE.md files, described as persistent context configurations that orient Claude to a codebase, conventions, and constraints, “varies widely across teams,” and that variance “matters a lot for output quality.”

That is the production lesson hiding inside the adoption story. Context quality has become output quality.

A CLAUDE.md file is not just a note for a model. In a serious agent environment, it is closer to a work instruction, onboarding packet, policy surface, and quality-control artifact compressed into one place. It tells the agent what world it is operating in. If one team maintains that artifact carefully and another team treats it like a stale README, their agents will not perform the same way even if they use the same model.

The model is shared. The operating context is not.

That is why agent reliability will diverge between organizations faster than many buyers expect. The winners will not merely have better prompts. They will have better context infrastructure.

Reusable context is an operating asset

The useful move is to stop treating context as chat decoration.

Context is not everything the agent might need. That is just a bigger mess. Reusable operating context is the selected information that lets a specific agent produce a specific product under known constraints. It is narrow enough to guide action and stable enough to improve over time.

A good context asset answers practical questions before the agent begins:

What product is this agent expected to produce?
Which examples show good work and bad work?
Which sources, systems, or files are authoritative?
Which tools can the agent use?
What authority does it have?
What should it never do alone?
What counts as an exception?
What evidence should it leave for review?
Which metric proves the work improved?

Those questions sound basic because they are basic. That is the point. Agents fail in expensive ways when the basics stay private inside human heads.

For a coding agent, the context asset might include architecture notes, style rules, test commands, deployment boundaries, security constraints, and examples of accepted pull requests. For a professional-services research agent, it might include client voice, citation standards, confidentiality boundaries, reasoning format, review expectations, and a list of sources that are not acceptable. For a local-service booking agent, it might include qualification rules, service areas, pricing boundaries, escalation triggers, and how to handle callers who ask two things at once.

The form can vary. It may be a CLAUDE.md file, a skill, a runbook, a checklist, a folder of examples, a policy document, or a structured memory layer. The name matters less than the operating role.

If the agent needs it every time, it should not live only in someone’s head.

Adoption fails when context has no owner

The adoption gap is not only a user behavior problem. It is an ownership problem.

Most organizations can get individuals to experiment with AI. That is easy because experimentation is personal. A motivated person finds a tool, learns the quirks, builds their own private prompt rituals, and gets a productivity bump. The organization sees activity and mistakes it for adoption.

Real agent adoption starts when the work can be repeated by someone else, inspected by a manager, improved by a team, and trusted by the business. That requires ownership. Someone has to own the agent’s job definition. Someone has to own the context asset. Someone has to own the review gate. Someone has to decide when a correction becomes a permanent instruction, when it becomes a better example, and when it reveals that the agent should not have had that job yet.

Without that ownership, context rots quickly. The business changes a process, but the agent still sees the old rule. A manager corrects one output, but the correction never becomes part of the reusable pattern. A tool permission expands for a demo and never gets narrowed again. A workaround becomes normal. The agent keeps working, but trust quietly declines.

That is when teams say the agent is unreliable. Sometimes the model really is the issue. More often, the system around the model has no memory discipline.

A practical agent program needs a context maintenance loop. Every serious correction should be sorted into one of three buckets. If the agent lacked a rule, update the rule. If the agent lacked an example, add the example. If the agent lacked authority or should have escalated, change the route. The correction should improve the operating asset, not become another private warning someone repeats in meetings.

That loop is how agents compound instead of merely behaving.

What to build before scaling agents

The next useful question is not “Which agent platform should we buy?” The better question is “Which pieces of work are clear enough to package?”

Start with one workflow that already has volume, visible pain, and bounded risk. Write the agent’s job description in product language. Not “help with sales.” The product might be “qualified inbound lead packet with source, need, urgency, missing data, and recommended next action.” Not “support research.” The product might be “sourced competitor brief with three decision-useful implications and confidence labels.” Product language gives the agent a finish line.

Then create the context asset. Include the smallest set of rules, examples, tools, permissions, and stop conditions required to produce that product. Do not dump the whole company into the context window. The agent does not need a library. It needs a job environment.

Add the review gate before the agent goes live. Decide what a human must inspect, what the agent can complete alone, and what evidence must travel with the output. The review gate should become lighter as the agent proves itself, but it should never be imaginary. Trust is earned by accepted work, not by confident completion messages.

Finally, measure the adoption stat that matters. Do not count prompts. Do not count generated artifacts. Count accepted outputs, reduced rework, faster cycle time, cleaner handoffs, fewer dropped exceptions, and decisions made with better evidence.

That is the difference between AI activity and operating capacity.

Bottom line

The market is moving from prompt assistance to outcome delegation. That is the right direction, but it raises the standard. If an agent is going to carry work, the business has to give it more than a goal and a tool login.

It needs reusable operating context.

The adoption gap will close first for teams that treat context as infrastructure: owned, maintained, tested, and improved from every correction. The teams that skip that layer will keep seeing the same pattern. Lots of AI trial. Occasional impressive demos. Thin agent adoption. Too much human rework.

Agents that actually work do not just have better models behind them. They have clearer work in front of them.

Sources

Anthropic, “Coding agents in the social sciences,” published May 27, 2026. https://www.anthropic.com/research/coding-agents-social-sciences
Anthropic, “Claude Cowork by Anthropic,” accessed June 2, 2026. https://www.anthropic.com/product/claude-cowork
Salesforce, “Pioneering the Agentic Shift Within Salesforce Engineering,” published May 27, 2026. https://www.salesforce.com/news/stories/how-engineering-became-agentic/

Stephen Nickerson.
Built for operators who need AI agents they can test, trust, and improve.