Blog

Case studies, strategies, and ideas shaping modern technology.

The Agony of the 'Helpful' Hand: Why AGENTS.md Might Be Hindering Your Coding Agents

The Agony of the 'Helpful' Hand: Why AGENTS.md Might Be Hindering Your Coding Agents

We’ve all been told to do it. The developer ecosystem is currently awash with instructions telling us that if we want our shiny new LLM-driven coding agents to actually succeed in our codebases, we need to build them a bespoke map. Put an AGENTS.md or a CLAUDE.md file in your repository root, they say. Fill it with architectural overviews, directory explanations, and tooling requirements so the agent doesn't get lost. It sounds beautifully logical.

Except, as a rather sobering new study from researchers at ETH Zurich reveals, it happens to be mostly wrong. The paper, titled "Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?", took a cold, hard look at this industry trend. By testing multiple state-of-the-art models – including Sonnet 4.5 and GPT-5.2 – across established benchmarks and a custom-built dataset called AGENTBENCH, they discovered that these context files often do more harm than good.

If you are a CTO or Engineering Manager trying to optimise your team's autonomous workflows, the reality is that providing these files typically reduces task success rates while inflating your inference costs by a cool 20% or more.

Let's look at what is actually happening under the hood when an agent reads your repository instructions – and how we can fix it.


The Reality of the Data: More Cost, Less Success

The researchers evaluated four prominent coding agents across two distinct datasets: SWE-BENCH LITE (comprising popular repositories) and AGENTBENCH (a curated set of niche repositories containing actual developer-written context files). They looked at three scenarios: providing absolutely no context files, using an LLM-generated context file, and using a human-developer-written context file.

The baseline assumption was that more context equals a straighter path to the solution. The actual results tell a completely different story.

 

Performance on Complex Repositories

The charts below illustrate the task success rates across different models. Across both benchmarks, automatically generating a context file using the agent's own recommended tooling consistently dented performance.

 

                [ SWE-BENCH LITE SUCCESS RATES (%) ]
            Lower is worse ⬇️ (Data extrapolated from Figure 3)
─────────────────────────────────────────────────────────────────────────────
SONNET-4.5     None (No Context)  ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓  ~60.0%
              LLM-Generated      ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓  ~58.5%
─────────────────────────────────────────────────────────────────────────────
GPT-5.2        None (No Context)  ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓      ~56.5%
              LLM-Generated      ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓        ~54.0%

                [ AGENTBENCH SUCCESS RATES (%) ]
─────────────────────────────────────────────────────────────────────────────
SONNET-4.5     None (No Context)  ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓  ~73.0%
              Human-Written      ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓    ~70.0%
              LLM-Generated      ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓        ~65.0%
─────────────────────────────────────────────────────────────────────────────
GPT-5.2        None (No Context)  ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓        ~65.0%
              Human-Written      ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓      ~68.0%
              LLM-Generated      ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓      ~68.0%

As you can see, on AGENTBENCH, providing an LLM-generated context file dropped the average resolution rate by 2%. While human-written context files did provide a modest 4% bump over providing nothing on average, it came at a significant operational premium.

money

The Financial Hangover

The real kicker isn't just that the success rates stagnated; it's that the agents took far longer to get there. Across the board, adding an AGENTS.md file increased the number of individual steps – interactions with the environment like running a command or editing a file – by up to nearly four steps per task.

According to the study's precise cost calculations, this behavioural detour increased token consumption and overall inference costs by 20% to 23% on average. In other words, you are paying a premium for a worse outcome.

 

The Behavioural Paradox: Why Context Distracts

To understand why this happens, we have to look at how these instructions alter agent behaviour. LLMs are notoriously polite; if you give them instructions, they feel intensely compelled to follow them.

The trace analysis performed in the study mapped out the exact tool calls made by the agents when context files were introduced. The findings show a clear behavioural shift:

  • Over-Exploration: When handed a context file, agents suddenly engage in far broader exploration. They search more files (grep), read more files (cat), and run significantly more tests.
  • Tool Obsession: If a specific tool like uv or a bespoke project script is mentioned in the AGENTS.md, its usage skyrockets – often used multiple times per instance compared to practically zero usage when omitted.
  • The Overviews Don't Help: You might think a high-level repository summary helps an agent find the right file faster. The data shows that the number of steps an agent takes before its very first interaction with a relevant file does not decrease with a context file; if anything, it slightly increases.

 

           [ INCREASE IN AVERAGE TOOL CALLS WITH CONTEXT FILES ]
          Higher means more redundant exploration (Data from Figure 6)
─────────────────────────────────────────────────────────────────────────────
pytest (Testing)   ████████████████████████████████████████ (Highest Increase)
Read/cat (Files)   █████████████████████████████
Grep/rg (Search)   ████████████████████
uv/repo_tool       ███████████████

Essentially, by trying to be helpful and outlining your engineering standards, directory trees, and linting preferences, you are inadvertently handing the agent an existential crisis. 

Instead of directly diagnosing the bug, the agent spends its token budget meticulously reading through files it doesn't need to touch, verifying its environment setup, and running redundant quality checks just because the AGENTS.md file implied it should. 

It behaves like an over-eager junior developer who reads the entire company wiki before writing a single line of code.

coding

 

The Documentation Redundancy Factor

There is a fascinating silver lining in the study's ablation tests. The researchers ran an experiment where they aggressively stripped out all existing project documentation – deleting all other .md files, example code, and the entire /docs folder.

In this artificially barren landscape, suddenly the LLM-generated context files shone, improving performance by an average of 2.7%.

What does this tell us? It proves that for any reasonably mature codebase that already possesses standard documentation, an automated AGENTS.md file is highly redundant. The agent is already perfectly capable of pulling repository context from your existing files. Forcing it to ingest a dedicated, LLM-generated summary simply pollutes its context window with duplicate information, increasing the cognitive load – and the adaptive reasoning tokens required to sort through it.

 

Post-Script: A Direct Blueprint – What to Do vs. What Not to Do

To leverage codebase context files successfully without tanking execution rates and blowing your compute budget, you must stop treating an AI like a human engineer. Humans need descriptive background prose; agents need strict, minimalist, executable guardrails.

 

❌ What Not to Do (The Conversational Essay)

Avoid high-level summaries, redundant directory layout definitions, or ambiguous stylistic guidance.

# Welcome to the Repository!
This platform is a complex, microservices-driven framework. The core logic is located inside the `/src` directory, while `/tests` houses our comprehensive test suites.

When you pick up a task, please ensure you look through the related architectural patterns in our middleware files before editing code. We like clean code and highly recommend you run formatting manually via our scripts.


Why this fails: The agent wastes steps searching the directory tree just because you mentioned it. Phrases like "look through the patterns" invite boundless, token-guzzling exploration that delays actual code modification.

 

What to Do Instead (The Executable Command Rail)

Restrict your context files entirely to concrete configuration variables, rigid tool declarations, and precise verification loops.

# Coding Agent Instructions

## Environment & Tooling Constraints
- Package Manager: Always use `uv` for dependency resolution. Do not use standard `pip`.
- Active Interpreter: Use `.venv/bin/python`.

## Mandatory Quality Gates
Before finalizing any code modification, you must run exactly these verification gates:
1. Syntax & Style: `uv run ruff check path/to/modified_file.py`
2. Test Run: `uv run pytest path/to/modified_file.py`

Why this works: It provides zero conversational noise or fluff. It plays directly into an agent's strict instruction-following mechanics, allowing it to move smoothly to verification without losing time exploring irrelevant files.

blueprint

Future-Proofing Beyond Code: The Infrastructure Layer

Solving the agent distraction problem at the codebase level is only half the battle. If your data scientists and engineering teams are manually wrestling with underlying cloud clusters, servers, or complex physical configurations just to get their code running, your overall time-to-market will stall anyway.

This is precisely why we developed Athena MLWorkspaces. Built on cutting-edge platform engineering principles, Athena acts as an intelligent intermediary that entirely abstracts away the infrastructure head-scratchers. 

Your data scientists simply declare what their model needs and provide their code; our intelligent automation smoothly provisions the correct components and specialised accelerators, saving you thousands of dollars in static resource costs.

Let's clear the clutter from your development lifecycle, from your AGENTS.md instructions all the way to your cloud environments.

 


Ready to turn your AI bottlenecks into lean opportunities? Contact Mesoform today to unlock your AI potential with Athena MLWorkspaces, or sign up for our platform engineering insights below.