How is this different from Smartlead or Instantly's A/B testing?

Those tools A/B at the message level - same campaign, two subject lines, pick the winner. cold.md autoresearch does it at the policy level - the artifact that defines your voice, sequence, and proof gets rewritten by its own results, with version-controlled diffs. The agent owns Subject + Opener + CTA. The human owns Voice + Proof + Banned. Earned trust unlocks more.

Why no open-rate tracking?

FoxReach intentionally doesn't track opens. Open-pixel tracking causes deliverability issues with modern inbox providers (Gmail, Outlook), so we optimize for interested-reply rate instead. It's a stronger signal anyway - opens are noise; intent in the reply is what matters.

What's the trust ladder?

When cold-learn declares a winner, it doesn't auto-edit your cold.md. It writes a proposed-diff.patch and prints git apply / rm to accept or reject. After three approved diffs in a row, auto-commit unlocks. Reject one and the streak resets. It's there so the agent earns control over your sender voice incrementally.

What's installed on my machine?

A Claude Code plugin (six skills + new ones for autoresearch) and the foxreach-cli (pip-installable). Skills shell out to the CLI for FoxReach calls; if the CLI doesn't cover an endpoint, they fall back to curl using a cached OpenAPI spec. State lives in .cold/ in your project directory - everything is git-versioned.

Back to Blog

Engineering

cold.md autoresearch: a self-improving cold-outreach loop on FoxReach

An open Claude Code plugin that runs Karpathy-style autoresearch on your cold email policy. Designs A/B tests, reads variant stats from FoxReach, applies a z-test, and proposes a cold.md diff for human review. Trust ladder unlocks auto-commit after 3 approved diffs.

Usama Navid

Founder, FoxReach

April 29, 20266 min read

cold.md autoresearch: a self-improving cold-outreach loop on FoxReach

Cold email tools have a leverage problem. You write a cold.md policy once - voice, sequence, proof, banned phrases - and the system uses it forever. Your replies, opens, bounces accumulate as data nobody reads. Six months later you're still sending the same opener that worked on day one and quietly stopped working in week three.

cold.md is an open spec for that policy file. The new release adds something specific: the policy edits itself.

This post walks through what we shipped, why it matters, and how to run it against your own FoxReach account.

What's new

A self-improving loop, structured as four stages:

Hypothesis. The agent picks the next variable to test (subject pattern → opener → CTA → cadence → tone).
Experiment. It writes a protocol with arm definitions, sample size, and a success criterion (frequentist z-test, p<0.05, |delta|>2pp).
Measure. After the minimum window, it pulls per-variant categorize-stats from FoxReach's new /api/v1/inbox/categorize-stats?groupBy=variant endpoint.
Update. If a winner is declared, it proposes a cold.md diff for human review. After three approved diffs in a row, auto-commit unlocks.

The agent never blows up your sender reputation chasing a local optimum. Bounce rate over 5% on either arm pauses the variant immediately and halts experiment progression.

Why interested-reply rate, not opens

FoxReach doesn't track opens. Open-pixel tracking is the single largest cause of deliverability problems with Gmail and Outlook in 2026 - the pixel itself is a spam signal. So the metric we optimize for is interested-reply rate: AI-categorized inbound replies tagged "interested" divided by total sends per variant.

It's a slower signal. It's also a better one. Opens measure curiosity. Interested replies measure intent.

The two new skills

Out of the ten skills in the suite, two run the loop:

/cold experiment reads .cold/config.json, picks the current tier (Tier 1 = subject lines for v0), drafts two arm specs, declares sample size and decision rule, and writes a protocol to .cold/experiments/<id>/protocol.md.

/cold learn runs after the minimum window. It pulls categorize-stats from FoxReach, runs a two-proportion z-test on interested-reply rate, checks the bounce-rate guard, decides winner / inconclusive / extend, and writes either an .cold/proposed-diff.patch (default) or applies it directly (after trust earned).

Two more skills support them: /cold offer refines the value prop via competitor + market web research, and /cold status prints a one-screen dashboard of beliefs, active experiments, and pending diffs.

The variable tier ladder

Tiers must be tested in order. Earlier tiers must reach a stable winner before later ones unlock - otherwise the agent is optimizing the CTA on top of a still-noisy subject baseline.

Tier	Variable	Min sample/arm	Time to read
1	Subject pattern	100	7 days
2	Opener template	150	10 days
3	CTA framing	200	14 days
4	Cadence	300	21 days
5	Voice tone	cohort	30+ days, manual unlock

For most v0 users, Tier 1 alone is the entire ROI: a +3 percentage-point lift in interested-reply rate over a 200-lead cohort compounds into measurable booked-call delta within a month.

What FoxReach added

Two backend changes shipped to support this:

/openapi-public.json - filtered to /api/v1/* paths only with Access-Control-Allow-Origin: *. The agent reads this on every cold start to ground itself in the live API surface. Internal routes (/api/auth, /api/admin, /api/billing, etc.) never appear.
GET /api/v1/inbox/categorize-stats - groups Reply categories (interested / not_interested / out_of_office / bounce / uncategorized) plus sent counts by variant, sequence, or day. Joins Reply.originalEmailLogId against EmailLog so each reply is correctly attributed to the variant that produced it.

Both are documented at docs.foxreach.io/api-reference. The Python CLI (pip install foxreach-cli, v0.3.0+) wraps the new endpoint as foxreach inbox categorize-stats.

The trust ladder

Auto-rewriting your sender voice is dangerous. Doing it via diffs you can git apply is fine.

When cold-learn declares a winner, it doesn't edit cold.md. It writes:

.cold/proposed-diff.patch

…and prints:

Review:  cat .cold/proposed-diff.patch
Accept:  git apply .cold/proposed-diff.patch && rm .cold/proposed-diff.patch
Reject:  rm .cold/proposed-diff.patch

A counter at .cold/trust.json tracks consecutive approvals. When you've accepted three diffs in a row, auto-commit unlocks. Reject one and the streak resets to zero.

This pattern matters for two reasons. First, you stay in the loop while the agent is learning your domain. Second, the audit trail is just git log - every policy change has a human approval and a measured experiment behind it.

Web research

Two surfaces use web search, both gated by config.

Policy-level (always-on): /cold icp validates your ICP by searching for competitor companies, target-title job postings, pain language on Reddit, and case-study patterns. /cold offer searches competitor pricing, recent funding, and differentiation gaps to refine the one-sentence value statement.

Per-lead (config flag): if you opt in at /cold init, /cold leads runs up to 2 searches per prospect (recent activity + company news) and saves findings to .cold/research/lead-personalization/<email-hash>.md. /cold draft reads these to inject specificity into the opener. Off by default - it's a real cost (one extra Claude call per lead).

Running it

# One-time setup
pip install foxreach-cli  # v0.3.0+
export FOXREACH_API_KEY=otr_...

# In your project root
mkdir my-outreach && cd my-outreach
claude  # Claude Code

> /cold init                              # 6-question wizard
> /cold icp https://your-company.com      # who
> /cold offer                             # what (with competitor research)
> /cold leads --csv ./prospects.csv       # ICP-score + import
> /cold experiment                        # design Tier 1 A/B
> /cold draft                             # variant pairs
> /cold send                              # ship via FoxReach (full pre-flight)

# Wait 7 days. Triage runs daily.
> /cold learn                             # propose cold.md diff
> /cold status                            # dashboard
> /cold report weekly                     # human digest

The full plugin install:

curl -fsSL https://cold.md/install | bash

The spec is open (CC-BY-4.0). The plugin is MIT. Source: github.com/concaption/cold-md.

What's next

Three things on the v0.3 roadmap:

Bayesian decision rule. Frequentist z-tests are honest with n≥100 per arm but punishing with n=50. Beta posteriors give you "probability A>B" at any sample size.
Multi-armed bandit mode. For workspaces happy to delegate more, replace the fixed 50/50 with adaptive weighting that shifts traffic toward the leading arm during the experiment, with exploration noise.
Per-lead deeper research. Right now per-lead search is light (2 queries). v0.3 will optionally run a deeper agent loop - LinkedIn activity, recent job change detection, public commits - for high-value leads only.

If you're running outbound and want a system that gets better instead of one that quietly degrades, give it a try. Your cold.md is portable - the agent that improves it doesn't have to be FoxReach's. We just happen to be the reference implementation.

What is an AI SDR? A 2026 Definition - the broader category cold.md autoresearch sits inside
Cold email to Google inbox in 2026 - the deliverability constraints that make interested-reply-rate the right metric
What is MCP - how Claude Code talks to FoxReach under the hood

For Agents

The complete guide to cold email for AI agents

Architectures, framework decision matrix, pattern library, and a 10-minute getting-started path. Free, no signup.

Read the pillar guide Explore integrations

Was this article helpful?

Your feedback helps us improve what we write.

Frequently asked questions

Autoresearch is a four-step loop borrowed from Karpathy's framing: hypothesis, experiment, measure, update. Applied to cold email, it means an agent picks a variable to test (subject line pattern, opener, CTA), runs an A/B against your real campaign, measures interested-reply rate from FoxReach's categorize-stats endpoint, and proposes an edit to your cold.md policy file. The cold.md plugin is the agent; FoxReach is the substrate.

Topics

cold.mdautoresearchA/B testingAI agentsClaude Codecold email

Written by

Usama Navid

Founder, FoxReach

Usama is the founder of FoxReach. He writes about cold email, AI agents, and the systems builders use to ship outbound at scale.

View all articles by Usama

Back to all posts

Share

cold.md autoresearch: a self-improving cold-outreach loop on FoxReach

What's new

Why interested-reply rate, not opens

The two new skills

The variable tier ladder

What FoxReach added

The trust ladder

Web research

Running it

What's next

The complete guide to cold email for AI agents

Was this article helpful?

Frequently asked questions

Topics

Related Posts

What is MCP (Model Context Protocol)? A 2026 Guide

FoxReach Plugin for Claude Code: Manage Outreach with Natural Language

Manage FoxReach from Your Terminal with the CLI

Stay ahead of the inbox

What's new

Why interested-reply rate, not opens

The two new skills

The variable tier ladder

What FoxReach added

The trust ladder

Web research

Running it

What's next

Related reading

The complete guide to cold email for AI agents

Was this article helpful?

Frequently asked questions

Topics

Related Posts

What is MCP (Model Context Protocol)? A 2026 Guide

FoxReach Plugin for Claude Code: Manage Outreach with Natural Language

Manage FoxReach from Your Terminal with the CLI

Stay ahead of the inbox