Inside the AutoResearch Lab: how we use it to continuously improve NataPulse

Every score you see on natapulse.com — event importance, cluster ranking, narrative confidence — comes from decision logic with tunable parameters. Until recently, tuning them meant a human changing a number and watching what happened. The AutoResearch Lab replaces that with something more disciplined: a private experimentation control plane that tests bounded configuration candidates against frozen evaluation sets built from real production data, and hands a winner to production only through a human-reviewed pull request. It went live in read-only mode on 2026-06-29. This is how the product gets better without gambling the live product.

What the Lab actually is

The Lab is not an autonomous agent rewriting NataPulse. It is a control plane, isolated from the public product, that does one narrow thing: given a piece of decision logic — how events are scored for importance, how clusters are ranked, how narratives are detected, which quant signals get promoted, how the data-source budget is allocated — it proposes small, bounded variations and measures them against evaluation sets frozen from real production history.

“Bounded” is structural, not aspirational. Each experiment campaign declares a fixed mutable surface: the only parameters a candidate may touch, each constrained to numbers, enumerations, or normalized weight vectors. A candidate cannot contain code, paths, URLs, or credentials, because the format does not allow them. Candidates are generated deterministically, one dimension at a time, so every result is reproducible and every difference is attributable.

Nothing the Lab does touches the live site. Its adapters are pure, read-only overlays over the real production modules — at baseline they collapse to exactly the constants running in production — and all of its writes land in its own isolated tables.

Four campaign families

The Lab launched with four campaign families, each aimed at a surface users actually feel:

Cluster and narrative fidelity — the complete v1 campaign. It tunes event importance, cluster ranking, and narrative detection, measuring clustering quality (pairwise and B-cubed F1, ranking quality at the top of the feed), narrative precision and recall, and lead time. The false-merge rate — two unrelated stories fused into one cluster — is a hard gate, not a trade-off.
Deep-research quality versus cost — evaluates completed Deep Research runs on factuality, citation entailment, evidence coverage, contradiction handling, and confidence calibration, alongside cost and latency. One hard gate is absolute: zero financial instructions. NataPulse produces research evidence, never trade recommendations, and no efficiency gain can buy that constraint back.
Quant-signal promotion precision — replays the policy that decides which quantitative signals become public events, over historical market data. It measures promoted-signal precision, anomaly recall, false promotions per asset-day, and stability across five market regimes. Profit is explicitly never an objective: the campaign optimizes for whether a promoted signal was genuine evidence, not whether trading it would have paid.
Data-source portfolio allocation — replays how ingest budget is spread across source families (social, news, SEC filings, market, on-chain), measuring unique material events per dollar, lead time, and duplicate rate. Hard gates include zero scraping or paywall bypass — a rule the wider ingest pipeline already holds.

The discipline: gates first, no single score

Three design choices carry most of the weight.

Hard gates are evaluated first. Before any candidate is compared on quality or cost, it must pass every safety gate for its campaign. A configuration that merges unrelated clusters slightly less often but occasionally emits a directional trade instruction is rejected outright, regardless of its other numbers.

Selection is multi-objective. There is no blended “overall score” to game. The Lab computes a vector of objectives and keeps the Pareto frontier — the set of candidates not dominated on every axis. A human reviewer then sees the real trade-offs (precision versus lead time, quality versus cost) instead of a single number that hides them.

Evaluation data is sealed and time-honest. Eval sets are built only from real production data — never fabricated — split into development, validation, and holdout, with content checksums. A campaign cannot seal until at least half its cases carry a human-reviewed label. Historical replay enforces a strict as-of boundary: a candidate scoring an event from March sees only what the system knew in March. Holdout cases are unreachable by the candidate generator entirely, reserved for a final, separate check.

Humans ship; the machine remembers

A winning candidate — a “Lab champion” — has no path to production except a normal, human-reviewed pull request. The Lab holds no deploy credentials by design. Champions are born unapplied, and every promotion step requires explicit human review, with an append-only rollback trail behind it.

Meanwhile, every finished experiment writes a redacted lesson — the hypothesis, the aggregate metric deltas, keep or reject — back into agent memory. Future candidate generation recalls those lessons, so each new hypothesis starts from what earlier experiments taught.

Where it stands today

Honest status: the Lab is live in read-only mode as of 2026-06-29, with all execution switched off by default. The four campaigns are seeded, the evaluation machinery is verified, and turning on the first pilot experiments will require its own explicit go-ahead. No experiment has yet changed a production number — and when one does, it will arrive the same way every other change does: through a pull request a human read.