No scraping, no paywall bypass: sourcing global financial news the legal way

NataPulse tracks global financial news through a source registry seeded with the world’s top 50 business outlets and merged with the connectors that predate it. Of the sources it holds, 39 actually feed the pipeline; another 21 sit in the registry as metadata, deliberately silent, because they offer no legal free feed. That gap is not a bug or a temporary limitation. It is the direct consequence of one rule the ingest layer will not break: NataPulse acquires content only from public RSS/Atom feeds, official APIs, or licensed feeds. No scraping. No paywall bypass. Ever.

A registry seeded with fifty, active where the law allows

The news source registry seeds the 50 most relevant financial outlets worldwide — US, UK, Europe, India, China, Japan, Canada, Australia and beyond — each with a region, a priority tier, a reliability tier, and an ingestion strategy. At rollout, activation was restricted to feeds that could be legally verified: 39 sources went live, composed of 36 official RSS feeds, the SEC EDGAR API, and two legacy connectors covering central-bank and company investor-relations sources.

Twenty-one sources stayed pending, in three honest categories. Nine are hard-paywalled or license-only outlets — Barron’s, The Economist, Nikkei Asia and peers — that require a paid credential before a single byte is fetched. Six publish no legal feed at all and would need either an RSS endpoint or a licensing deal. Six more nominally offer RSS but returned errors during verification, so they were not switched on; they can be reactivated the day a working legal feed is found. A pending source contributes exactly zero data. It is listed, tiered, and ready — and silent.

The golden rule, enforced at activation

Every source activation is a deliberate operator decision, and it starts with verification: the candidate feed is fetched with the pipeline’s declared user agent, and it must return well-formed RSS or Atom items. A 403, a 404, or an HTML page instead of a feed means the source does not activate — full stop. There is no fallback to fetching article pages, no headless browser, no rotating user agents. Licensed and API-gated sources follow the same discipline: without a valid credential obtained through an explicit procurement decision, the connector performs a clean no-call skip. No credential, no request, no cost, no gray area.

This also means the registry never pretends to have coverage it does not have. When a filing or a story only exists behind a paywall NataPulse has not licensed, the system simply does not know about it from that outlet — and often learns it anyway from one of the 38 other channels or from the primary source itself, such as an SEC filing on EDGAR.

What a real ingest run looks like

The rule does not starve the pipeline. A verified end-to-end run at rollout read 1,396 raw items and inserted 726 after deduplication, with zero errors. Normalization turned all 726 into structured market events; the provenance pass built 1,816 story groups with 3,446 event-source links, confirming 219 of those events across multiple independent outlets. The scoring stage then evaluated 2,042 candidate events, promoted 1,849, and flagged 114 as rumors. In that single run, 26 distinct outlets produced real events — legal feeds alone deliver genuinely global, cross-confirmed coverage.

Cross-source confirmation is where the registry design pays off. Because every item carries its outlet, region, and reliability tier from the moment of ingest, the pipeline can tell the difference between one outlet’s claim and a story independently reported by CNBC, Il Sole 24 Ore, and the Economic Times. A single low-authority source stays a rumor; corroborated stories earn confidence.

The Stooq test

Principles are cheap until they cost something. NataPulse’s daily S&P 500 price ingest was originally built on Stooq’s free public CSV endpoint. At rollout, Stooq placed that endpoint behind a JavaScript proof-of-work anti-bot wall. The technically easy move — solve the challenge programmatically and keep pulling data — would have meant circumventing an access control, exactly what the golden rule forbids. Instead, the team switched the live source to Polygon’s official grouped-daily API, which returns every US stock’s end-of-day bar in one sanctioned call. Same data shape, clean terms of use, and the original adapter shelved rather than weaponized.

Provenance-clean data is a feature

Treating legality as a hard constraint produces a pipeline where every event can answer the question “where did this come from?” with a specific feed, a fetch time, and an outlet identity — no laundered scraped text, no content of uncertain rights. That provenance is what makes downstream research defensible: reports cite their evidence, corroboration is computable, and rumors are labeled as rumors. NataPulse’s output is research evidence, never trade instructions — and evidence is only as good as its chain of custody. A registry that would rather stay silent on 21 outlets than scrape them is what keeps that chain intact.