Data sources & derived-data architecture
Schmatz is an analytics publisher, not a data reseller. The customer-facing surface is built on public-domain and freely- redistributable data sources. The backtest engine, shock classifier, and pattern-matching infrastructure run against that data on our own servers and publish the derived results — equity curves, summary metrics, classified events, narrative attributions. We never package or sell the underlying data feeds.
Source attribution
The customer-facing Schmatz product runs on public-domain and freely-redistributable sources. The same attribution appears in the site footer on every page.
- SEC EDGAR — the US Securities and Exchange Commission’s public-domain filing system. We use it for: corporate filings (8-K material events, Form 4 insider transactions, 13D activist filings, 10-K and 10-Q financial reports) and XBRL-tagged financial statements (used to compute fundamentals analytics like P/E, debt-to-equity, financial-health scores). All EDGAR data is in the public domain under US copyright law.
- Federal Reserve FRED — the Federal Reserve Economic Data system. Free for any use. We use it for macroeconomic indicators (VIX, 10-year treasury yields, dollar index, market breadth proxies, and other time series the V7 strategy reads as a regime gate).
- Freely-redistributable historical-price archives — the customer-facing chart, the user-run backtest engine in /lab, and the shock-detection pipeline all read from price archives whose licensing permits commercial display. Specific sources are documented in the project repository; the customer-facing UI shows derived values (charts, percentage returns, classified events), never bulk OHLCV exports.
For internal research only, the operator maintains separate paid-tier subscriptions to third-party market-data vendors. Those subscriptions are used as a gold-standard validation reference against the customer-facing public-data pipeline — if the two ever disagree by more than a small tolerance, we investigate before serving the output. Records from those paid sources are never returned by any customer-facing API endpoint. See docs/SCALING_PLAYBOOK.md and docs/DATA_LAYER_ARCHITECTURE.md in the project repository for the engineering implementation.
The derived-data architecture
Schmatz is structured to produce and publish derived analytics, not raw data. Concretely:
The ingestion loop (server-side only)
A scheduled job pulls the day’s incremental data from the public sources above and writes it into a local SQLite database on our server. This is the only process that reaches outside the box; from the data sources’ perspective the traffic pattern looks like a single research workstation, not a multi-tenant SaaS.
The local database holds the multi-year corpus needed for backtesting: prices, classified shock events, news metadata, SEC filings, computed fundamentals, FRED macro series. It is the only data store any user-facing query ever reads from.
The user-facing surface (derived only)
When a subscriber runs a backtest in /lab:
- The request lands at
POST /api/backtestwith the user’s strategy parameters (tickers, date range, cost model). - The backtest engine reads the necessary price history from local storage, runs the strategy’s entry/exit rules, simulates trades against a hedged-portfolio framework, and computes summary statistics.
- The response returned to the browser contains only derived output: an equity-curve line (cumulative-return time series), summary metrics (Sharpe ratio, hit rate, max drawdown, annualized return), and a list of trade triggers (e.g., “Entered AAPL on 2021-04-12; exited 2021-04-19; net +1.4%”).
- The raw price bars used to compute the result are never sent to the browser. The user cannot reverse-engineer the underlying price history from an equity-curve line or a Sharpe ratio.
The same pattern applies elsewhere
- Shock detection: the bot pre-computes classified shock events. Users see the classification (“stock-specific”, “sector move”, “market-wide”) and the magnitude, never the underlying volume + volatility distributions that drove the classification.
- Echoes similarity: the engine computes a feature vector from a ticker’s recent regime and searches the corpus for the top-K closest historical matches. Users see the matched dates + outcomes, not the underlying feature vectors or distance scores.
- Ask the bot: the engine pulls deterministic facts from the local database (PE ratio, recent earnings beats, financial-health scores), then asks an LLM to compose editorial prose around them. The LLM never has direct access to the raw data; it sees only the facts we’ve already derived. The user reads the prose plus the same derived facts.
What we publish (allowed)
- Backtest performance metrics (Sharpe, hit rate, annualized return, max drawdown, t-statistics)
- Equity-curve charts (cumulative-return lines, not raw price series)
- Lists of simulated trade triggers (entry/exit dates + prices for a specific strategy run)
- Classified event tags (“stock-specific shock”, “sector move”, etc.) and their magnitudes
- Narrative attribution prose composed by AI from deterministic SQL facts
- Headlines and links to SEC filings (which are themselves public-domain government records)
What we do NOT publish (forbidden)
- Downloadable CSV of historical OHLCV bars for any ticker or date range
- A general-purpose “query the underlying database” endpoint that returns raw records
- Bulk export of the fundamentals corpus
- Real-time tick-by-tick streaming of price data
- API responses that contain enough adjacent price points to reconstruct the daily-bars history for arbitrary tickers
A subscriber who needs raw historical data can pull it themselves directly from public sources or license it from a data vendor of their choice. The Schmatz product is the analytics; we don’t resell anything we ingest.
Bring-Your-Own-Key (future)
A planned future feature (not currently available) will let subscribers connect their own API keys from third-party market-data vendors. With a personal key connected, the user’s browser would render premium-vendor data directly under their own subscription terms — not under Schmatz’s. This pattern (sometimes called BYOK) lets us extend the product surface to include premium-data views without taking on the redistribution liability ourselves.
BYOK is on the roadmap and gated behind subscriber demand. Today, every Schmatz subscriber sees the same derived analytics built from public-domain data sources.
Why this matters
The separation between “raw underlying data” and “derived analytics” is the foundation of how a small publisher can legitimately offer research based on institutional-grade methodology without holding a commercial display license. Schmatz is built to that pattern deliberately. The same pattern is followed by traditional financial research publishers and academic groups who publish strategy backtests in journals: the publication is the derived work; the underlying data is referenced, not redistributed.
If you have questions about how a specific output from Schmatz fits in this framework, or if you believe something we publish crosses the line into raw-data redistribution, contact us through the legal contact and we’ll investigate promptly.