Real-Time Inference Engine

The Speed Problem

Every AI trading tool on the market today does the same thing: sends raw information to an LLM API and waits for a response. Each call takes 1–5 seconds. When a major event breaks and 48 related signals flood in simultaneously, the pipeline chokes.

By the time the AI finishes thinking, the market has already moved.

Roma doesn't do this. We built a layered inference system where 90% of signals never touch a large language model.

Two-Layer Architecture

Layer 1 — Lightweight Domain Model

A purpose-trained model, distilled from large language models and fine-tuned on crypto-specific data. It performs three tasks in under 100 milliseconds:

Task	Output
Entity Recognition	Which coins/projects are mentioned
Sentiment Classification	Bullish / Bearish / Neutral
Importance Scoring	0–100 importance rating

90% of incoming signals are fully processed at this layer. Duplicates are detected and merged. Low-importance noise is filtered out.

Only signals scoring above a configurable threshold advance to Layer 2.

Layer 2 — Deep LLM Reasoning

For the small number of high-importance signals, a full LLM performs deep analysis in 2–3 seconds:

Task	Output
Event Attribution	What type of event is this?
Impact Analysis	Which assets are affected, and how?
Historical Comparison	What happened last time?
Strategy Suggestion	What trade makes sense?

Example: Powell Speaks

At 19:30, the Fed Chair begins speaking. Within seconds, 48 related signals flood into Roma's pipeline.

Without Roma (typical AI tool):

48 signals × 1-5 seconds per API call = 48-240 seconds total
Result: 48 duplicate alerts, high cost, massive latency

With Roma:

19:30:05  48 signals arrive

          Layer 1 (80ms):
          → Entity: BTC, ETH, USD, Treasury
          → Sentiment: Strong bullish
          → Importance: 92/100
          → 47 duplicates detected and merged
          → 1 signal advances to Layer 2

19:30:05  Layer 2 (2.5s):
          → Event type: Fed policy pivot
          → Historical: 2024 rate cut cycle → BTC +15%
          → Impact: BTC (strong positive), ETH (strong positive),
                    AI tokens (moderate positive)
          → Strategy: Long BTC/ETH, watch AI sector rotation

19:30:08  → 1 structured trade signal output

3 seconds. 1 signal. Zero noise.

Training Data Flywheel

Layer 1's accuracy depends on its training data. This data comes from Roma's own pipeline:

30+ sources generate raw signals daily
    ↓
Layer 2 LLM processes high-score signals (labeled output)
    ↓
Labeled data feeds back to retrain Layer 1
    ↓
Layer 1 becomes more accurate → better filtering → better Layer 2 inputs

This creates a data flywheel: the more signals Roma processes, the better Layer 1 becomes at filtering, which improves the quality of signals reaching Layer 2, which produces better training data for Layer 1.

A competitor can build the same two-layer architecture. But without the pipeline generating domain-specific labeled data at scale, their Layer 1 model will be significantly less accurate.

Cost Efficiency

The layered approach dramatically reduces LLM API costs:

Approach	API Calls / Day	Estimated Cost
Naive (every signal → LLM)	~10,000	High
Roma (layered filtering)	~500–1,000	~10x lower

Layer 1 runs on lightweight infrastructure. Only the signals that truly matter consume expensive LLM compute.

Real-Time Inference Engine ​

The Speed Problem ​

Two-Layer Architecture ​

Layer 1 — Lightweight Domain Model ​

Layer 2 — Deep LLM Reasoning ​

Example: Powell Speaks ​

Training Data Flywheel ​

Cost Efficiency ​