Skip to main content

Data Source Quality

This guide covers critical differences in data quality between data providers when building volume profile and order flow indicators in Bookmap. Understanding these differences is essential for accurate overnight session metrics and historical data analysis.

The Problemโ€‹

The Bookmap API documentation states that the price parameter in onTrade() represents "the number of increments" - implying integer values. However, historical data providers may deliver pre-aggregated data where prices are fractional VWAP (Volume-Weighted Average Price) values, not individual tick prices.

This behavior is not documented in the official API but critically impacts volume profile accuracy.

Data Provider Comparisonโ€‹

ProviderRaw Price FormatTick AlignedZero-Size TradesData Type
Rithmic (live/recorded)100% integer100%~30% (MBO updates)True tick-by-tick
dxFeed CME Historical100% fractional0%~30%Aggregated VWAP

Rithmic Data (High Quality)โ€‹

Raw prices: 25857.000000, 25856.000000, 25855.000000
All integers representing exact tick levels.

Rithmic delivers true Market-By-Order (MBO) tick-by-tick data:

  • Every individual order fill is a separate event
  • Prices are exact tick values (integers when viewed as Level 1)
  • High event density (~2,000+ trades/second during active markets)
  • Zero-size events are order book updates (modifications, cancellations)

dxFeed Historical Data (Aggregated)โ€‹

Raw prices: 27930.555556, 27933.500000, 27930.142857
Fractional values indicating pre-aggregation.

dxFeed CME Historical Market Depth delivers pre-aggregated data:

  • Multiple trades collapsed into single VWAP records per time bucket
  • Prices are volume-weighted averages, not actual execution prices
  • Low event density (significantly fewer records per second)
  • Fractional patterns reveal the aggregation (e.g., 1/7 = 0.142857)

Detecting Aggregated Dataโ€‹

Fractional Pattern Analysisโ€‹

Aggregated VWAP prices produce telltale fractional patterns:

Raw PriceFractionCalculation
27930.5555565/99 contracts across multiple prices
27930.1428571/77 contracts averaged
27933.5714294/77 contracts averaged
27947.47619010/2121 contracts averaged

The denominator often matches the trade size, confirming VWAP aggregation.

Diagnostic Code Patternโ€‹

// Check if raw price is integer (as expected for tick data)
boolean isRawPriceInteger = (price == Math.floor(price));
if (isRawPriceInteger) {
integerRawPrices.incrementAndGet();
} else {
fractionalRawPrices.incrementAndGet();
// This indicates aggregated data source
}

File Size as Quality Heuristicโ€‹

SourceDurationFile SizeImplication
Rithmic2 minutes3 MBTrue tick-by-tick
dxFeed1 hour700 KBHeavily aggregated

A dramatically smaller file for longer duration indicates aggregation.

Zero-Size Trade Handlingโ€‹

Both data sources produce ~30% zero-size "trades" but for different reasons:

Rithmic MBO Zero-Size Eventsโ€‹

These represent order book updates, not executions:

  • Order modifications
  • Order cancellations
  • Quote updates without fills

Recommendation: Filter from volume calculations but valuable for order flow analysis.

dxFeed Zero-Size Eventsโ€‹

Origin unclear - likely artifacts of aggregation process.

Recommendation: Filter from volume calculations.

Implementationโ€‹

@Override
public void onTrade(double price, int size, TradeInfo tradeInfo) {
// Skip zero-size trades for volume profile
if (size == 0) {
zeroSizeTrades.incrementAndGet();
return; // Don't include in volume profile
}

// Process actual executions...
}

Handling Aggregated Dataโ€‹

The Approximation Strategyโ€‹

For aggregated data sources, round prices to the nearest valid tick:

private static final double TICK_TOLERANCE = 0.0001;

private boolean isValidTickPrice(double price) {
double remainder = price % pips;
return remainder < TICK_TOLERANCE || (pips - remainder) < TICK_TOLERANCE;
}

private double roundToTick(double price) {
return Math.round(price / pips) * pips;
}

@Override
public void onTrade(double price, int size, TradeInfo tradeInfo) {
if (size == 0) return;

double displayPrice = price * pips;

// Always round to nearest valid tick
double profilePrice = roundToTick(displayPrice);

volumeProfile.merge(profilePrice, size, Integer::sum);
}

Accuracy Implicationsโ€‹

MetricTick-by-Tick DataAggregated Data (Rounded)
POCExactร‚ยฑ1-2 ticks
VAH/VALExactร‚ยฑ1-2 ticks
Total VolumeExactExact (sizes are correct)
Volume DistributionPerfectSlightly smeared

When Approximation is Acceptableโ€‹

Acceptable Use Casesโ€‹

  • Timeframe 1-minute or higher: ร‚ยฑ1-2 ticks is noise
  • Reference zones: Overnight levels as areas, not precise lines
  • Non-HFT strategies: Latency already exceeds tick precision
  • Trend/swing trading: Key levels have natural width

Not Acceptableโ€‹

  • Scalping/HFT: Every tick matters
  • Precise entry/exit: Requires exact price levels
  • Spread trading: Tick accuracy critical for edge calculation
  • Backtesting: Aggregated data produces unrealistic fills

Data Quality Indicatorโ€‹

Add a quality mode indicator to your diagnostics:

private void logDataQualityMode() {
long total = totalTrades.get() - zeroSizeTrades.get();
long fracRaw = fractionalRawPrices.get();

String mode;
if (fracRaw == 0) {
mode = "PRECISE (tick-by-tick data)";
} else if (fracRaw * 100.0 / Math.max(1, total) > 50) {
mode = "APPROXIMATED (aggregated source, ร‚ยฑ1-2 tick uncertainty)";
} else {
mode = "MIXED (some aggregated data)";
}

Log.info("Data Quality Mode: " + mode);
}

Practical Architectureโ€‹

Hybrid Data Strategyโ€‹

SessionData SourceQualityHandling
RTH (9:30 AM - 4:00 PM ET)Rithmic livePreciseUse as-is
Overnight (6:00 PM - 9:30 AM ET)dxFeed historicalApproximatedRound to tick

Alternative: Self-Recordingโ€‹

For highest overnight data quality without additional cost:

  1. Keep connection active overnight
  2. Record Rithmic live feed to local files
  3. Replay own recordings for backfill

This provides tick-by-tick quality for overnight sessions using your existing Rithmic subscription.

Summaryโ€‹

Key Takeawaysโ€‹

  1. Raw price format varies by provider - don't assume integers
  2. Detect aggregation by checking for fractional raw prices
  3. Filter zero-size trades from volume calculations
  4. Round to nearest tick when processing aggregated data
  5. Document data quality mode in output for transparency
  6. ร‚ยฑ1-2 tick approximation is acceptable for non-HFT timeframes

Decision Matrixโ€‹

Your TimeframeData Quality NeededAggregated Data OK?
MicrosecondsExactรขยล’ No
SecondsExactรขยล’ No
1+ MinutesApproximateรขล“โ€ฆ Yes
5+ MinutesReference zonesรขล“โ€ฆ Yes

See Alsoโ€‹