Data Source Quality
This guide covers critical differences in data quality between data providers when building volume profile and order flow indicators in Bookmap. Understanding these differences is essential for accurate overnight session metrics and historical data analysis.
The Problemโ
The Bookmap API documentation states that the price parameter in onTrade() represents "the number of increments" - implying integer values. However, historical data providers may deliver pre-aggregated data where prices are fractional VWAP (Volume-Weighted Average Price) values, not individual tick prices.
This behavior is not documented in the official API but critically impacts volume profile accuracy.
Data Provider Comparisonโ
| Provider | Raw Price Format | Tick Aligned | Zero-Size Trades | Data Type |
|---|---|---|---|---|
| Rithmic (live/recorded) | 100% integer | 100% | ~30% (MBO updates) | True tick-by-tick |
| dxFeed CME Historical | 100% fractional | 0% | ~30% | Aggregated VWAP |
Rithmic Data (High Quality)โ
Raw prices: 25857.000000, 25856.000000, 25855.000000
All integers representing exact tick levels.
Rithmic delivers true Market-By-Order (MBO) tick-by-tick data:
- Every individual order fill is a separate event
- Prices are exact tick values (integers when viewed as Level 1)
- High event density (~2,000+ trades/second during active markets)
- Zero-size events are order book updates (modifications, cancellations)
dxFeed Historical Data (Aggregated)โ
Raw prices: 27930.555556, 27933.500000, 27930.142857
Fractional values indicating pre-aggregation.
dxFeed CME Historical Market Depth delivers pre-aggregated data:
- Multiple trades collapsed into single VWAP records per time bucket
- Prices are volume-weighted averages, not actual execution prices
- Low event density (significantly fewer records per second)
- Fractional patterns reveal the aggregation (e.g.,
1/7 = 0.142857)
Detecting Aggregated Dataโ
Fractional Pattern Analysisโ
Aggregated VWAP prices produce telltale fractional patterns:
| Raw Price | Fraction | Calculation |
|---|---|---|
27930.555556 | 5/9 | 9 contracts across multiple prices |
27930.142857 | 1/7 | 7 contracts averaged |
27933.571429 | 4/7 | 7 contracts averaged |
27947.476190 | 10/21 | 21 contracts averaged |
The denominator often matches the trade size, confirming VWAP aggregation.
Diagnostic Code Patternโ
// Check if raw price is integer (as expected for tick data)
boolean isRawPriceInteger = (price == Math.floor(price));
if (isRawPriceInteger) {
integerRawPrices.incrementAndGet();
} else {
fractionalRawPrices.incrementAndGet();
// This indicates aggregated data source
}
File Size as Quality Heuristicโ
| Source | Duration | File Size | Implication |
|---|---|---|---|
| Rithmic | 2 minutes | 3 MB | True tick-by-tick |
| dxFeed | 1 hour | 700 KB | Heavily aggregated |
A dramatically smaller file for longer duration indicates aggregation.
Zero-Size Trade Handlingโ
Both data sources produce ~30% zero-size "trades" but for different reasons:
Rithmic MBO Zero-Size Eventsโ
These represent order book updates, not executions:
- Order modifications
- Order cancellations
- Quote updates without fills
Recommendation: Filter from volume calculations but valuable for order flow analysis.
dxFeed Zero-Size Eventsโ
Origin unclear - likely artifacts of aggregation process.
Recommendation: Filter from volume calculations.
Implementationโ
@Override
public void onTrade(double price, int size, TradeInfo tradeInfo) {
// Skip zero-size trades for volume profile
if (size == 0) {
zeroSizeTrades.incrementAndGet();
return; // Don't include in volume profile
}
// Process actual executions...
}
Handling Aggregated Dataโ
The Approximation Strategyโ
For aggregated data sources, round prices to the nearest valid tick:
private static final double TICK_TOLERANCE = 0.0001;
private boolean isValidTickPrice(double price) {
double remainder = price % pips;
return remainder < TICK_TOLERANCE || (pips - remainder) < TICK_TOLERANCE;
}
private double roundToTick(double price) {
return Math.round(price / pips) * pips;
}
@Override
public void onTrade(double price, int size, TradeInfo tradeInfo) {
if (size == 0) return;
double displayPrice = price * pips;
// Always round to nearest valid tick
double profilePrice = roundToTick(displayPrice);
volumeProfile.merge(profilePrice, size, Integer::sum);
}
Accuracy Implicationsโ
| Metric | Tick-by-Tick Data | Aggregated Data (Rounded) |
|---|---|---|
| POC | Exact | รยฑ1-2 ticks |
| VAH/VAL | Exact | รยฑ1-2 ticks |
| Total Volume | Exact | Exact (sizes are correct) |
| Volume Distribution | Perfect | Slightly smeared |
When Approximation is Acceptableโ
Acceptable Use Casesโ
- Timeframe 1-minute or higher: รยฑ1-2 ticks is noise
- Reference zones: Overnight levels as areas, not precise lines
- Non-HFT strategies: Latency already exceeds tick precision
- Trend/swing trading: Key levels have natural width
Not Acceptableโ
- Scalping/HFT: Every tick matters
- Precise entry/exit: Requires exact price levels
- Spread trading: Tick accuracy critical for edge calculation
- Backtesting: Aggregated data produces unrealistic fills
Data Quality Indicatorโ
Add a quality mode indicator to your diagnostics:
private void logDataQualityMode() {
long total = totalTrades.get() - zeroSizeTrades.get();
long fracRaw = fractionalRawPrices.get();
String mode;
if (fracRaw == 0) {
mode = "PRECISE (tick-by-tick data)";
} else if (fracRaw * 100.0 / Math.max(1, total) > 50) {
mode = "APPROXIMATED (aggregated source, รยฑ1-2 tick uncertainty)";
} else {
mode = "MIXED (some aggregated data)";
}
Log.info("Data Quality Mode: " + mode);
}
Practical Architectureโ
Hybrid Data Strategyโ
| Session | Data Source | Quality | Handling |
|---|---|---|---|
| RTH (9:30 AM - 4:00 PM ET) | Rithmic live | Precise | Use as-is |
| Overnight (6:00 PM - 9:30 AM ET) | dxFeed historical | Approximated | Round to tick |
Alternative: Self-Recordingโ
For highest overnight data quality without additional cost:
- Keep connection active overnight
- Record Rithmic live feed to local files
- Replay own recordings for backfill
This provides tick-by-tick quality for overnight sessions using your existing Rithmic subscription.
Summaryโ
Key Takeawaysโ
- Raw price format varies by provider - don't assume integers
- Detect aggregation by checking for fractional raw prices
- Filter zero-size trades from volume calculations
- Round to nearest tick when processing aggregated data
- Document data quality mode in output for transparency
- รยฑ1-2 tick approximation is acceptable for non-HFT timeframes
Decision Matrixโ
| Your Timeframe | Data Quality Needed | Aggregated Data OK? |
|---|---|---|
| Microseconds | Exact | รขยล No |
| Seconds | Exact | รขยล No |
| 1+ Minutes | Approximate | รขลโฆ Yes |
| 5+ Minutes | Reference zones | รขลโฆ Yes |
See Alsoโ
- Price Conversion - Converting between Level 1 and display prices
- Data Listeners - TradeDataListener interface details
- Historical Data - Handling backfill and replay data
- Backfilled Data Listener - Session-based data processing