Skip to main content

Aligning Two Datasets Into One

Merging two different time series datasets into one can be like navigating a minefield — it is one of the trickiest challenges in data science.

Key Challenges

When aligning datasets, you need to consider several critical factors:
  1. Timestamp Availability — Even if your datasets are timestamped as Daily, you still need to know when the data was available to ensure it can line up.
  2. Symbol Consistency — Do both datasets contain identical symbols? How do you handle mismatches?
  3. Timeframe Misalignment — What happens when one dataset operates at 1-minute intervals and another at 5-minute intervals?
  4. Data Expansion Strategy — When expanding lower-frequency data, should you use first values, last values, or an alternative approach?
  5. Data Aggregation — When merging in the opposite direction, how do you summarize higher-frequency bars? The choice of average, max, min, or other metrics depends on column content.
The pandas reindex() method allows you to forward fill, back fill, use nearest, or fill with None.
df.reindex()
For detailed usage, see the pandas DataFrame.reindex documentation.

Alternative: SuperQuery

For datasets within CloudQuant Data Liberator, you can use the SuperQuery command to have the system perform the merge automatically. SuperQuery resamples multiple datasets onto a common time axis without manual alignment.
If both of your datasets are available in CloudQuant Data Liberator, SuperQuery is often the simplest approach — it handles the alignment for you. See the SuperQuery recipe for details.