Skip to main content

A Merge or Join

A merge or join operation expands the number of columns in a resulting DataFrame by combining data from two tables based on shared values. This differs from concatenation, which combines rows.

Merge vs. Concat

OperationWhat it does
CONCATCombines rows from two DataFrames (e.g., two 10-row DataFrames yield 20 rows). Generally requires identical columns.
MERGE/JOINCombines columns by matching rows on a shared column, typically increasing column count.

pd.merge()

Merging requires a left dataset, a right dataset, and a common column specified with the on parameter.

Join Types

The how parameter controls which rows appear in results:
Join TypeBehavior
inner(DEFAULT) Only rows where the merge column exists in both DataFrames
leftAll rows from the left DataFrame; NaN where right data is missing
rightAll rows from the right DataFrame; NaN where left data is missing
outerAll rows from both DataFrames; NaN where matches do not exist
Use indicator=True to add a column showing the merge source ('both', 'left_only', 'right_only') — especially useful when debugging outer joins.

Basic Examples

# Standard merge on a shared column
res = pd.merge(left, right, on='shared_column')

# Left join
res = pd.merge(left, right, on='shared_column', how='left')

# Merge with different column names
res = pd.merge(left, right, left_on='leftcol2',
               right_on='rightcol7', how='left')

# Select specific columns from right before merging
res = pd.merge(left, right[['col1', 'col2', 'col3']],
               on='shared_column')

# Check overlap before merging
left['datesym'].isin(right['datesym']).value_counts()

merge_asof()

merge_asof offers additional flexibility for time-based merging:
  • Merging on multiple values
  • Specifying tolerance thresholds

Example: Matching Quotes to Trades

pd.merge_asof(trades, quotes, on='time', by='ticker')

# With tolerance
pd.merge_asof(trades, quotes, on='time', by='ticker',
              tolerance=pd.Timedelta('2ms'))
If merge_asof() fails, try removing the tolerance parameter.

Reference