1. Support Center
  2. Python for Data Science Recipes

A MERGE or JOIN brings two tables together, typically creating a new table of MORE COLUMNS

working with python and pandas dataframes to merge or join tables

 

A MERGE or JOIN will typically expand the number of COLUMNS

CONCAT combines the ROWS of two dataframes (ie concat'ing two dataframes with 10 rows each will result in a dataframe with 20 rows). CONCAT 'generally' requires the columns to be the same

pd.merge

Merging requires a left dataset, a right dataset, and a common column to merge “on”.

You can also specify which columns to bring over from the right dataset

The result is rows from the left and right dataframes are matched up where there are common values of the merge column specified by “on”.

So if you are merging alternative data with market data, you would do best to use the market data as the left because if there is no market data there is nothing to analyze.

How

how = 'left', 'right', 'inner' (DEFAULT), 'outer'

Indicator

indicator = True will add an extra column when you use 'outer'. It’s values will be “both” or “left_only” or “right_only”.

Inner / Left / Right / Outer explained

  • Inner Merge / Inner join – (DEFAULT) Only keep rows where the merge “on” value exists in both the left and right dataframes.
  • Left Merge / Left outer join – Keep every row in the left dataframe. Where there are missing values of the “on” variable in the right dataframe, add empty / NaN values in the result.
  • Right Merge / Right outer join – Keep every row in the right dataframe. Where there are missing values of the “on” variable in the left column, add empty / NaN values in the result.
  • Outer Merge / Full outer join – A full outer join returns all the rows from the left dataframe, all the rows from the right dataframe, and matches up rows where possible, with NaNs elsewhere.

In short…

  • Inner will be only where both exist.
  • Left will be the length of the left dataframe.
  • Right will be the length of the Right dataframe.
  • Outer will be the length of all rows from both tables.

Examples of pd.merge

1.

res = pd.merge(left, right, on = 'shared_column')
res = pd.merge(left, right, on = 'shared_column', how = 'left')


# creates an extra column which indicates which method was used for the merges 

# pd.merge(left, right, on = 'shared_column', how = 'outer', indicator = True)


res['_merge'].value_counts() # gives a summary of the extra merge column
res = pd.merge(left, right[['col1', 'col1', 'col3']], on = 'shared_column')

# Merge when the columns have different names
res = pd.merge(left, right, left_on='leftcol2', right_on='rightcol7', how='left') 

 

2.

left = marketdata_df
right = altdata_df
shared_column = 'datesym'
# The next command will show us a True/False for how many of ‘one’ are in ‘the other’
# So the final table will be this size
left[shared_column].isin(right[shared_column]).value_counts
pd.merge(left,right,on = shared_column)

 

merge_asof can give you a little more flexibility

  • Merging on two values
  • Specifying a tolerance
    I’ve found that if a merge_asof fails it is often because of the tolerance, try removing it!

 

quotes time ticker     bid     ask
0 2016-05-25 13:30:00.023   GOOG  720.50  720.93
1 2016-05-25 13:30:00.023   MSFT   51.95   51.96
2 2016-05-25 13:30:00.030   MSFT   51.97   51.98
3 2016-05-25 13:30:00.041   MSFT   51.99   52.00
4 2016-05-25 13:30:00.048   GOOG  720.50  720.93
5 2016-05-25 13:30:00.049   AAPL   97.99   98.01
6 2016-05-25 13:30:00.072   GOOG  720.50  720.88
7 2016-05-25 13:30:00.075   MSFT   52.01   52.03

 

trades time ticker   price  quantity
0 2016-05-25 13:30:00.023   MSFT   51.95        75
1 2016-05-25 13:30:00.038   MSFT   51.95       155
2 2016-05-25 13:30:00.048   GOOG  720.77       100
3 2016-05-25 13:30:00.048   GOOG  720.92       100
4 2016-05-25 13:30:00.048   AAPL   98.00       100

 

pd.merge_asof(trades, quotes, on='time', by='ticker')

                     time ticker   price  quantity     bid     ask
0 2016-05-25 13:30:00.023   MSFT   51.95        75   51.95   51.96
1 2016-05-25 13:30:00.038   MSFT   51.95       155   51.97   51.98
2 2016-05-25 13:30:00.048   GOOG  720.77       100  720.50  720.93
3 2016-05-25 13:30:00.048   GOOG  720.92       100  720.50  720.93
4 2016-05-25 13:30:00.048   AAPL   98.00       100     NaN     NaN

 

pd.merge_asof(trades, quotes, on = 'time', by = 'ticker', tolerance = pd.Timedelta('2ms'))

                     time ticker   price  quantity     bid     ask
0 2016-05-25 13:30:00.023   MSFT   51.95        75   51.95   51.96
1 2016-05-25 13:30:00.038   MSFT   51.95       155     NaN     NaN
2 2016-05-25 13:30:00.048   GOOG  720.77       100  720.50  720.93
3 2016-05-25 13:30:00.048   GOOG  720.92       100  720.50  720.93
4 2016-05-25 13:30:00.048   AAPL   98.00       100     NaN     NaN

 

https://pandas.pydata.org/docs/reference/api/pandas.merge_asof.html