working with python and pandas dataframes to merge or join tables
A MERGE or JOIN will typically expand the number of COLUMNS
CONCAT combines the ROWS of two dataframes (ie concat'ing two dataframes with 10 rows each will result in a dataframe with 20 rows). CONCAT 'generally' requires the columns to be the same
pd.merge
Merging requires a left dataset, a right dataset, and a common column to merge “on”.
You can also specify which columns to bring over from the right dataset
The result is rows from the left and right dataframes are matched up where there are common values of the merge column specified by “on”.
So if you are merging alternative data with market data, you would do best to use the market data as the left because if there is no market data there is nothing to analyze.
How
how = 'left', 'right', 'inner' (DEFAULT), 'outer'
Indicator
indicator = True will add an extra column when you use 'outer'. It’s values will be “both” or “left_only” or “right_only”.
Inner / Left / Right / Outer explained
- Inner Merge / Inner join – (DEFAULT) Only keep rows where the merge “on” value exists in both the left and right dataframes.
- Left Merge / Left outer join – Keep every row in the left dataframe. Where there are missing values of the “on” variable in the right dataframe, add empty / NaN values in the result.
- Right Merge / Right outer join – Keep every row in the right dataframe. Where there are missing values of the “on” variable in the left column, add empty / NaN values in the result.
- Outer Merge / Full outer join – A full outer join returns all the rows from the left dataframe, all the rows from the right dataframe, and matches up rows where possible, with NaNs elsewhere.
In short…
- Inner will be only where both exist.
- Left will be the length of the left dataframe.
- Right will be the length of the Right dataframe.
- Outer will be the length of all rows from both tables.
Examples of pd.merge
1.
res = pd.merge(left, right, on = 'shared_column')
res = pd.merge(left, right, on = 'shared_column', how = 'left')
# creates an extra column which indicates which method was used for the merges
# pd.merge(left, right, on = 'shared_column', how = 'outer', indicator = True)
res['_merge'].value_counts() # gives a summary of the extra merge column
res = pd.merge(left, right[['col1', 'col1', 'col3']], on = 'shared_column')
# Merge when the columns have different names
res = pd.merge(left, right, left_on='leftcol2', right_on='rightcol7', how='left')
2.
left = marketdata_df
right = altdata_df
shared_column = 'datesym'
# The next command will show us a True/False for how many of ‘one’ are in ‘the other’
# So the final table will be this size
left[shared_column].isin(right[shared_column]).value_counts
pd.merge(left,right,on = shared_column)
merge_asof can give you a little more flexibility
- Merging on two values
- Specifying a tolerance
I’ve found that if a merge_asof fails it is often because of the tolerance, try removing it!
quotes time ticker bid ask
0 2016-05-25 13:30:00.023 GOOG 720.50 720.93
1 2016-05-25 13:30:00.023 MSFT 51.95 51.96
2 2016-05-25 13:30:00.030 MSFT 51.97 51.98
3 2016-05-25 13:30:00.041 MSFT 51.99 52.00
4 2016-05-25 13:30:00.048 GOOG 720.50 720.93
5 2016-05-25 13:30:00.049 AAPL 97.99 98.01
6 2016-05-25 13:30:00.072 GOOG 720.50 720.88
7 2016-05-25 13:30:00.075 MSFT 52.01 52.03
trades time ticker price quantity
0 2016-05-25 13:30:00.023 MSFT 51.95 75
1 2016-05-25 13:30:00.038 MSFT 51.95 155
2 2016-05-25 13:30:00.048 GOOG 720.77 100
3 2016-05-25 13:30:00.048 GOOG 720.92 100
4 2016-05-25 13:30:00.048 AAPL 98.00 100
pd.merge_asof(trades, quotes, on='time', by='ticker')
time ticker price quantity bid ask
0 2016-05-25 13:30:00.023 MSFT 51.95 75 51.95 51.96
1 2016-05-25 13:30:00.038 MSFT 51.95 155 51.97 51.98
2 2016-05-25 13:30:00.048 GOOG 720.77 100 720.50 720.93
3 2016-05-25 13:30:00.048 GOOG 720.92 100 720.50 720.93
4 2016-05-25 13:30:00.048 AAPL 98.00 100 NaN NaN
pd.merge_asof(trades, quotes, on = 'time', by = 'ticker', tolerance = pd.Timedelta('2ms'))
time ticker price quantity bid ask
0 2016-05-25 13:30:00.023 MSFT 51.95 75 51.95 51.96
1 2016-05-25 13:30:00.038 MSFT 51.95 155 NaN NaN
2 2016-05-25 13:30:00.048 GOOG 720.77 100 720.50 720.93
3 2016-05-25 13:30:00.048 GOOG 720.92 100 720.50 720.93
4 2016-05-25 13:30:00.048 AAPL 98.00 100 NaN NaN
https://pandas.pydata.org/docs/reference/api/pandas.merge_asof.html