A pivot table is a powerful tool used in data analysis to summarize, reorganize, and explore large datasets. Here are some key points about pivot tables and their uses:
What is a Pivot Table?
A pivot table allows you to:
- Summarize Data: It can aggregate data, such as calculating sums, averages, counts, and other statistics.
- Reorganize Data: You can pivot or rotate data to view it from different perspectives.
- Filter and Sort Data: Easily filter and sort data to focus on specific information.
- Group Data: Group data into categories for better analysis.
Why Use a Pivot Table?
- Simplify Complex Data: They help in breaking down large datasets into more manageable and understandable summaries.
- Identify Trends and Patterns: By summarizing data, pivot tables can reveal trends and patterns that might not be immediately obvious.
- Flexible Analysis: You can quickly change the layout and structure of the data to answer different questions.
- Save Time: Automate the process of data analysis, making it faster and more efficient.
Pivot Tables with Pandas
If you have ever used a Pivot Table in Excel you know how useful it can be to summarize/aggregate large amounts of data.
Pandas pivot_table command can do anything Excel can do and you can automate it to help you to produce regular reports.
import pandas as pd
# aggfunc = 'mean' or 'sum’ and others…
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'], aggfunc='sum')
Example
Let's mix a few things we have learned together…
- Pull AAPL from dataset ‘Trades’ for 9:30 to 4pm, lazily try to avoid the open/close prints.
- Make a new column called odd_lots that contains a True/False if the shares column contains a value <100 or >=100.
- Make a new column called ‘Hour’ by plucking just the two digits for the hour out of the timestamp as a string.
- Make a pivot of these two columns and the shares column, counting how many shares per hour split by <100 or >-100…
- Calculate the percentage odd lots for each hour...
import liberator
%time df = liberator.get_dataframe(liberator.query(name = 'trades', symbols = 'AAPL', as_of = '2024-07-30 15:59:58', back_to = '2024-07-30 09:30:01')) # lazy attempt to skip the open/close trades.
df['odd_lots'] = df['shares'] < 100
df['Hour'] = df.apply(lambda row: str(row.timestamp)[11:13], axis = 1)
pivot_table = df.pivot_table(values = 'shares', index = 'Hour', columns = 'odd_lots', aggfunc = 'sum', fill_value = 0)
pivot_table['% Odd Lots'] = pivot_table[True] / (pivot_table[True] + pivot_table[False]) * 100
pivot_table
Dataframe Pivot Command Parameters…
- aggfunc = 'sum' - specifies that the 'sum' function should be used to aggregate the data in the values column.
- values = 'shares' - specifies that the 'shares' column is the Target of the aggfunc ie sum the shares.
- index = 'Hour' - specifies that the ROWS are grouped by the 'Hour' column’s values.
- columns = 'odd_lots' - specifies how the COLUMNS are split ie on the 'odd_lots' columns values - True/False.
- fill_value = 0 - what should the pivot do with missing values - in this case use 0