Summarizing data with a Pivot Table

A pivot table is a powerful tool used in data analysis to summarize, reorganize, and explore large datasets. Here are some key points about pivot tables and their uses:

What is a Pivot Table?

A pivot table allows you to:

Summarize Data: It can aggregate data, such as calculating sums, averages, counts, and other statistics.
Reorganize Data: You can pivot or rotate data to view it from different perspectives.
Filter and Sort Data: Easily filter and sort data to focus on specific information.
Group Data: Group data into categories for better analysis.

Why Use a Pivot Table?

Simplify Complex Data: They help in breaking down large datasets into more manageable and understandable summaries.
Identify Trends and Patterns: By summarizing data, pivot tables can reveal trends and patterns that might not be immediately obvious.
Flexible Analysis: You can quickly change the layout and structure of the data to answer different questions.
Save Time: Automate the process of data analysis, making it faster and more efficient.

Pivot Tables with Pandas

If you have ever used a Pivot Table in Excel you know how useful it can be to summarize/aggregate large amounts of data.

Pandas pivot_table command can do anything Excel can do and you can automate it to help you to produce regular reports.

import pandas as pd
# aggfunc = 'mean' or 'sum’ and others… 

pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'], aggfunc='sum')

Example

Let's mix a few things we have learned together…

Pull AAPL from dataset ‘Trades’ for 9:30 to 4pm, lazily try to avoid the open/close prints.
Make a new column called odd_lots that contains a True/False if the shares column contains a value <100 or >=100.
Make a new column called ‘Hour’ by plucking just the two digits for the hour out of the timestamp as a string.
Make a pivot of these two columns and the shares column, counting how many shares per hour split by <100 or >-100…
Calculate the percentage odd lots for each hour...

import liberator

%time df = liberator.get_dataframe(liberator.query(name = 'trades', symbols = 'AAPL', as_of = '2024-07-30 15:59:58', back_to = '2024-07-30 09:30:01')) # lazy attempt to skip the open/close trades.

df['odd_lots'] = df['shares'] < 100

df['Hour'] = df.apply(lambda row: str(row.timestamp)[11:13], axis = 1)

pivot_table = df.pivot_table(values = 'shares', index = 'Hour', columns = 'odd_lots', aggfunc = 'sum', fill_value = 0)

pivot_table['% Odd Lots'] = pivot_table[True] / (pivot_table[True] + pivot_table[False]) * 100

pivot_table

pivot_table-2

Dataframe Pivot Command Parameters…

aggfunc = 'sum' - specifies that the 'sum' function should be used to aggregate the data in the values column.
values = 'shares' - specifies that the 'shares' column is the Target of the aggfunc ie sum the shares.
index = 'Hour' - specifies that the ROWS are grouped by the 'Hour' column’s values.
columns = 'odd_lots' - specifies how the COLUMNS are split ie on the 'odd_lots' columns values - True/False.
fill_value = 0 - what should the pivot do with missing values - in this case use 0