Batch Downloading data from Liberator (FAQ-p12)

Batch Downloading is an extremely useful method for API users wanting to download extremely large datasets.

It provides a number of advantages:

Rather than waiting for the whole query to be completed on the server side prior to delivery, the data is delivered as a stream
The batching then allows you to write out the query to a file as it arrives, thus if your connection is interrupted, you can find out how far you got and restart from there.
Simple query dataframe downloads are very simple to use and useful for small downloads, but for larger downloads they often use double the memory, this can often max out a users PC causing major slowdown. Batching allows the downloading of even the largest datasets.
The code is no longer than a standard "query to dataframe" and "dataframe to csv".

Small batch downloading processes

This couple of smaller scripts submit a single large query and writes to a file as it receives the data in batches from Liberator. The second version also checks for length (will not write a file if no data returned), and will only write the header once.

import liberator, time

start_time = time.time()

for batch in liberator.query(name = 'minute_bars',symbols = None, as_of = '2024-07-01', back_to = '2024-01-01'):

    batch.to_pandas().to_csv("minute_bar_data.csv", mode='a')

print("The query + saving took", (time.time() - start_time)/60.0, " minutes to run")

import liberator, time

start_time = time.time()

for i, batch in enumerate(liberator.query(name='daily_bars', as_of = '2024-07-01', back_to = '2024-01-01', symbols = None)):

    if not len(batch):

        continue

    batch.to_pandas().to_csv('daily_bars.csv', mode='a' if i else 'w', header=False if i else True)

print("The query + saving took", (time.time() - start_time)/60.0, " minutes to run")

Larger Batch Download Process

This code will allow you to download a very large dataset by splitting the download into a set of MONTHS.
As such it does not use mcal.

# Define the start and end dates and other setup info for the loop 

###########################################################################################

start_year = 2018

start_month = 3

end_year = 2024

end_month = 06

#symbols = ['AAPL','GOOG']

symbols = None

header = True # Put headers in files True or False

individual = 0 # save into individual files with date ranges in filenames

header_in_individual = 0 # Add a header in each individual file if individual files selected?

dataset = 'minute_bars'

###########################################################################################

import liberator, time

back_to_year = start_year

back_to_month = start_month

old_month = None

while (back_to_year < end_year) or (back_to_year == end_year and back_to_month <= end_month):

    if back_to_month == 12:

        as_of_month = 1

        as_of_year = back_to_year +1

    else:

        as_of_month = back_to_month+1

        as_of_year = back_to_year

    back_to = f"{back_to_year:04d}-{back_to_month:02d}-01"

    as_of = f"{as_of_year:04d}-{as_of_month:02d}-01"

    print(f"Downloading {dataset} {back_to} to {as_of}", end='')

    start_time = time.time()

    head = header if (old_month==None) else False

    for batch in liberator.query(name=dataset, as_of = as_of, back_to = back_to, symbols = symbols):

        if not len(batch):

            continue

        if individual:

            batch.to_pandas().to_csv(dataset+'_'+back_to+'_'+as_of+'.csv', mode='a', header=header_in_individual or head) # if header_in_individual is zero then this will only be 1 for the first pass, when head = 1, from then on head = 0 as well

        else:

            batch.to_pandas().to_csv(dataset+'_all.csv', mode='a', header=head)

    print(" Query + save took", (time.time() - start_time)/60.0, " minutes to run. head:",head, end='\n')

    back_to_year = as_of_year

    back_to_month = as_of_month

    old_month = as_of_month