Batch Downloading is an extremely useful method for API users wanting to download extremely large datasets.
It provides a number of advantages:
- Rather than waiting for the whole query to be completed on the server side prior to delivery, the data is delivered as a stream
- The batching then allows you to write out the query to a file as it arrives, thus if your connection is interrupted, you can find out how far you got and restart from there.
- Simple query dataframe downloads are very simple to use and useful for small downloads, but for larger downloads they often use double the memory, this can often max out a users PC causing major slowdown. Batching allows the downloading of even the largest datasets.
- The code is no longer than a standard "query to dataframe" and "dataframe to csv".
Small batch downloading processes
This couple of smaller scripts submit a single large query and writes to a file as it receives the data in batches from Liberator. The second version also checks for length (will not write a file if no data returned), and will only write the header once.
import liberator, time
start_time = time.time()
for batch in liberator.query(name = 'minute_bars',symbols = None, as_of = '2024-07-01', back_to = '2024-01-01'):
batch.to_pandas().to_csv("minute_bar_data.csv", mode='a')
print("The query + saving took", (time.time() - start_time)/60.0, " minutes to run")
import liberator, time
start_time = time.time()
for i, batch in enumerate(liberator.query(name='daily_bars', as_of = '2024-07-01', back_to = '2024-01-01', symbols = None)):
if not len(batch):
continue
batch.to_pandas().to_csv('daily_bars.csv', mode='a' if i else 'w', header=False if i else True)
print("The query + saving took", (time.time() - start_time)/60.0, " minutes to run")
Larger Batch Download Process
This code will allow you to download a very large dataset by splitting the download into a set of MONTHS.
As such it does not use mcal.
# Define the start and end dates and other setup info for the loop
###########################################################################################
start_year = 2018
start_month = 3
end_year = 2024
end_month = 06
#symbols = ['AAPL','GOOG']
symbols = None
header = True # Put headers in files True or False
individual = 0 # save into individual files with date ranges in filenames
header_in_individual = 0 # Add a header in each individual file if individual files selected?
dataset = 'minute_bars'
###########################################################################################
import liberator, time
back_to_year = start_year
back_to_month = start_month
old_month = None
while (back_to_year < end_year) or (back_to_year == end_year and back_to_month <= end_month):
if back_to_month == 12:
as_of_month = 1
as_of_year = back_to_year +1
else:
as_of_month = back_to_month+1
as_of_year = back_to_year
back_to = f"{back_to_year:04d}-{back_to_month:02d}-01"
as_of = f"{as_of_year:04d}-{as_of_month:02d}-01"
print(f"Downloading {dataset} {back_to} to {as_of}", end='')
start_time = time.time()
head = header if (old_month==None) else False
for batch in liberator.query(name=dataset, as_of = as_of, back_to = back_to, symbols = symbols):
if not len(batch):
continue
if individual:
batch.to_pandas().to_csv(dataset+'_'+back_to+'_'+as_of+'.csv', mode='a', header=header_in_individual or head) # if header_in_individual is zero then this will only be 1 for the first pass, when head = 1, from then on head = 0 as well
else:
batch.to_pandas().to_csv(dataset+'_all.csv', mode='a', header=head)
print(" Query + save took", (time.time() - start_time)/60.0, " minutes to run. head:",head, end='\n')
back_to_year = as_of_year
back_to_month = as_of_month
old_month = as_of_month