Batch Downloading Data
Batch downloading is an extremely useful method for API users wanting to download extremely large datasets. This approach offers several advantages over traditional query methods.
Key Advantages
- Streamed delivery — Data arrives as a stream rather than waiting for complete server-side processing
- Resumable downloads — Write data to file as it arrives, allowing restart points if connection interrupts
- Memory efficiency — Avoids double memory usage common with standard DataFrame downloads
- Simplicity — Code length is comparable to standard query-to-DataFrame approaches
Small Batch Download Example
The simplest implementation submits a single large query and writes batches to file:
import liberator, time
start_time = time.time()
for batch in liberator.query(name = 'minute_bars', symbols = None,
as_of = '2024-07-01', back_to = '2024-01-01'):
batch.to_pandas().to_csv("minute_bar_data.csv", mode='a')
print("The query + saving took", (time.time() - start_time)/60.0, " minutes to run")
An enhanced version prevents empty files and handles headers correctly:
import liberator, time
start_time = time.time()
for i, batch in enumerate(liberator.query(name='daily_bars', as_of = '2024-07-01',
back_to = '2024-01-01', symbols = None)):
if not len(batch):
continue
batch.to_pandas().to_csv('daily_bars.csv', mode='a' if i else 'w',
header=False if i else True)
print("The query + saving took", (time.time() - start_time)/60.0, " minutes to run")
Large Batch Download Process
For very large datasets, splitting downloads into monthly chunks prevents resource constraints:
# Define the start and end dates and other setup info for the loop
start_year = 2018
start_month = 3
end_year = 2024
end_month = 06
symbols = None
header = True
individual = 0
header_in_individual = 0
dataset = 'minute_bars'
import liberator, time
back_to_year = start_year
back_to_month = start_month
old_month = None
while (back_to_year < end_year) or (back_to_year == end_year and back_to_month <= end_month):
if back_to_month == 12:
as_of_month = 1
as_of_year = back_to_year + 1
else:
as_of_month = back_to_month + 1
as_of_year = back_to_year
back_to = f"{back_to_year:04d}-{back_to_month:02d}-01"
as_of = f"{as_of_year:04d}-{as_of_month:02d}-01"
print(f"Downloading {dataset} {back_to} to {as_of}", end='')
start_time = time.time()
head = header if (old_month==None) else False
for batch in liberator.query(name=dataset, as_of = as_of,
back_to = back_to, symbols = symbols):
if not len(batch):
continue
if individual:
batch.to_pandas().to_csv(dataset+'_'+back_to+'_'+as_of+'.csv',
mode='a', header=header_in_individual or head)
else:
batch.to_pandas().to_csv(dataset+'_all.csv', mode='a', header=head)
print(" Query + save took", (time.time() - start_time)/60.0, " minutes to run. head:",head)
back_to_year = as_of_year
back_to_month = as_of_month
old_month = as_of_month
This monthly chunking approach allows downloading even the largest datasets without memory constraints, making it ideal for production data pipelines.