Local File (CSV/TSV)
Local file datasources read CSV, TSV, or other delimited flat files from a directory on the CloudQuant Data Liberator server or a mounted filesystem. This is the simplest file-based connection type and serves as the foundation for understanding all other file-based sources.
Connection Configuration
Required Fields
| Field | Type | Description |
|---|
connection_type | string | Must be "file" |
behavior | string | Must be "file" |
location | string | Absolute path to the directory containing data files |
The location field should point to a directory, not an individual file. CloudQuant Data Liberator will scan the directory for files matching the file_pattern in data_args.
Example Connection
{
"name": "local-trades-connection",
"connection_type": "file",
"behavior": "file",
"location": "/data/trades"
}
Dataset Configuration (data_args)
All file-based datasources share the same data_args fields. These control how CloudQuant Data Liberator finds, parses, and interprets your files.
Required Fields
| Field | Type | Description |
|---|
file_pattern | string | Glob pattern to match files, e.g., "*.csv", "prefix_*.tsv" |
data_dt_column | string or list | Column(s) containing the datetime value |
data_dt_format | string or list | strptime format string, or special values: "muts", "uts", "nuts", "datetime", "date" |
data_key_column | string or list | Column(s) used as the symbol/key for query filtering |
Optional Fields
| Field | Type | Default | Description |
|---|
sep_override | string | "," | Delimiter character: "," (comma), "\t" (tab), "|" (pipe), ";" (semicolon) |
encoding | string | "utf-8" | File encoding (e.g., "utf-8", "latin-1", "ascii") |
data_dt_timezone | string | "UTC" | Timezone of source data, e.g., "UTC", "America/New_York" |
fname_dt_regex | string | | Regex to extract a date from the filename |
fname_dt_format | string | | strptime format for the date extracted by fname_dt_regex |
fname_dt_timezone | string | | Timezone of the filename-derived date |
fname_dt_nudge | int | 0 | Microsecond offset applied to filename-derived dates |
fname_dt_approx_seconds | int | | Approximate number of seconds of data per file (used for query optimization) |
arrow_sort | list | ["symbol", "muts"] | Sort order for the resulting Arrow table |
arrow_timestamp | bool | true | Whether to generate the human-readable timestamp column |
Set fname_dt_approx_seconds to 86400 for daily files. This helps CloudQuant Data Liberator skip files outside the query’s time range, significantly improving performance for large directories.
Complete Example
Below is a full configuration showing both the connection and a dataset for daily trade CSV files.
Connection
{
"name": "local-daily-trades",
"connection_type": "file",
"behavior": "file",
"location": "/data/daily-trades"
}
Dataset
{
"name": "us-equity-trades",
"connection": "local-daily-trades",
"data_args": {
"file_pattern": "trades_*.csv",
"sep_override": ",",
"encoding": "utf-8",
"data_dt_column": "trade_time",
"data_dt_format": "%Y-%m-%d %H:%M:%S",
"data_dt_timezone": "America/New_York",
"data_key_column": "symbol",
"fname_dt_regex": "trades_(\\d{4}-\\d{2}-\\d{2})\\.csv",
"fname_dt_format": "%Y-%m-%d",
"fname_dt_timezone": "America/New_York",
"fname_dt_approx_seconds": 86400,
"arrow_sort": ["symbol", "muts"],
"arrow_timestamp": true
},
"schema": [
{ "name": "symbol", "type": "string", "group": "key", "description": "Ticker symbol" },
{ "name": "trade_time", "type": "string", "group": "time", "description": "Trade timestamp" },
{ "name": "price", "type": "double", "group": "value", "description": "Trade price" },
{ "name": "volume", "type": "int64", "group": "value", "description": "Trade volume" }
]
}
Ensure the CloudQuant Data Liberator process has read permissions on the location directory and all files within it. Permission errors will cause silent failures during query execution.
Tab-Separated Files (TSV)
For TSV files, set sep_override to "\t":
{
"data_args": {
"file_pattern": "*.tsv",
"sep_override": "\t",
"data_dt_column": "date",
"data_dt_format": "%Y%m%d",
"data_dt_timezone": "UTC",
"data_key_column": "ticker"
}
}
Composite Key Example
When the symbol is constructed from multiple columns:
{
"data_key_column": [
{ "type": "column", "value": "exchange" },
{ "type": "literal", "value": "_" },
{ "type": "column", "value": "ticker" }
]
}
This produces keys like NYSE_AAPL, NASDAQ_MSFT, etc.
Multiple Datetime Columns
When the date and time are in separate columns:
{
"data_dt_column": ["trade_date", "trade_time"],
"data_dt_format": ["%Y-%m-%d", "%H:%M:%S.%f"]
}
CloudQuant Data Liberator concatenates the columns with a space before parsing, so the effective format becomes "%Y-%m-%d %H:%M:%S.%f".