Skip to main content

Local File (CSV/TSV)

Local file datasources read CSV, TSV, or other delimited flat files from a directory on the CloudQuant Data Liberator server or a mounted filesystem. This is the simplest file-based connection type and serves as the foundation for understanding all other file-based sources.

Connection Configuration

Required Fields

FieldTypeDescription
connection_typestringMust be "file"
behaviorstringMust be "file"
locationstringAbsolute path to the directory containing data files
The location field should point to a directory, not an individual file. CloudQuant Data Liberator will scan the directory for files matching the file_pattern in data_args.

Example Connection

{
  "name": "local-trades-connection",
  "connection_type": "file",
  "behavior": "file",
  "location": "/data/trades"
}

Dataset Configuration (data_args)

All file-based datasources share the same data_args fields. These control how CloudQuant Data Liberator finds, parses, and interprets your files.

Required Fields

FieldTypeDescription
file_patternstringGlob pattern to match files, e.g., "*.csv", "prefix_*.tsv"
data_dt_columnstring or listColumn(s) containing the datetime value
data_dt_formatstring or liststrptime format string, or special values: "muts", "uts", "nuts", "datetime", "date"
data_key_columnstring or listColumn(s) used as the symbol/key for query filtering

Optional Fields

FieldTypeDefaultDescription
sep_overridestring","Delimiter character: "," (comma), "\t" (tab), "|" (pipe), ";" (semicolon)
encodingstring"utf-8"File encoding (e.g., "utf-8", "latin-1", "ascii")
data_dt_timezonestring"UTC"Timezone of source data, e.g., "UTC", "America/New_York"
fname_dt_regexstringRegex to extract a date from the filename
fname_dt_formatstringstrptime format for the date extracted by fname_dt_regex
fname_dt_timezonestringTimezone of the filename-derived date
fname_dt_nudgeint0Microsecond offset applied to filename-derived dates
fname_dt_approx_secondsintApproximate number of seconds of data per file (used for query optimization)
arrow_sortlist["symbol", "muts"]Sort order for the resulting Arrow table
arrow_timestampbooltrueWhether to generate the human-readable timestamp column
Set fname_dt_approx_seconds to 86400 for daily files. This helps CloudQuant Data Liberator skip files outside the query’s time range, significantly improving performance for large directories.

Complete Example

Below is a full configuration showing both the connection and a dataset for daily trade CSV files.

Connection

{
  "name": "local-daily-trades",
  "connection_type": "file",
  "behavior": "file",
  "location": "/data/daily-trades"
}

Dataset

{
  "name": "us-equity-trades",
  "connection": "local-daily-trades",
  "data_args": {
    "file_pattern": "trades_*.csv",
    "sep_override": ",",
    "encoding": "utf-8",
    "data_dt_column": "trade_time",
    "data_dt_format": "%Y-%m-%d %H:%M:%S",
    "data_dt_timezone": "America/New_York",
    "data_key_column": "symbol",
    "fname_dt_regex": "trades_(\\d{4}-\\d{2}-\\d{2})\\.csv",
    "fname_dt_format": "%Y-%m-%d",
    "fname_dt_timezone": "America/New_York",
    "fname_dt_approx_seconds": 86400,
    "arrow_sort": ["symbol", "muts"],
    "arrow_timestamp": true
  },
  "schema": [
    { "name": "symbol", "type": "string", "group": "key", "description": "Ticker symbol" },
    { "name": "trade_time", "type": "string", "group": "time", "description": "Trade timestamp" },
    { "name": "price", "type": "double", "group": "value", "description": "Trade price" },
    { "name": "volume", "type": "int64", "group": "value", "description": "Trade volume" }
  ]
}
Ensure the CloudQuant Data Liberator process has read permissions on the location directory and all files within it. Permission errors will cause silent failures during query execution.

Tab-Separated Files (TSV)

For TSV files, set sep_override to "\t":
{
  "data_args": {
    "file_pattern": "*.tsv",
    "sep_override": "\t",
    "data_dt_column": "date",
    "data_dt_format": "%Y%m%d",
    "data_dt_timezone": "UTC",
    "data_key_column": "ticker"
  }
}

Composite Key Example

When the symbol is constructed from multiple columns:
{
  "data_key_column": [
    { "type": "column", "value": "exchange" },
    { "type": "literal", "value": "_" },
    { "type": "column", "value": "ticker" }
  ]
}
This produces keys like NYSE_AAPL, NASDAQ_MSFT, etc.

Multiple Datetime Columns

When the date and time are in separate columns:
{
  "data_dt_column": ["trade_date", "trade_time"],
  "data_dt_format": ["%Y-%m-%d", "%H:%M:%S.%f"]
}
CloudQuant Data Liberator concatenates the columns with a space before parsing, so the effective format becomes "%Y-%m-%d %H:%M:%S.%f".