Understanding Point-In-Time datasets
as_of and back_to represent the start and the end points of the data that you wish to pull.
We deliberately chose not to use start and end as some datasets are ‘POINT-IN-TIME’. In this situation, you may get different data back depending upon your ‘as_of’ date.
What are POINT-IN-TIME datasets?
The most common use of ‘POINT-IN-TIME’ is for a revised earnings report. If your as_of was for the date of the original earnings report or any date up to the revision date you would receive data for the initial values reported, if your as_of was the day of the revision or any date after that then you would receive the revised data.
However, with the advent of datasets created using Machine Learning, POINT-IN-TIME has become more prevalent. It is not unusual for a Provider of an ML dataset to ‘retrain’ their model and ‘recalculate their history’. In this circumstance, most traders/quants would want to know the be able to test with the original data that was available at the time of the trade and only use the modified data when THEY want to.
Any POINT-IN-TIME datasets will be labeled clearly as such in the dataset description in liberator.datasets().
Source of Point-In-Time Data Problems
Point-In-Time data problems exist because most databases are designed to show the most current information. This fails our quantitative data science discipline because we may need to know what the numbers were at all along the timeline so that we do not allow forward-looking survivorship bias into our data analysis.
For example, if one were to check in with the Securities and Exchange Commission (SEC) to see the last 5 years of Earnings for a given company you will receive a report that has that information. However, the earnings for the last 4 years may have been restated and republished due to accounting errors, misclassifications, regulatory scrutiny, valuation errors, or even fraud. This is perfect if you want to know the values at the current time what these values are. At the point-in-time that the previous years were originally submitted (prior to restatement) these numbers may have differed.
A thoughtful data scientist or quantitative researcher will want to see the numbers as originally reported and any modifications to those numbers with correct timestamps for when those numbers were modified.
Another example is any portfolio index. Take the Standard & Poors(™) S&P500 index. Over time the constituents of this index change. Companies are removed and new members are added over time based upon the analysts at S&P. Usually stocks are removed because they are the subject of a merger, or they are no longer performing well financially, Therefore if you do quantitive analysis based on today’s list of stocks in the S&P500 you most certainly have some degree of survivorship bias in the analysis. The astute analyst will want to know the exact makeup of the index and all changes to the index over the time of the data analysis.