Zeev's Turnips Patch-No Politics (ZEEV)

Replies (1) Next 10 Prev Next

Send PM Follow Ignore

Followers	50
Posts	6231
Boards Moderated	0
Alias Born	08/26/2002

Train Guy

Re: Joe Stocks post# 217807

Monday, 03/15/2004 9:21:48 AM

Monday, March 15, 2004 9:21:48 AM

I think I've figured out why we're seeing increasing amounts of program trading. A lot of it has to do with better and more data along with increased computer power. What use to take somebody sitting in front of a terminal pushing buttons, now they just program up the computers and turn them lose. I've been trying to vet the accuracy of some of the indicators I use. Stumbled on this interesting site, http://www.datavendors.com/ And from there came across this interesting white paper, http://www.tickdata.com/FilteringWhitePaper.pdf Haven't had a chance to read all 20 pages, but this first part is interesting. Anybody wondering how good your streaming data is, if you aren't seeing MSFT update about 4 times per second, you ain't seeing the real action, just snap shots here and there. The average stock in the Russell 3000 changes about once every 11 seconds.

High Frequency Data Filtering
A review of the issues associated with maintaining and cleaning a high frequency financial database
Thomas Neal Falkenberry, CFA

Every day millions of data points flow out of the global financial markets driving investor and trader decision logic. These data points, or ticks™, represent the basic building blocks of analysis. Unfortunately, the data is too often transmitted with erroneous prices that render pre-filtered data unusable.

The importance of clean data, and hence an emphasis on filtering bad data, has risen in recent years. Advances in technology (bandwidth, computing power, and storage) have made analysis of the large datasets associated with higher frequency data more accessible to market participants. In response, the academic and professional community has made rapid advances in the fields of trading, microstructure theory, arbitrage, option pricing, and risk management to name a few. We refer readers to Lequeux (1999)i for an excellent overview of various subjects of high frequency research.

In turn, the increased usage of high frequency data has created the need for electronic execution platforms to act on the higher frequency of trade decisions. By electronic execution, we do not refer to the process of typing order specifications into a Web site and having the order electronically transmitted. We refer to the fully-automated process of electronically receiving data, processing that data through decision logic, generating orders, communicating those orders electronically, and finally, receiving confirmation of transactions. A bad tick into the system means a possible bad order out of the system. The cost of exiting a trade generated on a bad tick becomes a new source of system slippage and a potential huge source of risk via duplicate or unexpected orders.

Estimates for the frequency of bad ticks vary. Dacorogna et al. (1995)ii estimated that error rates on forex quote data are between 0.11% and 0.81%. Lundin et al. (1999)iii describe the use of filters in preprocessing forex, stock index, and implied forward interest rate returns whereby 2%–3% of all data points were identified as false outliers.

This paper will describe the issues associated with maintaining and cleaning a high frequency financial database. We will attempt to identify the problem, its origins, properties, and solutions. We will also outline the filters developed by Tick Data, Inc. to address the problem, although the outline is intentionally general. This paper will make frequent use of charts and tables to illustrate key points. These charts and tables include data provided from multiple sources, each of which is highly reputable. The errant data points illustrated in this paper are structural to the market information process and do not reflect problems, outages, or the lack of quality control on the part of any vendor.

I. The Problem

Intraday data, also referred to interchangeably as tick data and high frequency data, is characterized by issues that relate to both the structure of the market information process as well as to statistical properties of data itself.

At a basic level the problem is characterized by size. Microsoft (MSFT) has averaged 90,000 ticks per day over the past twelve months. That equates to 22.6 million data points for a single year. While the number of stocks with this high level of tick count is limited, the median stock in the Russell 3000 produces approximately 2,100 ticks per day or 530,000 per year. A reasonable research or buy list of 500 stocks, each with three to five years of data, can exceed two billion data points. Data storage requirements can easily reach several hundred gigabytes after storing date, time, and volume for each tick.

While advances in databases, database programming, and computing power have made the size issue easier to manage, the statistical characteristics of high frequency data leave plenty of challenges.

Specifically, problems arise due to:
• The asynchronous nature of tick data.
• The myriad of possible error types, including isolated bad ticks, multiple bad ticks in succession, decimal errors, transposition errors, and the loss of the decimal portion of a number.
• The treatment of time.
• Differences in tick frequency across securities.
• Intraday seasonal patterns in tick frequency.
• Bid-ask bounce.
• The inability to explain the cause of errant data.

Yet, perhaps the most difficult aspect of cleaning intraday data is the inability to universally define what is “unclean.” You know it when you see, but not everyone sees the same thing. There are obvious outliers, such as decimal errors, and there are borderline errors, such as losing the fractional portion of a number or a trade reported thirty seconds out of sequence. The removal of obvious outliers is a relatively easy problem to solve. The complexity lies in the handling of borderline, or marginal, errors.

The filtering of marginal errors involves a tradeoff. Filter data too loosely and you still have unusable data for testing. Filter data too tightly and you increase the possibility that you overscrub it, thereby taking reality out of the data and changing its statistical properties. Overscrubbing data is a serious form of risk. Models that have been developed on overscrubbed data are likely to find real-time trading a chaotic experience. Entry and exit logic based on stop and limit orders will be routinely triggered from real-time data that demonstrates considerably greater volatility than that experienced during simulation. In Dunis et al. (1998)iv a methodology for tick filtering is described whereby the authors state, “cleaning and filtering of an archived database will typically be far more rigorous than what can feasibly be achieved for incoming real-time data.” We reject this concept for the reason sited above. Treating data differently in real time versus historical simulation can be risky.

Defining marginal errors is the crux in the tradeoff between underscrubbing and overscrubing data. In our opinion, these errors are a function of the base data unit (tick, 1-minute, 60-minute, etc.) employed by the trader. What is a bad tick to a tick-based trader may be insignificant to a trader using 60-minute bars. That is not to say that the 60-minute trader cannot or should not filter data to the same degree as the tick trader, but the decision to do so may unnecessarily add to the level of sophistication required by the filter(s). This unconventional idea, that error definition is unique to the trader and hence, that there is no single correct scrubbed time series applicable to all traders, has evolved through our work with high frequency data and traders over the past eighteen years. We believe it is more important to match the properties of historical and real-time data than it is to have “perfect” historical data and “imperfect” real-time data.

The primary objective in developing a set of tick filters is to manage the overscrub/underscrub tradeoff in such a fashion as to produce a time series that removes false outliers in the trader’s base unit of analysis that can support historical backtesting without removing real-time properties of the data.

Keep Last Read

Replies (1) Next 10 Prev Next

Discover What Traders Are Watching

Explore small cap ideas before they hit the headlines.

Join Today

Volume
Day Range:
Bid Price
Ask Price
Last Trade Time:

Boards:

Quotes:

Boards

News

Market Data

Markets

Discover

Discover

Boards:

Quotes:

Discover What Traders Are Watching

Boards:

Quotes:

Boards

News

Market Data

Markets

Discover

Discover

Boards:

Quotes:

Discover What Traders Are Watching

Go Ad-Free