Refusing to See the Obvious
You should reject all claims that an effect does not exist simply because a statistical test fails to declare significance.
Such claims are false. Yet, they are commonplace. I read such assertions by academics. Yet, they demonstrate poor scholarship. This is a major league scandal.
What is even worse, and I see it often, is that people go out of their way to avoid seeing the obvious.
The Proper Statistical Setting
Statistics can never show that two things are the same. They can show that, for a given set of assumptions, it is unlikely that they are the same. Statistics identifies differences, not similarities.
A standard statistical test may declare significance when there is none. The chance of a false alarm or false positive is the type 1 error. When we identify a 95% confidence limit, the chance of a type 1 error is 5%.
Statisticians can and do show similarities, but in a roundabout manner. After making a series of assumptions and setting up a test, a statistician can calculate whether his test would have revealed an actual difference of a specified magnitude. This tells us about the likelihood of a missed detection. This is the type 2 error.
A plot of the probability of a type 2 error versus an actual difference is called the POWER of a test.
Unless you specify the power of a test, any failure to report statistical significance is meaningless.
The Scandal in Financial Statistics
Here are quotes from pages 382-383 of David Dreman's Contrarian Investment Strategies: The Next Generation:
"..The statistics of the original mutual fund researchers in the sixties and early seventies failed to turn up such above-average performance by any investors.."
"..On closer examination, the efficient market victory vanished.."
"Even to be flagged on the screen, the manager had to outperform the market by 5.83% annually for 14 years. When we remember a top manager might beat the market by 1.5 or 2% a year over this length of time, the returns required by Jensen to pick up managers outperforming the averages were impossibly high. Only a manager in the league of a Buffett or Templeton might make the grade. One fund outperformed the market by 2.2% a year for 20 years, but according to Jensen's calculations, this superb performance was not statistically significant."
"In another study..it was not possible at a 95% statistical confidence level to say a portfolio up 90% over 10 years was better managed than another portfolio down 3%.."
Faulty Statistical Assumptions
The basic concept behind statistics is brilliant. You set up a hypothesis. It is called the null hypothesis. Then you reject it.
You cannot simply look at data and draw strong conclusions. You must introduce a set of assumptions. These assumptions include the details of the probability distribution. What you show when you declare statistical significance is that the observed data are atypical for the assumed probability distribution.
If an assumed probability distribution would have produced an observed result less than 5% of the time, you declare at a 95% confidence level that the differences are real, not simply the result of randomness.
Many flaws related to probability distribution assumptions are well publicized.
Most often, the assumed probability distribution is the normal, Gaussian, bell shaped curve. [Actually, lognormal. The percentage increases and decreases are assumed to have a normal distribution, not prices.] Most often, it is based on monthly statistics. It is made better if you separate secular bull markets from secular bear markets (as does Ed Easterling of Crestmont Research).
This is an excellent approximation and it works exceedingly well AT THE 90% CONFIDENCE LEVEL (two-sided, 95% one-sided confidence level).
This is an exceedingly dangerous assumption when you try to assign high levels of confidence. The assumption has made Nassim Taleb, the writer of Fooled by Randomness, rich. He bets (through call options) that highly unusual events (Black Swan events) occur more frequently than what standard analysis suggests. He almost always loses his bets. But whenever he wins a bet, he wins big, really big.
Similarly, Benoit Mandelbrot, who wrote The (Mis)Behavior of Markets, points to the wildly inaccurate risk assessments of insurance and insurance related products (such as commodity futures and cotton farms). A standard finance model declared the odds of ruin at one chance in ten billion billion when real-world odds were one in ten to one in thirty (pages 232-233).
Refusing to See the Obvious
For some reason, which I do not understand, it is considered OK for a financial researcher to miss the obvious. It is one thing for researchers to reject an advertising claim of 30% returns far into the future. It is much different to infer that nobody can beat the market from the failure of managers to beat the market by 5.83% annually over 14 years.
If researchers were to report the sensitivity of their tests accurately, that is, if they were to report the power of their tests and the type 2 errors, we would not run into this problem.
It is my opinion that the failure to see the obvious is weighted heavily by those making honest mistakes, often through ignorance, with only a few people intentionally making misleading assertions. However, it is also my opinion that many of those making honest mistakes know better. But they are lazy.
Let’s take an easy example. If we plot real stock returns versus time (on a semilog graph), we see small variations about a straight line. The line corresponds to a real, annualized return around 6.5% to 7.0%. It is obvious that the long-term of the stock market has been predictable and it is reasonable to expect it to continue to be predictable.
We don’t have to be demanding to get useful information. For those with the proper technical skills, it is straightforward to introduce formalism and to show whatever rigor is needed. For example, we might insist only that the long-term return is bounded between 4% and 8% and we can still get a lot of useful information.
In essence, if the price strays far from the long-term rate of return, it will correct. We do not insist that we know the exact details. We do insist that corrections will take place.
If someone simply wants to be contentious, he can demand all sorts of definitions and set criteria that we cannot meet. For example, he might insist that the long-term return have exactly one true value instead of being bounded within limits. Then, if he were to find a segment with a 6.4% return instead of 6.5% to 7.0%, he could claim to have proved that there is nothing of predictive value.
I have seen this kind of thing many times. It is not technically challenging to refute. Not by any stretch. But it can be exceedingly tedious.
The latest approach that I have seen for rejecting the obvious uses data overlap as a subtle excuse. We have two centuries of stock market returns (1800-2000), with better numbers in later periods, that vary slightly around a rate of 6.5% to 7.0%. If we use a 10-year (or 20-year) period as the minimum amount of time for reporting effects and IF WE INSIST ON NO DATA OVERLAP whatsoever, we have suddenly compressed 200 years of fluctuating stock prices into 20 (or 10) distinct data points (known as degrees of freedom).
It is more difficult to claim statistical significance with 20 data points than with 200. If you cut the time period to the twentieth century, you can cut the number of data points to 10. I have seen the time period reduced even more.
Before too long, you won’t be able to conclude anything from the data. Best of all, you give the IMPRESSION, BUT NOT THE REALITY of doing things right, of applying rigor.
Let us think through this statistical problem with the intent of extracting useful information. Does it make sense to talk about a long-term return of the stock market? Or is the obvious only an illusion?
Look at two 10-year sequences, offset by one year. Consider 1946-1956 and 1947-1957. Eight years overlap, but two do not. The first sequence has 1946. The second sequence has 1957. Both sequences share the years 1947-1956.
We could draw a line for eight years of returns at the overall historical rate. [We can also draw lines with returns at the historical bounds.] Now the question becomes whether year-to-year random price fluctuations from the two years, 1946 and 1957, are big enough to account for the deviations from the overall historical returns.
The answer is that one year is not quite big enough but two are. There remains a little overlap. Most of the years at the end of a sequence are also years at the start of a sequence. This reduces the number of independent comparisons (or degrees of freedom) by a factor between one and two. From our 200 years, we still have the information from more than 100 independent data points (degrees of freedom).
The effect is strong. The effect is real. We do not claim to know the exact confidence level. But we would not have claimed a confidence level above 90% (two-sided, 95% one-sided confidence level) anyway.
Have fun.
John Walter Russell
December 3, 2005