Statistical Ideas: Sea of faulty polls

Short-term update: two polling articles here in past week saw a combined readership of >1.5 million, and several thousand shares. Thank you! Latest article here.

In this article we cover the theoretical bases for two interconnected ideas that we've discussed recently: (a) that the empirical polling results are not as dire as current landslide mainstream media projections make it out to be, and (b) many polls are oscillating about impossibly low probabilities right now for Donald Trump. This year is genuinely unique in merging several fundamental aspects, with a largely disenfranchised voting base across the country (i.e., record undecideds), and pollsters unable or unwilling to properly assess the true probability for Mr. Trump (and their incoherent polls evidence this). This is not a matter of apologizing for the ground-level odds currently shown by mainstream media, or that the average Hillary Clinton lead is merely unsustainably high. This loses the forest through the trees, as we theoretically prove here. Start by studying a sample of the general election polls below, taken in just the past couple days. Do you see anything wrong there? If you don't, then you have no business being around polling data. The average margin of error on these 7 spreads shown is only 3%. Most polls should therefore be within a few percent of the 6% average spread that is advertised by media. But instead most are not! For example, the difference between the highest Ms. Clinton spread and the lowest Ms. Clinton spread is >14 percentage points! And the standard deviation among these mainstream polls is 5%. So both have to be added together, and each is already higher than 3%! That's an unusual, impossible outcome through luck alone. Therefore something is misrepresented in the polls. Also right now 2 of the 7 polls favor Donald (you just' don't hear about them), so double the 10-15% odds he is being given. In the final analysis of this trinomial data, on November 9 we'll look back and see only one poll being correct and most were flat out wrong. This evidence below is a breach of the probability theory behind proper polling, where most polls should see the correct spread within the margin of error interval (that's what the interval's definition must be!) If the margins are therefore completely busted, then so too are the egregious spreads that are seen to be all over the place (and mostly untrustworthy). Likely the correct expected spread right now is 4-5%, and the larger spreads are coming from pollsters that ironically also have the highest margin of errors (casting further suspicion on how close the election really is for Americans). We stand by our long-running estimate that the current probability for a Donald Trump victory is about in the 20% range, or twice what mainstream media is projecting. Of course that is low, but to some it's still a compelling 1 in 4 chance (and much different than some might expect given all the twists and turns this campaign season has brought us). It's also a better reflection of the true odds, versus those dished out by the same inane talking heads who recently gave you the Brexit "remain" prediction, or the NeverTrump prediction!

So there you have it as clearly shown as possible. If these margins of error are correct, then most polls would have the spreads located within a few percent of 6% (so 3%, to 9%). Yet the majority of the polls are outside of this 3%, to 9%, interval. Probabilistically impossible. The idea that whatever the correct spread is determined to be on November 9, we will ultimately prove -shockingly we might add- that one poll was correct but also that most of these other polls were wrong. Those polls (unsure right now which) are because there the correct spread will have been outside of their margin of error intervals.

The only correction anyone can make now to the failed margin of error is to enlarge it, in order to encapsulate most of the other intervals about the correct spread. Without these overlaps, we can't discuss spreads in the media, since the data is from an entirely corrupt polling system! The direction of unbiasing the data is also obvious.

To start with, the only correct expected value for the spread has to be reduced since that is the direction of asymmetric bias. The largest polling spreads have become too extreme and must be brought in already. Combined with larger margins of error. The result of this combination is a correct spread that is lower at about 4-5%, and a margin of error that is roughly double what's been advertised (5-6%). Implying Donald Trump's chances of winning is nearly twice what the mainstream media's been floating around.

How do we get a double of the margin of error, and the implications of it for where the expected spread should be? The likelihood that we would get a result of one where 2/3 of the polling spreads are inside the margin of error interval and yet most don't fall outside of the interval, is only about 1/4 or so of the time (other possible outcomes are that all, or most spreads within all margin of error intervals, or that no spreads overlap at all). In order to get that likelihood rebalanced back at majority, we need to have wider margins so the maximum likelihood outcome we expect to see at that time works out.

Note that these topics were discussed in a recently viral article that last weekend was on the top of ZeroHedge and reddit, and amassing 1 million reads and thousands of shares. For perspective on that number, it's equal to the print subscription/circulation of my cherished The New York Times (and a typical media article attains only several thousand reads). And we should note that a day after this article of ours noting the probability pricing arbitrage on gabling bets that Mr. Trump's spread would tighten, the largest bet ever was wagered for him.

Also the effect of wider margins is that the probability of Donald leading in the actual election doubles from the 10-15% or so that the current pollsters show (and he has not recently deteriorated from). Hence arriving at an actual probability for him that must be greater than 20% or so. Larger uncertainty therefore, given the undecideds for this candidate, and a more narrow spread. This is what we have been saying all along.

The last topic here is that we can see that the higher Hillary spreads are coming from pollsters that have the higher margins of error, though we also showed above they those error intervals are still not wide enough. It should be plain that between the highest spreads and the lowest spreads, the highest ones (those over 9% or so) should be the ones treated with the greatest reservation. Completed by the same shamelessly ignorant and flawed pollsters who gave you #NeverTrump and the Brexit stay prediction, both not so long ago.

At this point it makes for Americans to ignore the capricious polls, and simply vote their conscience on Election Day. The numbers in the polls don't add up to the significance the polling conclusions convey. Both candidates have their strengths, and Americans are torn. The video leak for Donald Trump was regrettable for all Americans, especially in these final weeks. But it will not drive his support to zero. Hillary Clinton for her part has not shown herself to be that much more of a transparent and flawless candidate (a true Scorpion).

Statistical Ideas

Pages

Thursday, October 20, 2016

Sea of faulty polls

4 comments: