Pages

Tuesday, July 4, 2017

Fat tail Zillow errors

In our previous article we discussed the size of the median absolute error in Zillow home price estimates (Z-ests) and why they are significantly larger than those estimates provided by human realtors.  A question arises as to why the median absolute error is the statistic that Zillow would provide, since this value would merely represent what essentially the middle ½ of the customer base would see as their error range.  A full ½ of customers would see errors unboundedly larger than these median errors.  Presumably even if Z-ests were normally distributed (which they are not), then the median error could be mapped to the typical error that say the outer decile of users would see.  For example, the 1 in 10 users whose Z-est was either extremely high, or extremely low.  In this article, we’ll show how the Zillow price estimates have a challenging fat-tail distribution, or more deviant errors than can be assumed by a normal distribution.  And that these errors are exceptionally high in some counties (well beyond the feared 20% relative error level, between the actual price and the Z-est).
The cumulative distribution function for the non-standard normal distribution is ½[1+erf[(x-μ)/σ√2]], which we can derive the standard error σ by setting the equation to equal ¾ to complete the median absolute error.  We know the median absolute error (x) has been given to us as roughly 6%, and so the σ comes to nearly 9%.  Using the middle part of the distribution, Zillow states that their final national estimates are within 5% of the sales price 46% of the time.  Using these parameters, we would get a σ of nearly 8% (else using 9% this only implies peaked-ness middle of the distribution to correspond to fat-tails at the ends).  In other words, the other 54% of the home price estimates (100%-46%) are slightly fatter-tailed than the middle part of the distribution and hence brings the overall σ up to 9%, from 8%.  The σ is a measure of the root of the typical squared errors, and hence always slightly higher than the Zillow quoted median error.  The squared error factors in the extreme weights disproportionately more.
Now none of this may be useful in considering the errors for those in the 54% of Z-ests that are not part of the within 5% of the home price estimate.  Let’s consider the other “extreme”, the 15% of home prices that are outside of the 20% error.  Zillow doesn’t tell you how bad those errors are, just that 85% (100%-15%) of their estimates are within a palatable 20%.  If Zillow home price errors followed a normal distribution, then from the middle of the distribution we could suggest that the σ of 8% would carve out a fraction of erroneous values that are beyond 2.5 standard deviations (20%/8%).  For a normal distribution, this is merely 1% of all home estimates.  Yet empirically we see that 15% of Zillow estimates are in this end of the tail (15 times as often!)  See how this maps out on a state-by state basis below.
 
 
Clearly using a normal distribution (and associated conditional Value-at-Risk analysis) to understand the extreme price errors within Zillow would give a falsely benign understanding of the problem.  Using generalized Pareto models we explore all the U.S. county data and see what we learn about how fat the tail errors are across the country.  We notice for example the following for the 95%’ile error solved for each of the counties, using the “shapeless” distribution function F(x)=1-e-x/β, where β can be seen as a proxy for the σ scaling parameter.  We also know this is from a fat-tail distribution family as the f(x) ~ x-(1+α) for α>0 as xà ¥, which we can see below.
F(x)  = 1-e-x/β
f(x)  = e-x/β/β
equivalent ratio of f(x)  = e-x/β/β/x-(1+α)
f(¥) = e-¥/¥-(1+α)
f(¥) à 0
 
So, with our tail error model we have a rich understanding of the moments of the Zillow error eqμ(Sqσj) for all j à (0, ¥).  And importantly, we can solve for the tail error given the thousands of county error distributions Zillow provides.  It is shown below, and it equates through the Pareto function above to average error (when a consumer is in the outer 10% of consumers) that is as little as 16%, and as high as 80%!  The median among all the counties is 36% (or an impracticable 6 times the median absolute error Zillow promotes!)
 
 
Last, we should note that the realty list price estimate has a median error of nearly 5%, and this squares to Zillow’s estimate of nearly ½ of the listing estimate is within 5% of the final sales price.  But realtors have over 96% of their listing estimates within 20% of the sales price, versus 93% for Zillow.  So, there is some fat-tail among realtors, but half as much versus with Zillow (99%-96%=3%, versus 99%-93%=6%).  That’s a welcome respite for the 1 in 10 users who would otherwise suffer a rough Z-est error in the home buying experience.

5 comments:

  1. Hi,

    You say "...average error (when a consumer is in the outer 10% of consumers) that is as little as 16%, and as high as 80%!" could you clarify what "outer 10% of consumers" means, thanks. Great article.

    ReplyDelete
  2. Thanks Farmer George, please look at the graphic provided in this case. In each county we look at the distribution of absolute errors, and mark the cut-off for the top 10%. Then the graphic shows the distribution of these cut-offs among all the counties. We see the low-end of this graphic is 16%, and the high-end is 80%. Believe you have a second question commenting on EVT, but don't see it here anymore; think it went on the tab "About Salil"

    ReplyDelete
    Replies
    1. Thanks for that, appreciated. Can I ask why 10% cut-off, when we have 15% who are outside the +/-20% range?

      Alsp, are these a mis-prints? "f(¥) = e-¥/¥-(1+α) and
      f(¥) à 0"eqμ(Sqσj) for all j à (0, ¥). Thanks.

      Delete
    2. I also meant ask, how can one have a low-end average error of 18%, when the 10% outer consumers are alreday outside the +/-20% range and so will have at least an error of +/-20%, or have I missed the point? Thanks.

      Delete
    3. Thanks much again Farmer George for the observations (also the English font translations in this particular article are unfortunately due to your computer --> on three different PC & Mac computers, and a smartphone on my end they show up fine... will consider the feedback for future). As for the other comment, one is the distribution of errors by county. Then we collect the tail errors in each case, and the second distribution is then of those tail errors themselves. The chart clearly indicates that most of this latter distribution is beyond the 20% level used elsewhere between these articles, though technically the actual data comes in within 18% on the low-end.

      Delete