Pages

Thursday, June 13, 2013

Correlated trivariate distributions, and outliers

How does one analyze aberrative patterns within numerous variables?  Say one is looking at a hypothesis on the combined positive relationship among tree heights, sun exposure, and rainfall.  Or on the combined negative relationship on the policy position dilution of governments pursuing the Impossible trinity of a fixed currency rate, free flow of international trade, and sovereign monetary policy.

To pursue this analysis, the enclosed video explores the trivariate standard normal distribution, with data that is color-separated based on the Mahalanobis distance.  A 100 random data sample is simulated for each correlation shown, as we move from a perfect correlation of 1, and limiting towards a perfect negative correlation of -1 (these negative values become fully unstable and can not exist among all pairs of several or more variables).  Notice the black values shown atop the data is the height of the z-axis for the data, whereas the bubble size is scaled to the absolute height change versus zero.  For example, if a particular {x,y,z} value is {-1,-2,-3}, then this would show a black -3 value within a bubble size of 3.

The green tiered data, beyond a distance of 3.3 as shown in the legend, reflects the third of the data that is most deviant from the rest of the correlated values cluster.


While the statistical formula build-up for a trivariate correlation is a more complicated covariance formula, we show a lower version one below for your edification.  The one shown proves the bivariate distribution of within a pair of simulated random variable vectors: X and Y.

Corrlation(X,Y)
= r
= Covariance(X,Y)/(StDevX * StDevY)

The denominator cancels as our X and Y variables, despite correlation, is drawn from a standard normal distribution, with a unit variance and standard deviation.  We continue now to solve for the covariance:

Average[(X - AverageX)(Y - AverageY)]
= Average[XY - AverageX*Y - X*AverageY + AverageX*AverageY]
= Average[XY] - AverageX*AverageY - AverageX*AverageY +AverageX*AverageY
= Average[XY] - AverageX*AverageY

Now the average of a standard normal variable (e.g., either X, Y, or any other) is zero.  So the latter part of this equation also zeros out, and we are left with the following.  We introduce a new variable vector X' to decompose the simulated variable Y.  X' is identical to, but independent of, X.  Recall that independence implies zero correlation, even though the reverse is not true.

Average(XY)
= Average[X (rX + X'(1-r^2)^0.5)]
= Average[rX^2 + XX'(1-r^2)^0.5)]
= Average[rX^2] + Average[X*X'(1-r^2)^0.5)]

We can pull out the correlation expressions, given that they are constants, from within both the averages summed in the final line above.  Also, given the independence of the identical X and X' normal random variables, the expected value of X*X' is zero.  And now we continue with what remains:
r*Average[X^2]
= r*[VarianceX + (AverageX)^2]
= r*[1 + 0^2]
= r

No comments:

Post a Comment