Statistical Ideas: Serial correlation techniques

Short-term update: cited in Wikipedia.

This note shows the relationship between two methods, among a few, to get a measure for autocorrelation (serial correlation). This is the covariability of sequential changes in data, either above or below its overall average. For example, the following are short series of directions. They offer a non-parametric sense of essentially positive, zero (i.e., sequential changes are random), and negative autocorrelation, respectively:

- - - - - + + + + +

+ - + + - - + + - -

+ - + - + - + - + -

A future article this summer will study the appearance of autocorrelation in an important, real-world data set (though it exists in many, such as daily weather, stock prices in recent years, one's height during life, etc.) Students of these empirical data sets should know that autocorrelation makes long-run forecasting more difficult, while it makes short-run forecasting given an initial direction more easy. For the current note, we will focus primarily on mathematically reconciling two approaches to solving a measure of autocorrelation. One might think it would be as easy as taking the correlation between the last n-1 data from a sample of n, and the first n-1 data. So let’s see what this looks like, assuming a substantial n, coupled with a well-behaved (e.g., normally distributed) random variable.

For equation writing ease, we will label the variables here as the following:

X = first n-1 data of n (in a time series this can be associated w/ time t-1)

Y = last n-1 data of n (in a time series this can be associated w/ time t)

m_X = average of random variable X

m_Y = average of random variable Y

X’ = X - m_X (we will think of X' as a residual)

Y’ = Y - m_Y(we will think of Y' as a residual)

Over a lengthy time series, the univariate characteristics of X and Y are assumed equal, and this will help with the reconciliation of the autocorrelation formulae below. The Pearson correlation (r) formula is shown here (where s substitutes for σ due to the web log software):

r_X,Y = Cov(X,Y)/(s_Xs_Y)

~ E[(X-m_X )(Y-m_Y )]/s_Y²

~ [E(XY)-m_Xm_Y ]/s_Y²

Let’s explore theoretical cases of r_X,Y, in three limiting scenarios for perfect positive, zero, and perfect negative serial correlation, respectively. Note that these are ideal scenarios only; we generally do not actually have all three shown in practice.

Where X=Y,

[E(XY)-m_Xm_Y ]/s_Y² = [E(Y²)-m_Y² ]/[E(Y²)-m_Y² ] = 1

Where X is independent of Y,

[E(XY)-m_Xm_Y ]/s_Y² = [m_Xm_Y -m_Xm_Y]/s_Y² = 0

Where X=-Y,

[E(XY)-m_Xm_Y ]/s_Y² = [-E(Y²)+m_Y² ]/[E(Y²)-m_Y² ] = -1

Now let's state that the Durbin-Watson (d) statistic looks at residuals, so let’s see how r_X,Y would look in a similar limiting case. And then we’ll reconcile this set-up with d itself.

Cov(X’,Y’)/(s_X’s_Y’) ~ E(X’Y’)/s_Y’²

Where X’=Y’,

Cov(X’,Y’)/(s_X’s_Y’) ~ [E(Y’²) -m_Y’²]/[E(Y’²)-m_Y’²] = 1

Where X’ is independent of Y’,

Cov(X’,Y’)/(s_X’s_Y’) ~ E[X’Y’]/s_Y’² = 0

Where X’=-Y’,

Cov(X’,Y’)/(s_X’s_Y’) ~ [-E(Y’²) +m_Y’²]/[E(Y’²)-m_Y’²] = -1

What we learn from the formulas above is that we could essentially collapse the r_X,Y formula to this expression using residuals: E(X’Y’)/s_Y’². However, the theoretical d format has serial correlation of a range from -1 to 1, inversely mapped to a range from 4 to 0. This can see this if we multiply the E(X’Y’)/s_Y’² expression by -2. This now maps to a range from 2 to -2. So we then add 2, and now we get the desired mapped range from 4 to 0:

-2[E(X’Y’)/s_Y’²]+2 = -2E(X’Y’)/[E(Y’²)-m_Y’²]+2

= -2E(X’Y’)/E(Y’²)+2

= [-2E(X’Y’)+2E(Y’²)]/E(Y’²)

= [-2E(X’Y’)+E(Y’²)+ E(X’²)]/E(Y’²)

= E(Y’-X’)²/E(Y’²)

~ Σ(Y’-X’)²/ΣY’²

~ d

So while Pearson and Durbin-Watson look very different, what we have just now shown is that we can easily and linearly, map back and forth between these two mathematical techniques for serial correlation! The basic method one would employ in practice depends slightly on how one's data is presented, how one wants to describe the autocorrelation, what analytical software formula libraries are available, and what level of down-stream analysis may also need to be performed.

The application of correlation also has interesting connections to least-square regression properties, which is used for factor analysis in many fields. For example it should be noted for econometricians generally, that the interpretation of market risk beta (β) has its roots in r_X,M. Where here we can think of M as the concurrent market variable. The square of r is Cov_X,M²/(s_Xs_M)², which is loosely bounded in a proportional range between 0%, and 100%. We can follow the square formula below to see how these components are connected with one another:

Cov_X,M²/(s_Xs_M)² = Cov_X,M²/(s_X²s_M²)

= (Cov_X,M/s_M²) * Cov_X,M/s_X²

= β * Cov_X,M/s_X²

As a reminder, other mathematical formula descriptions can be easily found using the search bar (on the top of this web log).

Statistical Ideas

Pages

Monday, May 26, 2014

Serial correlation techniques

1 comment: