Quote:
Originally Posted by Bonecrusher Smith
why do we divide by the d.o.f. in the sample standard deviation
This is actually two questions. The first is, why do we divide by n - 1 in the the sample variance? The second is, why is n - 1 the degree of freedom for the residuals, X_i - X¯? I do not believe these questions have much to do with one another. I will address them here, but if your background in probability theory is weak, the answers will be difficult to understand.
Strictly speaking, variance is a property of a random variable, not of a data set. A random variable is a function from a probability space to the reals. When we speak of a data set as a "population," we usually mean that the data set itself is the sample space, and the probability measure on the sample space is typically taken to be the uniform measure. The identity function on this probability space is then a random variable which represents a randomly chosen element from this "population." If we calculate the variance of this random variable, then we must divide by n, because that is the definition of variance. This is what is meant by the "population variance."
When we speak of a data set as a "sample," we usually mean that we are modeling this data set as a particular realization of a finite sequence of independent random variables, each one described as above. In this case, when we talk about the sample "variance," it is not really a variance at all. It is an estimate of the population variance. If we divide by n, we get one possible estimate. If we divide by n - 1, we get another possible estimate. We often choose to divide by n - 1, because then the resulting estimate is "unbiased."
Now consider degrees of freedom. Suppose I have a sample of size n. Then my data set is a particular realization of n independent random variable, X_1, ..., X_n. I then define the sample mean to be
X¯ = (X_1 + ... + X_n)/n.
Note that this is also a random variable. The residuals are X_i - X¯, which are also random variables. Random variables are functions, and functions are vectors, so we can talk about the dimension of the space spanned by these residuals, which is at most n, since there are n of them.
These residuals are linearly dependent, since they add up to the 0 random variable. So they span a space of dimension at most n - 1. In most cases, in fact, they span a space of dimension exactly n - 1, and this is what we mean when we say there are n - 1 degrees of freedom.
Quote:
Originally Posted by Bonecrusher Smith
I understand that d.o.f. is the number of independent quantities that can vary. The argument in most textbooks seems to be that since the sample mean is known, only n-1 of the quantities can vary, so we divide the sum of the squared deviations by the d.o.f.
This is a (hand-wavy) explanation for why there are n - 1 degrees of freedom. But it does not explain why we should divide by n - 1 in the sample variance.
Quote:
Originally Posted by Bonecrusher Smith
It seems to me that this explanation would also apply to the population. If we knew n-1 values, and the population mean, shouldn't we be dividing by the d.o.f. too?
As mentioned above, when we speak of a data set as a "population," we are thinking of the data set as being the sample space itself. It is then just a collection of numbers, and not a sequence of random variables. The real line has dimension 1, so the linear span of any set of real numbers will be either 0 or 1. Strictly speaking, it is therefore not correct to say that a population of size n has n - 1 degrees of freedom.
Quote:
Originally Posted by Bonecrusher Smith
As a sub question, why do we need to divide by the d.o.f. at all? It is my understanding that in calculating the variance, we are getting an average squared deviation. To get an average don't we need to divide by n?
Correct. By the definition of variance, we must divide by n. As mentioned above, when we calculate the so-called sample "variance," we are not actually calculating a variance. We are calculating an estimate of the population variance.