Friday in the course of statistics, we started the section on *confidence interval*, and like always, I got a bit confused with the degrees of freedom of the Student (should it be or ?) and which empirical variance (should we consider the one where we divide by or the one with ?).

And each time I start to get confused, the student obviously see it, and start to ask tricky questions… So let us make it clear now. The *correct* formula is the following: let

then

is a confidence interval for the mean of a Gaussian i.i.d. sample.

But the important thing is neither the * n-1* that appear as degrees of freedom nor the that appear in the estimation of the standard error. Like always in mathematical result, the most important part of that result is not mentioned here: observations have to be i.i.d. and to be normally distributed. And not “

*almost*” normally distributed….

Consider the following case: we have =20 observations that are

*almost*normally distributed. Hence, I consider a student

*t*distribution

An Anderson Darling normality test accepts a normal distribution in 2 cases out of 3.

With a *true* normal distribution if would be 95% of the cases, so in some sense, I can pretend that I generate *almost* normal samples.

For those samples, we can look at bounds of the 90% confidence interval for the mean, with three different formulas,

i.e. the *correct* one, or the one where I considered degrees of freedom instead of ,

and the one were we condired a Gaussian quantile instead of a Student *t* one,

for(s in 1:10000){ X=rt(n,df=3) m[s]=mean(X) sd=sqrt(var(X)) IC1[s]=m[s]-qt(.95,df=n-1)*sd/sqrt(n) IC2[s]=m[s]-qt(.95,df=n)*sd/sqrt(n) IC3[s]=m[s]-qnorm(.95)*sd/sqrt(n) }

One the graph below are plotted the distributions of the values obtained as lower bound of the 90% confidence interval,

(the curves with and degrees of freedom in quantiles are the same, here).

The dotted vertical line is the *true* lower bound of the 90%-confidence interval, given the *true* distribution (which was not a Gaussian one).

If I get back to the standard procedure in any statistical textbook, since the sample is almost Gaussian, the lower bound of the confidence interval should be (since we have a Student *t* distribution)

mean(IC1) [1] -0.605381

instead of

mean(IC3) [1] -0.5759391

(obtained with a Gaussian distribution instead of a Student one). Actually, both of them are quite different from the correct one which was

quantile(m,.05) 5% -0.623578

As I mentioned in a previous post (here), an important issue is that if we do not know a parameter and substitute an estimator, there is usually a cost (which means usually that the confidence interval should be larger). And this is what we observe here. From a teacher’s point of view, it is an important issue that should be mentioned in statistical courses….

But another important point is also that confidence interval is valid *only* if the underlying distribution is Gaussian. And not *almost* Gaussian, but really a Gaussian one. So since with =20 observations everything might look Gaussian, I was wondering what should be done in practice… Because in some sense, using a Student quantile based confidence interval on some almost Gaussian sample is as wrong as using a Gaussian quantile based confidence interval on some Gaussian sample…

OpenEdition suggests that you cite this post as follows:

Arthur Charpentier (February 20, 2011). Does the Student based confidence interval have any interest in practice ? *Freakonometrics*. Retrieved November 10, 2024 from https://doi.org/10.58079/ouh1

Thanks! I only found it confusing because I wasn’t sure I understood at first- but it seems like I do now. I’m going to take a closer look at the R-code. I think your example will help me have a better understanding of the Goldberger quote. (if you are interested it was from page 124, Chapter 11)

RESPONSE: thanks for the reference ! I put a copy of the paragraph below,This is very complicated, and I’m not sure I’m following your original post accurately. But, with regard to the decision to use student’s t vs. the standard normal table for confidence intervals using small sample sizes, we were taught not to use ‘t’ unless we could be sure the data was exactly normal. The reason being that t is an exact distribution that relies on assumptions of exact normality. My econometrics textbook (Goldberger, A Course in Econometrics) states:

“ There is no good reason to rely routinely on a t-table rather than a normal table unless Y itself is normally distributed” ( Goldberger, 1991).

I’ve made a post on this, but it is centered around asymptotic results and the slutsky theorems. I guess my understanding hinges on how small can n be for these results to be valid.

http://econometricsense.blogspot.co…

RESPONSE: I guess my post was just to stress that the normal assumption is crucial here ! I think I should just quote Golberger, since his statement was exactly the objective of my post, and I was willing to illustrate (numerically) that point. Sorry for being so confusing…I had missed the restriction to small samples. I agree that the normality assumption takes more importance in this case…

I had a free moment so I computed coverage probabilities for your example with n=20. Here is my code and the results:

To get the full confidence interval:

for(s in 1:10000){

X=rt(n,df=3)

m[s]=mean(X)

sd=sqrt(var(X))

IC1[s,]=c(m[s]-qt(.95,df=n-1)*sd/sqrt(n),m[s]+qt(.95,df=n-1)*sd/sqrt(n))

IC2[s,]=c(m[s]-qt(.95,df=n)*sd/sqrt(n),m[s]+qt(.95,df=n)*sd/sqrt(n))

IC3[s,]=c(m[s]-qnorm(.95)*sd/sqrt(n),m[s]+qnorm(.95)*sd/sqrt(n))

}

To compute the coverage probabilities:

for(i in 1:10000) {

if(IC1[i,1]<0 && IC1[i,2]>0) {cov1[i] = 1}

else {cov1[i] = 0}

if(IC2[i,1]<0 && IC2[i,2]>0) {cov2[i] = 1}

else {cov2[i] = 0}

if(IC3[i,1]<0 && IC3[i,2]>0) {cov3[i] = 1}

else {cov3[i] = 0}

}

And finally, the results:

> mean(cov1)

[1] 0.9015

> mean(cov2)

[1] 0.9002

> mean(cov3)

[1] 0.8837

The 2 Student confidence intervals are rather close to the nominal 90% coverage probability, while the Normal interval does slightly worse. I expect the results would be significantly worse if the true distribution was severely skewed, or the departure from normality was much more extreme than the “almost normal” Student used here.

RESPONSE: I got exactly the output,and then

and I was also surprised… I tried also the case of a mixture, so that we accept the Gaussian assumption in 3 cases out of 4, and again, Student confidence interval performs extremely well…

I was curious and ran the n=300 example very quickly and this is what I got using your notation:

> mean(IC1)

[1] -0.1609844

> mean(IC2)

[1] -0.1609827

> quantile(m,.05)

5%

-0.1621999

Which seems rather close to me… Again, I think the more interesting statistic would be the coverage probability…

RESPONSE: again, I agree about the coverage probabilityA question and an idea:

1. Should we not be looking at the coverage probabilities instead of the distribution of the lower bound ? My experience with confidence intervals has always between that coverage probabilities were more interesting than either the lower or upper bound individually, this might be wrong. What happens if n=300 and the distribution is almost normal ?

2. Wouldn’t bootstrap confidence intervals be more interesting in the case of some departure from the normality assumption ? How reliable would bootstrap samples be with n=20 ? n=300 ?

RESPONSE: about the coverage, this is, indeed, a very good idea… about n=300, then the Student distribution has no interest, since asymptotically, it is equivalent to the Gaussian one (my post was specifically on that idea, that we keep telling our student “be aware that if n is small, since you do not know the variance, the asymptotic distribution is no longer Gaussian, but Student”). Anyway, for the first comment, I can look at the proportion of cases where 0 (the true value of the mean) is outside the confidence interval…