Saturday, February 12, 2011

Jack-knife and Boot-strap

Jack-knife and boot-strap are non-parametric methods of analysis of data. They do not assume that your data is distributed as a Gaussian distribution. They are useful in many cases where you need to estimate the error-bars on an indirect quantity. Lots of examples come to mind. Consider, for example, the error estimation of a ratio (shall we say the Binder cumulant?) The distribution of this quantity is certainly not Gaussian, and its a further pain if you only have a few measurements. Or shall we say, that you are doing a (complicated) fit on a data set, and need to estimate errors on fit parameters? In each case, we could come up with some way of accurately determining the errors, but a simple universal answer is to do the bootstrap and/or the jack-knife. Its not my aim now to describe why this works, rather I'll just state what to do.

Consider the jack-knife method. Suppose you have independent data/measurements. (If you have auto-correlations, blocking the data is an effective way to make independent data items). Okay, so now, you do whatever calculation you have to do with the whole data set to get the mean values of the parameters you want to determine. Thats it! For the error, you do the analysis on the whole data set but removing one block at a time. Thus, you'll get N estimates of the quantity you were looking for, where N is the no of blocks you have divided your data into. Now quote the error as:


where θn are the estimates of the parameter you want on each of the Jackknife samples and Θ is the corresponding value on the whole data set. The error is then simply σ. Also note that the variance is calculated as the about the full sample mean Θ, if it is going to be the one quoted. Sometimes the average of all the jackknife estimates, ξ can also be used in the construction of an unbiased estimator as Θ - (N-1)(ξ - Θ) . This is sometimes quoted as an estimate of the mean.

What about the boot-strap? This relies on the re-usability of data. Make up your mind on doing N boot-strap samples. Now, select ndat (where ndat is your total number of data. This, I guess is not absolute, but convenient) samples from the whole data set. Do this selection by doing a uniform random number generator, so that some data, in principle, is reselected and some is not used at all, in each boot-strap sample. Now, estimate the parameters you want to in each sample. The mean of the estimates is your desired mean and the standard deviation is the estimate of the error on the mean.
Some people argue about the "absoluteness" of the boot-strap in the sense that, if you did more and more measurements, then you'd come up with the same samples more than once. And thats what the boot-strap achieves by doing the random selection with replacement. But a little thought will tell you, that if you are measuring an operator that has a continuous distribution, then it is pretty likely that you'll generate values that are not contained within your data set.