Determining sample size necessary for bootstrap method / Proposed Method

I know this is a rather hot topic where no one really can give a simple answer for. Nevertheless I am wondering if the following approach couldn’t be useful. The bootstrap method is only useful if your sample follows more or less (read exactly) the same distribution as the original population. In order to be certain this is the case you need to make your sample size large enough. But what is large enough? If my premise is correct you have the same problem when using the central limit theorem to determine the population mean. Only when your sample size is large enough you can be certain that the population of your sample means is normally distributed (around the population mean). In other words, your samples need to represent your population (distribution) well enough. But again, what is large enough? In my case (administrative processes: time needed to finish a demand vs amount of demands) I have a population with a multi-modal distribution (all the demands that are finished in 2011) of which I am 99% certain that it is even less normally distributed than the population (all the demands that are finished between present day and a day in the past, ideally this timespan is as small as possible) I want to research. My 2011 population exists out of enough units to make $x$ samples of a sample size $n$ . I choose a value of $x$ , suppose $10$ ( $x=10$ ). Now I use trial and error to determine a good sample size. I take an $n=50$ , and see if my sample mean population is normally distributed by using Kolmogorov-Smirnov. If so I repeat the same steps but with a sample size of $40$ , if not repeat with a sample size of $60$ (etc.). After a while I conclude that $n=45$ is the absolute minimum sample size to get a more or less good representation of my 2011 population. Since I know my population of interest (all the demands that are finished between present day and a day in the past) has less variance I can safely use a sample size of $n=45$ to bootstrap. (Indirectly, the $n=45$ determines the size of my timespan: time needed to finish $45$ demands.) This is, in short, my idea. But since I am not a statistician but an engineer whose statistics lessons took place in the days of yonder I cannot exclude the possibility I just generated a lot of rubbish :-). What do you guys think? If my premise makes sense, do I need to chose an $x$ larger than $10$ , or smaller? Depending on your answers (do I need to feel embarrassed or not? :-) I'll be posting some more discussion ideas. response on first answer Thanks for replying, Your answer was very useful to me especially the book links.
But I am afraid that in my attempt to give information I completely clouded my question. I know that the bootstrap samples take over the distribution of the population sample. I follow you completely but. Your original population sample needs to be large enough to be moderately certain that the distribution of your population sample corresponds (equals) with the 'real' distribution of the population. This is merely an idea on how to determine how large your original sample size needs to be in order to be reasonably certain that the sample distribution corresponds with the population distribution. Suppose you have a bimodal population distribution and one top is a lot larger than the other one. If your sample size is 5 the chance is large that all 5 units have a value very close to the large top (chance to ad randomly draw a unit there is the largest). In this case your sample distribution will look unimodal. With a sample size of a hundred the chance that your sample distribution is also bimodal is a lot larger!! The trouble with bootstrapping is that you only have one sample (and you build further on that sample). If the sample distribution really does not correspond with the population distribution you are in trouble. This is just an idea to make the chance of having 'a bad sample distribution' as low as possible without having to make your sample size infinitely large.

68.8k 13 13 gold badges 123 123 silver badges 274 274 bronze badges

asked Jul 29, 2012 at 14:02

579 1 1 gold badge 5 5 silver badges 4 4 bronze badges

$\begingroup$ checkout bayesian bootstrap sampling which might cope with small sample size. See sumsar.net/blog/2015/04/… for more details. $\endgroup$

Commented Apr 29, 2019 at 9:15