Four sampling methods are presented. SRS (Simple Random Sampling), Stratified Sampling, Cluster Sampling, Systematic Sampling. Here default setting is WITHOUT replacement sampling. Sample Size Estimation function is also in this package.
- SRS (Simple Random Sampling)
- Systematic Sampling
- Stratified Sampling
- Cluster Sampling
- Sample Size Estimation
Required Packages
- pandas
- SciPy
It is inspired by Sharon L. Lohr, Sampling: Design and Analyis, 2nd edition, 2009, Routledge. Also Wikipedia is another reference.
Samples of size n are chosen from the population with the same chance.
Denote n as the sample size, and N as the size of poplation.
Sample mean is defined as
The variance of sample mean is
The coefficient of variation (CV) is defined as
Sampling weight
Sample the units by the sampling interval k. For example, starting number is 34, and the interval is 1000, and then the sample numbers are 34, 1034, 2034,... and so on.
Divided data into odd numbered and even numbered and sample them with the size n. It is not optimal but simple.
Sample the data from partitioned sub-populations.
At first, divide the population with size N into H strata, each stratum h with size
Sample mean of each stratum is defined as
Source | df | Sum of Squares |
---|---|---|
Between strata | H-1 |
|
Within strata | N-H |
|
Total | N-1 |
|
If SSB <
The population is divied into several clusters. Then, we choose few clusters, and use all the units in the clusters as samples or sample again from chosen clusters. First process is called 'one-stage' cluster sampling plan, and second is called 'two-stage' cluster sampling plan.The Clusters are also called Primary Sampling Units (psu), and the each unit in each cluster is called Secondary Sampling Units (ssu).
Let n be the number of psus in the sample,
Sample mean of psu (cluster)
Sample variance in psu
Estimated total for psu
Unbiased estimator for population total is defined as
Sample variance of population total is defined as
In One-stage cluster sampling, the se of
Source | df | Sum of Squares |
---|---|---|
Between psus | N-1 | |
Within ssus | N(M-1) | $SSW = {\sum_{i=1}}^{N} {\sum_{j=1}}^{M}(\bar{y}{ij}-\bar{y}{iU})^2$ |
Total | NM-1 | $SST = {\sum_{i=1}}^{N} {\sum_{j=1}}^{M}(\bar{y}{ij}-\bar{y}{U})^2$ |
Intraclass (or intracluster) correlation coefficient (ICC) tells us how similar elements in the same cluster are. It provides a measure of homogeneity within the clsuters.
ICC = 1 -
If the elemnts in each cluster are similar and the sum of squares are small, then ICC gets smaller value, on the other hand if the elements are not similart, and ICC gets bigger. If ICC is negative value, cluster sampling is more efficient than SRS.
In Two-stage cluster sampling,
Estimated total for psu
Hence, the sample weight for each element is
- Specify the Tolerable Error
$P(|\bar{y}-\overline{y_U}|\leq e) = 1 - \alpha$ , where$\overline{y_U}$ is the population mean,$e$ is called margin of error in survey, usually 0.03, and significance level$\alpha$ , usaully 0.05. - Find an Equation
Here we get the equation: n =$\frac{{z_{\alpha/2}^2}{S^2}}{{e^2}+\frac{{z_{\alpha/2}^2}{S^2}}{N}}$ .
Here$S^2 = \hat{p}(1-\hat{p})$ where$\hat{p}$ is the estimated proportion. Since the maximum value of$\hat{p}(1-\hat{p})$ is 1/4, we can subsitute$S^2$ as 1/4. Furthermore, if we do not know population size N, the equation become n =$\frac{{z_{\alpha/2}^2}{S^2}}{{e^2}}$ . In more simple version, n =$\frac{{z_{\alpha/2}^2}{\hat{p}(1-\hat{p})}}{{e^2}}$ .