1
1735-1294
Statistical Research and Training Center - Statistical Centre of Iran
178
General
Calibration Weighting to Compensate for Extreme Values, Non-response and Non-coverage in Labor Force Survey
Bidarbakht-nia
Arman
Navvabpour
reza
1
9
2007
4
1
1
14
21
02
2016
21
02
2016
Frame imperfection, non-response and unequal selection probabilities always affect survey results. In order to compensate for the effects of these problems, Devill and Särndal (1992) introduced a family of estimators called calibration estimators. In these estimators we look for weights that have minimum distance with design weights based on a distance function and satisfy calibration equations.
In this paper after introducing generalized regression estimator, we explain general form of calibration estimators. Then special cases of calibration estimators due to using different distance functions, practical aspects and results of comparing the methods are ... [To continue please click here]
183
General
Spatiotemporal Kriging with External Drift
Mohammadzadeh
Mohsen
Sharafi
Maryam
1
9
2007
4
1
15
28
21
02
2016
21
02
2016
In statistics it is often assumed that sample observations are independent. But sometimes in practice, observations are somehow dependent on each other. Spatiotemporal data are dependent data which their correlation is due to their spatiotemporal locations.Spatiotemporal models arise whenever data are collected across bothtime and space. Therefore such models have to be analyzed in termsof their spatial and temporal structure. Usually a spatiotemporal random field...[ To continue please click here]
184
General
The Structure of Bhattacharyya Matrix in Natural Exponential Family and Its Role in Approximating the Variance of a Statistics
Khorashadizadeh
Mohammad
Mohtashami Borzadaran
Reza
1
9
2007
4
1
29
46
21
02
2016
21
02
2016
In most situations the best estimator of a function of the parameter exists, but sometimes it has a complex form and we cannot compute its variance explicitly. Therefore, a lower bound for the variance of an estimator is one of the fundamentals in the estimation theory, because it gives us an idea about the accuracy of an estimator.
It is well-known in statistical inference that the Cramér-Rao inequality establishes a lower bound for the variance of an unbiased estimator. But one has no idea how sharp the inequality is, i.e., how close the variance is to the lower bound. It states that, under regularity conditions, the variance of any estimator can not be smaller than a certain quantity.
An important inequality to follow the Cramér-Rao inequality is that of a Bhattacharyya (1946, 1947).
We introduce Bhattacharyya lower bounds for variance of estimator and show that Bhattacharyya inequality achieves a greater lower bound for the variance of an unbiased estimator of a parametric function, and it becomes sharper and sharper as the order of the Bhattacharyya matrix...[To continue please click here]
182
General
Cut-off Sampling Design: Take all, Take Some, and Take None
Jafari Jozani
Mohammad
Jamshidi
Farshid
1
9
2007
4
1
47
70
21
02
2016
21
02
2016
Extended Abstract. Sampling is the process of selecting units (e.g., people, organizations) from a population of interest so that by studying the sample we may fairly generalize our results back to the population from which they were chosen. To draw a sample from the underlying population, a variety of sampling methods can be employed, individually or in combination.
Cut-off sampling is a procedure commonly used by national statistical institutes to select samples. There are different types of cut-off sampling methods employed in practice. In its simplest case, part of the target population is deliberately excluded from selection. For example, in business statistics it is not unusual to cut off (very) small enterprises from the sampling frame. Indeed, it may be tempting not to use resources on enterprises that contribute little to the overall results of the survey. So, in this case, the frame and the sample are typically restricted to enterprises of at least a given size, e.g. a certain number of employees. It is assumed that the contribution of this part of the population is, if not negligible, at least small in comparison with the remaining population.
In particular, cut-off sampling is used when the distribution of the values Y1, ..., YN is highly skewed, and no reliable frame exists for the small elements. As explained above, such populations are often found in business surveys. A considerable portion of the population may consist of small business enterprises whose contribution to the total of a variable of interest (for example, sales) is modest or negligible. At the other extreme, such a population often contains some giant enterprises whose inclusion in the sample is virtually mandatory in order not to risk large error in an estimated total. One may decide in such a case to cut off (exclude from the frame, thus from sample selection) the enterprises with few employees, say five or less. The procedure is not recommended if a good frame for the whole population can be constructed without excessive cost.
This method may reduce the response burden for these small enterprises. On the other hand, this elementary form of cut-off sampling, which we refer to as type I cut-off sampling, may be considered a dirty method, simply because (i) the sampling probability is set equal to zero for some sampling units and so it can be considered as a type of non-probability sampling design, and (ii) it leads to biased estimates.
However, the use of cut-off sampling and its modified versions can be justified by many arguments. Among other one can argue, and justify the use of cut-off sampling, that
It would cost too much, in relation to a small gain in accuracy, to construct and maintain a reliable frame for the entire population;
Excluding the units of population that give little contribution to the aggregates to be estimated usually implies a large decrease of the number of units which have to be surveyed in order to get a predefined accuracy level of the estimates;
Putting a constraint to the frame population and, as a consequence, to the sample allows to reduce the problem of empty strata;
The bias caused by the cut-off is deemed negligible.
In this paper we discuss different types of cut-off sampling methods with more emphasize on analyzing type III cut-off sampling which consists of take all, take some, and take none criteria. Roughly speaking, in our discussed methods, the population is partitioned in two or three strata such that the units in each stratum are treated differently; in particular, a part of the target population is usually excluded a priori from sample selection. We discuss where we should consider cut-off sampling as a permitted method and how to deal with it concerning estimation of the population mean or total using model-based, model-assisted, and design-based strategies. Theoretical results will be given to show how the cut-off thresholds and the sample size should be chosen. Different error sources and their effects on the overall accuracy of our presented estimates are also addressed.
The outline of the paper is as follows. In section 2, we briefly discuss different types of cut-off sampling design and some of their properties. In section 3, we first introduce our notations and motivate the use of type III cut-off sampling. We further discuss estimation of the population mean (or total) based on ignoring the population units in ``take none" strata or by modeling them using auxiliary information. We study the problem of ratio estimation of the population mean and type III sample size determination (for given precision of estimation) using design-based, model-based, and model-assisted strategies. In this section, we also study the problem of threshold calculation and its approximation using different methods and under different conditions. Finally, in section 4, we present a simulation study and compare our obtained results with the ones under commonly used cut-off sampling of type I and its modification.
180
General
Small Area Estimation of the Mean of Household's Income in Selected Provinces of Iran with Hierarchical Bayes Approach
Zarei
Shaho
Gerami
Abbas
Jafari Khaledi
Majid
1
9
2007
4
1
71
90
21
02
2016
21
02
2016
Extended Abstract. Small area estimation has received a lot of attention in recent years due to necessity demand for reliable small area statistics. Direct estimator may not provide adequate precision, because sample size in small areas is seldom large enough. Hence, by employing models that can use auxiliary information and area effects in descriptions, one can increase the precision of direct estimators. Due to more readily available level auxiliary information of area, simplicity and possibility of evaluation of the assumptions used by survey data, area level model has become of comprehensive importance, nowadays. Therefore, basic area level models have been extensively studied in this paper to derive empirical best linear unbiased prediction (EBLUP), empirical Bayes (EB), and hierarchical Bayes (HB) with several different assumptions on parameters. Those models are used to obtion the small area estimators, i.e., the mean of household income in several provinces of Iran, including Khorasan-e-Razavi, Hamedan, Lorsetan, and Tehran. To assess small area estimators, we used 1700 urban households who live in those provinces from the data set of the 2006-2007 Household's Income and Expenditure Survey. Some sampling scheme has been utilized. The optimal total sample size has been more than 400 units, but we have only 212 units available. Due to shortage of sample size, we face large MSE's, encountered us with small area problem.
There are three measures for comparison of small area methods, including average square error (ASE), average absolute of relative bias (AARB), and average of absolute bias (AAB).
We have used two types of transformations, logarithm transformation, and Box-Cox transformation, because of abnormality and heterogeneity of variances.
Our data analysis has shown that it is better to use Box-Cox transformation than to use logarithm transformation, i.e., the test statistic is more significant by using this transformation; but Box-Cox transformation causes large sampling variance, which in some cases results in non-convergence in Gibbs algorithm.
Likewise, HB approach gives better results than EBLUP and EB. All of these approaches are better than direct estimator, i.e., they have smaller values of ASE, AASB, and AAB.
179
General
Probabilistic Linkage of Persian Record with Missing Data
Fallah
Afshin
Mohammadzadeh
Mohsen
1
9
2007
4
1
91
108
21
02
2016
21
02
2016
Extended Abstract. When the comprehensive information about a topic is scattered among two or more data sets, using only one of those data sets would lead to information loss available in other data sets. Hence, it is necessary to integrate scattered information to a comprehensive unique data set. On the other hand, sometimes we are interested in recognition of duplications in a data set. The identification of duplications in a data set or the same identities in different data sets is called record linkage. Linkage of data sets that their information is registered in the context of Persian language has special difficulties due to particular writing characteristics of the Persian language such as connectedness of letters in words, existence of different writing versions for some letters and dependency of writing shape of letters to their position in words.
In this paper, usual difficulties in linkage of data sets that their information is registered in the context of the Persian language are studied and some solutions are presented. We introduced some compatible methods for preparing and preprocessing of files through standardization, blocking and selection of identifier variables. A new method is proposed for dealing with missing data that is a major problem in real world applications of record linkage theory. The proposed method takes into account the probability of occurrence of missing data. We also proposed an algorithm for increasing the number of comparable fields based on partitioning of composite fields such as address. Finally, the proposed methods are used to link records of establishing censuses in a geographical region in Iran. The results show that taking into account the probability of the occurrence of missing data increases the efficiency of the record linkage process. In addition, using different codes and notations for data registration in different times, leads to information loss. Specially, it is necessary to design a general pattern for writing addresses in Iran, considering geographical and environmental situations.
181
General
Functional Analysis of Iranian Temperature and Precipitation by Using Functional Principal Components Analysis
Tazikeh Miyandarreh
Norallah
Hosseini-nasab
Ebrahim
1
9
2007
4
1
109
128
21
02
2016
21
02
2016
Extended Abstract. When data are in the form of continuous functions, they may challenge classical methods of data analysis based on arguments in finite dimensional spaces, and therefore need theoretical justification. Infinite dimensionality of spaces that data belong to, leads to major statistical methodologies and new insights for analyzing them, which is called functional data analysis (FDA).
Dimension reduction in FDA is mandatory, and is partly done by using principal components analysis (PCA). Similar to classical PCA, functional principal components analysis (FPCA) produces a small number of constructed variables from the original data that are uncorrelated and account for most of the variation in the original data set. Therefore, it helps us to understand the underlying structure of the data.
Temperature and amount of precipitation are functions of time, so they can be analyzed by FDA. In this paper, we have treated Iranian temperature and precipitation in 2005, extract patterns of variation, explore the structure of the data, and that of correlation between the two phenomena. The data, collected from the weather stations across the country, were discrete and associated with the monthly mean of temperature and precipitation recorded at each station. However, we have first fitted appropriate curves to them in which we have taken smoothing methods into account. Then, we have started analyzing the data using FPCA, and interpreting the results. When estimating the eigenvalues, we have found that the first estimated eigenvalue $hat {theta}$ shows a strong domination of its associated variation on all other kinds. Furthermore, the first two eigenvalues explain more than 98% of the total variation, inwhich their contributions individually were 93.7 and 4.3 percent, respectively. Contributions from others, however, were less than 2 percent. Thus, we have only considered the first two components.
The first estimated principal component (PC) shows that the majority of variability among the data can be attributed to differences between summer and winter temperatures. The second PC shows regularity of temperature when moving from winter to summer. In other words, it reflects the variation from the average of the difference between the winter and summer temperatures. Furthermore, bootstrap confidence bands for eigenvalues and eigenfunctions of the real data were obtained. They contain both individual and simultaneous confidence intervals for the eigenvalues. We have also obtained single and double bootstrap bands for the first two eigenfunctions, and seen that they are extremely close to each other, reflecting the high degree of accuracy of the bands that are obtained by the single bootstrap methods.