Journal of Statistical Research of Iran

en پیوند احتمالاتی رکوردهای فارسی با داده‌های گم‌شده Probabilistic Linkage of Persian Record with Missing Data عمومى General پژوهشي Research پیوند رکوردها برای شناسایی واحدهای یکسان در یک یا چند مجموعه‌داده‌ی لاتین در مقالات متعدد مورد بررسی قرار گرفته و روش‌های مناسبی ارائه شده است. اما پیوند رکوردهایی که اطلاعات آن‌ها به‌زبان فارسی ثبت شده است، به‌دلیل ویژگی‌های خاص نوشتارهای فارسی و نبود استاندارد ثبت اطلاعات، با مسائل خاصی مواجه می‌باشد. در این مقاله ضمن معرفی پیوند رکوردها بر اساس یک مدل احتمالاتی، روش‌هایی برای آماده‌سازی فایل‌ها به‌روش استانداردسازی و بلوک‌بندی و انتخاب متغیرهای شناساگر ارائه می‌شوند، که پیوند احتمالاتی رکوردهای فارسی را میسر سازند. برای مقابله با داده‌های گم‌شده که از جمله‌ی مسائل مهم کاربردی در پیوند رکوردها محسوب می‌شوند، روش جدیدی پیشنهاد شده است، که احتمال وجود داده‌های گم‌شده را نیز در مدل پیوند رکوردها لحاظ می‌کند. سپس نحوه‌ی برآورد پارامترهای این مدل با الگوریتم EM ارائه شده است. برای افزایش تعداد فیلدهای قابل مقایسه نیز الگوریتمی مبتنی بر افراز فیلدهای مرکب ارائه گردیده است. سپس نحوه‌ی کاربست روش‌های ارائه‌شده برای پیوند احتمالاتی رکوردهای حاصل از سرشماری‌های کارگاهی در یک منطقه‌ی جغرافیایی ایران، نشان داده شده است. Extended Abstract. When the comprehensive information about a topic is scattered among two or more data sets, using only one of those data sets would lead to information loss available in other data sets. Hence, it is necessary to integrate scattered information to a comprehensive unique data set. On the other hand, sometimes we are interested in recognition of duplications in a data set. The identification of duplications in a data set or the same identities in different data sets is called record linkage. Linkage of data sets that their information is registered in the context of Persian language has special difficulties due to particular writing characteristics of the Persian language such as connectedness of letters in words, existence of different writing versions for some letters and dependency of writing shape of letters to their position in words. In this paper, usual difficulties in linkage of data sets that their information is registered in the context of the Persian language are studied and some solutions are presented. We introduced some compatible methods for preparing and preprocessing of files through standardization, blocking and selection of identifier variables. A new method is proposed for dealing with missing data that is a major problem in real world applications of record linkage theory. The proposed method takes into account the probability of occurrence of missing data. We also proposed an algorithm for increasing the number of comparable fields based on partitioning of composite fields such as address. Finally, the proposed methods are used to link records of establishing censuses in a geographical region in Iran. The results show that taking into account the probability of the occurrence of missing data increases the efficiency of the record linkage process. In addition, using different codes and notations for data registration in different times, leads to information loss. Specially, it is necessary to design a general pattern for writing addresses in Iran, considering geographical and environmental situations. رکورد, فیلد, انطباق, پیوند رکوردها, نسبت درست‌نمایی, الگوریتم EM. record, field, matching, records linkage, likelihood ratio, EM algorithm. 91 108 http://jsri.srtc.ac.ir/browse.php?a_code=A-10-1-148&slc_lang=en&sid=1 Afshin Fallah افشین فلاح fallahaf@modares.ac.ir 1003194753284600545 1003194753284600545 No Mohsen Mohammadzadeh محسن محمدزاده mohsen_m@modares.ac.ir 1003194753284600546 1003194753284600546 Yes