خلاصة:
One of the primary tools used in text processing tasks such as information retrieval, text extraction, and text mining, is a corpus that is enhnaced by linguistic tags. In a corpus development effort, the role of a POS-tagger is to assign a linguistic tag to every textual token. POS annotation relies heavily on a tagset based on a linguistic theory. Text processing in Persian, too, follows this common practice. Several tagsets have been introduced, so far, to annotate Persian corpora. However, each tagset has followed a specific standard and linguistic theory. The resulting tagsets contain a limited number of tags, which renders them inadequate for a larger scope of research. This study is inspired by EAGLES, MULTEXT-East, positional tagset standards to produce a comprehensive standard positional tagset for Persian. The proposed tagset is also informed by the existing Persian tagsets. The proposed Persian Positional Tagset (PPT) is designed to be used for morphological, lexical, and syntactic annotations of Persian corpora.
ملخص الجهاز:
The proposed Persian Positional Tagset (PPT) is designed to be used for morphological, lexical, and syntactic annotations of Persian corpora.
The additional motivation is to produce a comprehensive set of part-of-speech categories and their respective features for Persian along with the proposed positional tagging scheme.
This study, therefore, intends to propose a comprehensive positional tagset that can be used for morphological, lexical, and syntactic annotation of Persian corpora.
Some of the pioneering works on corpus development for English language dates back to the Brown corpus (Greene & Rubin, 1971), the Lancaster/Oslo-Bergen corpus (LOB) (Johansson, 1986), Spoken English Corpus (SEC) (Taylor & Knowles, 1988), the Polytechnic of Wales corpus (PoW) (Souter, 1989), the University of Pennsylvania corpus (UPenn) (Santorini, 1990), the London-Lund Corpus (LLC) (Eeg-Olofsson, 1991), the International Corpus of English (ICE) (Greenbaum, 1992, 1993), the British National Corpus (BNC) (Burnard, 2000), and the Spoken Corpus Recordings in British English (SCRIBE) (Huckvale, 2004), among others.
Persian Dependency Treebank uses a tagset that, in addition to morphosyntactic annotation, introduces 43 categories for dependency relations (Rasooli, Moloodi, Kouhestani, & Minaei-Bidgoli, 2011).
Another initiative that intends to introduce a consistent morphological tagsets for Indo- European languages is MULTEXT-East, which is informed by MULTEXT (Derzhanski & Kotsyba, 2013; Erjavec, 2012).
Introducing Persian positional tagset (PPT) The proposed tagset, in this study, is intended to cover a wide range of annotations from morphological analysis to Treebank, syntactic, and lexical analysis.
Similar to EAGLES and MULTEXT-East, Persian Positional Tagset (here we adopt the PPT label for the proposed tagset) reserves the first position for specifying the main categories.