چکیده:
So far, various Romanization schemes have been proposed for capturing Persian
text using Latin alphabet. However, each have served a very specific and yet
limited function. This paper proposes an extended Romanization scheme that
can facilitate a wide range of encoding needed in the field of Natural Language
Processing. The proposed scheme endeavors to preserve both orthographic and
phonological phenomena in the language. It also accounts for encoding handwritten
manuscripts, in which glyph ambiguity is a salient feature. It is
particularly relevant to Romanizing the Kufi script, in which diacritical marks
are omitted. The current work also recommends orthographic rules in an effort to
standardize future Romanization tasks.
خلاصه ماشینی:
ca Abstract So far, various Romanization schemes have been proposed for capturing Persian text using Latin alphabet.
Transcription, on the other hand, is another form of Romanization that captures the speech utterances in form of written text using Latin characters.
19d0)2, the American Library Association - Library of Congress (ALA-LC: 1997)’, the United Nations (UNGEGN: 1972) , Deutsche Morgenlândeishe Gesellschaft (DMG:1969), Deutches Institut fur Normung standard (DIN 31 635:1982)’, Board on Geographic Names (BGN/PCGN: 1946,1958)6, International Civil Aviation Organization (NTWG:2008)’, the British Standard (BS 4280:1968), Buckwalter (Xerox) , FarsiTeX 10, UniPers 11, EuroFarsi'2, Dehdaril^, Maleki (Dabire)1 , The CJK Dictionary Institute (CJKI)15, Standard Arabic Technical Transliteration System (SATTS), ASMO 449, ECMA'6, and International Standard Organization (ISO 233 3:1999)'’.
The selection criteria are described as follows: (PI) Every letter in the Romanization scheme should be captured by a single UNICODE representing a unique character.
(P5) If several Persian letters have the same phonological realization, the alternative glyph forms should be captured using Latin diacritical mark.
(PH) If a Persian letter is written but not pronounced, a unique Latin character must be assigned to mark the non-vocal property of the glyph.
For instance, one phonological realization of letter WAW in Persian is a silent character that is written but not pronounced, as in the word (meaning sister).
In other words, in addition to providing an unambiguous method to produce an extensive tagged corpus for Persian NLP, the current transliteration scheme also provides a range of encodings for capturing hand- written manuscripts.