Obstinacy
Several years back, when language computing was based on the ISCIIstandard, and Unicode was known only to some developers in the Free Software community, Microsoft had issued a set of specifications for their fonts.
Called the OpenType specifications, (current specifications for Indic languages available here) -. the entire software community - not just the Free Software community -- had to rely on those specifications for guidance for not just the fonts, but also for language encoding.
Those standards issued by Microsoft had a serious little problem - (കാര്യം നിസ്സാരം, പ്രശ്നം ഗുരുതരം, as they say in Malayalam) - they (or rather, the OpenType specifications as available then), were as precise or ambiguous, depending on your point of view, about how the chillus were to be encoded. To be fair, it was not Microsoft's fault - they had simply copied and pasted some paragraphs from the then latest standards from Unicode into the OT specifications.
One reading meant that a consonant + chandrakkala + zwj was to render a chillakshram, or a chillu. The other meant that the chillu was to be formed only with the consonant + chandrakkala sequence. For some reason, the Free software community decided to stick to Microsoft's definition / interpretation of the Unicode standards, about Chillus.
And the publisher of the OpenType specifications, the hallowed Microsoft Corporation, choose to adopt the other interpretation.
But the Free Software community, blissfully unaware of what was happening in the Microsoft world, went on to create a huge pile of software and data (mostly user interface translations, personal web pages and blogs) on a standard which no-body implemented.
When the Chillus were finally encoded, we raised a hue and cry aout "incompatible legacy data".
And not learning from our mistakes, the community of developers for Malayalam Free Software, became probably, the first community in the Free Software world to openly declare "we will not implement a standard".
And here are a couple of reasons why that stand is obstinate and
idiotic.
quote from the above link
1. The atomic chillu's are unacceptable because it destroys the link of a chillu with its base character.
...
Here , the fundamental problem lies in Unicode's way of treating only representational forms without checking linguistic correctness.
The above quotes demonstrate a fundamental [mis|refusal]understanding about the Unicode standards.
The concerns are addressed in the Unicode FAQs - see the Indic FAQ, and for the bigger picture, the the Unicode FAQ index; and here is another.
Here is what a "character" is, according to Unicode - from the glossary.
Character.
(1) The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape, rather than a specific shape (see also glyph), though in code tables some form of visual representation is essential for the reader’s understanding.
(2) Synonym for abstract character.
(3) The basic unit of encoding for the Unicode character encoding.
(4) The English name for the ideographic written elements of Chinese origin. [See ideograph (2).
(Emphasis, is mine).
You should read that again. Especially, the "The smallest component of written language that has semantic value" part.
So, when the developer community, which claims to represent the
Malayalam population should rethink their stand when they say:-
2. The examples used to justify semantic difference between words only separated by ZWJ are non-existent in dictionary , not in are grammatically wrong or meaningless without proper context.
And finally, the security | spoofing part.
Did anybody in the "Swatantra Malayalam" community at least attempt to read RFC3454 ("stringprep").
If you are bandwidth or time challenged, you can avoid going to
the RFC and read the excerpts below.
5. Prohibited Output Before the text can be emitted, it MUST be checked for prohibited code points. There are a variety of prohibited code points, as described in this section. A profile of this document MAY use all or some of the tables in appendix C. The stringprep process never emits both an error and a string. If an error is detected during the checking for prohibited code points, only an error is returned. Note that the subsections below describe how the tables in appendix C were formed. They are here for people who want to understand more, but they should be ignored by implementors. Implementations that use tables MUST map based on the tables themselves, not based on the descriptions in this section of how the tables were created. The lists in appendix C MUST be used by implementations of this specification. If there are any discrepancies between the lists in appendix C and subsections below, the lists in appendix C always take precedence. Some code points listed in one section may also appear in other sections. It is important to note that a profile of this document MAY prohibit additional characters. Hoffman & Blanchet Standards Track [Page 10] RFC 3454 Preparation of Internationalized Strings December 2002 Each subsection of this section has a matching subsection in appendix C. For example, the characters listed in section 5.1 are listed in appendix C.1. 5.1 Space characters Space characters can make accurate visual transcription of strings nearly impossible and could lead to user entry errors in many ways. Note that the list below is split into two tables in appendix C: Table C.1.1 contains the ASCII code points, while Table C.1.2 contains the non-ASCII code points. Most profiles of this document that want to prohibit space characters will want to include both tables. 0020; SPACE 00A0; NO-BREAK SPACE 1680; OGHAM SPACE MARK 2000; EN QUAD 2001; EM QUAD 2002; EN SPACE 2003; EM SPACE 2004; THREE-PER-EM SPACE 2005; FOUR-PER-EM SPACE 2006; SIX-PER-EM SPACE 2007; FIGURE SPACE 2008; PUNCTUATION SPACE 2009; THIN SPACE 200A; HAIR SPACE 200B; ZERO WIDTH SPACE 202F; NARROW NO-BREAK SPACE 205F; MEDIUM MATHEMATICAL SPACE 3000; IDEOGRAPHIC SPACE 5.2 Control characters Control characters (or characters with control function) cannot be seen and can cause unpredictable results when displayed. Note that the list below is split into two tables in appendix C: Table C.2.1 contains the ASCII code points, while Table C.2.2 contains the non- ASCII code points. Most profiles of this document that want to prohibit control characters will want to include both tables. 0000-001F; [CONTROL CHARACTERS] 007F; DELETE 0080-009F; [CONTROL CHARACTERS] 06DD; ARABIC END OF AYAH 070F; SYRIAC ABBREVIATION MARK 180E; MONGOLIAN VOWEL SEPARATOR Hoffman & Blanchet Standards Track [Page 11] RFC 3454 Preparation of Internationalized Strings December 2002 200C; ZERO WIDTH NON-JOINER 200D; ZERO WIDTH JOINER 2028; LINE SEPARATOR 2029; PARAGRAPH SEPARATOR 2060; WORD JOINER 2061; FUNCTION APPLICATION 2062; INVISIBLE TIMES 2063; INVISIBLE SEPARATOR 206A-206F; [CONTROL CHARACTERS] FEFF; ZERO WIDTH NO-BREAK SPACE FFF9-FFFC; [CONTROL CHARACTERS] 1D173-1D17A; [MUSICAL CONTROL CHARACTERS]
C.2.2 Non-ASCII control characters ----- Start Table C.2.2 ----- 0080-009F; [CONTROL CHARACTERS] 06DD; ARABIC END OF AYAH 070F; SYRIAC ABBREVIATION MARK 180E; MONGOLIAN VOWEL SEPARATOR 200C; ZERO WIDTH NON-JOINER 200D; ZERO WIDTH JOINER 2028; LINE SEPARATOR 2029; PARAGRAPH SEPARATOR 2060; WORD JOINER 2061; FUNCTION APPLICATION 2062; INVISIBLE TIMES 2063; INVISIBLE SEPARATOR 206A-206F; [CONTROL CHARACTERS] FEFF; ZERO WIDTH NO-BREAK SPACE FFF9-FFFC; [CONTROL CHARACTERS]
In simple words, the above means - no zero width joiners or zero width non-joiners in the address bar of your browser. Or more accurately, browsers are not supposed to send strings / addressesses.
Just for more technical accuracy, here is some more info.
The IDNA RFC 3490. That is the basic documents developers and applicaions should conform to, to enable unicode compliant domain names. The unicode characters are first process through a "toASCII" filter, which entails checking to ensure that the string conforms to requirements of RFC 3454 (nameprep) .
So, the question of spoofing does not arise; unless applications do not conform to the IDNA standards (and there are plenty of them out there - see here for conforming applications). In case of applications which do nto conform to standards, it is within our power to uninstall that application (in case of proprietary software), or file bugs / fix them.
If you are a Malayalam developer, and if the RFCs above do not convince you that spoofing is not possible with chillus, I have just one thing to tell you =- stop pretending to represent the Malayalam Free Software Community.
I know that I sound rude, but there is no choice left for me.
And look at the danger we are heading into. The latest version has specifiecd that we should the chillu na for the /nta/ stacked conjunct. However dunderheaded that specification may be, the specification is a specification, which means that it is a specification. In other words, it is a specification. and because it is a specification, we are obliged to implement it.
The danger is the stacked /nta/ requires the atomic chillu. Not the "canonical equivalent" one. True, it going to introduce plenty of ambiguities and legacy data problems, but that is the pain Microsoft is. The user community does not want another strawman based pain from the Free Software world which says "we will not implement that standards".
No comments:
Post a Comment