Me on Random things like Technology.: October 2009

Obstinacy

Several years back, when language computing was based on the ISCIIstandard, and Unicode was known only to some developers in the Free Software community, Microsoft had issued a set of specifications for their fonts.

Called the OpenType specifications, (current specifications for Indic languages available here) -. the entire software community - not just the Free Software community -- had to rely on those specifications for guidance for not just the fonts, but also for language encoding.

Those standards issued by Microsoft had a serious little problem - (കാര്യം നിസ്സാരം, പ്രശ്നം ഗുരുതരം, as they say in Malayalam) - they (or rather, the OpenType specifications as available then), were as precise or ambiguous, depending on your point of view, about how the chillus were to be encoded. To be fair, it was not Microsoft's fault - they had simply copied and pasted some paragraphs from the then latest standards from Unicode into the OT specifications.

One reading meant that a consonant + chandrakkala + zwj was to render a chillakshram, or a chillu. The other meant that the chillu was to be formed only with the consonant + chandrakkala sequence. For some reason, the Free software community decided to stick to Microsoft's definition / interpretation of the Unicode standards, about Chillus.

And the publisher of the OpenType specifications, the hallowed Microsoft Corporation, choose to adopt the other interpretation.

But the Free Software community, blissfully unaware of what was happening in the Microsoft world, went on to create a huge pile of software and data (mostly user interface translations, personal web pages and blogs) on a standard which no-body implemented.

When the Chillus were finally encoded, we raised a hue and cry aout "incompatible legacy data".

And not learning from our mistakes, the community of developers for Malayalam Free Software, became probably, the first community in the Free Software world to openly declare "we will not implement a standard".

And here are a couple of reasons why that stand is obstinate and idiotic.
quote from the above link

1. The atomic chillu's are unacceptable because it destroys the link of a chillu with its base character.
...
Here , the fundamental problem lies in Unicode's way of treating only representational forms without checking linguistic correctness.

The above quotes demonstrate a fundamental [mis|refusal]understanding about the Unicode standards.

The concerns are addressed in the Unicode FAQs - see the Indic FAQ, and for the bigger picture, the the Unicode FAQ index; and here is another.

Here is what a "character" is, according to Unicode - from the glossary.

Character.
(1) The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape, rather than a specific shape (see also glyph), though in code tables some form of visual representation is essential for the reader’s understanding.
(2) Synonym for abstract character.
(3) The basic unit of encoding for the Unicode character encoding.
(4) The English name for the ideographic written elements of Chinese origin. [See ideograph (2).

(Emphasis, is mine).

You should read that again. Especially, the "The smallest component of written language that has semantic value" part.

So, when the developer community, which claims to represent the Malayalam population should rethink their stand when they say:-

2. The examples used to justify semantic difference between words only separated by ZWJ are non-existent in dictionary , not in are grammatically wrong or meaningless without proper context.

And finally, the security | spoofing part.

Did anybody in the "Swatantra Malayalam" community at least attempt to read RFC3454 ("stringprep").

If you are bandwidth or time challenged, you can avoid going to the RFC and read the excerpts below.

 5. Prohibited Output

   Before the text can be emitted, it MUST be checked for prohibited
   code points.  There are a variety of prohibited code points, as
   described in this section.  A profile of this document MAY use all or
   some of the tables in appendix C.

   The stringprep process never emits both an error and a string.  If an
   error is detected during the checking for prohibited code points,
   only an error is returned.

   Note that the subsections below describe how the tables in appendix C
   were formed.  They are here for people who want to understand more,
   but they should be ignored by implementors.  Implementations that use
   tables MUST map based on the tables themselves, not based on the
   descriptions in this section of how the tables were created.

   The lists in appendix C MUST be used by implementations of this
   specification.  If there are any discrepancies between the lists in
   appendix C and subsections below, the lists in appendix C always take
   precedence.

   Some code points listed in one section may also appear in other
   sections.

   It is important to note that a profile of this document MAY prohibit
   additional characters.






Hoffman & Blanchet          Standards Track                    [Page 10]

RFC 3454        Preparation of Internationalized Strings   December 2002


   Each subsection of this section has a matching subsection in appendix
   C.  For example, the characters listed in section 5.1 are listed in
   appendix C.1.

5.1 Space characters

   Space characters can make accurate visual transcription of strings
   nearly impossible and could lead to user entry errors in many ways.
   Note that the list below is split into two tables in appendix C:
   Table C.1.1 contains the ASCII code points, while Table C.1.2
   contains the non-ASCII code points.  Most profiles of this document
   that want to prohibit space characters will want to include both
   tables.

   0020; SPACE
   00A0; NO-BREAK SPACE
   1680; OGHAM SPACE MARK
   2000; EN QUAD
   2001; EM QUAD
   2002; EN SPACE
   2003; EM SPACE
   2004; THREE-PER-EM SPACE
   2005; FOUR-PER-EM SPACE
   2006; SIX-PER-EM SPACE
   2007; FIGURE SPACE
   2008; PUNCTUATION SPACE
   2009; THIN SPACE
   200A; HAIR SPACE
   200B; ZERO WIDTH SPACE
   202F; NARROW NO-BREAK SPACE
   205F; MEDIUM MATHEMATICAL SPACE
   3000; IDEOGRAPHIC SPACE

5.2 Control characters

   Control characters (or characters with control function) cannot be
   seen and can cause unpredictable results when displayed.  Note that
   the list below is split into two tables in appendix C: Table C.2.1
   contains the ASCII code points, while Table C.2.2 contains the non-
   ASCII code points.  Most profiles of this document that want to
   prohibit control characters will want to include both tables.

   0000-001F; [CONTROL CHARACTERS]
   007F; DELETE
   0080-009F; [CONTROL CHARACTERS]
   06DD; ARABIC END OF AYAH
   070F; SYRIAC ABBREVIATION MARK
   180E; MONGOLIAN VOWEL SEPARATOR



Hoffman & Blanchet          Standards Track                    [Page 11]

RFC 3454        Preparation of Internationalized Strings   December 2002


   200C; ZERO WIDTH NON-JOINER
   200D; ZERO WIDTH JOINER
   2028; LINE SEPARATOR
   2029; PARAGRAPH SEPARATOR
   2060; WORD JOINER
   2061; FUNCTION APPLICATION
   2062; INVISIBLE TIMES
   2063; INVISIBLE SEPARATOR
   206A-206F; [CONTROL CHARACTERS]
   FEFF; ZERO WIDTH NO-BREAK SPACE
   FFF9-FFFC; [CONTROL CHARACTERS]
   1D173-1D17A; [MUSICAL CONTROL CHARACTERS]

C.2.2 Non-ASCII control characters

   ----- Start Table C.2.2 -----
   0080-009F; [CONTROL CHARACTERS]
   06DD; ARABIC END OF AYAH
   070F; SYRIAC ABBREVIATION MARK
   180E; MONGOLIAN VOWEL SEPARATOR
   200C; ZERO WIDTH NON-JOINER
   200D; ZERO WIDTH JOINER
   2028; LINE SEPARATOR
   2029; PARAGRAPH SEPARATOR
   2060; WORD JOINER
   2061; FUNCTION APPLICATION
   2062; INVISIBLE TIMES
   2063; INVISIBLE SEPARATOR
   206A-206F; [CONTROL CHARACTERS]
   FEFF; ZERO WIDTH NO-BREAK SPACE
   FFF9-FFFC; [CONTROL CHARACTERS]

In simple words, the above means - no zero width joiners or zero width non-joiners in the address bar of your browser. Or more accurately, browsers are not supposed to send strings / addressesses.

Just for more technical accuracy, here is some more info.

The IDNA RFC 3490. That is the basic documents developers and applicaions should conform to, to enable unicode compliant domain names. The unicode characters are first process through a "toASCII" filter, which entails checking to ensure that the string conforms to requirements of RFC 3454 (nameprep) .

So, the question of spoofing does not arise; unless applications do not conform to the IDNA standards (and there are plenty of them out there - see here for conforming applications). In case of applications which do nto conform to standards, it is within our power to uninstall that application (in case of proprietary software), or file bugs / fix them.

If you are a Malayalam developer, and if the RFCs above do not convince you that spoofing is not possible with chillus, I have just one thing to tell you =- stop pretending to represent the Malayalam Free Software Community.

I know that I sound rude, but there is no choice left for me.

And look at the danger we are heading into. The latest version has specifiecd that we should the chillu na for the /nta/ stacked conjunct. However dunderheaded that specification may be, the specification is a specification, which means that it is a specification. In other words, it is a specification. and because it is a specification, we are obliged to implement it.

The danger is the stacked /nta/ requires the atomic chillu. Not the "canonical equivalent" one. True, it going to introduce plenty of ambiguities and legacy data problems, but that is the pain Microsoft is. The user community does not want another strawman based pain from the Free Software world which says "we will not implement that standards".

Me on Random things like Technology.

Wednesday, October 28, 2009

How swatantra is this?

Obstinacy

Blog Archive

About Me