Technical Notes

Characters

Web browser technology in 1998, when I completed the first pass of this tutorial, was in a very different state than in 2013. Although the Unicode standard existed with the goal of providing a code for every character in every human language (and already provided a block of definitions for Devanagari), Microsoft's Windows operating system was still several years away from supporting it and there was no chance of the typical browser even having the fonts to support Devanagari characters. (For more background, read All About Unicode, UTF8 & Character Sets.) I therefore initially used GIF images for Devanagari letters, painstakingly modifying those I found at Ukindia.

By 2013 Unicode support has dramatically changed. All major operating systems and all major browsers support Unicode at a fundamental level. Browsers support reading HTML pages encoded in UTF-8; and rendering engines provide algorithmic combining of vowels, conjuncts, and anusvār. There is no longer a need for using embedded images to represent Devanagari characters, and in fact using embedded images greatly restricts the accessibility of the content and should be abandoned.

Unicode Tips

In storing Devanagari as a series of Unicode code points, I learned a few tips that may prove helpful to others. Unicode rendering algorithms of browsers know how to automatically combine a consonant, a vowel mātrā, and anusvār. So if I present म (U+092E), ै (U+0948), and ं (U+0902) as a sequence of Unicode code points, the browser automatically knows how to combine them into the single glyph मैं (see The Letter ऐ (ai) for more information on what this means in terms of Devanagari). But conjunct formation does not happen automatically. If I want to join स (U+0938) and त (U+0924), for instance, putting them together merely yields सत (after all, consonants do not form conjuncts in all contexts). To join the letters I must place a virama ् (U+094D) between them: स (U+0938), ् (U+094D), and त (U+0924) in a sequence yield the desired conjunct स्त.

Transliteration

The state of a standardized transliteration system is somewhat less complete than the state of bit-level representation, which has standardized on the Unicode character set stored in UTF-8 or some similar but equivalent encoding. For representing Devanagari characters in a Latin script there remains a variety of options, none universal. In India the Hunterian transliteration system was officially adopted by the government, but it has been criticized for lacking differentiation between certain letters. The International Organization for Standardization (ISO) in 2001 introduced ISO 15919, a systematic and largely consistent approach for transliteration of Hindi and other languages that used scripts based upon Devanagari. ISO 15919 is little-used inside India, however, but it does provide an unambiguous, international standard to follow.

ISO 15919 can be pedantic at times. To be consistent with its representation of आ as a long ā (to distinguish it from the short a used to represent अ), ISO 15919 prefers to use ō and ē to represent ओ and ए, respectively, as these are long vowel sounds—even though normally there are no corresponding short values in Hindi. (The ē symbol could confuse English speakers as well, as they are taught that a long e is pronounced as in me, rather than as the first part of the vowel sound in may .) ISO 15919 allows the exception of using o and e instead if there are no short vowels to provide confusion, which is what I preferred to use in this tutorial.

There also exists a very popular set of conventions referred to as the International Alphabet of Sanskrit Transliteration (IAST). This is largely a subset of ISO 15919, with the simplification of using o and e to represent ओ and ए, as mentioned above, and the use of ṃ for all nasalization. IAST has several shortcomings for my purposes, however. First, IAST is not an official standard and to my knowledge there exists no IAST specification. (ISO 15919, on the other hand, can be downloaded from ISO for a fee.) Furthermore, the use of ṃ for all nasalization in my mind is overkill, as the nasalization of vowels is better indicated using some manner that does not create a separate character, which may confuse a newcomer into believing that an extra m vowel should be pronounced. There also exist other specialized approaches to transliterating Devanagari, such as the ITRANS system popular in entering Bollywood song lyrics on the web.

For this tutorial I settled on the use of ISO 15919. This has the benefit of being an international standard with a published specification. It allows using o and e to represent ओ and ए in contexts in which there would be no confusion with a short o or short e, as is the case in Hindi. And although it does allow a simplified nasalization representation using ṁ throughout, its standard prescription is to represent vowel nasalization by placing a tilde over the vowel, which I believe more closely mirrors the use of candrabindu/anusvār in Devanagari and will come as more natural to new learners.

Transliteration Tips

After deciding on a tilde for vowel nasalization, it begs the question of how to combine a tilde with existing letters. For the Latin lowercase letter a (U+0061), for instance, Unicode already provides a Latin lowercase letter a with tilde ã (U+00E3). But for the Latin lowercase letter a with macron ā (U+0101), there exists no code point representing the corresponding letter with tilde. There is, however, a separate combining tilde character ̃ (U+0303) which can follow any letter and, if the rendering engine supports it, be combined with the previous letter by appearing above it. This bets the question if whether the Latin lowercase letter a with macron ā (U+0101) should itself be decomposed into the Latin lowercase letter a (U+0061) followed by a combining macron ̄ (U+0304). In the end I opted to consider the Latin lowercase letter a with macron ā (U+0101) as a semantic unit (equivalent to e.g. आ) and to consider the combining tilde ̃ (U+0303) semantically equivalent to the Devanagari nasalization mark in question. Hopefully this has the side benefit of reducing the risk of confusing the various rendering systems, which may not be able to handle multiple combining glyphs.

Audio

Native audio support on browsers has historically been horrible. In my original tutorials, I sidestepped the browser completely through the use of a custom Java applet. Java audio support was pretty dismal at the time, but at least it provided consistent support across platforms. I was forced to use low-quality Au audio files sampled at only 8Khz.

It was not until the advent of HTML5 that browsers begin to standardize on how to invoke audio clips from within the browser purely from HTML and JavaScript. Unfortunately browsers did not stand on actual audio codecs, with the result that, as of recently there was no single audio format supported across all major browsers. After Mozilla's change of heart, Firefox 21 was released on May 14, 2013 with the ability to play MP3 audio if the underlying operating system supports it (as is the case in the most recent versions of Microsoft Windows). This makes MP3 the only audio format supported on all the major browsers.