blog.darrenstruthers.net: Phonetic Transliteration of English into Tengwar

One of the things I love about The Lord of the Rings is that J. R. R. Tolkien originally created the setting of Middle-earth as a place where the characters could speak the various invented languages that he spent most of his life working on. Tolkien's languages have interesting grammatical properties, but the focus of this post is merely on one of the alphabets that Tolkien devised for writing his languages. Specifically, I'd like to share the experience I've had so far working on a Web-based transliterator for automatically converting English text from Latin letters to Tolkien's Tengwar alphabet.

If you could care less about all the technical details and just want to mess around with the tool (and help me find bugs), you can go check it out here.

If you're a real nerd, read on...

Some Challenges of Creating a Tengwar Transliterator

Note: If you happen to know a thing or two about writing English in Tengwar, then I should clarify up front that what I have done so far is implement a phonetic transliterator. The so-called Common Mode is my ultimate goal, but as it introduces some additional difficult challenges, I haven't gotten there yet. I can't find an example on the Internet of anyone actually having pulled this off, so it may be a bit of a lofty goal.

Having essentially dived into this project on a whim, I ended up running into more issues than I initially anticipated. They break down into the following problem areas, each of which will be touched on subsequently.

Orthographic differences between the Latin and Tengwar alphabets
Programmatically determining pronunciations of English words
Digital encoding of Tengwar characters and font selection
Displaying Tengwar characters on the Web and browser compatibility issues

To implement reverse transliteration of Tengwar back into Latin text, the following additional problem areas emerge:

Keyboard (or otherwise) input of Tengwar characters
A less easily overcome version of problem 1

Four annoying problems seemed like enough, so I haven't implemented reverse transliteration yet, although I intend to in the future. I've been enjoying pondering how to deal with problems 5 and 6.

Orthographic Differences Between the Latin and Tengwar Alphabets

I'm not going to provide anything in the way of a tutorial for how to read and write Tengwar. If you're interested in that, here are some useful links:

Tengwar and Latin orthography are very different when it comes to writing English, which has very irregular pronunciation. As one out of countless examples, consider words like 'daughter', 'laughter', and 'aghast'. In each of these words we encounter a <gh> digraph with a different pronunciation. There are historical and etymological explanations for irregularities such as this, but the bottom line is that they create headaches for the aspiring Latin-to-Tengwar transcriber.

The primary cause of these headaches is that Tengwar is a phonetic alphabet. Its characters are organized into four series, or témar, and six grades, or tyeller, which represent the place and manner of articulation for the sound represented by each character. In less technical terms, that means that similar-looking Tengwar characters tend to represent similar phonetic sounds, which is pretty cool.

The simplest solution to this orthographic discrepancy was to just aim for a phonetic transliteration for now, which means the bottom line is that we need the computer to be able to deal with these unpredictably spelled English words and convert them to phonetic representations in Tengwar. This led me to my next problem.

Programmatically Determining Pronunciations of English Words

How can a computer tell how an English word is pronounced? I wasn't quite sure at first. Very similar words like 'laughter' and 'daughter' with different pronunciations cause me to seriously doubt the efficacy of trying to determine a word's pronunciation by algorithm. Instead, I assumed there must be some kind of database or Web service that I could use.

While assessing my options, I discovered The CMU Pronouncing Dictionary from Carnegie Mellon University. It is no surprise that CMU came to my aid, as they are a research leader in text/speech analysis. The CMU Pronouncing Dictionary contains North American English pronunciations for 125,000 words (including common proper nouns), and it is machine-readable. This was my missing link.

I wrote a thin object-oriented wrapper around this dictionary file in PHP which allowed me to access pronunciation information in my programs. Here is a succinct usage example:

<?php
$words = array('daughter', 'laughter', 'aghast');

foreach ($words as $word) {
$p = new Pronunciation($word);
echo "$word is pronounced / $p->pronunciation /\n";
}
?>

The above outputs:

daughter is pronounced / D AO1 T ER0 /
laughter is pronounced / L AE1 F T ER0 /
aghast is pronounced / AH0 G AE1 S T /

As you can see, pronunciations are represented as a space-delimited sequence of phonemes. The CMU dictionary defines 39 phonemes for North American English. The exact number of phonemes in the English language is dialect-dependent, but I had to pick something and go with it. The numbers at the end of vowel phonemes represent the relative amount of stress that is placed on each syllable. For my purposes, I disregard the numbers, because Tengwar does not denote stress.

Note: One obvious flaw with this solution is that non-English words cannot currently be transliterated. I'm considering writing in some code to fall back to a simple orthographic transliteration for words that aren't in the dictionary. We'll see.

But aside from this caveat, I consider the problem solved. At this point, I was able to move on to defining a mapping between the CMU phoneme set and the Tengwar characters, although this introduced more questions.

Digital Encoding of Tengwar Characters and Font Selection

Just what is the best way to digitally represent these weird Tengwar characters? You may be intuitively thinking something along the lines of, well, can't you just use a special Tengwar font that maps the glyphs we need over the A-Z character set? Back in the 1990s when Tolkien enthusiasts everywhere were discovering the Internet, this actually wasn't a bad solution, and there are still a lot of old sites on the Internet where this is how they do it. There are some substantial problems with this approach, though.

First of all, a true one-to-one mapping between Latin and Tengwar letters is impossible. The Latin alphabet has 26 letters, while the Tengwar alphabet has 36 letters (tengwar) and no less than 8 diacritics (tehtar), depending on the writing mode in use. Like the writing systems of Arabic and Hindi, Tengwar typically denotes consonant sounds with characters and adjacent vowel sounds using diacritics over or under them. There are also several digraphs in English such as <th>, <sh>, and <ch> that are represented by only one letter in Tengwar. A simple example is mellon, the password used by the Fellowship to open the Moria gate. Here it is in Tengwar, requiring three characters to write.

The Unicode character encoding standard helps us deal with this more elegantly by defining how non-Latin characters should be defined in fairly precise terms. While the Unicode standard is silent on how to specifically encode the fictional Tengwar alphabet, it does provide what is called the Private Use Area, which is basically a range of characters that are left undefined so that they can be used for very specialized applications.

The Free Tengwar Project has come up with a Unicode character mapping using the Private Use Area, and this seemed like the most valid approach to me. They provide three different free fonts, each with different pros and cons. They also provide a keyboard layout for typing Tengwar, if you're a huge enough nerd to need one.

Having opted to use the Free Tengwar Project's fonts and encoding rules, all I had to do was create a set of rules for mapping the CMU phonetic pronunciations into Unicode Tengwar. This was not difficult; just a bit tedious. The only slight issue is that there isn't a completely straightforward mapping for all of CMU's vowel phonemes with the available Tengwar diacritics. I largely based my mapping on this phonetic mode for English.

At this point, only one small problem remained.

Displaying Tengwar Characters on the Web

What are the chances of most people having the Free Tengwar Project's fonts installed on their computer? Not good, to say the least. Fortunately, I knew going into this project that it is possible to embed fonts on Web pages. All I had to do was learn how to do it. Like every other step of this project, the Internet provided me with a few alternatives for doing the font embedding, but in the end, the easiest method by far was to use the Font Squirrel Web Font Generator, which, given a font file, produces a .zip archive containing converted font files and an example CSS stylesheet.

Note: A word of caution to anyone else who may be interested in using the Font Squirrel tool to create Tengwar Web fonts: In order for things to work properly, you must use "Expert mode" and change the Subsetting setting to "No Subsetting" or "Custom Subsetting" with the Private Use Area range specified.

I've mostly just tested this technique in Firefox and Chrome, but it seems to work reliably for displaying Tengwar on computers without any of the fonts installed. If you try it out in Internet Explorer (or any browser on Windows) and run into issues, let me know if you want.

There are still some issues I'm running into with some of these fonts, and I'll continue to play with the code. Specifically, I would love to switch over to the Tengwar Telcontar font, but it seems too finicky for general use.

Conclusion and Next Steps

If you've stuck with me until now, I commend you for probably being a pretty big nerd. I'm releasing what I have so far and calling it alpha software. Here's another link if you're too lazy to scroll back up to the top. I have every intention of continuing to add features. Here are some ideas for future additions:

Probably tweak vowel and diphthong output a bit more
Numeral support
Possibly implement orthographic transliteration for non-dictionary words
"Reverse" transliteration from Tengwar into IPA (I doubt the existence of a reliable way to take phonetic Tengwar back to English)
Transliteration from Tengwar to Latin characters using Sindarin and Quenya modes

This will require implementation of a "virtual keyboard" for inputting the Tengwar characters, unless anyone has a better idea

Support for other Tolkien alphabets (Cirth, possibly Black Speech/One Ring inscription style)
Possibly a phonetic German mode
Ultimately, transliteration from English into the Common Mode

As always, I have also released my code on GitHub for those of you who are interested.

blog.darrenstruthers.net

Tuesday, June 10, 2014

Phonetic Transliteration of English into Tengwar