Tuesday, June 17, 2014

Sour Shchi (Russian Cabbage Soup)

I wasn't planning on a food post for this,
so this is the only photo you get...
What do you do when you have an abundance of homemade lacto-fermented sauerkraut at home? Apparently one option is to make sauerkraut soup. When I learned about shchi, a Russian cabbage soup, and how you can make sour shchi by using sauerkraut in the recipe, I decided to go for it. There really aren't a lot of foods I wouldn't try at least once. Since the end product was surprisingly delicious, with only a subtle sour flavor that melded excellently with the sweet chicken broth, I thought I'd go ahead and share my process for making sour shchi from scratch.

This recipe works well as a template. Change up the vegetables to whatever you think would be good.

Sour Shchi (Russian Cabbage Soup)

Makes 4 servings

INGREDIENTS

Broth

  • 2-3 lbs chicken parts (I prefer wings - an 8 pack works well)
  • Approximately 2 tbsp evaporated milk powder (flour can be substituted here)
  • 2 tbsp vegetable oil
  • 8 cups cold water
  • 1 large yellow onion, peeled and quartered
  • 4 cloves garlic, smashed and peeled
  • 1 bay leaf
  • 2 tsp Kosher salt

Soup

  • 2 large baking potatoes, peeled and chopped
  • 2 cups German-style sauerkraut (or shredded fresh cabbage if you insist)
  • 1 large carrot, peeled and chopped into 1 cm cubes
  • 1 medium onion, peeled and chopped
  • 1 cup cooked chicken, chopped

Garnish

  • Sour cream
  • Fresh dill
  • Chopped green onions

PROCEDURE

Broth

If you don't have some good, homemade chicken stock on hand, here is a method you can use to prepare a tasty broth from scratch relatively quickly (about 45 minutes). While it is not necessary to make your own stock or broth, I find that doing so enhances the quality of homemade soups so much that I rarely consider making soup worthwhile if I don't have time for it. If you want to skip this phase, you can use 64 oz. of store-bought chicken or vegetable broth.

  1. Heat vegetable oil in a medium stainless steel stock pot on medium high-ish heat (I set the dial to the line between medium and medium high heat)
  2. While oil is heating, coat chicken pieces lightly with milk powder
  3. Brown chicken pieces for 5 minutes per side. Do so in two batches if necessary. Don't get freaked out when chicken pieces stick and the skin is blackening in spots.
  4. Remove chicken pieces from stock pot and pour in cold water to deglaze. Scrape and loosen bits of chicken that are still stuck to the stock pot, but don't worry about clearing up the entire surface.
  5. Bring liquid to a simmer and add browned chicken pieces, onion, garlic, bay leaf, and Kosher salt.
  6. Simmer, partially covered, for 20 minutes
  7. Remove chicken pieces from the broth and take the broth off the heat.
  8. Strain broth through a fine sieve and then discard the leftover bits of onion, garlic and bay leaf.
  9. Let cooked chicken cool before removing skin and chopping up 1 cup of the meat for the soup. The rest you can save for chicken salad or whatever else you want to do with it. When I use an 8-pack of wings, my rule of thumb is to use the meat from 4 of them for the soup and save the rest for future use.

Soup

If you made your own broth, you may wish to wash your stock pot to eliminate any lingering bits of chicken that are still stuck to the bottom.

  1. Return chicken broth to the stock pot and heat back up to a simmer
  2. Add potatoes and sauerkraut to the pot and simmer for 20 minutes, or until potatoes are cooked
  3. 7 minutes before potatoes are timed to be cooked, add carrot and onion to the pot
  4. Add chopped chicken and let it reheat for 30 seconds, and immediately dish soup into bowls
  5. Garnish each bowl with a dollop of sour cream, green onions, and fresh dill

Tuesday, June 10, 2014

Phonetic Transliteration of English into Tengwar

One of the things I love about The Lord of the Rings is that J. R. R. Tolkien originally created the setting of Middle-earth as a place where the characters could speak the various invented languages that he spent most of his life working on. Tolkien's languages have interesting grammatical properties, but the focus of this post is merely on one of the alphabets that Tolkien devised for writing his languages. Specifically, I'd like to share the experience I've had so far working on a Web-based transliterator for automatically converting English text from Latin letters to Tolkien's Tengwar alphabet.

If you could care less about all the technical details and just want to mess around with the tool (and help me find bugs), you can go check it out here.

If you're a real nerd, read on...

Some Challenges of Creating a Tengwar Transliterator

Note: If you happen to know a thing or two about writing English in Tengwar, then I should clarify up front that what I have done so far is implement a phonetic transliterator. The so-called Common Mode is my ultimate goal, but as it introduces some additional difficult challenges, I haven't gotten there yet. I can't find an example on the Internet of anyone actually having pulled this off, so it may be a bit of a lofty goal.

Having essentially dived into this project on a whim, I ended up running into more issues than I initially anticipated. They break down into the following problem areas, each of which will be touched on subsequently.
  1. Orthographic differences between the Latin and Tengwar alphabets
  2. Programmatically determining pronunciations of English words
  3. Digital encoding of Tengwar characters and font selection
  4. Displaying Tengwar characters on the Web and browser compatibility issues
To implement reverse transliteration of Tengwar back into Latin text, the following additional problem areas emerge:
  1. Keyboard (or otherwise) input of Tengwar characters
  2. A less easily overcome version of problem 1
Four annoying problems seemed like enough, so I haven't implemented reverse transliteration yet, although I intend to in the future. I've been enjoying pondering how to deal with problems 5 and 6.

Orthographic Differences Between the Latin and Tengwar Alphabets

I'm not going to provide anything in the way of a tutorial for how to read and write Tengwar. If you're interested in that, here are some useful links:
Tengwar and Latin orthography are very different when it comes to writing English, which has very irregular pronunciation. As one out of countless examples, consider words like 'daughter', 'laughter', and 'aghast'. In each of these words we encounter a <gh> digraph with a different pronunciation. There are historical and etymological explanations for irregularities such as this, but the bottom line is that they create headaches for the aspiring Latin-to-Tengwar transcriber.

The primary cause of these headaches is that Tengwar is a phonetic alphabet. Its characters are organized into four series, or témar, and six grades, or tyeller, which represent the place and manner of articulation for the sound represented by each character. In less technical terms, that means that similar-looking Tengwar characters tend to represent similar phonetic sounds, which is pretty cool.

The simplest solution to this orthographic discrepancy was to just aim for a phonetic transliteration for now, which means the bottom line is that we need the computer to be able to deal with these unpredictably spelled English words and convert them to phonetic representations in Tengwar. This led me to my next problem.

Programmatically Determining Pronunciations of English Words

How can a computer tell how an English word is pronounced? I wasn't quite sure at first. Very similar words like 'laughter' and 'daughter' with different pronunciations cause me to seriously doubt the efficacy of trying to determine a word's pronunciation by algorithm. Instead, I assumed there must be some kind of database or Web service that I could use.

While assessing my options, I discovered The CMU Pronouncing Dictionary from Carnegie Mellon University. It is no surprise that CMU came to my aid, as they are a research leader in text/speech analysis. The CMU Pronouncing Dictionary contains North American English pronunciations for 125,000 words (including common proper nouns), and it is machine-readable. This was my missing link.

I wrote a thin object-oriented wrapper around this dictionary file in PHP which allowed me to access pronunciation information in my programs. Here is a succinct usage example:

<?php
$words = array('daughter', 'laughter', 'aghast');

foreach ($words as $word) {
  $p = new Pronunciation($word);
  echo "$word is pronounced / $p->pronunciation /\n";
}
?>

The above outputs:

daughter is pronounced / D AO1 T ER0 /
laughter is pronounced / L AE1 F T ER0 /
aghast is pronounced / AH0 G AE1 S T /

As you can see, pronunciations are represented as a space-delimited sequence of phonemes. The CMU dictionary defines 39 phonemes for North American English. The exact number of phonemes in the English language is dialect-dependent, but I had to pick something and go with it. The numbers at the end of vowel phonemes represent the relative amount of stress that is placed on each syllable. For my purposes, I disregard the numbers, because Tengwar does not denote stress.

Note: One obvious flaw with this solution is that non-English words cannot currently be transliterated. I'm considering writing in some code to fall back to a simple orthographic transliteration for words that aren't in the dictionary. We'll see.

But aside from this caveat, I consider the problem solved. At this point, I was able to move on to defining a mapping between the CMU phoneme set and the Tengwar characters, although this introduced more questions.

Digital Encoding of Tengwar Characters and Font Selection

Just what is the best way to digitally represent these weird Tengwar characters? You may be intuitively thinking something along the lines of, well, can't you just use a special Tengwar font that maps the glyphs we need over the A-Z character set? Back in the 1990s when Tolkien enthusiasts everywhere were discovering the Internet, this actually wasn't a bad solution, and there are still a lot of old sites on the Internet where this is how they do it. There are some substantial problems with this approach, though.

First of all, a true one-to-one mapping between Latin and Tengwar letters is impossible. The Latin alphabet has 26 letters, while the Tengwar alphabet has 36 letters (tengwar) and no less than 8 diacritics (tehtar), depending on the writing mode in use. Like the writing systems of Arabic and Hindi, Tengwar typically denotes consonant sounds with characters and adjacent vowel sounds using diacritics over or under them. There are also several digraphs in English such as <th>, <sh>, and <ch> that are represented by only one letter in Tengwar. A simple example is mellon, the password used by the Fellowship to open the Moria gate. Here it is in Tengwar, requiring three characters to write.
The Unicode character encoding standard helps us deal with this more elegantly by defining how non-Latin characters should be defined in fairly precise terms. While the Unicode standard is silent on how to specifically encode the fictional Tengwar alphabet, it does provide what is called the Private Use Area, which is basically a range of characters that are left undefined so that they can be used for very specialized applications.

The Free Tengwar Project has come up with a Unicode character mapping using the Private Use Area, and this seemed like the most valid approach to me. They provide three different free fonts, each with different pros and cons. They also provide a keyboard layout for typing Tengwar, if you're a huge enough nerd to need one.

Having opted to use the Free Tengwar Project's fonts and encoding rules, all I had to do was create a set of rules for mapping the CMU phonetic pronunciations into Unicode Tengwar. This was not difficult; just a bit tedious. The only slight issue is that there isn't a completely straightforward mapping for all of CMU's vowel phonemes with the available Tengwar diacritics. I largely based my mapping on this phonetic mode for English.

At this point, only one small problem remained.

Displaying Tengwar Characters on the Web

What are the chances of most people having the Free Tengwar Project's fonts installed on their computer? Not good, to say the least. Fortunately, I knew going into this project that it is possible to embed fonts on Web pages. All I had to do was learn how to do it. Like every other step of this project, the Internet provided me with a few alternatives for doing the font embedding, but in the end, the easiest method by far was to use the Font Squirrel Web Font Generator, which, given a font file, produces a .zip archive containing converted font files and an example CSS stylesheet.

Note: A word of caution to anyone else who may be interested in using the Font Squirrel tool to create Tengwar Web fonts: In order for things to work properly, you must use "Expert mode" and change the Subsetting setting to "No Subsetting" or "Custom Subsetting" with the Private Use Area range specified.

I've mostly just tested this technique in Firefox and Chrome, but it seems to work reliably for displaying Tengwar on computers without any of the fonts installed. If you try it out in Internet Explorer (or any browser on Windows) and run into issues, let me know if you want.

There are still some issues I'm running into with some of these fonts, and I'll continue to play with the code. Specifically, I would love to switch over to the Tengwar Telcontar font, but it seems too finicky for general use.

Conclusion and Next Steps

If you've stuck with me until now, I commend you for probably being a pretty big nerd. I'm releasing what I have so far and calling it alpha software. Here's another link if you're too lazy to scroll back up to the top. I have every intention of continuing to add features. Here are some ideas for future additions:
  • Probably tweak vowel and diphthong output a bit more
  • Numeral support
  • Possibly implement orthographic transliteration for non-dictionary words
  • "Reverse" transliteration from Tengwar into IPA (I doubt the existence of a reliable way to take phonetic Tengwar back to English)
  • Transliteration from Tengwar to Latin characters using Sindarin and Quenya modes
    • This will require implementation of a "virtual keyboard" for inputting the Tengwar characters, unless anyone has a better idea
  • Support for other Tolkien alphabets (Cirth, possibly Black Speech/One Ring inscription style)
  • Possibly a phonetic German mode
  • Ultimately, transliteration from English into the Common Mode
As always, I have also released my code on GitHub for those of you who are interested.