August 19th, 2004


Thoughts on translation for the Internet

This is an idea I have been thinking of for awhile now. My summary is based on a reply I made to a recent post by Larry Lessig on how the language-based balkanization of the Internet can be resolved.

I've actually given this problem a lot of thought in the past, and I think I have a solution that, although brute force, is arguably the best one available.

Machine translation is, generally speaking, proprietary and woefully inadequate. Approximating language is an inadequate solution for where we should all want the Internet -- and human knowledge of each other -- to go in the future. What I think would be ideal is if there were a major open source translation project, which uses applications designed to leverage the power of many thousands of users in order to create a massive translation database.

The application would need to start from some sort of baseline of usefulness for those who would use it, but it also needs to learn from and be taught by its users. Think of it from a perspective of something like CDDB -- it's would be an application designed to create an enormous database, in this case, not of correct, user-verified album information, but of language itself... a database that could never be built by one person alone. It would make us both the students and the teachers, and our whole world would be all the richer for it.

So, why would people offer up their assistance to build this database? Well, imagine if, when you registered with the program/site, you filled out your language proficiencies. You might rate yourself a 10 (native) in English and a 7 in Spanish. You want to improve your Spanish skills, however, so the appliation could do several things to help you, such as allow you to exchange correspondence with others who have better skills in Spanish than you do -- ideally native speakers who want to improve their English skills. You could write in Spanish, they could write in English. You correspond via the application, and when you encounter a sentence that doesn't make sense as written, you can either correct the text or refer it to someone(s) else to fix who are suitably skilled. Alternately, you can approve the text. All this information could be added to the database, thereby making the program learn.

Likewise, you could use the program in a "solitaire mode", where you could, for instance, learn vocabulary, possibly with words in a pictogram -flashcard kind of way. (Audio could also be added into the project, too. The goal, as always, is for users to teach and add to the system, in their own words.) If you are moderately skilled, you could also be given sentences to translate in order to improve your language skills -- these sentences could be ones that other users requested translations for. The translations would then be sent on to the people in question, in order to improve their ability to properly read and translate the language. If all flagged blocks of text are used up, the application could even pull text off of the internet in that language and offer them up for translation too.

Now, this is really just one example of a piece of software that could be part of the same project. The data collected is the goal, whereas it could be used in many, many different ways with different applications. Many people might experience the project by using the software to translate a website -- such a task could be triggered done with a bookmarklet or plugin in someone's browser, for instance. Even that information, however, can be returned in a format where people can flag or correct bad elements within the translation, thereby increasing the application's knowledge. Likewise, you could use such a system to power many other applications and widgets. I, for one, would love to be able to convert foriegn websites into English RSS feeds on the fly.

One way you could further improve both translations and the educational aspect of such an application would be to not only have a somewhat arbitrary numerical rating on language skills, but also have a kind of computer reputation system, where people are prompted to review other people's translations and rate their quality. Good translators would earn higher grades from the computer. Technically, the solftware could even be used to evaluate student proficiency and learning in languages, or evaluate the language skills of those seeking professional work.

Obviously, some consideration must be given to funding such a massive undertaking. Such a system may require the financial assistance of its users to survive and grow, so a reasonable cost could be charged for people to use the system. However, this expense could be waived -- in part or entirely -- if a person helps "teach" the system or volunteers on some level. In addition, grants could be sought by governments, universities, individuals, businesses, organizations, etc.

Now, my concern on all this is technical. How big of a database would something like this be, and would it run fast enough on the web? Would it be centralized, or distributed? Should it be a desktop app, or should it be on the web... or both?! Is there, from a technical perspective, a "sweet spot" that balances translation quality and translation speed, or is it better to create translations which are as accurate as possible, based on future expectations of speed improvements? Does creating a huge translation database slow a search for a proper translation, or will blocks of text be more easily translated than through mechanical translation methods, as identical blocks of text had already been translated in the past? Could the wealth of text on the web or available through search engines with open APIs be of value to an application that translates text? For instance, if you were to translate "I love to plant flowers." into Spanish, would a Google search of the sentence or its fragments hint towards a preferred or alternate translation? Could it suggest these alternative translations when it's not so certain how to translate a block of text, so that readers could choose the most appropriate one and help the database learn as it goes?

I don't know the answers to these questions, but the technical restraints must be lessening every year. I would love to find others who want to make this idea -- or an even better solution -- a reality, and I think it should be a major goal for the open source movement to bring about a serious, united initiative for translation. Frankly, commercial software initiatives are poorly suited for this task, but I believe that open source are up to the task.

Najaf, negotiations, and no end in sight.

First things first... the reason that Najaf is being attacked right now is because the Iraqi interim government refused to honor the peace terms that Sadr accepted from the Iraqi National Conference. The interim government ignored the conference -- whose sole purpose, incidentally, is to have oversight and veto power over the interim government -- and issued numerous new conditions on top of the conference's peace plan. They also waived the previously agreed-upon promise of amnesty. Listen to this NPR story for full details on this.

And so, numerous U.S. soldiers and hundreds of Iraqis on both sides are going to die. Earlier today, the Sadrists got revenge on their previous "dueling partners" the Najaf police, in a mortar attack that killed 8 of them and wounded another 31. Scratch one police force. That they successfully executed this attack while "surrounded" and under attack from U.S. forces is pretty impressive. After the attack, Iraqi police raided a local hotel where foreign journalists were staying, claiming they suspected some of the reporters helped the attackers locate the police station. Yes, more angry abuse of the media in Najaf... as if they're responsible for this clusterfuck.

As for negotiations, this site on the subject of surrender is interesting, although given the current predicament in Najaf, it seems like a relic of the "good old days", if such days ever applied to war.

I particularly appreciated this bit:
"The popular impression that, for example, a besieger may summon a city or fortress to surrender and declare that no quarter will be given if it is taken by storm, is quite wrong and reflects the comparative savagery of earlier days, especially of the religious wars from the Crusades through the Thirty Years' War, as did the former rules that quarter could be refused to a weak garrison that obstinately and pointlessly persisted in defending a fortified place..."

Comparative savagery of the Crusades, eh?! I wonder what Allawi would think of that? He's not viewed very favorably anywhere right now, even in Great Britain, where Tony Blair has dropped his plans to invite him.

Contrary to what the Iraqi defense minister says, if Sadr wants to negotiate terms, he should be able to do so by sending someone out under a white flag, *any time he wants*. Unconditional surrender may be demanded, but that is ultimately what the conditional government is offering anyway.

"Surrender . . . may be unconditional . . . or upon terms . . . A surrender upon terms naturally follows upon negotiations, customarily initiated by sending out a party under a white flag."

Whether Sadr will be granted that right -- or will even need it -- remains to be seen. Alex Berenson of The New York Times recently visited Sadr's forces and says that "the mood in the shrine is not one of resignation... morale is quite high amongst Sadr's fighters . . . who have a *VERY* good defensive position . . . The Iraqi Defense Minister can talk all he likes about how it will be Iraqi forces that fight their way to the shrine but from a practical point of view, that's simply impossible."

And so, people die white we all wait. It might be a long wait too.

Bush milks Iraqi soccer team with Olympic ad...

...but the Iraqi soccer team -- united against the ad -- has a few choice words for Bush.

"How will he meet his god having slaughtered so many men and women? He has committed so many crimes." - Ahmed Manajid, midfielder

"My problems are not with the American people. They are with what America has done in Iraq: destroy everything. The American army has killed so many people in Iraq. What is freedom when I go to the stadium and there are shootings on the road? . . . The war is not secure. Many people hate America now. The Americans have lost many people around the world--and that is what is happening in America also." - Adnan Hamad, soccer coach

"I want the violence and the war to go away from [Najaf] . . . We don't wish for the presence of Americans in our country. We want them to go away." -Salih Sadir, scorer of the winning goal against Portugal.

"I want to defend my home. If a stranger invades America and the people resist, does that mean they are terrorists? Everyone [in Fallujah] has been labeled a terrorist. These are all lies. Fallujah people are some of the best people in Iraq." - Ahmed Manajid, midfielder