Insomnia (insomnia) wrote,
Insomnia
insomnia

Thoughts on translation for the Internet

This is an idea I have been thinking of for awhile now. My summary is based on a reply I made to a recent post by Larry Lessig on how the language-based balkanization of the Internet can be resolved.
-------------------------------------

I've actually given this problem a lot of thought in the past, and I think I have a solution that, although brute force, is arguably the best one available.

Machine translation is, generally speaking, proprietary and woefully inadequate. Approximating language is an inadequate solution for where we should all want the Internet -- and human knowledge of each other -- to go in the future. What I think would be ideal is if there were a major open source translation project, which uses applications designed to leverage the power of many thousands of users in order to create a massive translation database.

The application would need to start from some sort of baseline of usefulness for those who would use it, but it also needs to learn from and be taught by its users. Think of it from a perspective of something like CDDB -- it's would be an application designed to create an enormous database, in this case, not of correct, user-verified album information, but of language itself... a database that could never be built by one person alone. It would make us both the students and the teachers, and our whole world would be all the richer for it.

So, why would people offer up their assistance to build this database? Well, imagine if, when you registered with the program/site, you filled out your language proficiencies. You might rate yourself a 10 (native) in English and a 7 in Spanish. You want to improve your Spanish skills, however, so the appliation could do several things to help you, such as allow you to exchange correspondence with others who have better skills in Spanish than you do -- ideally native speakers who want to improve their English skills. You could write in Spanish, they could write in English. You correspond via the application, and when you encounter a sentence that doesn't make sense as written, you can either correct the text or refer it to someone(s) else to fix who are suitably skilled. Alternately, you can approve the text. All this information could be added to the database, thereby making the program learn.

Likewise, you could use the program in a "solitaire mode", where you could, for instance, learn vocabulary, possibly with words in a pictogram -flashcard kind of way. (Audio could also be added into the project, too. The goal, as always, is for users to teach and add to the system, in their own words.) If you are moderately skilled, you could also be given sentences to translate in order to improve your language skills -- these sentences could be ones that other users requested translations for. The translations would then be sent on to the people in question, in order to improve their ability to properly read and translate the language. If all flagged blocks of text are used up, the application could even pull text off of the internet in that language and offer them up for translation too.

Now, this is really just one example of a piece of software that could be part of the same project. The data collected is the goal, whereas it could be used in many, many different ways with different applications. Many people might experience the project by using the software to translate a website -- such a task could be triggered done with a bookmarklet or plugin in someone's browser, for instance. Even that information, however, can be returned in a format where people can flag or correct bad elements within the translation, thereby increasing the application's knowledge. Likewise, you could use such a system to power many other applications and widgets. I, for one, would love to be able to convert foriegn websites into English RSS feeds on the fly.

One way you could further improve both translations and the educational aspect of such an application would be to not only have a somewhat arbitrary numerical rating on language skills, but also have a kind of computer reputation system, where people are prompted to review other people's translations and rate their quality. Good translators would earn higher grades from the computer. Technically, the solftware could even be used to evaluate student proficiency and learning in languages, or evaluate the language skills of those seeking professional work.

Obviously, some consideration must be given to funding such a massive undertaking. Such a system may require the financial assistance of its users to survive and grow, so a reasonable cost could be charged for people to use the system. However, this expense could be waived -- in part or entirely -- if a person helps "teach" the system or volunteers on some level. In addition, grants could be sought by governments, universities, individuals, businesses, organizations, etc.

Now, my concern on all this is technical. How big of a database would something like this be, and would it run fast enough on the web? Would it be centralized, or distributed? Should it be a desktop app, or should it be on the web... or both?! Is there, from a technical perspective, a "sweet spot" that balances translation quality and translation speed, or is it better to create translations which are as accurate as possible, based on future expectations of speed improvements? Does creating a huge translation database slow a search for a proper translation, or will blocks of text be more easily translated than through mechanical translation methods, as identical blocks of text had already been translated in the past? Could the wealth of text on the web or available through search engines with open APIs be of value to an application that translates text? For instance, if you were to translate "I love to plant flowers." into Spanish, would a Google search of the sentence or its fragments hint towards a preferred or alternate translation? Could it suggest these alternative translations when it's not so certain how to translate a block of text, so that readers could choose the most appropriate one and help the database learn as it goes?

I don't know the answers to these questions, but the technical restraints must be lessening every year. I would love to find others who want to make this idea -- or an even better solution -- a reality, and I think it should be a major goal for the open source movement to bring about a serious, united initiative for translation. Frankly, commercial software initiatives are poorly suited for this task, but I believe that open source are up to the task.
Subscribe
  • Post a new comment

    Error

    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.
  • 4 comments