I made a natural language classifier. Check it out:

Unamundita has the 57 languages that have a good amount of training data.

Unamunda has many more languages, some of which have very little training data. It is no longer supported.

Check out the results of training.

Presentation (ppt) (odp)

Source code

What is Unamunda?

Unamunda is an applet that, given a short text sample, will try to determine what language that text is. For instance, given the phrase "Yo soy Miguel y voy a la biblioteca" Unamunda will indicate that it is Spanish.

How do I use Unamunda?

Check some boxes, enter some text, and then click "Go!"

But wait, there's more interface than that!

True. The "check all" button is self-explanatory. The interface under the text area is to fiddle with internal parameters - not for novices.

I'm not a novice! Tell me how it works.

The "Universal" and "Specific" radio buttons control which miss rate is used. See the slideshow for more information.

And the other thing?

Ahh, the weights. You might want to read how Unamunda works first, because this won't make much sense if you don't know that. Traditionally the metric weights all N-Grams equally: there's no compelling reason for this. In principle 2-grams could be twice as important as 4-grams, and 1-grams could be entirely irrelevant. The weights field lets you control this. You enter a four-digit positive integer, and then hit "enter" ("return"). This changes the metric: 1-grams are weighted by the first digit, 2-grams by the second, etc. So for the example given above, you'd enter "0211".

What's the difference between Unamunda and Unamundita

Unamunda has many many more languages than Unamundita, but is several versions behind.

Why have Unamundita, why not just Unamunda?

First, Unamunda is getting to the size where most computers run out of memory before loading all languages. This is a pain, making to check only some boxes. Second, just in terms of the GUI Unamunda is far too cluttered to be easily used.

Alright, I buy that, but why don't you even support Unamunda anymore?

Supporting two versions in this early proof-of-concepty version is a pain.

Where does the name "Unamunda" come from?

David Ives, "The Universal Language." Read it - it's not worth knowing if I have to explain it to you.

Why don't you include language X?

Because I hate language X and everyone who speaks it. Seriously though, it's because that language didn't have very long articles in Wikipedia on the subjects whence I drew training data.

Well then can you add in language X?

Sure thing. Just send me a representative corpus of the language at least 40Kb in length, in Unicode. Then I'll work it into the next version if I agree that it's a language worth having.

Hey! Your thingie classified something incorrectly!

You don't say.

It doesn't seem to be working at all. . . nothing happens when I click "Go!"

First, make sure you have text entered and at least one language checked (2 if you want sensible results).

If you've got that, then it's probably that you're out of memory. Try checking fewer languages, especially if you're reasonably certain that your sample is not that language (e.g. Thai if your sample is Roman characters).

If you've checked the console and it's not an OutOfMemoryError, please do let me know what's wrong, and I'll do my best to fix it.

I tried reading your source code, but there are hardly any comments!

Yeah. . . sorry about that.

Is Unamunda free?

As in beer, yes it is free to use. As in speech, well. . . yes, the source code is available. I ask that you drop me a line if you're using it and cut me in on any profit you make.

Something else!

Email me about it. You can frequently reach the owner of a web page by putting the part of the URL after the ~ before an @ and then following that up with the part after the www.

How does Unamunda work?

That would take more space to explain than is appropriate for a FAQ. Watch the presentation, read Cavnar and Trenkle's paper on N-gram based text classification, and if you have further questions do inquire.