Thursday, February 16, 2006

Googling for Hungarians

When I was in Kiev a few months back, I realized that all web addresses are in Roman script. Now this may not be much of a revelation, I’ll grant you, but I was intrigued to imagine that while I’m guessing much of the older generation in countries that don’t use the same alphabet as I do have very little understanding or recognition of the Roman letters, the younger, Internet savvy generation, probably have to have. And not just the ones who have learned English either.

I was reminded of this the other day when I was searching for something on Google, but unusually (in fact maybe uniquely) I was looking up something in Hungarian. It was then that I realized that the way Google works in English may not be quite as successful in Hungarian. If I type an English word into it, I know Google will find all the instances of that word in its database. Exactly that word. But in Hungarian, a word will vary in its spelling depending on its role in a sentence and whether it has suffixes stuck on it. If I type in “tojás” (egg) for example, it will presumably return all instances of the word tojás. But, if tojás is the direct object of a verb (as in “I boiled an egg”) it will be “tojást”, or, as far as a search engine is concerned, a completely different word. And that’s just one possibility. For place names the range of possibilities is endless. Off the top of my head the name of this town could be rendered as Csikszereda, Csikszeredán, Csikszeredában, Csikszeredára, Csikszeredába, Csikszeredát, Csikszeredához, Csikszeredával, Csikszeredábol, Csikszeredárol, and almost certainly loads of others depending on whether you’re in the town, going to the town, coming from the town, or just hanging around in the general vicinity of the town.

I checked this out on Google, as I suspected that they may have worked something out for this – after all even in English you get plurals which are in essence different words – and it seems they have. They claim to use something they call “Stemming technology” (isn’t that what George Bush wants to ban?) to ensure that different variants of the same root word are recognized when you search. I wonder if this only works with English or it somehow crosses languages. Or if google.hu uses a different Magyar version of stemming technology? If not I fear there are a lot of searches that may miss their targets. But how does stemming technology work – is it a piece of software that guesses which words have the same roots? So if you type in station you might get hits for both stationary and stationery? And if not, then presumably the groups of related words have been programmed by someone.

If not (or before the miracles of stemming technology) I’m guessing use of a search engine is/was quite a different skill for a Hungarian than it is for me, for example. Thinking about it occupied my brain for a few minutes anyway, and now, thanks to the miracles of the internet, I've shared that inner monologue with all of you. My generosity knows no bounds.

Some new favourite Hungarian words: Kinel, which is a question word meaning (something like) “at whose place?”, and which is amusing because, well it sounds like “kinel”. It may be only British readers who see why that’s remotely amusing to my puerile mind, but if you are really interested I’ll explain it in the comments. And Prezli, which means “breadcrumbs" and is amusing basically because it is pronounced exactly the same as the surname as the singer of Heartbreak Hotel, and one or two other songs. It amuses me to think of Elvis Breadcrumbs. Not sure why, but there you go. Hungary even has it’s own Elvis figure, a bloke called Fenyö Miklos (Nicholas Pine-Tree), who is very big on the rubbish variety programmes shown on New Years Eve circuit.

9 comments:

Anonymous said...

I read Google's page on stemming and I think it only works in English and perhaps some of the bigger langs (like German).

Both Google.ro and Google.hu have searches in Hungarian but as I don't speak the language I can't tell you how successful the stemming works in Magyar.

Congrats on your AFOE noms, I voted for ya.

Pax

dumneazu said...

Have to correct you - Fenyo Miklos may be stuck in the Hungarian fantasy of the 1950s, but the real Fake Elvis of Hungary is Komar Laszlo. Ask anyone.

I think Prezli for crumbs is a Transylvanian dialect item, like "pityoka" for spuds or "murok" for carrots.

I've enjoyed your blog for a year now from up here in Budapest. As soon as weather gets nicer I'll have to head down to Erdely for some fishing... you guys have trout out there, ya know.

Anonymous said...

Hm.. Google.hu appers to be owned by a local company ("Hirek Media"), which probably grabbed it before Google.

--Bogdan

Anonymous said...

Search engine companies tend to employ lots of linguists. In order to keep up with the competition they have to make their engines understand as much as possible about the material that their engines are indexing. The search company I know most about (www.fast.no) definitely supports lemmatization and stemming for Hungarian.

Expect not just stemming, but also searching for synonyms, spelling correction, automatic recognition of names (of persons, places, organizations etc.) to show up in your favorite search engine.

For example, a search for your name would bring up a side bar that lets you drill down into Palestine, Vermont, teacher training, Transsylvania, palinka etc. because the search engine understands that these are important concepts related to you.

Automatic translation into other languages is almost a side effect of all these functions.

(Tinfoil hat mode ON) -- A lot of this functionality is being driven by the need of (American) intelligence agencies to record and be able to search through every email and phone conversation that they are interested in (and I believe they are interested in _every_ conversation). I'm sure it's not just those text ads that are making Google all that money.

An interesting demonstration of some of this technology is the blinkx.tv search engine... this is a search engine that watches lots of TV stations and allows you to search for phrases that have been uttered in particular TV shows. And if it can listen to TV, it can listen to our phones.

Sorry for geeking out... but I work with search technology every day.

Hey, and congrats on the nominations. I voted for you. I must have been one of the first people to suggest you start a blog, so I'm anxiously awaiting your acceptance speech!

Andy said...

Dumneazu: I just asked my Hungarian cultural consultant and she confirmed that you are probably right on Komar Laszlo. You don't have trout in Hungary?

My anonyomus tin-foil-hatted friend - wow. that comment has sent me reeling. I still have to digest most of it.

Anonymous said...

On my list of Hungarian words to avoid is 'sajt', because I always hear it being said by an Irish friend in which it wounds like "shoit" - not very appertising.

Pesticide also point to the restaurant called 'Fakanál' (wooden spoon), which when said by a scouser approaches the F-word.

For the Hungarians you have to be careful how you pronounce 'Bus' as this can sound like 'Basz' to them! More innocently 'Cookie' is a hungarian childs words for the part of their body they soon learn not to refer to in polite company. Finally, very innocently 'chalk' sounds like 'csók'.

So be careful out there!

Catherine said...

If Google doesn't even stem Croatian, can't see it doing it with Hungarian any time soon...

On the other hand, probably ensures Hungarians have much more developed mind-reading capabilities than the rest of us!

Anonymous said...

Google is stemming Hungarian these days

Danelyn from search engine company said...

This is very informative as it stated the different information that we need to know. Thank you so much.