Making sense of search queries

When someone is asked a question, a whole plethora of mental processes comes into play before an answer can be produced. When a user enters a search query into the search bar, a search engine performs a whole series of actions to deliver a response.
The first thing the search engine does is identify the language it has to deal with. The next thing is the analysis of the search words to determine which words and word forms it should be looking for on the internet. Looking for the exact word combination [Drugstores Toronto] may not be enough, as pages with the words ‘Toronto Drug stores’ or ‘Drugstores in Toronto’ are very likely to be relevant as well. It also has to figure out which words and in which forms should be disregarded.
The whole procedure – language detection, word analysis, synonym matching, etc. – takes only fractions of a second.

Understanding the language

Query analysis starts with language identification.
The English ‘gift’ might just as well turn out to be the German ‘poison’ – ‘Gift’. For a search engine to make sure it knows which language is being used, it looks for distinctive letters or their combinations or specific words in the search query. A search query [Gift der Scorpione] retrieves information about scorpion poison in German, while [gift for a scorpio] fetches links to websites with birthday present ideas for people whose astrological sign is Scorpio.
The German word ‘ein’ looks exactly like the American [EIN], which is interpreted by the search engine as ‘employer identification number’ and returns, accordingly, the information about the US social security system, while [ein team ein Ziel!] brings up search results in German related to Germany’s national football team’s slogan – ‘one team, one goal’.
While the search engine is expected to classify [San José de Arimatea] as a Spanish query and respond with information about Joseph of Arimathea in Spanish, [San Jose Sharks] should trigger results about a Californian ice hockey team.
Also, it helps to know the user’s region when trying to determine their language. Knowing that a search query comes from, say, Seville, would point the search engine to believe that the language of the search is likely to be Spanish.

Widening the search

Having identified the language of a query, the search engine looks at the structure of each search term to expand the pool of potentially matching search results. Instead of looking only for the exact match, the search engine uses its knowledge of the rules according to which words are formed, so that it can find all pages that contain various forms of a search term. If someone looking for Gone with the Wind types [go with the wind] in the search field, they will find references to the film anyway. In addition to exact matches for the query [steel knife], the search results also include links to pages with ‘steel knives’, ‘knives’ and even ‘knife’ and ‘steel knifes’.
Analysing a search query, Yandex makes up a list of all possible grammatical forms for each word.
Some words may sound and look the same, but mean different things. When processing a search query with such words, a search engine offers results for all possible meanings. Users looking for [Castle], for instance, will see both the images of and links to pages about fortified, predominantly medieval, buildings, and information about a popular TV series, together with links to pages where it can be streamed.
If a search engine limited its search exclusively to finding pages with the words that matched the search terms exactly, a lot of useful information relevant to the user’s search query would be missed. There is often more than one way to refer to something. Talking about one and the same thing, for instance, different sources can use either an abbreviated or a full version of a term or name. Responding to a search query, Yandex adds to the original search terms all possible versions of these terms. Delivering search results to the search query [Massachusetts Institute of Technology], Yandex also adds pages that contain ‘MIT’, and the other way around.
In the same way, a search engine has to look for different ways of writing numbers (e.g.,’Charles the First’ and ‘Charles I’), closely related single-root words, alternative spellings and synonyms. To a search query for [Latvian], the system adds ‘Latvia’, and for [linguistics] it will include ‘linguistically’ and ‘linguistic’.
In choosing which words to add and which to omit, Yandex looks at how often each word in a query co-occurs with other words – both in users’ queries, and in the general pool of documents. Word-pairing statistics tell the search engine that a user looking for ‘bow tie’ really wants to find sites about neckties in the form of a bow, and not information about bending over or projectile contraptions for firing arrows.
Single-root words and synonyms are taken from dictionaries and reference resources, some of which Yandex produces especially for such situations.

Working on mistakes

Analysing a query, the search system always checks whether it’s grammatically correct. According to Yandex’s statistics, about 12 percent of queries contain errors – typographical blunders, spelling mistakes and gibberish caused by incorrect keyboard layout. If a search is limited to just what is written in the search field, the user will not get the answer they are looking for, because in most cases site content is written correctly. In the case of words that are frequently spelled incorrectly or queries for which there is no good answer, the search engine immediately corrects the query and shows answers to the corrected version – also warning the user that the query was corrected, of course. [Asassin’s creed] is thus automatically corrected and the search engine will be looking for ‘Assassin’s Creed’.
In some cases, it’s hard to tell whether the user has made an error. In these situations, Yandex asks whether the user has made a mistake and whether he or she wants to see answers to the corrected version of the query. The search engine knows that [Tokio Hotel] is a musical group, while [hotels in Tokyo] relates to accommodation in the capital of Japan. If the query is [tokyo hotel], the search engine shows answers to both versions of the query on the one page, and offers the option of narrowing it down to just one or the other by clicking on the intended spelling.
This work with mistakes and the entire process of linguistic analysis takes place in a split second. In that time, the system manages to determine the language of the query, analyse each word, find synonyms and common combinations, and then finally decide exactly which words need to be found.