Friday, December 28, 2007

How Does a Search Engine Know the Language of A Query? Google Explores Character Mapping

Google Patent Applications on Languages and Queries


Google published four patent applications recently that delve into
these areas, covering the “handling of language uncertainty in
processing search queries and searches over the web, where the queries
and documents can be expressed in any one of a number of different
languages.”


A search engine is called upon to index and search documents written
in a wide variety of languages, and a number of documents that are
expressed in multiple languages.


Keyboards without Non-Latin Characters


Another challenge is that some devices that are used to create
content and display web pages can have difficulties in producing some
of the characters used in different languages.


People searching on a handheld, or on a keyboard, may use characters
that are close substitutes for the ones that they would actually want
to use, such as an unaccented character.


A search engine could process content that it has indexed, to remove
accents and convert special characters into a standard set of
characters, but this would result in losing information from the search
index, and making it impossible to retrieve content when a searcher
does use their natural language in a query, when their search does use
non-latin characters.


The Query Language Patent Filings


The patent applications were published on December 13, 2007, and were originally filed on April 19, 2006.



Search Engine Learning the Language of a Document


Under the approach in these patents, a training model is created to
use to identify the language used in documents to be searched. The
training model focuses upon a specific body of documents when training,
and those can be a mix of different types of documents, such as:


  • HTML
  • PDF
  • Text documents,
  • Word processing documents,
  • Usenet articles, or;
  • Any other kinds of documents having text content, including metadata content.

These documents should ideally represent what might be found on the
Web, and might be the Web itself, or a snapshot or extract from the
Web.


That body of documents should include all languages represented on
the Web, with enough documents from each language, so that they might
contain a significant enough portion of the words found within all
documents of the language on the Web.


The Role of Character Encoding


A system like this might work best if each of the training documents
and each document to be searched would be encoded in a known and
consistent character encoding, such as 8-bit Uniform Transformation
Format (UTF-8).
Of course, that isn’t what you’ll find on the Web, where you will see
many pages not even including a character set defined, or another
character set completely. Here’s what the code what look like in the
HTML for a page using UTF-8:


<meta http-equiv=”Content-Type” content=”text/html; charset=utf-8″>


If a page doesn’t use UTF-8, and this language determination process
does, then documents using some other encoding might be converted into
UTF-8. That conversion might result in some funny looking characters
ending up in results.


Language Detection on Pages, Using Probabilities


The document language detection process uses statistical learning theories, and classification models.


The most likely class or classes for a page of text may be based on
the text from the page, and possibly by looking at the URL of the page.


This could be done by breaking the text down into words, and
computing the probabilities of those words appearing upon the page
together in different languages, to prodict the most likely language
for that text.


So, on a page where the word “Hello” occurs frequently, and in the
training model, it appears most frequently on English and then German
pages, there’s a probability that the page may be in English, and then
in German.


Looking at certain characters can be helpful, too. If certain
characters don’t appear very frequently, if at all, in some languages,
then pages which have words in them with those characters might be less
likely to be in those languages.


The Use of Character Mapping


One of the keys to this process is the creation of character maps
that may be more unique to one language than to others. A common form
of a word in a specific language may contain accented characters, for
instance.


The patent applications go into a great deal of detail on how these character mappings can be used in a few different ways.


One is to help identify languages for some queries.


Another is to identify when certain queries might be simplified
versions of a word, when a searcher can’t use certain characters
because the device that they are using, such as a smart phone, is
incapable of using those characters. There are a number of examples of
how this might work given in the patent applications.



Powered by ScribeFire.

1 comment:

fashy said...

Thank you for sharing how does the search engine know and detect language that have been enter by hte site. In the part of the business service there are lot's of help on how to find it. Lot's of keyword could be aply, for example the Dentistas that have been served on the particular palce, this can be easy find and to have a information.