Showing posts with label Google. Show all posts
Showing posts with label Google. Show all posts

April 23, 2011

Google LocalRank

On February 25, 2003, the Google Company patented a new algorithm for ranking pages called LocalRank. It is based on the idea that pages should be ranked not by their global link citations, but by how they are cited among pages that deal with topics related to the particular query. The LocalRank algorithm is not used in practice (at least, not in the form it is described in the patent). However, the patent contains several interesting innovations we think any SEO specialist should know about. Nearly all search engines already take into account the topics to which referring pages are devoted. It seems that rather different algorithms are used for the LocalRank algorithm and studying the patent will allow us to learn general ideas about how it may be implemented.

While reading this section, please bear in mind that it contains theoretical information rather than practical guidelines.

The following three items comprise the main idea of the LocalRank algorithm:

1. An algorithm is used to select a certain number of documents relevant to the search query (let it be N). These documents are initially sorted by some criteria (this may be PageRank, relevance or a group of other criteria). Let us call the numeric value of this criterion OldScore.

2. Each of the N N selected pages goes through a new ranking procedure and it gets a new rank. Let us call it LocalScore.

3. The OldScore and LocalScore values for each page are multiplied, to yield a new value – NewScore. The pages are finally ranked based on NewScore.

The key procedure in this algorithm is the new ranking procedure, which gives each page a new LocalScore rank. Let us examine this new procedure in more detail:

0. An initial ranking algorithm is used to select N pages relevant to the search query. Each of the N pages is allocated an OldScore value by this algorithm. The new ranking algorithm only needs to work on these N selected pages. .

1. While calculating LocalScore for each page, the system selects those pages from N that have inbound links to this page. Let this number be M. At the same time, any other pages from the same host (as determined by IP address) and pages that are mirrors of the given page will be excluded from M.

2. The set M is divided into subsets Li. These subsets contain pages grouped according to the following criteria:
  
Belonging to one (or similar) hosts. Thus, pages whose first three octets in their IP addresses are the same will get into one group. This means that pages whose IP addresses belong to the range xxx.xxx.xxx.0 to xxx.xxx.xxx.255 will be considered as belonging to one group.

1.  Pages that have the same or similar content (mirrors)

2.  Pages on the same site (domain).

3. Each page in each Li subset has rank OldScore. One page with the largest OldScore rank is taken from each subset, the rest of pages are excluded from the analysis. Thus, we get some subset of pages K referring to this page.

4. Pages in the subset K are sorted by the OldScore parameter, then only the first k pages (k is some predefined number) are left in the subset K. The rest of the pages are excluded from the analysis.

5. LocalScore is calculated in this step. The OldScore parameters are combined together for the rest of k pages. This can be shown with the help of the following formula:

   Here m is some predefined parameter that may vary from one to three. Unfortunately, the patent for the algorithm in question does not describe this parameter in detail.

After LocalScore is calculated for each page from the set N, NewScore values are calculated and pages are re-sorted according to the new criteria. The following formula is used to calculate NewScore:

NewScore(i)= (a+LocalScore(i)/MaxLS)*(b+OldScore(i)/MaxOS)

i is the page for which the new rank is calculated.

a and b – are numeric constants (there is no more detailed information in the patent about these parameters).

MaxLS – is the maximum LocalScore among those calculated.

MaxOS – is the maximum value among OldScore values.

Now let us put the math aside and explain these steps in plain words.

In step 0) pages relevant to the query are selected. Algorithms that do not take into account the link text are used for this. For example, relevance and overall link popularity are used. We now have a set of OldScore values. OldScore is the rating of each page based on relevance, overall link popularity and other factors.

In step 1) pages with inbound links to the page of interest are selected from the group obtained in step 0). The group is whittled down by removing mirror and other sites in steps 2), 3) and 4) so that we are left with a set of genuinely unique sites that all share a common theme with the page that is under analysis. By analyzing inbound links from pages in this group (ignoring all other pages on the Internet), we get the local (thematic) link popularity.
 
LocalScore values are then calculated in step 5). LocalScore is the rating of a page among the set of pages that are related by topic. Finally, pages are rated and ranked using a combination of LocalScore and OldScore.

February 2, 2011

Common search engine principles

To understand seo you need to be aware of the architecture of search engines. They all contain the following main components:

Spider - a browser-like program that downloads web pages.

Crawler – a program that automatically follows all of the links on each web page.

Indexer - a program that analyzes web pages downloaded by the spider and the crawler. 

Database– storage for downloaded and processed pages.

Results engine – extracts search results from the database. 

Web server – a server that is responsible for interaction between the user and other search engine components. 

Specific implementations of search mechanisms may differ. For example, the Spider+Crawler+Indexer component group might be implemented as a single program that downloads web pages, analyzes them and then uses their links to find new resources. However, the components listed are inherent to all search engines and the seo principles are the same. 

Spider. This program downloads web pages just like a web browser. The difference is that a browser displays the information presented on each page (text, graphics, etc.) while a spider does not have any visual components and works directly with the underlying HTML code of the page. You may already know that there is an option in standard web browsers to view source HTML code. 

Crawler. This program finds all links on each page. Its task is to determine where the spider should go either by evaluating the links or according to a predefined list of addresses. The crawler follows these links and tries to find documents not already known to the search engine. 

Indexer. This component parses each page and analyzes the various elements, such as text, headers, structural or stylistic features, special HTML tags, etc.

Database. This is the storage area for the data that the search engine downloads and analyzes. Sometimes it is called the index of the search engine.

Results Engine. The results engine ranks pages. It determines which pages best match a user's query and in what order the pages should be listed. This is done according to the ranking algorithms of the search engine. It follows that page rank is a valuable and interesting property and any seo specialist is most interested in it when trying to improve his site search results. In this article, we will discuss the seo factors that influence page rank in some detail. 

Web server. The search engine web server usually contains a HTML page with an input field where the user can specify the search query he or she is interested in. The web server is also responsible for displaying search results to the user in the form of an HTML page. 

History of Search Engines

In the early days of Internet development, its users were a privileged minority and the amount of available information was relatively small. Access was mainly restricted to employees of various universities and laboratories who used it to access scientific information. In those days, the problem of finding information on the Internet was not nearly as critical as it is now.

Site directories were one of the first methods used to facilitate access to information resources on the network. Links to these resources were grouped by topic. Yahoo was the first project of this kind opened in April 1994. As the number of sites in the Yahoo directory inexorably increased, the developers of Yahoo made the directory searchable. Of course, it was not a search engine in its true form because searching was limited to those resources who’s listings were put into the directory. It did not actively seek out resources and the concept of seo was yet to arrive.

Such link directories have been used extensively in the past, but nowadays they have lost much of their popularity. The reason is simple – even modern directories with lots of resources only provide information on a tiny fraction of the Internet. For example, the largest directory on the network is currently DMOZ (or Open Directory Project). It contains information on about five million resources. Compare this with the Google search engine database containing more than eight billion documents.

The WebCrawler project started in 1994 and was the first full-featured search engine. The Lycos and AltaVista search engines appeared in 1995 and for many years Alta Vista was the major player in this field.

In 1997 Sergey Brin and Larry Page created Google as a research project at Stanford University. Google is now the most popular search engine in the world.

Currently, there are three leading international search engines – Google, Yahoo and MSN Search. They each have their own databases and search algorithms. Many other search engines use results originating from these three major search engines and the same seo expertise can be applied to all of them. For example, the AOL search engine (search.aol.com) uses the Google database while AltaVista, Lycos and AllTheWeb all use the Yahoo database.