Saturday, May 13, 2006

A word on convergence

Convergence is an important mathematical aspect of PageRank, which allows
Google to provided unprecedented search quality at comparably low costs. This is
a semi-complex topic but it is important to your understanding of how and why
PageRank works. We've tried to make it as simple as possible but, unless you're
Sergey Brin or Larry Page, you'll still need to concentrate!



Whilst it will take some concentration, it isn't that hard to understand.



We've shown that the outlet values (final values) of one stage of the
calculation become the inlet values (starting values) of the next stage, and
that we keep on doing this (it's known as a recursive procedure). But the really
big question is how and when does the recursive procedure stop?



The answer is "convergence". Provided the dampening factor (d in our equation) is less than one, then convergence will occur. Nominally we set it to 0.85
(because that's the value mentioned in the Stanford papers).



This convergence basically means that whatever values we start at, after running
the calculation a number of times we will end up with the same final values and
that these values will no longer change if we do further iterations of the

calculation. These final values are known as limiting values.



Once the limiting values have been reached, Google no longer needs to expend
processing power on calculating the PageRank. They can finish there!


This is easier to understand with an example. Let's take a look at the following
structure:




PageRank for pages A, B, C, D at various stages of iteration


How PageRank is calculated

On a simple level, we can tell quite a lot about how PageRank is calculated. This is because when Google was just a university research project, the creators of PageRank published a paper that detailed a formula for calculating it. This formula is now more than a handful of years old, and we suspect that it has changed somewhat since then. However, for detailing the over-riding principles of how PageRank works, it is as accurate today as the day it was written.

PR(A) = (1-d) + d (PR(T1) /C(T1) + ... + PR(Tn) /C(Tn) )
Where PR(A) is the PageRank of Page A ( the one we want to work out ) .
D is a dampening factor. Nominally this is set to 0.85
PR(T1) is the PageRank of a site point ing to Page A
C(T1) is the number of links off that page
PR(Tn) /C(Tn) means we do that for each page point ing to Page A

Source: The Anatomy of a Large-Scale Hypertextual Web Search Engine, Sergey Brin and Lawrence Page, http://www-db.stanford.edu/~backrub/google.html

Couldn't be simpler right? Depending on your mathematical competence, this is either remarkably easy to understand, or remarkably complex. So here's the deal ? this isn't a one-time calculation. It looks pretty above, but you can't just do one simple calculation and get an answer with the PageRank of a page! If you look at it, to calculate the PageRank of Page A, we need to know the PageRank of all the pages pointing to it. To calculate the PageRank of those pages we need to know the PageRank of all the pages pointing to them (one of which could, and is quite likely to be Page A!). If this seems like it could go around in circles forever, that's because it nearly does. We have to do a whole lot of little calculations many times over before we actually get the answer. What this formula tells us is that however you slice it, and however the formula may have changed since it was written:

The PageRank given to Page A by a Page B pointing to it, is decreased with each link to anywhere that exists on Page B. This means that a page?s PageRank is essentially a measure of its vote; it can split that vote between one link or two or many more, but its overall voting power will always remain the same.


This really is important. So to be totally clear, let's put some numbers to it (these numbers are made up for demonstration purposes only, and have no relevancy to any particular page). Say Page B has a PageRank value of 5 and has a single link on it pointing to Page A. Page A's PageRank is improved by a proportion of Page B's value of 5 (Page B doesn't lose anything, but Page A gains). If Page B has two links, that PageRank improvement would be split, and Page A would only gain half the PageRank that it did before.

Now put the formula out of your mind for a moment, as it's easier to understand how it works using a diagram. Let's say we have a hypothetical set of pages imaginatively titled Page A, Page B, Page C and Page D. They link to each other as shown below:


To begin with, in our example at least, we don't know what the page's starting PageRanks are. There's nothing special about the number that we pick to start with (in fact, if you read forward to the section on convergence ? you can see that we can start at any number we want). Since in the last version of this paper we performed the calculations by setting these values to 1 ? we're going to set them to zero this time around in order to prove that it doesn't matter what the starting values are.

Next, we perform the necessary calculation to obtain the PageRank for each page. The rules are:

1. We take a 0.85 * a page's PageRank, and divide it by the number of links on the page.
2. We add that amount on to a new total for each page it's being passed to.
3. We add 0.15 to each of those totals.

The first calculation is easy. Because we've started at zero ? 0 * 0.85 is always 0. So each page gets just 0.15 + 0. Meaning each page now has a PageRank of 0.15. Clearly we're not done ? we want to show the importance of each page based upon links, and they're all the same; so we need to run the calculation again.

Page A links to pages B, C and D. Page A's PageRank is 0.15 so it will add 0.85 * 0.15 = 0.1275 to the new PageRank scores of the pages it links to. There are three of them so they each get 0.0425.

Page B links to page C. Page B's PageRank is 0.15 so it will add 0.85 * 0.15 =0.1275 to the new PageRank score of the pages it links to. Since it only links to page C, page C will get it all.

Page C links to Page A, all 0.1275 passes to page A.

Page D links to Page C. Again all 0.1275 passes to page C.

The new totals for each page them become:

Page A: 0.15 (base) + 0.1275 (from Page C) = 0.2775
Page B: 0.15 (base) + 0.0425 (from Page A) = 0.1925
Page C: 0.15 (base) + 0.0425 (from Page A) + 0.1275 (from Page B) + 0.1275
(from Page D) = 0.4475
Page D: 0.15 (base) + 0.0425 (from Page A) = 0.1925


So we've got:



Pretty neat huh? Already we're begining to see that Page C is probably the most important page in the system (but we can't be sure yet ? it could well change). We carry on doing these calculations until the value for each page no longer changes (this is called convergence ? there's more about this in the next section). In practice, Google probably doesn't wait for this convergence, but instead run a number of iterations of the calculation which is likely to give them fairly accurate values (more on this later as well). If we carried out all the calculations for the example given, it would take us 143 calculations [ Excel Example 1] and we'd reach final values of:



As suspected, Page C is the most important. If we take a quick look at these raw values we can see something about the number of links pointing out from a page. Look at Page A, which has a link from a high PageRank page (Page C), which has only one outbound link. Then look at Page B and D; both share links from a high PageRank page (Page A), with three outbound links. The number of links significantly alters the way PageRank is distributed.

Really heavy competition

No explanation of PageRank strategies would be complete without a final statement regarding heavy keyword competition. There are some queries where competition is so intense that you must do everything possible to maximize your ranking score (for example "web hosting"). In such situations it is impossible to rank highly through Non-PageRank factors alone (as you will not initially be listed high enough to be noticed and linked to). That is not to say that Non-PageRank factors aren't important. Consider what your final rank score is:

Final Rank Score = (score for all Non-PageRank factors) x (actual PageRank score).

Improving either side of the equation can have a positive effect. However, because the Non-PageRank factors have a restricted maximum benefit, the actual PageRank score must be improved in order to compete successfully. Under really heavy competition ? it holds true that you cannot rank well unless your actual PageRank score is above a certain level. In other words:

There exists a query specific "Minimum PageRank level". For queries that do not have heavy competition, this level is easy to achieve without even trying. However, where heavy competition exists, Non-PageRank factors are just as important (and easier to get) until they reach the Non-PageRank factor threshold. This is why careful keyword choice can help you avoid the extensive work associated with highly competitive search phrases.

Using the threshold to derive the worth of two ranking strategies

The threshold explains principles and the different ways that search engine marketers work. It also demonstrates why some of the misunderstandings about PageRank occur. Let's consider the strategies of two people, Person A consders PageRank to be unimportant, and Person B considers PageRank to be very important.

Person A says "PageRank" is unimportant. They have optimised pages for years and know how to use "on the page" factors very successfully. They understand the basics of anchor text but they couldn't care at all about PageRank.

What's happening: person A is reaching the Non-PageRank Factor Threshold very quickly because they are maximising the "on the page" factors. Through carefully choosing keywords they jump-start themselves up the SERPs. As long as their content is good, high-ranking sites (over time) tend to get linked to. Whilst they didn't directly ask for it, a slow trickle of sites will begin to link to them and give them PageRank, which helps consolidates their position.

Person B says "PageRank" is important. We've all seen those pages in the results that have no content, but great rankings (With big brands, this can often occur naturally even when they have no idea what PageRank is. This would be Person C who is not relevant to the discussion at hand.) Person B understands lots about PageRank and concentrates heavily on it.

What's happening: person B is doing the reverse of person A. Whilst person A concentrated on the Non-PageRank factors and found herself getting PageRank anyway, person B concentrates on the PageRank Factor and finds himself getting Non-PageRank factors. The reason for this is that increasing

PageRank requires links, and links have anchor text. Thus, through carefully choosing the anchor text linking to his page, person B automatically increases his Non-PageRank factor scores whilst obtaining his high PageRank score. Obviously, these are two extremes, but we can use these to extrapolate the advantages and disadvantages of each approach:



It is clear that both strategies can and do work. Both strategies are using PageRank as part the mix of factors that will ultimately improve their ranking in the SERPS. Because there is such a mix, we can use them to different degrees, depending on the strategy that best suits your style. My personal strategy is to use a combination, but to save some of the ?on the page? factors for later, in case I need a quick boost if the competition heats up at a later time.

Non-PageRank Factor Threshold

Having written about the difference between PageRank and other factors, and how PageRank is harder to get, it should now be clear that whilst we could use many methods to get good rankings, there is a threshold which defines when high PageRank is worth striving for and when it is not.

With ranking factors other than PageRank, there is a score beyond which the slow down in the rate that any factor adds to this score is so insignificant that it is not worthwhile. This is the Non-PageRank Factor Threshold. To illustrate this,
let?s put an example figure on this of 1000.

If we have a query where the results are Page A and Page B, then Page A and B have scores for that query which are the total scores for all ranking factors (including PageRank). Let's say Page A's score is 900 and page B's score is 500. Obviously Page A will be listed first. These are both below our hypothetical Non-PageRank Factor Threshold, thus without any change in PageRank, it is possible for page B to improve their optimization to beat Page A for this particular query. There are lots of queries like this on Google; they're more commonly thought of as less competitive queries!

Now assume Page A raises its score to 1100. Suddenly page B cannot compete in the SERPs (search engine results pages) without increasing its PageRank. In all probability, page B must also improve for all the other ranking factors, but an increase in PageRank is almost certainly necessary. There are also lots of queries like this on Google, which are more commonly thought of as more competitive queries!

Generally, when querying Google, the group of pages in the SERPs will contain some pages that have a score above the Non-PageRank Factor Threshold, and some that do not.

There is an important point to be made here:

To be competitive you must raise your page's search engine ranking score beyond the Non-PageRank Factor Threshold. To fail to do so means that you can easily be beaten in the search results for your query terms. The quickest way to approach the Non-PageRank Factor Threshold is through "on the page factors", however you cannot move above the Non-PageRank Factor Threshold without PageRank.

The obvious question is what?s the numerical value of the Non-PageRank Factor Threshold, and how much work do you need to do to get past it. The answer is that it has no value; it is a hypothetical line. Google could put a value on it, but that would not help us unless we know what the page's individual scores are. We need only be aware that the threshold exists, and that it gives us information about principles.

The difference between PageRank and other factors

To assess when PageRank is important and when it is not, we need to understand how PageRank is different from all other ranking factors. To do this, here's a quick table that lists a few other factors, and how they add to the ranking score:



All other ranking factors have cut off points beyond which they will no longer add to your ranking score, or will not add significantly enough for it to be worthwhile. PageRank has no cut off point.

Google's First 1000 Results

Remember, PageRank alone cannot get you high rankings. We've mentioned before that PageRank is a multiplier; so if your score for all other factors is 0 and your PageRank is twenty billion, then you still score 0 (last in the results). This is not to say PageRank is worthless, but there is some confusion over when PageRank is useful and when it is not. This leads to many misinterpretations of its worth. The only way to clear up these misinterpretations is to point out when PageRank is not worthwhile.

If you perform any broad search on Google, it will appear as if you've found several thousand results. However, you can only view the first 1000 of them. Understanding why this is so, explains why you should always concentrate on "on the page" factors and anchor text first, and PageRank last.

Assume that you perform a search on Google and it returns 200,000 results. If we were to calculate every factor for each 200,000 pages ? do you think it would really take just 0.34 seconds to search? The answer to speeding up the search is to get a subset of documents that are most likely to be related to the query. This subset of documents needs to be larger than the number of search results. For example, let's say that number is 2000. What the search engine does is query the whole database using 2 or 3 factors, finding the 2000 documents that rank highest for them. (Remember, there were 200,000 possible documents, and that's the number that actually gets shown). Then the engine applies all the factors to those 2000 and ranks them accordingly. Because there's a drop in the quality of the results (not the pages) at the bottom of this subset, the engine just shows the first 1000. PageRank is almost certainly not one of those factors. Notice how before, we highlighted the word "related," in creating the subset of 2000 pages. The search engine is looking for pages that are on-topic. If we included PageRank in that list we?d get a lot of high PageRank pages with topics that are only slightly related (because of the second factor), but that?s not what we want.

Why this is critical:
You must do enough "on the page" work and/or anchor text work to get into that subset of 2000 pages for your chosen key phrase, otherwise your high PageRank will be completely in vain. PageRank means nothing if you do not have enough ranking from other factors to make it into the first subset.

Is PageRank a good determination of the quality of a page?

To examine the worth of PageRank, we need to first look at its premise, and how accurate it is. Basically PageRank says:

1. If a page links to another page, it is casting a vote, which indicates that the other page is good.
2. If lots of pages link to a page, then it has more votes and its worth should be higher.

The basic implication here is: People only link to pages they think are good.

It shouldn't be hard to convince you that this premise is wrong. A few of the reasons people link to pages other than ones they think are good are:

1) Reciprocal links ? "Link to me and I'll link to you."
2) Link Requirements ? "Using our script requires you to put a link to our page." or "We'll give you an award solely because you link to our page."
3) Friends and Family ? "This is my friend Pete?s site.? or "My mum?s site is here, my dad's site is here. My dog's site is here."
4) Free Page Add-ons ? "This counter was provided by www.linktocountersite.com."

Furthermore, anybody who has a top-ranking site will tell you that it tends to get links from new sites. This is not necessarily because it's good (although they generally are). Assume a Webmaster is setting up a new site and they are looking for some outbound links. Nowadays, one of the first things they do is a Google search for similar sites. The links they end up with may not necessarily be the best sites, but merely the easiest ones to find. If PageRank influences rankings, and if they subsequently link to those pages ? the new Webmaster will be adding to the inaccuracies in the judging of the quality of a page. The same is true when these new Webmasters use the Google Toolbar PageRank indicator to choose whom to link to.

To put this another way:

Thursday, May 04, 2006

How significant is PageRank?

The significance of any one factor in search engine algorithms depends on the quality of the information it supplies. A factor's importance is known as its weight. To demonstrate how weighting is arrived at, it's easiest to move away from
PageRank for a second and look at Meta tags. Originally, when the Meta keyword tag was new, you could write something like this in your document:

webranking meta

In theory, the Meta keyword tag was a very good indicator of what the page was about. However, as most are well aware - the weighting for the keywords tag is fast approaching nothing. Two things have contributed to this:

1. The ease at which Webmasters can manipulate it.
2. The level of manipulation by Webmasters.

These two things are separate factors, but with human nature being what it is, the easier something is to influence - the more it is manipulated. The combination of these factors determines the "weighting" - i.e., how much we trust the nformation provided by that factor.

So it makes sense to look at these factors in relation to PageRank first.

PageRank is, without doubt, one of the hardest things for a Webmaster to manipulate ethically. However, it is possible to generate links to your site from other sites fairly simply through the use of link farms and guestbooks. Google frowns upon this kind of abuse, and many sites that have tried this have had their PageRank influence blocked. But it must be said that the abuse is still rampant, and that it can have an influence on PageRank. So, whilst not easy to do, PageRank is still subject to manipulation.

The extent to which PageRank is manipulated has also changed. Most people no longer believe Google?s old line of people not being able to influence PageRank and the results based on it. However, there is more information about PageRank available than ever, and people are more aware of manipulation techniques.

So whilst PageRank is valuable, you should be careful not to over-estimate its usage and capabilities. Your final ranking in Google is due to a mix of factors, of 6

which PageRank is only one. We?ll get into more details later by discussing how PageRank is different than the other ranking factors, and thus, when it applies and when it doesn?t. Ironically enough, PageRank's weighting factor is undeniably declining. Since the original version of this paper gave out detailed information about PageRank, in all likelihood it may also have contributed in some small way to the decline in weighting of the very subject it talks about!

How accurate is the Google toolbar?

The Google toolbar is not very accurate in showing you the actual PageRank of a site, but it?s the only thing right now that can give you any idea. As long as you know the toolbar?s limitations, then at least you know what you are viewing.

There are two limitations to the Google toolbar:
1. The toolbar sometimes guesses. If you enter a page, which is not in its index, but where there is a page that is very close to it in Google?s index, then it will provide a guesstimate of the PageRank. This guesstimate is worthless for our purposes because it isn?t featured in any of the PageRank calculations. The only way to tell if the toolbar is a guesstimate is to type the URL into the Google search box and see if the page shows up in the SERPS. If it doesn?t, then the toolbar is
guessing!

2. The toolbar is just a representation of actual PageRank. Whilst PageRank is linear, Google has chosen to use a non-linear graph to portray it. So on the toolbar, to move from a PageRank of 2 to a PageRank of 3 takes less of an increase than to move from a PageRank of 3 to a PageRank of 4. A comparison table best illustrates this phenomenon. The actual figures are kept secret so we?ll just use any figures for demonstration purposes:






The PageRank shown in the Google directory (http://directory.google.com) suffers from the same problems. The PageRank shown in the directory is also on a different scale. There have been attempts to cross-reference these two scales but because they are non-linear, the results really do not tell you anything more than you already know.

Also of note is that a programmer managed to generate a tool to look up PageRank without using Internet Explorer. This tool has since been withdrawn, but whilst originally the numbers given by this software and Google?s toolbar matched - presently querying with such software sometimes produces different numbers than querying with the toolbar. This is Google?s right to protect their data, but is the strongest indication that:

How can you tell what a page's PageRank is?

To learn what a page's PageRank is, you can download a toolbar for Internet Explorer from http://toolbar.google.com. Once installed, there will be a bar graph at the top of the browser showing a version of PageRank for the page you're browsing. When you hold the mouse over the bar, you see a number from zero to ten. (If you don't see the number, you may have an older version of the toolbar installed. You will need to completely uninstall it, reboot your computer and reinstall the latest version. Once this is done, you should be able to see the PageRank number.)

How is PageRank determined?

The Google theory goes that if Page A links to Page B, then Page A is saying that Page B is an important page. PageRank also factors in the importance of the links pointing to a page. If a page has important links pointing to it, then its links to other pages also become important. The actual text of the link is irrelevant when discussing PageRank.

What is PageRank?

PageRank is Google's method of measuring a page's "importance." When all other factors such as Title tag and keywords are taken into account, Google uses PageRank to adjust results so that sites that are deemed more "important" will move up in the results page of a user's search accordingly.

A basic overview of how Google ranks pages in their search engine results pages (SERPS) follows:
1) Find all pages matching the keywords of the search.
2) Rank accordingly using "on the page factors" such as keywords.
3) Calculate in the inbound anchor text.
4) Adjust the results by PageRank scores.

In reality, it?s slightly more complex and we?ll discuss this in more depth later, but for now the above description serves our purposes. It?s worth noting that PageRank is a multiplier and is not just simply added to the score. Thus, if your page had a PageRank of zero, it would rank at the very end of the SERPS.