please wait

How Search Engines Index Your Web Content?

How Search Engines Index Your Web Content?
Article admin

Search Engine Indexing and Web Crawling

Search engine indexing uses automated software—known as web crawlers, robots , or spiders—to search the web and index pages which are then analyzed and ranked. For web crawlers to be able to read your content, it needs to be visible to them. Crawlers basically cannot see anything that is not a piece of text. Content embedded in formats such as Flash, JavaScript and images cannot be read, and therefore it is text lost for boosting your search rankings. This chapter contains some basic rules that will let you avoid this by enhancing your site’s visibility to web crawlers.


JavaScript is a client-side scripting language used for creating dynamic web pages. When correctly applied, it can enhance a website and help you achieve many effects that HTML cannot. However, it is important to note that links in JavaScript are not visible to crawlers, and therefore will not be followed. If you have JavaScript menus that you cannot do without, you should make sure there are alternative HTML links towards those destinations, so that all your links will be crawlable.


Flash is a popular method for adding animation and interactivity to web pages. When used in small amounts, it can enhance a website without damaging its search engine rankings. However, Search engine indexing crawlers cannot see the content inside a Flash file or follow Flash links. Therefore, links and content should reside outside Flash’s influence.

As a general rule, keep Flash at a minimum. If you feel strongly about using an all Flash page, be sure to create an HTML version of that page as well and block the Flash version of your pages from the crawlers with a robots meta tag.* Use Flash where it counts and avoid it whenever there is a reasonable alternative using HTML, CSS or JavaScript.


Audio and video are both elements that can enhance a website when used appropriately. For ranking purposes, however, make sure to also create a text-only version of your multimedia whenever possible. As for graphical text, it should be avoided in most situations. Use CSS to style your text so that it will count towards increasing your search rankings.

Other Crawler Blocks

Search engine indexing Web crawlers may not see a web page or its links completely if it has one of the characteristics listed below. However, they can make an exception if the page has a significant amount of reputation.

  • Parameters – Pages with more than two dynamic parameters may not be indexed – for example, “page.php?post=102&cat=5&action=view”.

  • Link quantity – Crawlers may not follow all links from a page containing more than 100 of them.

  • Deep links – Internal pages more than three links away from the front page may not be followed.

Some pages may be completely inaccessible for Search engine indexing web crawlers, particularly if they have one of the attributes below.

  • Login – Pages requiring a login or cookie to access.

  • Form – Pages accessible only through an HTML form.

  • Robots – Pages blocked with the robots meta tag.

Note   IndexSpy-WP* – This WP plug-in provides a list showing which of your pages Google has indexed.

Canonical URL

Each page should have only one possible URL. Otherwise you may hurt those pages’ ranking, because the value that comes from links will be split to multiple versions. This can occur because the CMS you use has several URL paths that all point to the same page, which are accessed through separate forms of navigation on your site. WordPress, with its permalink structure, does not suffer from this problem much.

Most often, the problem of multiple versions can be found on a site’s front page. For example, the following URLs all point to the same page, but are different to search engines:

In WordPress, the index.php version is automatically redirected to the root URL. Likewise, the www and non-www versions are also redirected to the version specified under WordPress address on the Settings -> General administration page. The following SEO plug-in will also take care of some other common duplication issues.

Note   All in one SEO pack* – This WP plug-in avoids typical duplicate content issues and allows the creation of canonical URLs.

Note that pointing to the front page with or without a “/” at the end does not matter to search engines. Similarly, leaving out the “http://” protocol will not cause the page to be registered as a separate version. The following URLs thus refer to the same version of the front page.

A solution to the issue of multiple page versions is to take any duplicate pages and use a 301 permanent redirect rule† to point all versions to a single “canonical” version of the web page. This is done by redirecting the request for a particular URL to another page by adding a couple of code lines to an .htaccess file. You can simplify the task using the plug-in below.

Note   Quick Page/Post Redirect – This WP plug-in lets you redirect pages and posts to a canonical URL.

Duplicate Content

Google removes pages with duplicate content from its search results. What this means is that if there are multiple pages on the web with virtually the same content, only one of them will be displayed on the search result pages—the one with the highest reputation. Therefore, you should make sure that your content only appears once on your site, and not on any other sites. A tool you can use to discover online plagiarism is Copyscape.* If you do republish some of your own content, or with permission someone else’s content, it is a good idea to rewrite it a bit first.

Duplicate content is defined rather vaguely by Google as a substantial block of text that is a complete match or appreciably similar. Smaller chunks, snippets, and translated content are not considered duplicates.

Comments( 0 )

  • user.png

    Similar Articles All Articles

    Recommended For You