In this third and final segment of our guide to basic SEO concepts, we’ll touch on the more advanced SEO definitions and concepts of website “crawlability,” including basic search engine directives, common client and server errors, best practices and web developer resources. You can access Part 1 of our series, covering on-page SEO concepts, and Part 2, explaining essential linking terms and related search engine directives.
In simple terms, “crawlability” refers to the ease with which search engine robots (or “bots” like Googlebot) can “crawl” a website in doing their work of indexing pages to build the search engine results pages (SERPs). There are several tools and best practices webmasters and developers can employ to optimize a website for search engine bots – in other words, maximize its crawlability – as well as minimize the usual crawling errors. The ultimate goal behind site crawlability is to expedite the speed and accuracy of the bot’s crawling and indexing of individual pages.
Basic Search Engine Directives
User-agents: User-agents are a general term for software that acts on behalf of a user or a program and their corresponding Web browsers or identity (e.g., Chrome, Internet Explorer, Firefox). Web developers and SEOs employ user-agent switchers, which change the user agent of a browser (e.g. Chrome, Internet Explorer, Firefox) when testing how a site renders when crawled by different search engine robots. Changing the user-agent of a browser is an advanced method generally reserved for a brand’s “geek squad.” A short list of search engine robots and corresponding browsers would include:
- Googlebot for Chrome
- Bingbot for Bing
- Slurp for Yahoo!
- MSNbot for MSN/Live
There are also an extensive number of user-agents and switchers that Web developers can apply to see how different browsers work on mobile devices. Web developers have the option of using custom user-agents; this is a more common practice for larger, more complex websites.
Redirects: Another set of search engine directives centers on redirects, which forward a Web page URL to a new Web page address, directing both site visitors and search engine robots to a different Web page. There are two redirects commonly used: permanent (301) and temporary (302).
- 301: 301 indicates a permanent redirect, reflecting the HTTP (hypertext transfer protocol) status code of a Web page (HTTP status codes are further discussed below). It is the recommended method for Web page redirects, as it passes most of the PageRank status of the original page to the new page.
- 302: 302 designates a temporary redirect. It does not pass PageRank and is generally not recommended.
Errors and Best Practices
There are several common and seemingly persistent issues that compromise the performance of websites, resulting in a poor user experience. Among the more typical problems are Web server glitches, faulty redirects, broken links, slow page speeds, duplicate content and multiple URLs. Fortunately, there are counter-measures that webmasters and developers can adopt to address these issues. Here, we define the problems most often encountered and best practices for deterring them.
Errors: Errors are HTTP response status codes, ranging from 1xx to 5xx, indicating five classes of standardized responses to search queries. The most common are the 3xx redirection (described previously), 4xx client (website owner) error, 5xx server error and 444 no response.
404 Not Found: You’re most likely familiar with 404 not found error message, which simply indicates the page URL requested could not be located. This is usually result of a broken or defunct link. A best practice is to develop a custom 404 page to display to the (likely frustrated) searcher, offering help or guidance in non-technical language. A second common 4xx error is 444 no response, indicating that the server has failed to return information and shut down the connection.
5xx Server Error: This is often used to fend off malware. 5xx server error response codes signify the server is aware of an error, and cannot execute the user’s request. There are 18 5xx responses, ranging from 500 internal server error to 504 gateway timeout.
Canonical link element and canonical HTTP headers: In cases where Web page content may be accessed through multiple HTTP headers (URLs), has syndicated content that is published elsewhere, or is otherwise duplicated, canonicalization is recommended. Canonicalization means defining the single, preferred Web page URL for your content, which consolidates and strengthens both link and ranking signals for greater search visibility. There are several ways to do this, such as specifying a canonical link in your HTTP header for downloadable white papers and PDFs, all of which you can find via Google’s Webmaster Help forum. Learn how our own ContentIQ can crawl your site to detect 4xx and 5xx errors and direct you towards getting them fixed.
Site speed: Site speed is a major signal in Google’s search ranking algorithm, and the search giant continues to push for a faster internet experience with its mobile-friendly initiative, which encourages webmasters to improve page load time. While rich media is a medium to embrace, it’s important to pay attention to the size of images and “bulkiness” of videos, as they may significantly slow upload time.
XML Sitemaps: XML sitemaps lists a site’s Web pages in file with XML tags that details the organization of your website using “extensible markup language” (i.e., XML) schema. Submitting an XML Sitemap to the search engines is a recommended best practice to help search engine bots crawl and index your site’s pages quickly and accurately. You can learn more about XML Sitemaps from our article on the BrightEdge blog.
Resources and Tools
For SEO glossary purposes, we’ve only scratched the surface of how to optimize your website for crawlability. There are several resources that go much farther in-depth, including BrightEdge’s ContentIQ, Google Webmaster Tools and Webmaster Central Help Forum. We hope you’ve found our introduction to basic SEO concepts helpful!