XML Sitemaps

dniska
dniska
M Posted 5 years 4 months ago
t 9 min read

Arguably one of the more straightforward technical elements of SEO, XML sitemaps are often typically misunderstood. To get a better understanding of XML sitemaps and how to use them efficiently, it helps to know what they are and what they are not.

What are XML sitemaps?

In its simplest form, a sitemap serves as a road map for search engines to discover your website’s most important content and get further context on your website’s overall structure. In addition to providing search engines with a list of URLs, sitemaps can help search engines find newer content, or content located deep within the website’s architecture, which helps websites with a poor internal linking structure.

Common myths about XML sitemaps:

  1. A sitemap is not a list of the pages on your site. There’s no need to include every page in the sitemap. Most websites have sensitive content, like investor information, or content that does not provide a great user experience through search, like login or account pages, and content located behind paywalls or pages returning non-200 response codes. These are examples of pages that should not be made available to search engines and can be left out of a sitemap.
  2. Sitemaps aren’t needed if my site is well laid out. While a good infrastructure is always important, an XML sitemap is meant to serve as an indicator of the most important content that that you want to be crawled and considered for indexation. If you have an enterprise-level site, relying on your infrastructure alone probably isn’t the safest bet to ensure crawling and indexation. Setting up your sitemap to feature your most important pages will assist a search engines’ ability to understand what you consider your most important content to be. Since the search engines operate on crawl budgets, this can be an advantageous approach for larger sites. If your website has more than 50,000 URLs of important content, creating a sitemap index that contains multiple sitemaps may be the way to go.
  3. Sitemaps tell Google what to index. An XML sitemap does not guarantee that a page will be indexed, just that it will be considered for indexation.

Now that we know the myths and what sitemaps aren’t, how can we use them to improve our site organically?A guide to creating XML Sitemaps - BrightEdge

Using XML sitemaps to your advantage

Ignore ‘priority’ and ‘change frequency’ tags:

Two popular pieces of markup found in XML sitemaps are the ‘priority’ and ‘change frequency’ tags. Many webmasters will utilize this markup to improve crawl efficiency and highlight a website’s priority content. John Mueller of Google has stated that Google ignores these two signals. However, he has indicated the utilizing the lastmod markup is used when Google analyzing a sitemap. Focusing on this tag and making sure that you are including the right URLs will go a long way to ensure that your site map is crawled efficiently and has the greatest impact.

Improve your chances at content being indexed

Make your sitemap available to the search engines:

A big first step in making sure that your most important content is discovered is to learn how to create a sitemap and place it in the root directory of your server.

https://www.example.com/sitemap.xml

Next, be sure to provide a link to your XML sitemap in your robots.txt file. This file is one of the first places a search engine bot will visit when it hits a website. There it will find directives on what content to crawl and what content to avoid. By including a link to your sitemap, you help ensure that search engines are discovering and crawling your content.

A final step is to physically submit your sitemap to Google Search Console and Bing Webmaster Tools. According to Google’s webmaster's forum, they don’t check your sitemap every time it is updated, only the first time they notice it. After that, they will check your sitemap only when they are notified it has changed. This can be done using Google Search Console’s sitemap tool, and using the “ping” functionality to Ask Google to crawl your site map by sending an HTTP GET request:

http://www.google.com/ping?sitemap=<complete_url_of_sitemap>

For example:

http://www.google.com/ping?sitemap=https://example.com/sitemap.xml

Only include valid URLs:

It’s imperative that your sitemap references URLs that are indexable and returning a 200 OK response code. Webmasters, SEOs or dev teams should routinely audit their website’s sitemap to remove pages returning 404 errors, 300-response codes and 500-level server errors. This can be done manually by crawling the sitemap or utilizing Google Search Console’s XML Sitemap report to identify invalid URLs. Remember, search engines operate on a crawl budget, so every non-indexable URL increases the chance a valid one won’t get crawled.

Use consistent, qualified URLs:

Consistency is important to a properly formatted XML sitemap. Make sure to use consistent protocols. If your website is a secure site (uses HTTPS) then make sure that the sitemap and all URLs are using the secure protocol. Otherwise, your sitemap will contain redirects which can affect your crawl efficiency and indexation.

Utilize consistent sub domains. Since the XML sitemap provides insight into website architecture and organization, each subdomain should have its own sitemap. This will also help keep your sitemaps as condensed as possible.

Include unique URLs:

Be sure to only include canonical versions of URLs. URLs that include parameters or session IDs can be considered duplicative and should be excluded. Otherwise, crawl efficiency and overall indexation could suffer. When conducting regular sitemap audits, be sure to look for any-non-canonical URLs and remove them. Again, utilizing Google Search console’s sitemap report can help you easily identify non-canonical URLs and checking this report regularly is a good best practice. In addition to utilizing Google’s tools through Search Console, leveraging BrightEdge's ContentIQ site audit tools can help SEOs and webmasters identify non-canonical URLs and pages returning non-200 response codes to help further audit your XML sitemaps.

Do not include non-alphanumeric characters:

A sitemap needs to be UTF-8 encoded. URLs must use entity escape codes for characters like ampersands (&), single quotes (‘), double quotes (“), less than (<), and greater than (>). Also, URLs should only contain ASCII characters.

Limit the size of the sitemap:

The size of an XML sitemaps can quickly get out of hand, especially for larger websites like e-commerce sites. When a sitemap gets too big, it can negatively impact the number URLs that are crawled and indexed, and it can contribute to your web server getting bogged down if it needs to serve large files. To combat this, XML sitemaps should be limited to containing 50,000 URLs and/or being no larger than 50 MB. This means that larger sites may need to use multiple site maps in a sitemap index file.

For larger sitemaps, breaking out sections of content into their own sitemaps can help keep content organized and help avoid sitemap bloat. Creating separate sitemaps for videos, images, and blogs may be a good idea.

Use XML sitemap creation tools:

There are many tools that can assist in XML sitemap creation. Many CMS’ have dynamic sitemap creation options that you can use to help manage what content is published in your sitemap file. A CMS like WordPress has several plugins to help manage sitemaps.

Now that you know how to create a sitemap, format, setup and edit one, it’s time to prepare the list of your most important content to include and get it submitted to the search engines. Get started today!

Guide to SEO Basic Concepts: Part 3

A BrightEdger
A BrightEdger
M Posted 7 years 7 months ago
t 9 min read

In this third and final segment of our guide to SEO basic concepts, we’ll touch on the more advanced SEO definitions and concepts of website “crawlability,” including basic search engine directives, common client and server errors, best practices and web developer resources.

You can access Part 1 of our series, covering on-page SEO concepts, and Part 2, explaining essential linking terms and related search engine directives.

Discover these SEO basics - BrightEdge

Crawlability

In simple terms, “crawlability” refers to the ease with which search engine robots (or “bots” like Googlebot) can “crawl” a website in doing their work of indexing pages to build the search engine results pages (SERPs).

There are several tools and best practices webmasters and developers can employ to optimize a website for search engine bots – in other words, maximize its crawlability – as well as minimize the usual crawling errors.

The ultimate goal behind site crawlability is to expedite the speed and accuracy of the bot’s crawling and indexing of individual pages.

SEO basic search engine directives

User-agents are a general term for software that acts on behalf of a user or a program and their corresponding Web browsers or identity (e.g., Chrome, Internet Explorer, Firefox).

Web developers and SEOs employ user-agent switchers, which change the user agent of a browser (e.g. Chrome, Internet Explorer, Firefox) when testing how a site renders when crawled by different search engine robots. Changing the user-agent of a browser is an advanced method generally reserved for a brand’s “geek squad.”

A short list of search engine robots and corresponding browsers would include:

  • Googlebot for Chrome
  • Bingbot for Bing
  • Slurp for Yahoo!
  • MSNbot for MSN/Live

There are also an extensive number of user-agents and switchers that Web developers can apply to see how different browsers work on mobile devices. Web developers have the option of using custom user-agents; this is a more common practice for larger, more complex websites.

Another set of search engine directives centers on redirects, which forward a Web page URL to a new Web page address, directing both site visitors and search engine robots to a different Web page. There are two redirects commonly used: permanent (301) and temporary (302).

  • 301 indicates a permanent redirect, reflecting the HTTP (hypertext transfer protocol) status code of a Web page (HTTP status codes are further discussed below). It is the recommended method for Web page redirects, as it passes most of the PageRank status of the original page to the new page.
  • 302 designates a temporary redirect. It does not pass PageRank and is generally not recommended.

Errors and Best Practices

There are several common and seemingly persistent issues that compromise the performance of websites, resulting in a poor user experience. Among the more typical problems are Web server glitches, faulty redirects, broken links, slow page speeds, duplicate content and multiple URLs. Fortunately, there are counter-measures that webmasters and developers can adopt to address these issues. Here, we define the problems most often encountered and best practices for deterring them.

Redirects

Errors are HTTP response status codes, ranging from 1xx to 5xx, indicating five classes of standardized responses to search queries. The most common are the 3xx redirection (described previously), 4xx client (website owner) error, 5xx server error and 444 no response.

You’re most likely familiar with 404 not found error message, which simply indicates the page URL requested could not be located. This is usually result of a broken or defunct link. A best practice is to develop a custom 404 page to display to the (likely frustrated) searcher, offering help or guidance in non-technical language. A second common 4xx error is 444 no response, indicating that the server has failed to return information and shut down the connection. This is often used to fend off malware.

5xx server error response codes signify the server is aware of an error, and cannot execute the user’s request. There are 18 5xx responses, ranging from 500 internal server error to 504 gateway timeout.

Canonical link element and canonical HTTP headers: In cases where Web page content may be accessed through multiple HTTP headers (URLs), has syndicated content that is published elsewhere, or is otherwise duplicated, canonicalization is recommended.

Canonicalization means defining the single, preferred Web page URL for your content, which consolidates and strengthens both link and ranking signals for greater search visibility. There are several ways to do this, such as specifying a canonical link in your HTTP header for downloadable white papers and PDFs, all of which you can find via Google’s Webmaster Help forum. Learn how our own ContentIQ can crawl your site to detect 4xx and 5xx errors and direct you towards getting them fixed.

Site speed

Site speed is a major signal in Google’s search ranking algorithm, and the search giant continues to push for a faster internet experience with its mobile-friendly initiative, which encourages webmasters to improve page load time. While rich media is a medium to embrace, it’s important to pay attention to the size of images and “bulkiness” of videos, as they may significantly slow upload time.

Sitemaps

XML Sitemaps lists a site’s Web pages in file with XML tags that details the organization of your website using “extensible markup language” (i.e., XML) schema. Submitting an XML Sitemap to the search engines is a recommended best practice to help search engine bots crawl and index your site’s pages quickly and accurately. You can learn more about XML Sitemaps from our article on the BrightEdge blog.

Resources and Tools

For SEO glossary purposes, we’ve only scratched the surface of how to optimize your website for crawlability.

There are several resources that go much farther in-depth, including BrightEdge’s ContentIQ, Google Webmaster Tools and Webmaster Central Help Forum.

We hope you’ve found our introduction to basic SEO concepts helpful!

request a demo cta banner

HREFlang Tags: What Are They?

Yulia Kronrod
Yulia Kronrod
M Posted 8 years 5 months ago
t 9 min read

Global companies must ensure that international customers enjoy the same quality user experience when interacting with their brand as their domestic counterparts. Localization helps create targeted content tailored to the needs of customers in different countries. But how can we ensure understanding hreflang tags with brightedgethat the right content actually reaches those customers? The answer is HREFlang tags.

To help search engines identify URLs to be served to regional users, webmasters can use rel="alternate" hreflang="x" attributes. In this blog, we’ll focus on the technical side of hreflang tags implementation and cover some common mistakes and misconceptions. In a follow up blog, we’ll share a case study of how we implemented hreflang sitemaps for our company, how we tracked rankings and what the results were. Stay tuned!

Which search engines recognize hreflang tags annotations?

At this time, hreflang tags annotations are recognized by Google and Yandex.

  • Google: HTML link element in header, HTTP header, or XML sitemap
  • Yandex: HTML link element and HTTP header but not XML sitemaps (details here)

Bing does not recognize hreflang tags and recommends using “content-language” meta tags instead.

XML sitemaps vs HTML link element

For sites targeting a large number of geos, XML sitemaps are probably the best option as they are easier to generate, maintain and, importantly, QA as opposed to on-page code that can change without SEOs’ knowledge. Additionally, including hreflang tags in the page code adds to HTML size and slows the page load speed. While Google says that the location of a URL within your sitemap files does not matter and you can structure the sitemaps in any way that makes sense for you, please note the limits for an XML sitemap file: 10MB or 50,000 URLs (<loc> elements). You can choose to include hreflang annotations for a larger set of URLs in individual geo-specific XML sitemaps or create line-of-business sitemaps with a smaller number of URLs but spanning all regions.

5 Common Mistakes and Misconceptions about HREFLang

  1. Use of wrong country and/or language codes. Values for the hreflang tags attribute should either be a language code in ISO 639-1 format ("en" for English) or a combination of language and country code where the country code is in ISO 3166-1 Alpha 2 format ("en-US" English for US, non-capitalized “en-us” is also acceptable). UK versus Ukraine: “en-uk” would not mean English for UK (correct: “en-gb”, Great Britain) but rather English for Ukraine. Latin America is may also present a challenge for hreflang tags values: many sites create a single site to target these countries – /la/ or /latam/. “es-la” would unfortunately mean that you’re targeting Laos (!). “es-latam” is not an acceptable ISO code. So, to target Latin America, one needs to either include multiple country annotations (“es-ar” for Argentina, “es-cl” for Chile, “es-pe” for Peru, etc.) with the same Latin American URL (www.site.com/latam/), or use “es-es” for the Spanish site while targeting the /latam/ site to global Spanish speakers by using the language “es” annotation only. We’ve also seen sites use “es-br” for their Brazilian sites in Portuguese (correct: “pt-br”), “uk-ua” (Ukrainian for Ukraine) for a site in Russian, and “en-be” (English for Belgium) for a site in French.
  2. Lack of hreflang tags for the page itself. Hreflang tags annotations should be bi-directional, and each language page must identify all language versions, including itself. For example, if your site provides content in English, French, and German, all three language versions must include the same references to the English, French, and German pages.
  3. Use of non-canonical or redirecting URLs. An example would be the use of a URL with parameters (www.site.com/page.html?trackingid=123) instead of the canonical URL version (www.site.com/page.html) or use of an old URL that got redirected. John Mueller recently confirmed this in his Google+ post: “Make sure any rel=canonical you specify matches one of the URLs you use for the hreflang pairs. If the specified canonical URL is not a part of the hreflang pairs, then the hreflang markup will be ignored.”
  4. You have to implement hreflang for the entire site otherwise, it will not work. Wrong. Hreflang tags will work for any set of URLs you choose, as long as you cover all language/country variations. Clearly, you’d want to focus on top traffic and revenue drivers. An interesting observation: after we implemented hreflang for all 60 regional home pages and a small set of priority URLs, we’ve seen a collateral effect that other not-covered, regional URLs started to rank in place of US URLs.
  5. Using rel=canonical across different language or country versions. Google recommends not using rel=canonical across different language or country versions, and only use it on a per-country/language basis.

,