We often hear in the SEO world that ‘content is king’, and to a point that statement has some truth. SEOs and content marketers spend a lot of time creating high-quality content and building links – both external and internal – into these key pages to help them rank. And those efforts can help to improve the general authority of a whole domain.
But for medium-to-large websites, and particularly those that have been around for some years, too much (low-quality) content can actually hold back the ranking potential of those key pages. In cases like this, content is no longer king. Ensuring Googlebot is not wasting time on legacy content and being forced through redirect chains or encountering crawl errors can bring marginal SEO gains that shouldn’t be ignored.
So this post is not about optimising your ‘key’ content or possibly even pages currently linked to on your site, but finding those legacy URLs that Google continue to visit and ensuring they are being handled effectively.
This is all about crawl efficiency and making the most of the crawl budget that is assigned to a given site. Instead of crawling every page of a site each time it visits, Googlebot assigns a ‘crawl budget’ which is dependent on a number of factors. You can read more about crawl budget in this Google post. As Googlebot is not going to visit each page of a site every time it crawls, we want to make the most of the crawl budget assigned and ensure we’re not wasting any on old or weak pages.
Optimising for crawl efficiency
So to optimise for crawl efficiency, we’re looking to identify crawl budget waste and blocks that we’re putting in the way of Googlebot from crawling the pages we want them to crawl. This waste can include:
URLs that are going through redirect chains
A redirect chain happens when an original URL goes through more than one redirect before reaching the destination URL. Ideally, URLs should only have one redirect in place. This is because equity *may* be lost through a redirect, so the more redirects in place the more equity that may be lost. I say *may* because despite Google saying there is no equity loss, there are plenty of case studies proving otherwise. Here’s one for example. In SEO it’s not worth leaving anything to chance, so my preference would always be to remove redirect chains. It also speeds up the site for both search bots and visitors.
Why do redirect chains occur? Typically they surface on large websites where over time pages have been removed. A page gets deleted and redirected elsewhere, but a year later that page gets deleted and redirected. If the original redirect is not updated to point to the new final page, a redirect chain is in place.
Old pages that 404 that could be redirected elsewhere
Whilst 404s don’t technically harm your rankings, if we can see Google are regularly crawling pages that 404, we can try and address this so that they spend more time on the pages we want them to be crawling. A 404 page could be redirected to another relevant page to help Google on its way, or if a page has no relevance, a 410 (Gone) status code could be used instead of a 404 (Not found). This should stop Google attempting to crawl the URL once they’ve seen the 410.
Deleted pages with external links that aren’t being redirected elsewhere
For large websites that have been around for some time, it’s not uncommon to find old pages with external links, that have been deleted and left to 404. On the basis that these are good links, these should be redirected to a relevant page that resolves 200 on site, to ensure as much link equity as possible is being retained.
Using either Ahrefs or Majestic, you can identify all pages on site that have ever received an external link. Run these pages through a crawling tool like Screaming Frog to identify any that no longer return 200 or redirect elsewhere. If any are found, redirect them to a relevant page to retain the link equity from the external links.
URLs that are being blanket redirected to the homepage which could be considered soft 404s
It’s also not uncommon on old websites to discover a large number of old URLs that have been blanket redirect to the homepage. These could well be considered soft 404s by Googlebot and therefore the redirects could well be ignored. Instead, if you have a large number of homepage redirects identify if these pages could be redirected somewhere more relevant, or if not consider allowing them to 404, or preferably 410.
URLs that are being 302 redirected rather than 301
When redirecting a page, it’s preferable to use a 301 (permanent) redirect over a 302 (temporary) as long as you don’t intend to bring the page back. 301s have been proven to retain more equity than 302s, so if you find old pages being redirected via 302s, updating them to 301s would definitely be advised.
Gathering enough historical URLs to make a difference
So we know what we want to try and identify and then fix, but one issue we have is finding enough historical URLs to make a difference.
Finding historical URLs on medium-to-large sites – let’s say sites of at least 1,000 pages but likely many, many more – and that have been around for years is often not an easy task. Pages get deleted and forgotten about, site migrations take place, previous SEOs or developers move on, and therefore there is a lack of information available.
Whilst a crawl of the current site will uncover any issues with current pages that are linked to within the site’s structure, it won’t identify old pages that are no longer linked to. The trouble is, Googlebot likes to revisit pages it’s visited before time and time again. If these ‘old’ pages are not being redirected or handled efficiently, we’ve got some crawl budget waste that we can address.
Ask the existing team
Whilst you’ll be lucky if this one works, the first place to start is to ask the current development team if they have a list / database of historical URLs that you can access. Even in the unlikely event that a list can be provided, you’ll want to use other sources too, to ensure that you cover as many bases as possible.
Hat tip to Oli Mason for this one; you can use the Internet Archive’s Wayback Machine to identify all URLs it has archived since it’s been running. If the Wayback Machine has archived a URL, it’s highly likely that Googlebot will have visited it at some point too.
All you need to do is pop the below into your browser, changing ‘DOMAIN’ for the domain you want to view.
You should then see something like this (example for bbc.co.uk):
Simply copy and paste the URLs into Excel and remove the port numbers and other cruft, and you’ve got a list of all URLs that the Wayback Machine has ever archived.
You may have noticed the URL limit appended at the end – if you’re working with sites with potentially hundreds of thousands or millions of historical URLs, you can just amend the limit to grab more. It will take longer to load of course.
Of course, your server logs will show you all URLs Googlebot have attempted to visit over a particular time-frame. To get the most out of this data, you’ll want to try and get hold of as many logs as possible.
If you can get a few month’s worth or even up to a year’s worth, there will likely be a lot of historical URLs contained in the logs that you’ve never heard of, or long forgotten about.
Using a combination of Screaming Frog’s SEO Spider and Log File Analyser, you can easily identify orphan pages – these are pages that search engines are crawling but are no longer linked to internally on your site. There’s a great guide on how to use the Log File Analyser and section 17 is all about finding orphan pages. To do this you’ll need to import crawl data alongside your log file data, and from there you can view all of those pages directly within the Log File Analyser, and export to Excel for further analysis.
Search Console – crawl errors
Whilst limited, the crawl errors report in Google Search Console also often contains URL errors that Google have encountered on a recent crawl.
This is limited to 1,000 URLs at a time, but can still be useful and show you URLs Googlebot has been having trouble with recently.
Combine gathered URLs and remove duplicates
Once you’ve gathered as many historical URLs as possible, merge all of these into Excel and remove duplicates. Now you hopefully have a long list of URLs that have at one stage existed on site.
Identifying response codes and redirect paths of historical URLs
With your – hopefully long – list of historical URLs collected, you now want to identify the response codes of these URLs. Remember, some of what we’re looking to fix includes:
- URLs going through redirect chains
- Old pages that 404
- Deleted pages with external links
- URLs that are being redirected to the homepage
- 302 redirects that should be 301s
I like to use the Screaming Frog SEO Spider for this. Set it to ‘List’ mode, ensure you select the ‘Always Follow Redirects’ option, and then just paste in your list and let the spider run.
Once the spider has finished, you want to export the ‘Response Codes’ tab as well as the ‘Redirect Chains’ report.
The ‘Redirect Chains’ report list will show you – you’ve guessed it – all redirect chains that your URLs are going through.
The ‘Response Codes’ tab will show you the response code of each URL. Any that don’t resolve 200 can be investigated to ensure we’re cleaning up as much crawl budget waste.
Fixing as many historical issues that you can find is the aim of the game here.
So to summarise, this post may not be on the most exciting of subjects – although I personally love it – but trawling through the history of your sites and fixing old mistakes can bring marginal gains that can help the content you care about to rank better.
Of course, we can and should optimise for crawl efficiency by ensuring we have a well mapped internal link structure, removing broken links, utilising robots.txt and optimising site speed amongst a number of other things. But uncovering those historical issues that wouldn’t be picked up on a current crawl can really help Google prioritise the content we want them to be looking at.
So if you have an old site, I’d encourage you to follow the process above. You may be surprised with what you find.