A very important part of an SEO’s job is to ensure content gets indexed by search engines, as if a page isn’t indexed it can’t drive organic traffic to a website. But this doesn’t mean the approach should be let everything on a site get indexed. Too many low-quality URLs can bloat the index and hold back the ranking potential of the pages you want to be performing well.
Many sites with index bloat issues have fallen victim to Google’s Panda algorithm in the past, and more recently Google’s regular ‘quality’ updates now that Panda has been rolled into the core algorithm. Therefore to protect your site from running into trouble because of index bloat, you need to have a firm view of the pages Google – and other search engines – are including in their index for your site. This post aims to cover what index bloat is, how to identify it, and how to rectify it if you believe there is a problem.
What is index bloat?
Index bloat occurs when a search engine indexes significantly more URLs for a site than it should have in its index.
A good way of looking at it is your sitemaps should ideally contain all URLs that you want to be indexed. On the basis that your sitemaps are up-to-date and contain everything you want to be indexed, if the number of the pages in the index significantly exceeds the number of pages in your sitemaps, you’ve likely got a case of index bloat.
You can also crawl your site with a tool like Screaming Frog and compare the number of URLs picked up in the crawl vs what’s listed in the index. If the index has lots more URLs than the crawl, then there’s likely an index bloat issue.
Whatever Google have in their index for your site, that’s what they are using to determine your site quality. If you want to maintain your rankings and benefit from core algorithm updates, then you want to ensure that you have as much high-quality content in the index as possible, and ensure that as much low-quality content as possible is removed. Otherwise you risk Google taking a dim view of your site and suffering from said updates. And once that happens, it’s can take a long time for Google to change that view.
How to check how many pages you have in the index
There are two main ways to view how many pages you have in the index. Google Search Console has a handy ‘Index coverage’ report which will give you a current number and also show you if that has changed over time. If you notice a spike in the graph and the number of URLs has suddenly increased it’s usually worth looking into.
The other way to get the number of indexed pages for a site is performing a simple site:domain query in the search engine itself.
This isn’t quite as accurate as using Search Console but still gives you a good idea of the number of indexed pages. In the case above asos.com has approximately 2.7m URLs in Google’s index. Using advanced search operators we can drill down further to the types of pages in the index, but as a first look a site:domain query will give you a top-down view of the number of pages in the index.
Typical pages that cause index bloat
So what pages typically cause index bloat? At a basic level any URL that isn’t adding value to the user shouldn’t really be in the index. But although it varies from site to site, there are some common URL types that you should be looking to restrict from the index.
Internal search results
As they typically wouldn’t add any value to a user coming from a search engine, internal search results should be blocked from crawling by search engines and also not linked to, to prevent them from being indexed.
Whilst tag URLs can be useful for some sites, they can also create a significant amount of weak pages on a site that offer little unique content.
Paginated URLs typically offer little value to users coming from search engines. As a user you would almost always want to land on the first pages of a paginated series when coming from a search engine, so my preference is to always reduce the amount of paginated content in the index. The use of rel=”next” and rel=”prev” can help Google understand your pagination and therefore reduce the number of indexed paginated URLs.
Filters / facets
Whilst filters and facets can be really useful for users when searching within a site, it’s not advisable to allow all the different combination of results to be indexed by search engines. You can quickly create a duplicate content issue if you’re not careful.
Weak pages such as ‘out of stock’ pages
Weak pages – those that add little unique value to the user and/or would likely result in the user bouncing back to the search results – are another type of page that can cause index bloat. As an example, asos.com have 278,000 products in Google’s index that are out of stock, which is around 10% of their total indexed pages. It’s likely a significant number of these could be removed from the index, especially if they are long-term out of stock.
Multiple URL versions and development sites
It’s not uncommon for some sites to have more than one version of a URL indexed (e.g. the www URL and non-www URL). This creates obvious duplication and index bloat but can be easily solved by using canonical tags, implementing redirects and only linking to one URL version. Staging sites for testing can also create duplicate content issues if indexed, so measures should always be taken to prevent them being crawled and indexed.
Old URLs that are now redundant
Most sites have historical URLs that drive little to no organic traffic and aren’t ever accessed by users. A content audit is a good place to identify these types of pages, and if they can’t be improved then it’s a good option to remove them from your site and the index.
Identifying pages that you want to remove
When dealing with index bloat, one of the trickiest parts to tackle is getting a list of URLs that you want to remove. Google don’t make it easy to have a full list of the URLs in its index for a particular site, especially when you’re working with a site that has thousands or even millions of URLs.
If you suspect index bloat, one way to narrow down finding URLs is to use advanced search operators within Google itself. You can search for a particular URL string or sub-folder to look at different sections of a site at a time. And sometimes you have to look in detail, as Google don’t put it on a plate for you.
I had a recent example of this when an SEO friend of mine asked me to take a quick look at a food and drink site they were pitching to do some work for. The site had slowly been losing organic traffic for a sustained amount of time but they weren’t too sure why. Tag URLs were added to the site around the time of the decline but this wasn’t believed to be the cause of the drop.
As with any site that has lost a significant amount of traffic over time, I always start by identifying if there is any index bloat. Whilst I didn’t have access to Search Console for this site, a quick site:domain query showed they had just over 5,000 pages in the index. This matched up pretty well a crawl of the site, so on the face of it index bloat didn’t appear to be an issue.
However, sometimes you have to dig a little deeper to find a case of index bloat. I did some further index checks on sub-folder URLs of the domain, including the tag URLs that had been mentioned previously. Again, this didn’t show any major problem with just seven tag URLs in the index.
But when I scrolled down to the bottom of the search results after performing the query above, I found something that immediately confirmed to me that this site had an index bloat problem with their tag URLs:
If you see this message after performing a site:query, it’s a pretty sure way to know that Google have pages in their index that even they don’t want to display and therefore have chosen to omit from the initial view.
I then hit the ‘repeat the search with omitted results included’ and got the below:
So seven tag URLs in the index was actually more like 10,400. And for a site with around 5,000 ‘standard’ pages, that’s a lot of cruft that needs to be dealt with.
There are other tools you can use to help you get a list of pages in the index. URL Profiler has a great ‘Google Indexation’ checker which you can run a list of URLs through to see if they are indexed. Of course, you need a list of URLs in the first place in this case though.
The new Google Search Console also has a useful section in the ‘Index coverage’ report which shows you indexed URLs that are not included in sitemaps. From experience this isn’t extensive, but can help give you a steer of where to start when looking for index bloat.
How to remove index bloat
Once you have a list of pages that you want to remove from the index, there are a number of methods you can use to start the process of getting them removed.
If the pages return a 200 status code, are still linked to on the site and not blocked in robots.txt, setting the meta robots tag to ‘noindex, follow’ will eventually see Google drop them from their index.
If the pages you want to remove no longer return a 200 status code, then using a 410 would be preferable over a 404. A 410 status code informs a search engine that the page has gone, whereas a 404 means not found. From experience pages with a 410 status code drop from the index more quickly than pages with a 404.
Of course if you are removing pages you can also redirect them elsewhere, or if you have similar pages that need to remain you can use the canonical tag to reduce the amount of duplicate URLs in the index.
Whilst not officially supported by Google, I’ve also had success using noindex within robots.txt to remove unwanted URLs from the index at scale.
There is also the URL Removal Tool within Google Search Console, which temporarily takes pages out of the index for 90 days. Ideally I prefer to use other methods than use the URL Removal Tool, as it’s a temporary solution and I’m not convinced it actually removes the page from the index page. Just a theory but I suspect it prevents a URL from showing in the index for 90 days, rather than actually remove it.
Whichever method you use to remove URLs, it’s usually not a quick process. Whilst the URL Removal Tool method will work within a few hours, the other methods require Google to re-crawl the URLs to take the changes you have made into account. And it often takes more than one crawl for Google to reflect the changes you have made. I’ve worked on sites where it’s taken several months to clear up index bloat, so you need to stay patient.
To speed things up I’d recommend creating sitemaps that contain the URLs you want to remove and submitting these in Search Console. Might sound counter-intuitive but doing this can help push Google to crawl the URLs more quickly. You can also monitor the number of indexed URLs within the sitemaps themselves, which should start dropping over time.
The rewards of fixing index bloat
As mentioned previously, many sites have suffered from algorithm updates because of significant index bloat issues. From my experience, fixing these issues can result in very positive gains when core algorithm updates take place.
I’ve personally worked on a site in a competitive niche which was suffering from a significant and sustained loss of organic traffic. After taking a drastic approach of removing approximately 90% of indexed pages, we saw a fantastic uplift which greatly surpassed previous organic traffic levels.
Similar URLs can cause issues on a page-level basis
Whilst sites with index bloat issues are at risk from core algorithm updates, that’s not the only time when rankings can suffer because of index bloat. I recently had an interesting case of index bloat for a freelance client of mine who operates in the cosmetic surgery space.
When I took them on as a client their service pages were often struggling to rank, with the homepage often ranking instead of the service pages, and usually not on page 1. After some initial optimisation things started moving in the right direction, especially when their most lucrative service started ranking with the correct page. But a combination of adding new images to that page and a Yoast SEO bug which meant the images ended up in the index, resulted in an interesting case study of index bloat.
Suddenly Google had multiple URLs in their index for my client’s site that were very similar, and the result was that the ranking URL changed a number of times. First unrelated service pages started ranking, then one of the image URLs itself ranked for a short while, and then the homepage started ranking again.
After I noticed the Yoast issue and ensured the image URLs redirected to the parent page, eventually they dropped from the index. And only then did the correct URL start ranking again.
If you have a WordPress site and are using Yoast SEO – which is a great free plugin by the way – make sure you ensure that you enable attachment URLs to redirect to the parent page. Yoast have apologised for the bug – which is now fixed – and also created a Search Index Purge plugin which will hope those that have been affected by the bug to remove image URLs from the index as quickly as possible.
Hopefully this post has explained what index bloat is, how you can identify it, and why fixing it can bring impressive results. If you have a site that has been losing organic traffic and struggling when core algorithm updates hit, then I’d encourage you to look into whether index bloat could be a cause.
Even if your site is flying and gaining traffic, I’d still encourage you to keep an eye on what you have in the index. Sites that don’t keep an eye on the index risk Google indexing low-quality pages that may count against them in future updates.