We use our own spidering software to automatically crawl each website on the platform once a week (although customers can request manual re-crawls at any time).
If you're seeing the above message and not seeing any crawl-related data in modules like the Meta Data modules, then there could be a number of causes for this:
- Off domain re-directs. Our spider naturally includes to check if a link is taking it "off domain"; if we didn't have this in place, it would follow every link (including outbound ones) and crawl the web indefinitely. Obviously, this would be a poor use of our resources and make a mess of your reports. So, if you added a site as www.example.com and it re-directs to www.example.co.uk, it will consider that as an "off domain re-direct" and immediately stop crawling once it's followed the 301 or 302.** If you want to check this quickly, check the Outbound Links module (Pages/Outbound Links) and take a look at the Target page column.**
- IP blocking - your server could simply be blocking our crawler server. If this is the case, there is no way around this unless you remove this block (our crawler servers have IPs of 188.8.131.52 and 184.108.40.206).
- Robots.txt blocking - our crawler is called Curious** George** (hence the picture). If your robots.txt includes a link blocking this user agent, then we will follow good practice, obey that instruction and not crawl your site. You would need to get this restriction removed if you want us to be able to crawl the site. The Pages Crawled will display a message, explaining that this is the case. Please bear in mind, that our spider runs every week, whereas the robots.txt analysis is done every day, so if you fix the issue with robots.txt, please let us know and we will restart both jobs.
- You've entered a page instead of a domain! - if your homepage re-directs to a specific URL, we will now notify you of this when you try to add a site(using the Add Site button); if you've accepted this change, it will mean that we will ignore the .html part of the URL in order to determine what to crawl; so you may have entered the site as www.example.com/home.html and it may show on the platform as such, but we will look to crawl links from www.example.com.
- You've got a Meta No Index tag on your homepage: <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> This tells robots not to index the content of a page, and/or not scan it for links to follow. We'll obey this as we follow 'white hat' principles, so you would need to remove this (or comment it out) in order to allow us to crawl the site.
- You're hosting multiple sites on your account on the same server. Our spider will deliberately postpone a spidering job if it notices another one also going through a site on the same IP. This is obviously to prevent us from hammering a server where multiple sites are hosted. It automatically suspends the job and then retries an hour later. If the other job is still ongoing, it will postpone again.
- Time Constraints: we will only crawl one site for a maximum of 24 hours before automatically stopping. If we've not covered all of the site which you wanted crawling in this time, please get in touch and ask us to recrawl the site with more spidering threads. Please bear in mind that this will mean a slighter higher demand on your web servers, though, whilst we have multiple threads requesting pages.
- Odd Site Structure: we do see instances of sites being entered as www.example.com/GB/store/ but the links from that page actually point to other pages with a URL structure of www.example.com/store/GB/shirts/, so the spider doesn't follow these links. We can resolve such issues manually - just raise a ticket and we'll fix it manually so it will crawl such sites.
NB: if you're comparing our Pages Crawled number with Google's number of pages indexed (use the site: Google 'hack'), then there will be a number of reason why these numbers can vary considerably. Apart from the points already outlined above, you also need to consider the following:
- 404s - we won't count 404s as pages crawled, but will report those in the Deadlinks module. However, Google may include these in its total number of indexed pages.
- Orphaned pages - Google may index these if it has another way of finding them (from external backlinks from other unique domains). Remember, we won't see these as we're spidering internal links only.
- The site: hack may include sub domains which we will ignore as our spidering software will consider all sub domains as off domain links and will not follow them; Google, however, may be giving you an number of pages indexed for the whole domain. You can easily verify this by running a site: hack and flicking the sample results Google returns.