
How an errant 404 page caused 33,000 crawl errors when Googlebot came a calling to our site. And how you can avoid the same issue when using WordPress as a CMS.
A few posts ago, we told you about some issues with using WordPress as a content management system, mostly about using pretty permalinks and their strain on your server resources. Here’s another one, though this little hiccup was caused by my sloppiness rather than a built-in faux pas of WordPress itself.
Google Webmaster Tools
Google webmaster tools is a great resource for anyone running a website or blog. It’s free to sign-up (all you need is a Google account) and it gives you all sorts of diagnostic tools on how your site is performing, how people are finding it, how it’s ranking for this keyword and that. For the sake of this discussion, Google webmaster tools also shows you how Googlebot (big ‘G’s search bot or spider) sees your site, which links are broken, what code is pooched, etc. The theory goes as follows – the cleaner your code, the more efficient your site map, the better Google will ‘like your site. When it comes to showing up in search rankings, Google ‘liking’ your site is pretty important stuff. All fair enough. Now look at the screen cap above, taken from The Logo Factory‘s dashboard. That’s not a mistake.
On a recent trip to our site, Googlebot found 33,252 errors. To be more accurate, Google couldn’t find 33,252 pages on our site that links from other pages told it were there. Crikey. While I imagine that’s some sort of error record, it’s probably not too good in the grand scheme of things, especially when it comes to SEO for logo design, something that’s relatively important around here. How the hell can a site with about 2,000 pages throw 33,000 crawl errors? It has to do with a messed up 301 redirect, an errant 404 page, a relative link, and WordPress’ almost uncanny ability to find pages. Or try to create pages when they don’t exist. And oh yeah, my rank stupidity when setting up a WordPress footer template.
Types of links: relative vs. absolute
When it comes to links, there are two types. Relative and absolute. An absolute is a link that tells a browser exactly where a page is. Something like “http://www.thelogofactory.com/logo-design-portfolio/”. A relative link is a link that tells a browser where a resource is relative to a page you’re currently sitting on. Something like “../logo-design-portfolio/”, the little dots and the backslash indicating that our logo design portfolio is one directory above the page that the link is on. Simple enough stuff. When coding HTML pages, programs like Dreamweaver uses relative links by default, while CMS and blog software like WordPress use absolute links. There are arguments for both types, but the main difference is this – an absolute link will work regardless of where the page is located, while a relative link will only work in the directory that it was originally intended to sit in.

Keeping that in mind, when I was creating a 404 page (a page that is served up when a visitor requests a page that doesn’t exist) I wanted to make it nice and pretty, rather than the default page created by our server. Accordingly, I used a couple of site templates, a header and a footer, to assemble a 404 page that looked like the rest of our site (above). I kept all the links at the bottom of the page active, so that a visitor could have a wide selection of alternate destinations to choose from. They were all absolute links so there wouldn’t be an issue regardless of what level of our site the 404 page was published at. Well, all the links save one. A button to our how we work page was a relative link, placed there when I set the footer template up in Dreamweaver. Unfortunately, that link would only work when the page was published in a second level of our site. “http://www.thelogofactory.com/directory/404/” kind of thing. Anywhere else, the link would lead to it’s own 404 page. You can see where this is going, right?
301s and 404s
When we were setting up our new site, we 301d a ton of pages (a 301 is a server script that tells Google and other search engines that a page has moved permanently, to index the new page, and delete the old). When writing hundreds of 301 redirects, invariably a few won’t work (there’s twitchy naming conventions when using WordPress, and the order of the 301s is critical when redirecting pages with similar names). Sure enough, one of our 301s led nowhere, and threw up a 404 page when Googlebot tried to follow the link. And then Googlebot tried to follow the links on the 404 page. Which worked out fine and dandy. Until it hit the ‘how we work’ link. Which produced a 404 page. With another broken link. Which produced another 404. And on. And on. And on. For approximately 33,000 times. Mess doesn’t quite cover it. That’s why, when I logged into our Webmaster Tools dashboard, I saw the freakish crawl error tally above. Took us a bit to figure out what had happened. The fix was simple once we did. Here’s the thing. If we didn’t log on, we never would have been aware of the massive errors that were being recorded by Google. And we would have been left scratching our heads as our site sank in search engine rankings.
Lessons to be learned
The takeaway on this in many fold. 1) if you don’t have an account on Google Webmaster Central, get one. Now. 2) When you do get one, log in often, just to see what’s what. 3) If you’re going to get clever with your 404 pages, make sure ALL the links are absolute. 4) It’s probably better to set your 404 pages to ‘No Follow”. That’s a little bit of code, added to the HEAD portion of your page, that tells search engine spiders not to follow the links on it. Perhaps the most important lesson is how much damage a pooched line of code can do to a website. And that a little bit of knowledge is a very dangerous thing.
Googlebot & hazmat suits
In any case, as for the FUBAR mess upstairs, we’ve set up some 301 redirects that will hopefully clean things up whenever Googlebot returns to our site. Though, after being so badly abused by my screwed up 404 page, I wouldn’t be surprised if that wasn’t for a while.
Or without the little spider donning a hazmat suit before diving in.
Related Posts







very nice logo post.
So, your name is ‘logo design’, it’s linked to logochefs.com and your comment on a post about WordPress CMS and Google crawl errors is “very nice logo post?” That might be interpreted as comment spamming by some. Not me. But some.
Ah Steve, Looks to me like L.D. up there was only pinching a log….Ooh!
Sounds similar to pinching a loaf but it was a log. Oh. definitely a log…
Ooh!
Where’s the eX-lax when you need it?
Informative post Steve.
I actually logged into Webmaster tools today and was surprised to see that I had a few errors – hence my searching around for some answers. I setup WordPress as a CMS, but for some reason I have getting sitemap errors for the /blog/.
No idea why this is as all of the links are absolutes, 404 is simple and I have no 301s in place.
I have a similar problem with thousands of crawl errors but the urls are working fine. I tried to include the link to the google help forum but it was detected as spam.
Now, you mentioned relative vs. absolute links. I assume my links are relative and you suggested to change it to absolute. If i do that, would i have to go through all the 47000 url errors that are already relative links?
i am having same problem and im having trouble now in indexing in Google Search
Thank God i am able to find this post. I have been searching for months on how to fix not found errors on my blog and this post has come to my aid.
Me too, went to “webmasters” and found a boatload of crawl errors.
I’d cleaned up my category structure a couple weeks earlier, “I thought!” ugh…
Now another thing to learn about blogging and google. sheez…
sandy
hi, i have 2 situations with the crawling errors: First I imported on my WP site one of my blogspot blogs (using import blog plugin) and the after the import all the keyword tags were interpreted in WP as categories. Of course, did not want that and i used “convert category to tags” plugin. The problem is that after a while (3 weeks) i got a lot of crawl 404 errors where my tags are correlated in category-related links (therefore looked like googlebot “knew” that those tags were imported as categories initially. Now i do not know how to remove those 404 errors. SECOND problem is that on the 404 error list i have a URL related to feed error (and containing some page links). When i check all those links all of them seem to work fine but do not know why they are displayed as 404 error. Is anyone aware of such situations? Thanks!
Came here looking how to avoid crawl errors and now looking at having a logo done. Great post and site!