33000 Google Crawl Errors

Nothing to do with logo design per se, but how an errant 404 page caused 33,000 crawl errors when Googlebot came a calling to our site. And how you can avoid the same issue when using WordPress as a CMS.

A few posts ago, we told you about some issues with using WordPress as a content management system, mostly about using pretty permalinks and their strain on your server resources. Here’s another one, though this little hiccup was caused by my sloppiness rather than a built-in faux pas of WordPress itself.

Google Webmaster Tools

Google webmaster tools is a great resource for anyone running a website or blog. It’s free to sign-up (all you need is a Google account) and it gives you all sorts of diagnostic tools on how your site is performing, how people are finding it, how it’s ranking for this keyword and that. For the sake of this discussion, Google webmaster tools also shows you how Googlebot (big ‘G’s search bot or spider) sees your site, which links are broken, what code is pooched, etc. The theory goes as follows – the cleaner your code, the more efficient your site map, the better Google will ‘like your site. When it comes to showing up in search rankings, Google ‘liking’ your site is pretty important stuff. All fair enough. Now look at the screen cap above, taken from The Logo Factory‘s dashboard. That’s not a mistake.

33,000 pages not found.

On a recent trip to our site, Googlebot found 33,252 errors. To be more accurate, Google couldn’t find 33,252 pages on our site that links from other pages told it were there. Crikey. While I imagine that’s some sort of error record, it’s probably not too good in the grand scheme of things, especially when it comes to SEO for logo design, something that’s relatively important around here. How the hell can a site with about 2,000 pages throw 33,000 crawl errors? It has to do with a messed up 301 redirect, an errant 404 page, a relative link, and WordPress’ almost uncanny ability to find pages. Or try to create pages when they don’t exist. And oh yeah, my rank stupidity when setting up a WordPress footer template.

Types of links: relative vs. absolute.

When it comes to links, there are two types. Relative and absolute. An absolute is a link that tells a browser exactly where a page is. Something like “http://www.thelogofactory.com/logo-design-gallery/”. A relative link is a link that tells a browser where a resource is relative to a page you’re currently sitting on. Something like “../logo-design-gallery/”, the little dots and the backslash indicating that our logo design gallery is one directory above the page that the link is on. Simple enough stuff. When coding HTML pages, programs like Dreamweaver uses relative links by default, while CMS and blog software like WordPress use absolute links. There are arguments for both types, but the main difference is this – an absolute link will work regardless of where the page is located, while a relative link will only work in the directory that it was originally intended to sit in.
Custom 404 PageKeeping that in mind, when I was creating a 404 page (a page that is served up when a visitor requests a page that doesn’t exist) I wanted to make it nice and pretty, rather than the default page created by our server. Accordingly, I used a couple of site templates, a header and a footer, to assemble a 404 page that looked like the rest of our site (above). I kept all the links at the bottom of the page active, so that a visitor could have a wide selection of alternate destinations to choose from. They were all absolute links so there wouldn’t be an issue regardless of what level of our site the 404 page was published at. Well, all the links save one. A button to our company page was a relative link, placed there when I set the footer template up in Dreamweaver. Unfortunately, that link would only work when the page was published in a second level of our site. “http://www.thelogofactory.com/directory/404/” kind of thing. Anywhere else, the link would lead to it’s own 404 page. You can see where this is going, right?
404 image

301s and 404s.

When we were setting up our new site, we 301d a ton of pages (a 301 is a server script that tells Google and other search engines that a page has moved permanently, to index the new page, and delete the old.) When writing hundreds of 301 redirects, invariably a few won’t work (there’s twitchy naming conventions when using WordPress, and the order of the 301s is critical when redirecting pages with similar names.) Sure enough, one of our 301s led nowhere, and threw up a 404 page when Googlebot tried to follow the link. And then Googlebot tried to follow the links on the 404 page. Which worked out fine and dandy. Until it hit the ‘how we work’ link. Which produced a 404 page. With another broken link. Which produced another 404.

And on.

And on.

And on. For approximately 33,000 times. Mess doesn’t quite cover it. That’s why, when I logged into our Webmaster Tools dashboard, I saw the freakish crawl error tally above. Took us a bit to figure out what had happened. The fix was simple once we did. Here’s the thing. If we didn’t log on, we never would have been aware of the massive errors that were being recorded by Google. And we would have been left scratching our heads as our site sank in search engine rankings.

Lessons to be learned.

The takeaway on this in many fold. 1) if you don’t have an account on Google Webmaster Central, get one. Now. 2) When you do get one, log in often, just to see what’s what. 3) If you’re going to get clever with your 404 pages, make sure ALL the links are absolute. 4) It’s probably better to set your 404 pages to ‘No Follow”. That’s a little bit of code, added to the HEAD portion of your page, that tells search engine spiders not to follow the links on it. Perhaps the most important lesson is how much damage a pooched line of code can do to a website. And that a little bit of knowledge is a very dangerous thing.\

Googlebot & hazmat suits.

In any case, as for the FUBAR mess upstairs, we’ve set up some 301 redirects that will hopefully clean things up whenever Googlebot returns to our site. Though, after being so badly abused by my screwed up 404 page, I wouldn’t be surprised if that wasn’t for a while.

Or without the little spider donning a hazmat suit before diving in.

[Footnote: This article was originally posted on our (now) Legacy Blog and moved to its current location for consistency and database functionality. While it was accurate at the time of publication, it is currently posted as part of our historical record and details may have changed.]