Ultimately, I had to make a compromise between feasibility and correctness, but I’m pretty satisified with the results. In fact, SiteSucker was able to confirm that, besides images, we did manage a 100% conversion for all the existing urls.
Google Analytics & Full Realization
At first, I spent about a day going through the top 1000 URLs according to the Google Analytics tracker. This was horribly redundant, mind-numbing work but I didn’t see a way to export all 5 MILLION URLs that Google Analytics had on on record for the last month.
I calculated at the rate I needed to go through the top 1000, I’d need almost 7 years to double check the rest. And, of course, I knew we didn’t have that many valid pages in our site.
Site Sucker to the Rescue
I’d used SiteSucker a few times in recent months to double check the health of our site’s link structure. It’s extremely fast and the user interface is very lean (making use of Mac OS’s Console logging application). What I wondered was how I could execute a web crawl not from a site, but from a saved file? Turns out, it’s very easy!
First, ensure the settings of SiteSucker to log the download history (and save that log):
Then, enter the original sitename in the Web URL input and hit enter. You’re off to the races!
For ~20k links, it took almost 20 minutes for SiteSucker to grab them all. Take the finished log output and snip away the unneeded text to the left and right of the url. Go ahead and take the extra time to wrap them up in a nice anchor tag (this will help in the next step).
Copy this file and rename it to reflect the new site where you want to test your redirects. In your favorite editor search/replace the old domain with the test (or new) domain. Wrap it in with basic html & body tags and change the extension to
.html. Now, we have a very basic html page containing all the links from the old site.
In SiteSucker, go back into the settings and check the Limits. We want to enforce a maximum level of 1 now. This is because we already have all the relevant links in our file – no point in asking SiteSucker to recrawl the entire site for every original link (this would take days).
Finally, drag the html file you created (containing the test or new domain name and all links) into the Web URL bar and let go. SiteSucker dutifully follows every link and reports its findings. Hopefully, you won’t have too many ERRORs, but if you do, it’s quite easy now to rectify them as you have a log showing exactly which redirects failed!
Don’t let huge numbers of links frighten you into thinking you can’t make your site better. There are plenty of web crawlers available on many platforms – I have also used and recommend Xenu on Wine. Thinking a bit outside of the box can make it easy to turn that mountain into a mole hill again.
How do you maintain your sites’ link “healthiness” ? Let us know in the comments.