[Looking for Charlie's main web site?]

Easily finding cached versions of a site/page when it's down or gone

Have you ever had a web site "go dark" on you? or found that a given page on a site somehow disappeared? Maybe it's only temporary (there may even be a "we're down" message, though the site or server may just fail to respond at all), or maybe the failure of the page or site will be permanent.

The good news is that there are at least two easy ways that you may well still be able to see that content you may be missing: the Google cache (to at least see the last version which Google may have cached), and the internet archive "wayback machine", which often lets you see YEARS back in the history of a page or entire site, including one that may be long-gone.

In this post I share tips (and gotchas) on using both tools.

They aren't GUARANTEED to have the page you're looking for, but I find that they do about 99% of the time I try them (and I use them a lot, because I'm often mining gold in old blog posts or articles which have gone away across many sites I have visited).

[Updated June 9 in a variety of ways, mostly minor, but with some additions in the "trip down memory lane" discussion.)

Google's cache (and cache: keyword)

So first, if the content you're interested in is indexed by Google, note you will have two ways to see the last version of the page as they indexed it (and this is true of many of the search engines, like Bing and so on).

Available link in Google searches

First, when you search for content, Google always shows the page title as a link and under that the page's URL. To the right of that URL, you may never have noticed but there is a small down-pointing triangle, which if you click it will generally offer "cached" as an option. See an example below, where I took the screenshot after clicking on it:

Image of Google search result showing cached option

And if you click that "cached" option, it will show the page as it looked when Google last visited the site (to grab its content to store in its db). Often that will be exactly what you were looking for, even though the "real" page is failing.

There are a couple of potential gotchas:

  • sometimes the images on the cached form of the page will be broken. there can be many reasons for that, but nothing really you can do about it
  • it could be possible that by the time Google visits the site again, it will see new content that replaces what WAS there before (what you WERE looking for), so that you may not be able to see the content you hoped to
  • related to the above, of course if you're searching for a page that you know used to exist, and you use some keywords that you KNEW were on that page in the past, yet Google doesn't find it, it could be that Google HAS found new content on the page, and so no longer thinks it should be found based on the search keywords you used

All three of these issues may be solvable with the archive.org option I discuss below.

Google's "cache:" search keyword

Before moving on to that, I wanted to share another tip related to using the Google cache to find prior versions of some page. Did you know that there is a "cache:" keyword, which you can use to find any cached version of a given url? Of course this is only useful when you DO KNOW the URL of the page, and just want to find if there's a cached version of it.

But that does have value. Consider if the site you're trying to visit is just *temporarily* unavailable? You may follow a link to it, and get an error that the page was unavailable. But you have the URL in your browser (or can get it from the link you clicked). In that case, just use the Google cache: keyword in front of the URL to see if Google might show you what IT saw as the last good version of the page. So for instance, with that page I offered above, I could get to the cached version using:

Again, to be clear: I'm saying you would use THAT as your Google search criteria. But there are some tips that can help make that easier to do.

Tips on the Google cache: keyword

Note first that you could also take off the protocol (http:// in that case) and it would still work. This is useful because these days many browsers DO remove that protocol from the display, so it's nice to know that if you just have the domain/page part of the url, you don't need to manually guess at and type in the protocol. So I'm saying this would work just as well compared to the above:

cache:blogs.coldfusion.com/post.cfm/coldfusion-2016-installer-refreshed

Also useful is that most modern browsers will let you do searching from the URL toolbar itself (the field in which you type URLs), so you need not visit the google site or even use a browser-provided google search box. So not only can you just type your search keywords there where you'd enter your URLs, you may often find that when you get such a failed page (for a URL you entered which you know or assume is a valid URL), again you can just add the "cache:" keyword in front of it and hit enter and bang, you're taken to the cached version by way of your search engine of choice (if one exists).

(And yes, you can change what search engine is used by default in the browser, but that's beyond the scope of this article.)

I'll add also that many browsers offer plugins to more easily access a page cached in a search engine, whereby while you're viewing some page you can just right-click on the page content or click some button, and the plugin would automatically take you to the cached version of the page.

The internet archive (archive.org) "wayback machine"

Finally, while the Google cache (or the cache of your favored search engine) is cool, a problem is that in nearly all cases you only get to see up to one version back. So again if the site has changed dramatically, or is gone permanently, soon Google (and other engines) will no longer find the page, or what they find will soon not be based on the content that USED to be there.

In such cases, or just for fun or research, you can use the "internet wayback machine" at the Internet Archive project (where the main site, archive.org, has a whole lot MORE than just archived web pages including millions of other archived documents, audio, and video).

Just drop your URL into the field offered there and you'll see (if the page has been archived by them) that it will show the latest archived version of the page. More important, it will also show a calendar-based indicator of when previous versions of the page were archived. You can select any to see pages from that time in the past, which is just so cool when it works.

For some reason, the Adobe blogs site stopped being archived in Nov 2016, so I can't show the URL of the blog post above (more on such gotchas below). But here is the last archive of the FRONT page of the blog from then: https://web.archive.org/web/20161122113518/http://blogs.coldfusion.com/. And you can see how you can look at previous months and years, back to when archive.org started archiving the site in Jan 2012.

Tips on the wayback machine:

And note that in most cases, you can click links within an archived page, and you will be taken to an archive version of that link. Indeed, I've used this as a way to get access to old download files or media, since even such binary files do generally get archived.

Here's another tip: if you just want to see what archive exists for a given URL, you can preface it with "https://web.archive.org/web/", so for instance to see the old archived versions of the front page of Google, use https://web.archive.org/web/www.google.com. (A commenter kindly reported that they needed to put a /* after /web in that example for it to work. I recall now seeing that in the past, too. Maybe it's a browser- or OS-specific variation. I wasn't seeing it when I prepared the above.)

Note that I didn't need to use the protocol (such as http:// or https://), though if you used it that would work also.

Tripping down memory lane:

And for fun and a trip down memory lane, check out the what the Google web site used to look like in 1998. Talk about old school. :-)

But hey, I need to watch out about the pot calling the kettle black. Here's one of the first archived versions of my site from about 10 years ago. Not much has changed, I have to admit...except the age of that face in the picture...and of course hundreds of resources created since then hidden under all those links and buttons. :-)

Or here's the domain I used to use (the company name was one I used from 1995-2003) and how it looked when first cached 19 years ago. Eew. :-) But you can see I was creating CF content then, including the link on the left "CF Tips" which was my form of an early blog (before blogs were a thing). And be patient when you click the link for the tips: because I changed the URLs back then, for some reason it redirects to a version of the page from later in the year (more on still worse potential gotchas in a moment).

Finally I had wanted to show the original Allaire.com site, the company that created CF. It used to be archived and it was fun looking at that from 1995-ish. Sadly, it's gone from archive.org. Seems that when the company held a 20-year reunion in 2015, they put in a redirect to a new reunion.allaire.com domain, and now that takes precedence when looking for the site in archive.org. So that leads to the next section.

Some wayback machine gotchas:

As much as I love the archive.org site, it's clear from the above that it doesn't always work out for a specific site or page you may seek. It may not be there, or a redirect on the site may confuse it if it gets cached.

The reason that a site or page may not be there could be for technical or legal reasons. Or it may just be a mistake.

Note that on the main web archive page there are buttons and resources down the page and on linked pages for trying to get archive.org to archive a site or page, and in the FAQ are discussions on how one can get a page or site removed.

One of the more tragic things that can happen, wiping out all past archive.org content for a site, is when given domain (for an old site) is given up by the original owner and acquired by someone else (perhaps a squatter/ransom-holder, or someone creating a new site with that old domain). If they happen to put in a robots.txt file to block access, the archive.org site will block access to past archived files. :-(

(I had posted a discussion/complaint about this in their forum last year if you're interested in that topic and want to add your voice.)

So bottom line: you can't always rely on the wayback machine to have the page you seek.

Still, the archive.org and the Google cache have both saved my bacon many times, and I hope they may help you also.

What prompted this post?

I've been meaning to share this info for years, as I find I am also helping people learn about it in my CF server troubleshooting. I've also mentioned it in passing in some of my past blog posts here, such as to find some old page or file I wanted to highlight.

But some readers will connect the dots that I have created this post this week especially in the wake of the sudden, unexpected, and prolonged disappearance about 2 weeks ago of the ColdFusion team blog, which for years has been a great resource (I did a post just recently on the 100 most interesting posts from the CF team blog from the past 3 years.)

I have chosen NOT to open this post with that mention, as I am sincerely hoping that this will be a temporary problem and that we may soon be able to see all the content again.

But until then, and if sadly somehow the content is lost, we will have the means above to see much of the site's content, and maybe the will help if you hit a similar problem with another site.

[An update on June 9: CF product manager Rakshith Naresh confirmed in a comment that indeed the blog's disappearance is not only temporary--as I indeed indicated hoping for--but also that "every single post" will be preserved. Let's hope that covers comments, too, which are often as valuable. But if not, again the wayback machine will have them for any who need them.]

Comments
Thanks Charlie!

I had to add a star.

Example: http://web.archive.o...*/houseoffusion.com
# Posted By Phillip Senn | 6/8/17 2:35 PM
Thanks, Phillip. You know, I had seen that in the past also, but I confirmed what I was seeing did work. Perhaps it's somehow a browser- or OS-specific variation. I have tweaked the entry to help future readers.
# Posted By Charlie Arehart | 6/8/17 2:45 PM
Thanks for the post, Charlie.

I would like to clarify that the blog is down only temporarily and we will back with every single post that was earlier available.
# Posted By Rakshith Naresh | 6/8/17 7:36 PM
Thanks for that clarification, Rakshith. Glad to hear you confirm my hopes on both points (I figured it would be back. I hoped it would include all old posts.)

Can you confirm also if you will also be preserving all the previous comments to each? Sometimes they were as valuable as the blog post itself.
# Posted By Charlie Arehart | 6/9/17 11:02 AM
archive.org has proved useful many times in recovering clients website content after it has been hacked or lost due to host problems.
Just had one this week in fact, the client was with TalkTalk who built and hosted the site, they let it get hacked, wouldn't fix it, so the site been down since last year, impossible to get FTP access. Used the Wayback machine to find an old version from 2014, which it seems is the last time the site was actually working properly.
# Posted By Snake | 6/30/17 5:21 AM
Copyright ©2017 Charlie Arehart
Carehart Logo
BlogCFC was created by Raymond Camden. This blog is running version 5.005.
(Want to validate the html in this page?)

Managed Hosting Services provided by
Managed Dedicated Hosting