An Elephant (and the Internet) Never Forgets

August 17, 2016 | Article | Search Engine Optimization

Bruce Lee once said: “Mistakes are always forgivable, if one has the courage to admit them.” While that may be true, Bruce Lee didn’t have access to the “world wide web,” and internet trolls don’t seem forgiving these days. And certainly, thanks to the internet, they’re also never going to forget.

Or will they? It’s a tricky question to answer and a feat that is rarely attempted without consequence. Melania Trump experienced the internet’s ire after her team attempted to remove inaccurate content from her site by redirecting it to Trump.com. Not only did this cause the Twittersphere to explode with people commenting on the topic, but the removed page was resurrected from the Wayback Machine and reposted on who knows how many major news sites and blogs, quite possibly cementing her mistake on the web for all time.

But You Can Mess With Its Memory

While we expect this for our public figures, it’s a bit more complicated for companies who may want more control over the information shared about them online. Many brands take steps to ensure old content is not archived for legitimate business reasons, such as American Express's elusive 100k bonus point offer. If everyone could negotiate these type of rewards, it would make other offers less appealing.

Unfortunately, there are many more archives than the Wayback Machine (archive.org) that store old pages. Here’s a list of at least six “lesser” known archiving sites:

https://www.screenshots.com/
https://archive.org/
https://webcitation.org/query
https://www.competitorscreenshots.com/
https://www.webarchive.org.uk/
https://timetravel.mementoweb.org/ - There’s even a Chrome extension for this one

In today’s post, we will discuss different ways to block crawlers like the Wayback Machine (archive.org) from indexing your content. Note that if your site is live as of now, it most likely already has an archive somewhere and the below actions will only help prevent future content from being indexed. Please defer to my next installment on how to remove content from these types of sites if your content has been archived. That being said, here are some key actions you can take immediately if your content has not already been archived:

1. User Registration

Put all personal content behind a registration page. That’s right, force your users to register with your site before they can see your content. This will prevent Google and other crawlers from accessing, indexing or archiving your content.

2. Block IPs

.Htaccess can be utilized to block the IPs of crawlers and other bots from accessing your site. You can start by reading this handy-dandy resource that lists the IPs of all sorts of bad sites out there. If you’re using WordPress, there’s a plugin called WordFence that can automate this for you. This is a temporary fix given that IPs are very cheap these days, so crawlers change IPs all the time.

3. Disallow with Robots.txt

Use robots.txt to block the Wayback Machine. You can find instructions on how to do this here, but keep in mind these instructions only work for the Wayback Machine. Other archive sites may have their own disallow commands, or won’t honor robots.txt at all, so this is a temporary solution at best.

4. Don’t Forget About Search Engines

Search engines love to index content, especially unique content that you may not want to share. You can use the noindex meta tag to prevent most search engines from indexing your content. Follow this link to learn more. Remember to only use this option if you do not want your page to show up in search engines. Just like options 2 and 3, this is temporary since not all search engines and crawlers honor the noindex meta tag.

5. Reserve Your Rights

Adding "All Rights Reserved" to your web page as well as registering a copyright for your content is, at best, a deterrent. However, taking this step will help immensely when exercising your right to remove content that has been archived by other sites (which we will explain more in my next post).

As you can see, trying to avoid having your content archived is a challenge, but it can still go a long way in preventing more work down the line. If your site has already been archived, keep an eye on our blog—we’ll be posting about how to remove archived sites in the future.

Well, that’s all folks. If you enjoyed reading my article, don’t forget to hit me up via Twitter. See all of you soon at PubCon Las Vegas 2016!