Practical Technology

for practical people.

The Rotting Web

People have this delusion that the Web is a fount of all information. That, if you look hard enough with Google and other search engines, you can find all the answers. That’s crap.

Let’s look closely shall we? Google, in honor of its 10th birthday, has put up a site that lets you search the Internet as it was in 2001. Why not 1998? Because January 2001 was as far back as they had an archived copy of their index.

This time-machine search engine is fun. For example, the top site for Paris Hilton in 2001 was for three Hilton hotels in Paris. All together there were 79,900 records for the two words. Today, it’s all about Paris the pseudo-celebrity and there are 2,880,000 Web page hits.

But, the site does more than just show how things change. It also shows how things disappear on the Web.

Take, for example, yours truly. I have a very rare last name and I write a lot so I’m easy to Google. If you wanted to find out what I had been writing about in December 2000 or January 2001, you’d find, thanks to the old Google index, that over 90% of my work from then has disappeared, rotted away.

So, if someone were to tell you that I actually hadn’t started writing about Linux until 2001, you’d have trouble finding proof that I’d actually been writing about my favorite operating system since 1997. That’s a trivial example though.

Let’s say Microsoft claims, which it has, that it had changed its business practices because it wanted to do right by its customers and that these changes had nothing to do with the Department of Justice beating it in court. Now, you can find records of this because newer stories state that, yes, the DoJ did beat Microsoft into giving up some of its illegal monopolistic practices.

Let’s go one more step though. Let’s say you want to know, specifically, what happened in the trial according to ace Microsoft reporter, and I’m pleased to say my friend, Mary Jo Foley who covered the Microsoft/DoJ case like paint. In January 2001, a search on Microsoft DOJ and Mary Jo Foley would have found 1,820 records.

None of the first twenty, the most popular links, are still active in 2008. And, as far as Google 2008 is concerned those stories don’t exist. That means, for most people, those stories never happened. It means, in short, that if you rely on the Web for “The Truth” you’re relying on something that’s constantly rotting away and falling apart.

In theory, the Internet Archive Wayback Machine has copies of some of those stories. In practice, it didn’t. Even with its claimed 85-billion Web pages from 1996 until a few months ago, the Wayback Machine only offers a sampling of the Web.

This happens, in part, because people don’t really value information. Many of the sites of 2001 are still around, but their owners don’t feel it’s worth their time to keep old pages online. Often, when a site changes its Web design or CMS (content management system), rather than try to import old pages, the site management just dumps them.

That’s accidental Web rot. Then, there’s the deliberate destruction of records. For example, Jimmy Wales, Wikipedia‘s co-founder, tried and failed to cover up using Wikipedia money for personal expenses and tried to rewrite Wikipedia’s other co-founder Larry Sanger out of Wikipedia history. More recently, Wikipedia tried, and failed, to burn away a Wikipedia posting detailing a site, Deletionpedia, which archives articles that Wikipedia deletes.

Those are only examples of attempts to delete or modify Wikipedia records that we know about. I have no doubt that there have been many more successful examples of this so-called encyclopedia being edited to the Wikipedia editors’ own agendas without any discussion or oversight.

Of course, there’s nothing unusual about what’s happening at Wikipedia. Everyone does it. I can’t begin to count the number of e-mails I’ve gotten making various scurrilous claims about Barack Obama based on this or that “news” site. When I bother to check them I always found these sites’ “news” to be dubious at best and out-right lies at worse. The difference between them is only that more people trust Wikipedia than they do “J. Random News Site.”

The real truth is that, both because of neglect and deliberate deletion and editing, you can’t trust the Web. The old saying used to be “Question Authority.” Today, it should be “Question Web Authority.”

2 Comments

  1. I have often complained that the internet is, at best, how things ARE. Finding how things WERE doesn’t work on the internet. Disregarding intentional obliteration, the past is lost through continual updates. A good example is a Google map of a growing area. Try finding maps of a town from 20 years ago.

  2. “People have this delusion that the Web is a fount of all information”. Is that ALL people, MANY, SOME, FEW??? In any case, have you empirical evidence to support this assertion? You might be right, but I’d need some proof.

    “over 90% of my work from then has disappeared,” What! Are you saying that their index, that begins and ends in January 2001, does NOT have your articles written before or after that date? I’m shocked!

    “And, as far as Google 2008 is concerned those stories don’t exist.” This has nothing to do with Google. They don’t exist because the sites that had stored them have pulled them down. It ain’t Google’s fault for that happening. So you should have said “And, as far as today’s search engines are concerned those stories don’t exist.”

    I agree that “web rot” occurs, but don’t imply that Google is the cause. And yes, I know you clarify this later in your article, but mud sticks so be careful.

    “I have no doubt that there have been many more successful examples” However, why were these attempts doomed to failure? Because other people, like yourself, care and are vigiliant.

    “The real truth is that … you can’t trust the Web.” Thank the Lord we have you to tell us the Real Truth ™. But what does it mean “can’t trust” – if it means you cannot have 100% absolute certainty of completeness and accuracy, then of what media form can you trust? None, I’d say. The source of information (i.e. authors) is a guide (and only a guide) of its trustworthness. And when applied to the Web (blogs, news, ‘pedias, articles) we must be more wary but only because there is a lower barrier to publishing nowadays than in the past. I trust my local newspaper about as much as I trust the Web news.