An Exploration of Information Validation Methods on the WWW

or "Your eyes can deceive you, Luke".

Introduction

The World Wide Web (WWW, or web) is an enabling technology. It allows any vaguely computer saavy person to publish information on any subject, with any perspective, for relatively small amounts of money (compared to traditional press methods). The WWW has enabled grass roots organizations from all over the political spectrum to organize and flourish. The WWW has allowed news that would normally be supressed to get out. It has made availiable views that were previously "unmarketable" and hence unheard.

Easy access to freely publish brings with it a price, easy access to freely lie, subvert, blur, or propagandize. Before the WWW, it was a relatively expensive proposition to get your opinions out to a large number of people, you had to be fairly serious about it to get it done. Plenty of low tech organizations have published blatently false or purposely blurring propaganda pieces throughout history, but usually these could be easily identified by the quality of the material or the reputation of the publisher. The "skill" of determining the validity of a published work is practically unconscious (probably too much so, even in the print world), we trust works that look trustworthy.

Now our society has to relearn the unconscious mechanisms of information validations, we have to adapt them to a new medium, and invent new standards. In many ways this effort has feedback application to the traditional media channels of print/television/radio/etc, once we learn when to and when not to trust all electronic words, we learn to be critical and conscious of all the information we receive.

As a consumer of web information I have begun building my own criteria for judging the validity and quality of web content. In this document, I will explore the techniques I use to form an opinion of web content, as well as the tactics providers (including myself) use to affect the judgement of the reaer, and I will apply those techniques to a couple specific examples.

Layout Details

The first thing a reader registers when accessing a web page is the layout. The reader sees the layout before anything else. If there is an unconscious process of validation this is probably where it should start.

There are probably 3 major styles of web pages: well done glitzy pages, well done functional pages, and poorly done pages. Putting a page into one of these catagories is not neccessarily relevent to verifying the content, but it does help put the rest of the opinions in context, it gives some indication of the intent and character of the content provider.

I usually make a quick visual scan for a number of common elements:

Broken images are particularly easy to spot, if images on a page are broken it implies that the page is either under construction (correlate this with the age (described below) of the page to find abandoned pages), it was put together by someone with very few computer skills and little familiarity with the web, or perhaps there are simply server problems beyond the control of the provider.
The number of images on the page are a good indicator of the glitz factor. Pages with a lot of large image files are a hassle to download with a modem and pages containing them usually weren't designed with functionality (getting the word out) in mind. Also the more time that is spent processing and uploading images is time that was not spent on content creation. Images are another very subjective judgement though, a tasteful aethetic page with a moderate number of images shows that work and thought was put into the pages creation, it gives the impression of permanence. Plain unformatted text gives the impression of a memo or news item that is impermanent.
Many sites now have ad banners. These are images that a company pays the provider to display. Often these images are accompanied by "cookies" that stick in your browser and help advertisers track your interests. Most sites which display ad banners are commericial sites and users should beware of the perspective that comes with that. Commercial sites exist to make a profit by selling their ads, and the perspective of their advertisers may affect the content they decide to post (to be inline with the policial/economic agenda of their advertisers). Of course the ads may be for products that are directly in line with the information the user is seeking, moderate use of advertisement is not an immediate display of invalidity.
Use of backgrounds, text color, frames, java, javascript is another aesthetic judgement that can give some insight into the content providers background and perspective. Each of these features can be (and more often than not are) abused to the extent that the page is annoying or impossible to read. These features tend to be browser specific, and without a very good graphic design sense it is very difficult to make a page look good when they are used. Proper use of these features, or restraint from using them shows a dedication on the part of the content provider to provide useful information rather than glitz. Over use of these features suggests the provider is probably a new comer to the web who is in too much of a hurry to publish glitzy (likely for profit) content to learn HTML style.
Many servers send information to the browser about the pages last modification time. This information is usually buried under a "Document Information" menu item. Correlating the last update with the nature of the content helps pinpoint the timeliness of the information and the attention the content provider pays to it. Some well written content does not need to change, but most dedicated sites will provide new information on a regular basis.

Check the Source

After the initial layout scan (for a veteran web user the layout scan phase only takes a few minutes, and happens mostly unconsciously while skimming the actual content), there are a number of more technical things that can be checked. Given some meta information about a document (its URL for instance) and some knowledge of the organization of the Internet, a user can track down a significant amount of information about the source of the document.

The URL contains several pieces of information, the most important part of this is the server name. The server name can provide you with a lot of information.

If the document in question is not the root document on the server, the user can check the root document to determine what the business of the server is and how it related to the business of the document. Is the server a general service provider, is it owned by a comercial company or a university, is it dedicated to a related topic, it is a professional hosting/web authoring service?

The last two fields of the server name make up the domain name for the network the server is on. The user can use a service like "whois" (maintained by internic) to determine who the site contact and owner of that domain name are. Similarly the user can use internic services to search for similar domain names (sometimes a provider will "spoof" an existing site by using a similar domain name). The user can learn a lot about the permanence and the investment that went into the server by comparing the domain of the email addresses mentioned in the document, the domain of the URL, and the domain of the email addresses mentioned in the "whois" entry.

It costs about $100 per year to own a domain name. For this reason many providers host their websites off of someone else's server (their universities, a comercial hosting service, etc). The first path info field of the URL tells something about the providers relationship with their server. If the URL contains a tilde, such as my page http://www.imsa.edu/~wolfgang, it means that the page is a user home page on the server, these can be set up without any intervention from the administrators of the server. It is likely (though not absolute) that any information provided by such URL's is fairly informal. In many cases this may be preferable (when looking for a "fan page" for a musical artist, I look for the tilde because it indicates the page is not created as an ad by the artists recording label). If the tilde is not present then either the server is run by the provider or at the very least the server administrator has approved the creation of the pages. This information must be correlated with knowledge of the size of the server (that is how many users does it serve), which can be guessed at by some of the domain information gathered above.

What do others think

The real judgement of a website comes from the collective consciousness of the rest of the net. Most major topics have an agreed upon "primary home site" or "official home page" or whatnot. For any major site you should be able to find other people linking to the site, reference to the site from official or credible sites, and/or reviews in netnews about the site.

Using search engines to check which pages link to the page can be very revealing (and it can help you find related sites). There are also search engines for netnews into which the user can enter the URL in question and find out if anyone has had anything to say.

The "web of trust" rule on the Internet tends to be that while you can rarely trust any one person, you can usually trust a group of opinions found in separate forums. Similarly you can trust the resources that a group of (otherwise unrelated) people have endorsed (newsgroup FAQ's for instance).

A topics presense in newsgroup discussions, or in links on others pages, should be proportional to how "big" the topic seems. For instance, if a particular conspiracy theory seems really important, but no one seems to know about it, it is not because it is covered up, it is because it is bunk and people are ignoring it.

Content

Finally, for certain areas of interest the task of verification is a little easier. If the user is knowledgable in the subject area, then some information can be verified, lending credibility to the rest. This should not be a stand alone method, however, because many false sites hide behind a mask of "seeming" reasonable and knowledgable.

Even if the user is not as fluent in the subject, using a collection of unrelated sites on the subject should be able to establish a baseline of agreed upon fact. A sites consistency with this baseline helps to establish its credibility.

Examples

I chose two topics to research using these guidelines. The first topic was a familiar one, the folk singer Ani DiFranco. The second topic was something I only recently became aware of, Internet Censorware.

Ani DiFranco

Ani DiFranco is a modern independent folk artist with a very large fan base. Because of her independence (she owns her own record label), information on her is not usually availiable through normal channels (very few of her concerts are on ticketmaster, there aren't as many interviews of her in the mainstream media as of other artists). Because of her large, fiercly loyal, fan base, there are a great number of fan pages devoted to her.

For a musical artist it is often good to read a few fan pages, as well the official page for the artists. The official page is usually maintained by the record company, while the fan pages tend to be unauthorized and amatuer. The official page usually has the best information about the tour schedule and upcoming album information (thats their business after all), while the unofficial pages usually have interesting interviews and random trivia.

The first step is to gather as many sites as possible, to get an idea of what is out there and where. Surfing these sites should show some patterns. A good starting point for information about any musician is Yahoo (Top:Entertainment:Music:Artists:By Genre:Rock and Pop:DiFranco, Ani). Surfing random sites from yahoo uncovered two web rings devotes to Ani. The Annie DeFazio Webring contains 10 sites. The Ani DiFranco Webring contains 72 sites.

I looked into the policies of these webrings and it seems the DeFazio (a joke refering to people frequently misunderstanding DiFranco's name) ring is a moderated group attempting to only allow fresh new sites into the ring, while the DiFranco webring is completely unmoderated. The DeFazio ring seems to have the sanction of the "Secret Garden" mailing list, which further surfing revealed to be a splinter list off of the official ani difranco mailing list. Surfing shows the sites on the Defazio ring are indeed of higher quality than those on the DiFranco list.

In skimming the lists of Ani pages, it became pretty easy to tell the unique ones from the repeats. Most pages were simply collections of lyrics, lists of bootlegs, lists of concert data, and random praise (uninspired praise at that). Anyone that used too many exclamation marks, or acknowledged that their page was simply another in a series of fan pages, were uninteresting. The following sites really stood out as worthwhile:

Miki's Random Ani Page: This is a very well done fan site, it has links to useful resources, and some random fan things (song parodies, funny images, concert stories) that are actually interesting and well presented.
Flemming and Tamulevich Tour Dates: Although Ani's record company does not have an official page, her booking agent has this tour dates page. This seems to be the most up to date and complete tour information availiable. All the other pages seem to copy this information (incompletely).
Alan's Page: This is more traditional than Miki's page, but its straightforward and presents the useful information without a lot of glitzy flashing and animation.

Unless I was looking for bootlegs or random pictures of Ani none of the other pages really have anything to offer that can not be found at one of these three sites. There were a few sites with concert reviews that were interesting to read. Some of the more glitzy sites crashed the macintosh I was using, and I had to switch to a more stable UNIX browser to complete my research. This is a general webwide problem, and another reason to avoid glitzy sites.

Censorware

Censorware has been marketted towards parents and schools that want to provide their children with internet access but want to protect them from pornography. The censorware business initially began to boom with the introduction of the Communications Decency Act (which has since been found unconstitutional), the censorware was a method by which many sites were able to legally (during the time the act was in place) continue to operate. Recently a friend of mine on an electronic forum pointed me to a site that reported on censorware censoring more than pornography. I decided this would be interesting to investigate.

My starting point was given to me by my friend. Peacefire claims to be an organization of teenagers acting as a watchdog against the censorware industry. My initial impression was that this sight might just be a bunch of kids whining that they should have a right to view porn, but since my friend recomended it I explored the sight anyway. Upon exploration I found the site very informative. It was free of advertisement and seemed to be dedicated and serious about its cause. There was a separate article about each major censorware product, and each article contained links to supporting information. All supporting information was found on different servers It would be hard to mount a conspiracy of lies accross so many servers, many of which were major reputable sources like eff.org and wired.com. In the cases where claims of sites being censored were verifiable, links to provided to the documentation from the companies that wrote the software that verified the claims.

The overall claim was that censorware products censor much more than pornography, they often censor important (clean) political information such as freedom of speech pages, the national organization of women, gay rights pages, HIV related pages, breast cancer research pages, civil liberties pages, the list goes on and on. In every instance of an accusation the accusation was backed by a link to a report or story from another site. Links to verifiable news items related to federal ligislation supporting censorware were also provided.

I was able to verify enough of the information on this site that my faith in its validity is fairly strong. Technical validation through whois shows that the Peacefire organization is based out of Vanderbilt University in TN, their domain was established in 1996. The webserver has up to the minute information (including news for today). It seems to be a very active effort, rather than something someone threw up and let stand. The page has the "modern web page" feel, frames, and tables, but they are not overdone, and it is still navigable in lynx (this actually takes real effort).

During my research I got some strange behavior from peacefire.org as though it had dropped off the net. I suspect that perhaps it is the target of frequent denial of service attacks.

To get some further perspective on the matter I consulted Yahoo. Past experience has shown Yahoo to be mostly on political matters. I turned up the following catagories:

The first provided lists of censorware vendors and information about their technology. This list is useful for cursory verification that the software peacefire targets really represents the major players in the censoring arena. Further study of the individual sites reveals the individual attitudes of the individuals businesses, and shows how slick they sound when one doesn't know the "dirt".

The second catagory included a number of sites with debate on the issues of censorship on the net, these sites seemed to back up peacefires claims.

I could not find a strong 3rd party counter argument to the peacefire claims. Most censorware sites claim to be in active support of the first amendment.

Conclusion

Forming an opinion about the validity of web sites is a complex process. It is not perfect, rather it is somewhat stochastic, it involves looking at the information from as many different sources as possible and looking at each document in its context of the webs of knowledge and trust. Having access to technical knowledge about the internet may provide an edge in some cases of spoofing.

Tools

The following sites are of general use:

DejaNews Netnews search engine
Altavista WWW search engine
Internic Internet database
Yahoo WWW directory