Summer 2011

An Internet for All Time

– The Wilson Quarterly

The Web’s size defies comprehension. That, in turn, defies easy archiving.

Ever since humans learned to write, they have collected their works into archives, seeking to convey their wisdom and history to future generations. The rise of the Internet poses a daunting new challenge, and not only because of the huge quantity of information it contains.

The very nature of the Web poses a problem, notes Ariel Bleicher, a writer living in New York City. Anything published online “exists in a perpetual state of being updated, and it cannot be considered complete in the absence of everything else it’s hyperlinked to.” As many as two billion people regularly go online, and many of them do a lot more than passively absorb content: They comment, create their own videos, play games, interact with friends.

The Web’s size defies comprehension. The number of URLs indexed by search engines has exploded from 50 million in 1997 to about three trillion now, but that’s just a small part of the entire Web. Some people estimate that the total “surface” of the Web accessible to archivists’ tools may be six times as large as the indexed areas, and that the “deep” Web, which includes password-protected sites and certain types of databases, may be 500 times larger than that.

Beyond the sheer magnitude of the archival task there is the thorny question of legality. Only a few countries have laws that permit archivists to copy and save virtual documents. In the United States, much of what appears online is copyright protected. The Library of Congress archives only government Web sites and several thousand other sites whose administrators have voluntarily consented.

And there are technological hurdles. Digital archivists track content with the help of “crawlers”—computer programs that scour the Web. Crawlers cannot see the “hidden” Web: password-protected sites, isolated pages not connected to the broader Web, and “form-fronted” databases that require users to enter search terms in order to pull up information. Existing crawlers have difficulty recognizing “rich media”—anything that moves when a user interacts with it—and other new forms of content.

Finally, who is going to do all of the work of archiving? Google hasn’t made digital archiving a priority, and many of the nonprofit foundations and government offices that have sprung up to fill the void are too small and too resource strapped for such a large project.

Part of the difficulty is knowing what will be of interest to future historians. Indexes of goods that have been sold on eBay may seem trivial today, but they’re just the sort of data that can help illuminate our culture in future centuries.

THE SOURCE: “A Memory of Webs Past” by Ariel Bleicher, in IEEE Spectrum, March 2011.

Photo courtesy of Flickr/Scott Beale

Up next in this issue

A Reason for Reason

– The Wilson Quarterly

Why is reason so defective?