Brewster Kahle: Universal access to all human knowledge

This is an essay of sorts based on Brewster Kahle's speech at NotCon '04 conference. The speech is available at NotCon's site as mp3, ogg audio formats and quicktime video format. By all means, listen to the audio or see the video, because this essay pales in comparison to Brewster's enthusiasm.

This essay is a free form transcription of the speech, abusing and re-using Brewster's original phrases, and adding some misunderstandings of mine. The facts are based on the speech, I haven't done any research of my own. I'm open to suggestions on improving the text, correcting any mistakes it contains. You'll find my email address at the bottom of this page.

Brewster Kahle and the people at The Internet Archive (http://archive.org) have an ambitious dream: to make an archive of all human knowledge and to make the archive available to everyone on the globe. (Of course, some of the stuff will be for fee, some of it for free.) Sounds like an ambitious project, and indeed it is, but in his speech Brewster makes a compelling argument that such a project is actually plausible. Not only that, but he can show a pretty convincing track record so far. Furthermore, he invites all of us to participate in the project.

[Clay tablet is a rather
	inaccessible technology, but great for preservation.] Around 3000 BC, The Library of Alexandria was founded for a similar purpose: to store all human knowledge. It was based on a technological advance, namely papurys scrolls. Before that, knowledge was stored on clay tablets, which weren't exactly what you would call a compact way to store stuff. But with scrolls, you could actually store hundreds of thousands pieces of information in a fixed place. We have had a similar technological breaktrough in recent decades with computers and the Internet. Computers can store vast amounts of data and the Internet can make it available.

We also have the political will to complete such a project. We want to live in an open society, where the idea of education and access to knowledge is supported. Sure, there are forces trying to restrict openness, but, "by and large the idea is supported."

So, for the first time in history, we've got all the three things required: the storage, network and distribution mechanism and the political will. It's uncertain how long this opportunity will last, so we've got to act fast.

To answer the question whether such project is indeed doable, we need to find answers to four questions:

So, should we?

The hand-waving answer: Yes, we should.

You could dwell on the issue for as lang as you will, weighing the pros and cons, never coming to a conclusive argument. But, it's a project that might be up there with man's greatest achievements, achieving or even bypassing the dream of Library of Alexandria. Right now we have a window of opportunity to complete the project, and the window might close in few years. So let's give this project a shot; if it turns out to be a bad idea, let's find that out.

The other open question is "Will we?" Brewster left the answer to that as "an exercise to the reader." He and a bunch of other people will devote their time and money to the project, but their resources are limited. Volunteers are needed.

As for questions "can we?" and "may we?"; well, there are obstacles, but they are not insurmountable, as we shall see.

We divide these questions further to the kinds of data we need to archive. The categories are: text, audio, moving images and the web. Let's go through these one at a time.

Text, text, text

So, how many books are there?

The Library of Congress, which has the largest collection of books, has around 24 million books. One book takes about one megabyte of storage. So to digitally store all books in Library of Congress it would take a whopping 24 terabytes of storage space. (If you'd need to store images of the pages, it would take a bit more, but that's still within our reach.)

You can store 24 terabytes of data in a stack of Linux boxes that's takes about square meter of space and costs approximately $60000. All in all a doable number. Nothing out of our reach.

Then there's the trouble of converting all those books to a digitized format. You need to scan the books. There are projects already doing this, in which The Internet Archive is involved with, like the Million Book Project or Project Gutenberg.

There are a couple of ways to do the scanning. One is to do it by hand; a person flips through the pages and "takes a picture" of the pages. This has been going on in India, where it costs about ten dollars a book.

In western countries, where the labor is more expensive, they have set up so called in-library scanning systems, where the local people do the scanning, and automated systems, which consist of robots and custom made arrangements. The price is still a bit higher per book, but they are making progress and collecting experiences from India.

[A bookmobile is a
	small, handy way to print out books.] What about making the books available? Reading the books on computer screens kind of sucks, so TIA has this project called Bookmobile. Bookmobiles are trucks that carry a satellite on the top and printing facilities inside. One bookmobile costs around $15000, including the car, which is not that much. A Bookmobile downloads the book through satellite connection from the archive, prints and binds the book. The whole thing costs about $1 per book (assuming the book is black and white and around 100 pages). The local people can do the whole thing with just a week's training. Also, they found out that loaning a book from Harvard library costs, all in all, around $2, so it might actually be cheaper to print and bind a book and to give it away. (However, see Chris Lightfoot's notes on the subject.) So, we can conclude that making the books available is also well within our grasp.

May we scan the books and make them available?

Imaging, ie. the actual scanning of the books, is allowed for as long as you own the copy of the book.

But making them available is a bit more complicated. First, there are the in-print books, which consist of hundreds of thousands of books, which are rather strictly legislated. But there is also a huge body of copyrighted books that are permanently out of print, which are called the "orphans." Here the librarians ran in to the tragedy of screwed up laws. By the current legislation, you just can't make them available, even though the books are never going to be in-print anymore. (Libraries live by giving access to these out-of-print books.)

To clarify the copyright issue with orphans, TIA filed a law-suit. (That's the way things are clarified in USA.) So, there's an ongoing Kahle vs. Ashcroft case, called "Free The Orphans", which hopefully brings some sense to the issue.

But we can conclude that text and books are doable.

Audio and music

How much is there?

Almost all of the published works of audio is music. They consist of 2-3 million titles (78s, LPs, CDs). That's storage-wise a doable number, but published music is right now a very litigated and restricted area, so we just can't go and give access to it.

Instead, TIA has started to work on other areas of music. They've been working with bootleg recorders. There's an ongoing tradition, started by the Grateful Dead, amongst jam bands that you are allowed to tape and trade recordings of live performances as long as you don't make profit out of it. The trading has moved to Internet, but the traders have had problems with bandwidth. One concert takes about gigabyte, losslessly compressed, so it's not so light on downloading or uploading.

So the people at The Internet Archive offered the live recorders "unlimited storage, unlimited bandwidth, forever, for free.", which is just what the traders wanted. Now they have archived over 12000 concerts; mostly from the "guys with guitars genre" and lately also from the "guys with mandolins genre." The only requirement is that the recordings are under Creative Commons license. If you are in a band, or know people who are in bands, join the action; uploading gigabytes of concerts is no sweat to The Internet Archive. Upload away.

Legislation concerning recorded performances in the US is really hard to figure out. But in the Europe there's a 50 (or 70) year rule: the works are placed under public domain 50 years after the performer has died. This basically implies that the contents of 78 rpm records are public domain, so if you have any 78s, rip them and send the contents to The Internet Archive. It would be kind of fun to have hip-hop songs built out of scratches of 78s.

Archive.org also hosts around 50 netlabels. These netlabels have their content on the archive and they point straight to the downloadable files, meaning that you don't have to go through archive.org's branding or anything.

So, all in all, archiving music and making it available is also possible.

Moving images

Films

Most people first think of the big theatratical releases of which there are around 100 000 - 200 000. (Half of them are estimated to be Indian.) These are very litigated and restricted. However, there are some public domain feature films too. There's a guy that has collected around 600 public domain feature films, and The Internet Archive is in process to digitize them. The legendary The Night of the Living Dead, for example, is a public domain film!

But there are also other interesting films that are less limited. It used to be that you had to insert an explicit copyright remark to the film, otherwise the film would be automatically public domain. (This is what Kahle called the "Ben Franklin way of viewing the world.") This means that there are tons of old films that are public domain.

Clips of such films are used in various interesting ways now: government films, educational films, social guidance stuff, home art flicks, etc. Brewster invites artist to use the public domain stuff to create new art. And if you have your own videos, and want to share them under a free license (for example, the creative commons license), The Internet Archive offers you unlimited storage, unlimited bandwidth, forever, for free; just upload away.

Television

The Internet Archive and Television Archive have started archiving TV broadcasts little late, at 1999. They record 20 channels (some Russian channel, some Japanese channel, CNN, BBC, etc.) 24 hours a day with DVD quality. Broadcasts are stored on off-line hard-drives. The recording takes around 20 terabytes of storage space a month.

Due to legislation issues, these TV archives are not, by default, publicly available. However, they've once made the archives available for a short period of time. At October 11 2001, they showed the TV archives spanning from September 11 2001 to September 18 2001.

The idea was to observe in retrospective what did the world actually see, because the 9/11 was basically a television event. They wanted to show how people perceived it. The showcase made evident that media is just a player, not the truth, just one point of view.

But as far as archiving is considered, moving images are very much doable. We can archive the whole dang thing and we could make it available.

Software, or, run Lotus 1-2-3 as it was originally written

It's estimated that there has been around 50,000 packaged and released software titles. Technically, archiving the software is doable. We can rip the stuff, we can run them through emulators. You can replay all that amazing stuff from the early days, software for early Mac, Sol, Commodore 64, etc.

However, they ran into a major obstacle: The Digital Millennium Copyright Act (DMCA), which, as Kahle puts it, is an "amazing piece of Soviet-era legislation where 'everything is illegal unless we give permission'," and which "is an anti-thesis of United States used to stand for."

DMCA implies that ripping (ie. making a digital copy of) the software is illegal. You are not allowed to break the copy protection, which a lot of the early software had, which effectively means that you can't archive them.

The librarians went to the copyright office with a lot of preparation, briefs etc. They waved a floppy in their hands and said, paraphrased: "Here's Lotus 1-2-3 on a floppy, rotting away. We are ready to make a digital copies of this stuff and we want to be allowed to do this."

The librarians won. They got a copyright exemption for two years. What The Internet Archive now needs is donations of the physical objects, for example floppies, and digital copies of the software. People need to act fast with this. The exemption is for two years and the window of opportunity may shut. When the legal hearings come around again, we can show that "the world didn't crater as the software industry thought."

The exemption says that you are allowed to rip old software if you are "an agent for archive.org." Only ripping is allowed, you can't publish the stuff on the net.

The web

The Internet Archive is perhaps best known for archiving the web. They have been archiving the web since 1996, taking a full snapshot of the publicly available web, all the text and all the pictures, every two months. It takes about 20 terabytes storage space compressed, around 50-60 terabytes uncompressed.

[The Yahoo website
	looked a lot different in 1996.] All that data is available through the Wayback machine. With the Wayback Machine you can (try to) "surf the web is it was". It is visited by 150 000 people a day, and it gets 8 million hits a day. The database is in order of 300-400 terabytes compressed, over a petabyte uncompressed.

It might be one of the largest databases out there, even though it doesn't run on Oracle. Instead, it runs on cluster of Linux machines and the data is stored as flat files on the filesystem, which is the only way they've found to scale.

Archiving the web is doable, at least as good as The Internet Archive is doing it right now. And they are getting better at it all the time.

Preservation and access

"Don't just have one copy of the data"

The storage has to be very cost effective. Tapes are too expensive, so they're storing the data on hard drives. They put four 300 gigabyte disks in a Linux box, yielding around terabyte storage space per machine, and stack a lots of these Linux boxes to a rack.

The only full copy of the data is in San Francisco, known also for its massive earthquakes, which might not prove to be a good idea. The lesson to learn from the original Library of Alexandria is, of course, that it burned down. What's now left of all that ancient knowledge is allgedly eight pieces of papurys. So, don't just have one copy of the data.

That's why they've been swapping data with the New Library of Alexandria. Anything that The Internet Archive gets goes to the New Library of Alexandria and vice versa. The New Library of Alexandria has a few hundred terabytes now, and by the end of the year they will have the other 700 terabytes.

Also, The Internet Archive is kicking off the European archive at Amsterdam, Netherlands. (Brewster himself and his family are living in Amsterdam right now.) The first installment in Amsterdam is a 100 terabyte machine, which is to show that they are serious with the project. They are planning to roll the full petabyte in 12 months. Internet Access For All (XS4ALL) provides free hosting and free bandwidth for the European archive. The Internet Archive is looking for a technical director for the European Archive.

Access

The Internet Archive now provide several gigabytes of bandwidth, hoping to extend it to 10 gigabytes, which would be a meaningful perecentage of the public web.

But the most important thing that libraries do is to provide an interface the knowledge, ie. "search engines". One search engine to the Internet Archive data was built by a gal in two years as a part time job. Her search engine indexes 11 billion web pages, four times the size of Google in terms of pages. So the idea is to make it possible to "build a Google in a month."

The public access to the Internet is quite horrible in the States. The bandwidth isn't growing and the phone companys want to raise the prices. The Internet Archive are providing wireless access is San Francisco area. They provide unlimited bandwidth, that's not the problem, but primarily the costs with technical support.

We've got to start making better job as far as bandwidth is concerned. Right now, most of the videos on the Internet are pathetically small. In the future, people just won't turn to the Internet, if we can't serve proper videos.

So, will we do this?

Finally, Brewster returns to the question of whether we are going to make this project happen. There are forces that are making it not happen, but, on the other hand, there are a lot of people already doing it. They've got the funding. TIA's budget is around 5 million dollars a year, and they are trying to raise that to 10 million dollars a year.

In his speech, Brewster argued that universal access to all human knowledge is possible. Succeeding, the project might be one of humanity's greatest achievements, up there with the myth of Library of Alexandria. Not only to store all human knowledge, but to provide access to it, even to the child in rural of Uganda. All we need need is few years of really hard core work, and we might actually pull this through.

Let's get on with it.


Jarno Virtanen. (email: jajvirta@hole.fi)
All text above by me is in the public domain.