|
The Wayback Machine -- www.Archive.org
-- frequently asked questions (FAQ)
(Note: I researched and included the below excerpt from The Wayback
Machine because of its contact information: The Presidio in San
Francisco, California, the current haunt of former USSR president
Mikhail Gorbachev, and his 'environmental' website and organization,
Green Cross International http://www.gci.ch/ and
http://www.greencrossinternational.net/index.asp,
as well as The Gorbachev Foundation USA http://www.gfna.net/index.php.
Please visit the article located here: http://www.supervirtual.com.br/acervo/KRPN-Gorbachev-PresidiumtoPresidio.htm.
It is a very important read. The fact that the archive for the
Internet is located in such a place should concern many. What may
perhaps at first seem unrelated -- but really is related -- is this: http://www.bdt.fat.org.br/iRead?28+biodiv-l+24.
Please read it; it is not long, but please consider the players
involved.) Internet Archive, The Presidio of San
Francisco, P.O. Box 29244, San Francisco, CA 94129. 415-561-6767 info@archive.org The
Presidio in San Francisco, California, is also home to "ERPA"
(Electronic Resource Preservation and Access Network) http://www.erpanet.org,
whose Mission Statement reads: "The European
Commission funded ERPANET Project will establish an expandable
European Consortium, which will make viable and visible information,
best practice and skills development in the area of digital
preservation of cultural heritage and scientific objects.
ERPANET will bring together memory organisations (museums, libraries
and archives), ICT and software industry, research institutions,
government organisations (including local ones), entertainment and
creative industries, and commercial sectors (including for example
pharmaceuticals, petro-chemical, and financial). The dominant feature
of ERPANET will be the provision of a virtual clearinghouse and
knowledge-base on state-of-the-art developments in digital
preservation and the transfer of that expertise among individuals and
institutions." http://www.erpanet.org/about.php Okay,
after you've digested THAT, try ERPANET's "Aims and
Purpose" statement: "The ERPANET project aims to
establish an expandable and self-sustaining European Initiative, which
will serve as a virtual clearinghouse and knowledge-base in the area
of preservation of cultural heritage and scientific digital objects.
The dominant feature of ERPANET will be the exchanging of knowledge
on state-of-the-art developments in digital preservation and the
transfer of expertise among individuals and institutions. More
specifically ERPANET will deliver a range of services (e.g. content
creation, advisory service, training and thematic workshops and fora),
both to information creation and user community. It will make
accessible tools, knowledge, and experience. ERPANET will
not directly carry out new research to develop such tools, but it will
create a coherent platform for proactive co-operation, collaboration,
exchange and dissemination of research results and
experience in the preservation of digital objects. It will bring
together research institutions, memory organisations, ICT industry,
entertainment and creative (e.g. broadcasting) industries and provide
effective, multidisciplinary, knowledge and resource-sharing
infrastructure." http://www.erpanet.org/about.php)
How can I get my site included in the Archive? Alexa Internet has been
crawling the web since 1996, which has resulted in a massive archive. If
you have a web site, and you would like to ensure that it is saved for
posterity in the Internet Archive, and you've searched Wayback and found
no results, you can visit the Alexa's "Webmasters" page at http://pages.alexa.com/help/webmasters/index.html#crawl_site. How can I remove my site's pages from the Wayback Machine? The Internet Archive is not interested in preserving or offering access to Web sites or other Internet documents of persons who do not want their materials in the collection. By placing a simple robots.txt file on your Web server, you can exclude your site from being crawled as well as exclude any historical pages from the Wayback Machine. Internet Archive uses the exclusion policy intended for use by both academic and non-academic digital repositories and archivists. See our exclusion policy. You can find exclusion directions at exclude.php. If you cannot place the robots.txt file, opt not to, or have further questions, email us. What is the Internet Archive Wayback Machine? The Internet Archive Wayback Machine is a service that allows people to visit archived versions of Web sites. Visitors to the Wayback Machine can type in a URL, select a date range, and then begin surfing on an archived version of the Web. Imagine surfing circa 1999 and looking at all the Y2K hype, or revisiting an older version of your favorite Web site. The Internet Archive Wayback Machine can make all of this possible. See our original press release at http://www.archive.org/about/press_release.php. Can I link to old pages on the Wayback Machine? Yes! The Wayback Machine is built so that it can be used and referenced. If you find an archived page that you would like to reference on your Web page or in an article, you can copy the URL. You can even use fuzzy URL matching and date specification... but that's a bit more advanced (check out our advanced search page at http://web.archive.org/collections/web/advanced.html). Why isn't the site I'm looking for in the archive? Some sites may not be included because the automated crawlers were unaware of their existence at the time of the crawl. It's also possible that some sites were not archived because they were password protected, blocked by robots.txt, or otherwise inaccessible to our automated systems. Site owners might have also requested that their sites be excluded from the Wayback Machine. When this has occurred, you will see a "blocked site error" message. When a site is excluded because of robots.txt you will see a "robots.txt query exclusion error" message. What does it mean when a site's archive data has been "updated"? When our automated systems crawl the web every few months or so, we find that only about 50% of all pages on the web have changed from our previous visit. This means that much of the content in our archive is duplicate material. If you don't see ""*"" next to an archived document, then the content on the archived page is identical to the previously archived copy. Who was involved in the creation of the Internet Archive Wayback Machine? "The original idea for the Internet Archive Wayback Machine began in 1996, when the Internet Archive first began archiving the web. Now, five years later, with over 100 terabytes and a dozen web crawls completed, the Internet Archive has made the Internet Archive Wayback Machine available to the public. The Internet Archive has relied on donations of web crawls, technology, and expertise from Alexa Internet and others. The Internet Archive Wayback Machine is owned and operated by the Internet Archive." How was the Wayback Machine made? Alexa Internet, in cooperation with the Internet Archive, has designed a three-dimensional index that allows browsing of web documents over multiple time periods, and turned this unique feature into the Wayback Machine. The Internet Archive Wayback Machine contains approximately 1 petabyte of data and is currently growing at a rate of 20 terabytes per month. This eclipses the amount of text contained in the world's largest libraries, including the Library of Congress. If you tried to place the entire contents of the archive onto floppy disks (we don't recommend this!) and laid them end to end, it would stretch from New York, past Los Angeles, and halfway to Hawaii. What type of machinery is used in this Internet Archive? Much of the Internet Archive is stored on hundreds of slightly modified x86 servers. The computers run on the Linux operating system. Each computer has 512Mb of memory and can hold just over 1 Terabyte of data on ATA disks. However we are developing a new way of storing our data on a smaller machine. Each machine will store 1 terabyte. For more information go to www.petabox.org. How do you archive dynamic pages? There are many different kinds of dynamic pages, some of which are easily stored in an archive and some of which fall apart completely. When a dynamic page renders standard HTML, the archive works beautifully. When a dynamic page contains forms, JavaScript, or other elements that require interaction with the originating host, the archive will not contain the original site's functionality. Why are some sites harder to archive than others? If you look at our collection of archived sites, you will find some broken pages, missing graphics, and some sites that aren't archived at all. Here are some things that make it difficult to archive a web site:
As a general rule of thumb, simple HTML is the easiest to archive.
Some sites are not available because of robots.txt or other exclusions. What does that mean? The Standard for Robot Exclusion (SRE) is a means by which web site owners can instruct automated systems not to crawl their sites. Web site owners can specify files or directories that are disallowed from a crawl, and they can even create specific rules for different automated crawlers. All of this information is contained in a file called robots.txt. While robots.txt has been adopted as the universal standard for robot exclusion, compliance with robots.txt is strictly voluntary. In fact most web sites do not have a robots.txt file, and many web crawlers are not programmed to obey the instructions anyway. However, Alexa Internet, the company that crawls the web for the Internet Archive, does respect robots.txt instructions, and even does so retroactively. If a web site owner decides he / she prefers not to have a web crawler visiting his / her files and sets up robots.txt on the site, the Alexa crawlers will stop visiting those files and will make unavailable all files previously gathered from that site. This means that sometimes, while using the Internet Archive Wayback Machine, you may find a site that is unavailable due to robots.txt (you will see a "robots.txt query exclusion error" message). Sometimes a web site owner will contact us directly and ask us to stop crawling or archiving a site, and we endeavor to comply with these requests. When you come across a "blocked site error" message, that means that a site owner has made such a request and it has been honored. How can I help the Internet Archive and the Wayback Machine? The Internet Archive actively seeks donations of digital materials for preservation. If you have digital materials that may be of interest to future generations, please let us know by submitting a proposal at http://www.archive.org/web/researcher/proposal.php. The Internet Archive is also seeking additional funding to continue this important mission. You may make a donation through the Amazon.com Honor System at http://www.amazon.com/paypage/PFW9L3HMJTPIQ. Using the Internet Archive Wayback Machine, it is possible to search for the names of sites contained in the Archive (URLs) and to specify date ranges for your search. We hope to implement a full text search engine at some point in the future. Why am I getting broken or gray images on a site? Broken images (when there is a small red "x" where the image should be) occur when the images are not available on our servers. Usually this means that we did not archive them. Gray images are the result of robots.txt exclusions. The site in question may have blocked robot access to their images directory. How do I contact the Internet Archive? All questions about the Wayback Machine, or other Internet Archive projects, should be addressed to info@archive.org. What is the Wayback Machine's Copyright Policy? The Internet Archive respects the intellectual property rights and other proprietary rights of others. The Internet Archive may, in appropriate circumstances and at its discretion, remove certain content or disable access to content that appears to infringe the copyright or other intellectual property rights of others. If you believe that your copyright has been violated by material available through the Internet Archive, please provide the Internet Archive Copyright Agent with the following information:
Identification of the copyrighted work that you claim has been
infringed;
An exact description of where the material about which you complain
is located within the Internet Archive collections;
Your address, telephone number, and email address;
A statement by you that you have a good-faith belief that the
disputed use is not authorized by the copyright owner, its agent, or
the law;
A statement by you, made under penalty of perjury, that the above
information in your notice is accurate and that you are the owner of
the copyright interest involved or are authorized to act on behalf
of that owner;
Your electronic or physical signature.
Internet Archive uses the exclusion policy intended for use by both academic and non-academic digital repositories and archivists. See our full exclusion policy. The Internet Archive Copyright Agent can be reached as follows: Internet Archive Copyright Agent Why is the Internet Archive collecting sites from the Internet? What makes the information useful? Most societies place importance on preserving artifacts of their culture and heritage. Without such artifacts, civilization has no memory and no mechanism to learn from its successes and failures. Our culture now produces more and more artifacts in digital form. The Archive's mission is to help preserve those artifacts and create an Internet library for researchers, historians, and scholars. The Archive collaborates with institutions including the Library of Congress and the Smithsonian. No, we do not collect or archive chat systems or personal email messages that have not been posted to Usenet bulletin boards or publicly accessible online message boards. Do you collect all the sites on the Web? No, we collect only publicly accessible Web pages. We do not archive pages that require a password to access, pages tagged for "robot exclusion" by their owners, pages that are only accessible when a person types into and sends a form, or pages on secure servers. If a site owner properly requests removal of a Web site through http://www.archive.org/about/exclude.php, we will exclude that site from the Wayback Machine. Is there any personal information in these collections? We collect Web pages that are publicly accessible. These may include pages with personal information. Who has access to the collections? What about the public? The Archive makes the collections available at no cost to researchers, historians, and scholars. At present, it takes someone with a certain level of technical knowledge to access them, but there is no requirement that a user be affiliated with any particular organization. 'How can I get a copy of the pages on my Web site? If my site got hacked or damaged, could I get a backup from the Archive?' Our terms of use do not cover backups for the general public. However, you may use the Internet Archive Wayback Machine to locate and access archived versions of your web site. We can't guarantee that your site has been or will be archived. For site owners only we offer limited backup capabilities. Send your request to info@archive.org for more information. Can people download sites from the collections? Our terms of use specify that users of the collections are not to copy data from the collections. If there are special circumstances that you think the Archive should consider, please contact info@archive.org. How do you protect my privacy if you archive my site? The Archive collects Web
pages that are publicly available -- the same ones that you might
find as you surfed around the Web. We do not archive pages that
require a password to access, pages tagged for "robot
exclusion" by their owners, pages that are only accessible when a
person types into and sends a form, or pages on secure servers. We
also provide information on removing a site from the collections.
Those who use the collections must agree to certain terms of use. What does 'failed connection' and other error messages mean? These are the main error messages you will see while searching the Wayback Machine: Failed Connection: The server that the particular piece of information lives on is down. Generally these clear up within two weeks. Robots.txt Query Exclusion: A robots.txt is something that a site owner puts on their site that keeps crawlers like our own from crawling them. The Internet Archive retroactively respects all robots.txt. Blocked Site Error: Site owners and/or copyright holders have requested that the site be excluded from the Wayback Machine. For exclusion criteria, please see our exclusion policy (we use the same one used and developed by other digital repositories and archivists both academic and non-academic). Path Index Error: A path index error message refers to a problem in our database wherein the information requested is not available (generally because of a machine or software issue, however each case can be different). We cannot always completely fix these errors in a timely manner. Why are there no recent archives in Wayback? Wayback does not add
pages less than 6 months after they are collected. Updates can take up
to 12 months in some cases. How does the Wayback Machine behave with Javascript turned off? If you have Javascript turned off, images and links will be from the live web, not from our archive of old Web files. How did I end up on the live version of a site? or I clicked on X date, but now I am on Y date, how is that possible? Not every date for every site archived is 100% complete. When you are surfing an incomplete archived site the Wayback Machine will grab the closest available date to the one you are in for the links that are missing. In the event that we do not have the link archived at all, the Wayback Machine will look for the link on the live web and grab it if available. Pay attention to the date code embedded in the archived URL. This is the list of numbers in the middle; it translates as yyyymmddhhmmss. For example in this URL http://web.archive.org/web/20000229123340/http://www.yahoo.com/ the date the site was crawled was February 29, 2000 at 12:33 and 40 seconds. |