Some URLs Are Immortal, Most Are Ephemeral

Abstract

"How long does a web page last?" is often answered with "44 to 100 days," but the web has changed since those numbers were first given in 1996. We examined how webpage lifespans have evolved using a sample of 27.3 million URLs archived from 1996 to 2021 by the Internet Archive (IA). We found that only 35% of URLs remained active in 2023, indicating significant web inactivity over time. Our preliminary analysis suggests that these numbers are not inflated with soft 404s and other phenomena. We encountered DNS failures for 30% of 7 million unique domains. Surprisingly, almost half of the URLs initially archived in 1996–2000 were still active in 2023, suggesting the longevity of some early URLs. Sites like nasa.gov continue to exist. Conversely, some URLs had lifespans that defied measurement. Nearly 30% had short lifespans, with only one archived page or no "200 OK" mementos, indicating brief existence or limited archival interest. The average lifespan of a web page in our dataset is 1.6 years. However, this average conceals the bimodal nature of root URLs, where 10% persist for less than a year, and nearly 20% thrive for over 20 years, resulting in an average lifespan of 3.9 years. Deep links have an average lifespan of 1.3 years. We examined web page half-life, the time it takes for half of the pages to disappear. Root URLs had a half-life of nine years compared to one year for deep links. URLs from different decades exhibited varying lifespans: 1990s URLs had a half-life of 15-20 years, early 2000s URLs had 6-7 years, and URLs from 2003 to 2021 had 6 months to 3 years. Using the IA as a source for sample URLs provides the only realistic, public option to study the evolution of the web at this scale and duration. However, there are well-known classes of pages that are not present in the Wayback Machine and our findings apply only to publicly archivable pages. They provide a nuanced understanding of web page longevity, emphasizing that while some URLs survive a long time, most have an ephemeral lifespan.

Details

Creators
KRITIKA GARG
Institutions
Date
2024-09-17 13:30:00 +0100
Keywords
approaches to preservation; from document to data
Publication Type
poster
License
Creative Commons Attribution 4.0 (CC-BY-4.0)
Download
(unknown) bytes

View This Publication