Link rot
Link rot (also called link death, link breaking, or reference rot) is the phenomenon of hyperlinks tending over time to cease to point to their originally targeted file, web page, or server due to that resource being relocated to a new address or becoming permanently unavailable. A link that no longer points to its target, often called a broken, dead, or orphaned link, is a specific form of dangling pointer.
The rate of link rot is a subject of study and research due to its significance to the internet's ability to preserve information. Estimates of that rate vary dramatically between studies. Information professionals have warned that link rot could make important archival data disappear, potentially impacting the legal system and scholarship.
Prevalence
A number of studies have examined the prevalence of link rot within the World Wide Web, in academic literature that uses URLs to cite web content, and within digital libraries.
A 2002 study suggested that link rot within digital libraries is considerably slower than on the web, finding that about 3% of the objects were no longer accessible after one year<ref name=Nelson2002>Nelson, Michael L.; Allen, B. Danette (2002). "Object Persistence and Availability in Digital Libraries". D-Lib Magazine. 8 (1). doi:10.1045/january2002-nelson. Archived from the original on 2020-07-19. Retrieved 2019-09-24.</ref> (equating to a half-life of nearly 23 years).
A 2003 study found that on the Web, about one link out of every 200 broke each week,<ref name=Fetterly2003>Fetterly, Dennis; Manasse, Mark; Najork, Marc; Wiener, Janet (2003). "A large-scale study of the evolution of web pages". Proceedings of the 12th international conference on World Wide Web. Archived from the original on 9 July 2011. Retrieved 14 September 2010.</ref> suggesting a half-life of 138 weeks. This rate was largely confirmed by a 2016–2017 study of links in Yahoo! Directory (which had stopped updating in 2014 after 21 years of development) that found the half-life of the directory's links to be two years.<ref>van der Graaf, Hans. "The half-life of a link is two year". ZOMDir's blog. Archived from the original on 2017-10-17. Retrieved 2019-01-31.</ref>
A 2004 study showed that subsets of Web links (such as those targeting specific file types or those hosted by academic institutions) could have dramatically different half-lives.<ref name=Koehler2004>Koehler, Wallace (2004). "A longitudinal study of web pages continued: a consideration of document persistence". Information Research. 9 (2). Archived from the original on 2017-09-11. Retrieved 2019-01-31.</ref> The URLs selected for publication appear to have greater longevity than the average URL. A 2015 study by Weblock analyzed more than 180,000 links from references in the full-text corpora of three major open access publishers and found a half-life of about 14 years,<ref>"All-Time Weblock Report". August 2015. Archived from the original on 4 March 2016. Retrieved 12 January 2016.</ref> generally confirming a 2005 study that found that half of the URLs cited in D-Lib Magazine articles were active 10 years after publication.<ref name=McCown2005>McCown, Frank; Chan, Sheffan; Nelson, Michael L.; Bollen, Johan (2005). "The Availability and Persistence of Web References in D-Lib Magazine" (PDF). Proceedings of the 5th International Web Archiving Workshop and Digital Preservation (IWAW'05). Archived from the original (PDF) on 2012-07-17. Retrieved 2005-10-12.</ref> Other studies have found higher rates of link rot in academic literature but typically suggest a half-life of four years or greater.<ref name=Spinellis2003>Spinellis, Diomidis (2003). "The Decay and Failures of Web References". Communications of the ACM. 46 (1): 71–77. CiteSeerX 10.1.1.12.9599. doi:10.1145/602421.602422. S2CID 17750450. Archived from the original on 2020-07-23. Retrieved 2007-09-29.</ref><ref name=Lawrence2001>Lua error in Module:Cite_Q at line 13: attempt to index field 'wikibase' (a nil value).</ref> A 2013 study in BMC Bioinformatics analyzed nearly 15,000 links in abstracts from Thomson Reuters's Web of Science citation index and found that the median lifespan of web pages was 9.3 years, and just 62% were archived.<ref>Hennessey, Jason; Xijin Ge, Steven (2013). "A Cross Disciplinary Study of Link Decay and the Effectiveness of Mitigation Techniques". BMC Bioinformatics. 14 (Suppl 14): S5. doi:10.1186/1471-2105-14-S14-S5. PMC 3851533. PMID 24266891.</ref> A 2021 study of external links in New York Times articles published between 1996 and 2019 found a half-life of about 15 years (with significant variance among content topics) but noted that 13% of functional links no longer lead to the original content—a phenomenon called content drift.<ref>"What the ephemerality of the Web means for your hyperlinks". Columbia Journalism Review. Archived from the original on 2021-08-02. Retrieved 2021-08-02.</ref>
A 2013 study found that 49% of links in U.S. Supreme court opinions are dead.<ref>Garber, Megan (2013-09-23). "49% of the Links Cited in Supreme Court Decisions Are Broken". The Atlantic. Retrieved 2024-01-10.</ref>
A 2023 study looking at United States COVID-19 dashboards found that 23% of the state dashboards available in February of 2021 were no longer available at the previous URLs in April of 2023.<ref name="Adams1">Adams, Aaron M.; Chen, Xiang; Li, Weidong; Chuanrong, Zhang (27 July 2023). "Normalizing the pandemic: exploring the cartographic issues in state government COVID-19 dashboards". Journal of Maps. 19 (5): 1–9. doi:10.1080/17445647.2023.2235385.</ref>
Causes
Link rot can result from several occurrences. A target web page may be removed. The server that hosts the target page could fail, be removed from service, or relocate to a new domain name. As far back as 1999, it was noted that with the amount of material that can be stored on a hard drive, "a single disk failure could be like the burning of the library at Alexandria."<ref name="McGranaghan1999">McGranaghan, Matthew (1999). "The Web, Cartography and Trust". Cartographic Perspectives (32): 3–5. doi:10.14714/CP32.624.</ref> A domain name's registration may lapse or be transferred to another party. Some causes will result in the link failing to find any target and returning an error such as HTTP 404. Other causes will cause a link to target content other than what was intended by the link's author.
Other reasons for broken links include:
- the restructuring of websites that causes changes in URLs (e.g. <syntaxhighlight lang="text" class="" id="" style="" inline="1">domain.net/pine_tree</syntaxhighlight> might be moved to <syntaxhighlight lang="text" class="" id="" style="" inline="1">domain.net/tree/pine</syntaxhighlight>)
- relocation of formerly free content to behind a paywall<ref name="Adams1" />
- a change in server architecture that results in code such as PHP functioning differently
- dynamic page content such as search results that changes by design
- deletion of the target page and/or its content
- the presence of user-specific information (such as a login name) within the link
- deliberate blocking by content filters or firewalls
- the expiration of a domain name registration
Prevention and detection
Strategies for preventing link rot can focus on placing content where its likelihood of persisting is higher, authoring links that are less likely to be broken, taking steps to preserve existing links, or repairing links whose targets have been relocated or removed.[citation needed]
The creation of URLs that will not change with time is the fundamental method of preventing link rot. Preventive planning has been championed by Tim Berners-Lee and other web pioneers.<ref name=Berners-Lee1998>Berners-Lee, Tim (1998). "Cool URIs Don't Change". Archived from the original on 2000-03-02. Retrieved 2019-01-31.</ref>
Strategies pertaining to the authorship of links include:
- linking to primary rather than secondary sources and prioritizing stable sites<ref name="Koehler2004" />
- avoiding links that point to resources on researchers' personal pages<ref name=McCown2005/>
- using clean URLs or otherwise employing URL normalization or URL canonicalization<ref name=Kille2014>Kille, Leighton Walter (8 November 2014). "The Growing Problem of Internet "Link Rot" and Best Practices for Media and Online Publishers". Journalist's Resource, Harvard Kennedy School. Archived from the original on 12 January 2015. Retrieved 16 January 2015.</ref>
- using permalinks and persistent identifiers such as ARKs, DOIs, Handle System references, PURLs,[citation needed] or content addressing<ref>Sicilia, Miguel-Angel, et al. "Decentralized Persistent Identifiers: a basic model for immutable handlers Archived 2023-05-10 at the Wayback Machine." Procedia computer science 146 (2019): 123-130.</ref>
- avoiding linking to documents other than web pages<ref name=Kille2014/>
- avoiding deep linking[citation needed]
- linking to web archives such as the Internet Archive,<ref>"Internet Archive: Digital Library of Free Books, Movies, Music & Wayback Machine". 2001-03-10. Archived from the original on 26 January 1997. Retrieved 7 October 2013.</ref> WebCite,<ref name=Eysenbach2005>Eysenbach, Gunther; Trudel, Mathieu (2005). "Going, going, still there: Using the WebCite service to permanently archive cited web pages". Journal of Medical Internet Research. 7 (5): e60. doi:10.2196/jmir.7.5.e60. PMC 1550686. PMID 16403724.</ref> archive.today, Perma.cc,<ref name=permacc>Zittrain, Jonathan; Albert, Kendra; Lessig, Lawrence (12 June 2014). "Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations" (PDF). Legal Information Management. 14 (2): 88–99. doi:10.1017/S1472669614000255. S2CID 232390360. Archived (PDF) from the original on 1 November 2020. Retrieved 10 June 2020.</ref> Amber,<ref>"Harvard University's Berkman Center Releases Amber, a "Mutual Aid" Tool for Bloggers & Website Owners to Help Keep the Web Available | Berkman Center". cyber.law.harvard.edu. Archived from the original on 2016-02-02. Retrieved 2016-01-28.</ref> or Arweave<ref>"Arweave - A community-driven ecosystem". arweave.org. Archived from the original on 2023-03-15. Retrieved 2023-03-15.</ref>
Strategies pertaining to the protection of existing links include:
- using redirection mechanisms such as HTTP 301 to automatically refer browsers and crawlers to relocated content.[citation needed]
- using content management systems which can automatically update links when content within the same site is relocated or automatically replace links with canonical URLs<ref name="Justaddwater 2007">Rønn-Jensen, Jesper (2007-10-05). "Software Eliminates User Errors And Linkrot". Justaddwater.dk. Archived from the original on 11 October 2007. Retrieved 5 October 2007.</ref>
- integrating search resources into HTTP 404 pages<ref name="GoogleToolbar">Mueller, John (2007-12-14). "FYI on Google Toolbar's Latest Features". Google Webmaster Central Blog. Archived from the original on 13 September 2008. Retrieved 9 July 2008.</ref>
The detection of broken links may be done manually or automatically. Automated methods include plug-ins for content management systems as well as standalone broken-link checkers such as like Xenu's Link Sleuth. Automatic checking may not detect links that return a soft 404 or links that return a 200 OK response but point to content that has changed.<ref name=Bar-Yossef2004>Bar-Yossef, Ziv; Broder, Andrei Z.; Kumar, Ravi; Tomkins, Andrew (2004). "Sic transit gloria telae: towards an understanding of the Web's decay". Proceedings of the 13th international conference on World Wide Web – WWW '04. pp. 328–337. CiteSeerX 10.1.1.1.9406. doi:10.1145/988672.988716. ISBN 978-1581138443.</ref>
See also
- Archive Team, web archiving team
- Dead Internet theory
- Deletionism and inclusionism in Wikipedia
- Digital preservation
- Infodemic
- Software rot
- Info - Cern
Further reading
- Markwell, John; Brooks, David W. (2002). "Broken Links: The Ephemeral Nature of Educational WWW Hyperlinks". Journal of Science Education and Technology. 11 (2): 105–108. doi:10.1023/A:1014627511641. S2CID 60802264.
- Gomes, Daniel; Silva, Mário J. (2006). "Modelling Information Persistence on the Web" (PDF). Proceedings of the 6th International Conference on Web Engineering. ICWE'06. Archived from the original (PDF) on 2011-07-16. Retrieved 14 September 2010.
- Dellavalle, Robert P.; Hester, Eric J.; Heilig, Lauren F.; Drake, Amanda L.; Kuntzman, Jeff W.; Graber, Marla; Schilling, Lisa M. (2003). "Going, Going, Gone: Lost Internet References". Science. 302 (5646): 787–788. doi:10.1126/science.1088234. PMID 14593153. S2CID 154604929.
- Koehler, Wallace (1999). "An Analysis of Web Page and Web Site Constancy and Permanence". Journal of the American Society for Information Science. 50 (2): 162–180. doi:10.1002/(SICI)1097-4571(1999)50:2<162::AID-ASI7>3.0.CO;2-B.
- Sellitto, Carmine (2005). "The impact of impermanent Web-located citations: A study of 123 scholarly conference publications" (PDF). Journal of the American Society for Information Science and Technology. 56 (7): 695–703. CiteSeerX 10.1.1.473.2732. doi:10.1002/asi.20159.
References
External links
- Future-Proofing Your URIs
- Nielsen, Jakob (14 June 1998). "Fighting Linkrot". Archived from the original on 23 December 2012.