Study finds that we could lose science if publishers go bankrupt

  News
image_pdfimage_print
A set of library shelves with lots of volumes stacked on them.

Back when scientific publications came in paper form, libraries played a key role in ensuring that knowledge didn’t disappear. Copies went out to so many libraries that any failure—a publisher going bankrupt, a library getting closed—wouldn’t put us at risk of losing information. But, as with anything else, scientific content has gone digital, which has changed what’s involved with preservation.

Organizations have devised systems that should provide options for preserving digital material. But, according to a recently published survey, lots of digital documents aren’t consistently showing up in the archives that are meant to preserve it. And that puts us at risk of losing academic research—including science paid for with taxpayer money.

Tracking down references

The work was done by Martin Eve, a developer at Crossref. That’s the organization that organizes the DOI system, which provides a permanent pointer toward digital documents, including almost every scientific publication. If updates are done properly, a DOI will always resolve to a document, even if that document gets shifted to a new URL.

But it also has a way of handling documents disappearing from their expected location, as might happen if a publisher went bankrupt. There are a set of what’s called “dark archives” that the public doesn’t have access to, but should contain copies of anything that’s had a DOI assigned. If anything goes wrong with a DOI, it should trigger the dark archives to open access, and the DOI updated to point to the copy in the dark archive.

For that to work, however, copies of everything published have to be in the archives. So Eve decided to check whether that’s the case.

Using the Crossref database, Eve got a list of over 7 million DOIs and then checked whether the documents could be found in archives. He included well-known ones, like the Internet Archive at archive.org, as well as some dedicated to academic works, like LOCKSS (Lots of Copies Keeps Stuff Safe) and CLOCKSS (Controlled Lots of Copies Keeps Stuff Safe).

Not well-preserved

The results were… not great.

When Eve broke down the results by publisher, less than 1 percent of the 204 publishers had put the majority of their content into multiple archives. (The cutoff was 75 percent of their content in three or more archives.) Fewer than 10 percent had put more than half their content in at least two archives. And a full third seemed to be doing no organized archiving at all.

At the individual publication level, under 60 percent were present in at least one archive, and over a quarter didn’t appear to be in any of the archives at all. (Another 14 percent were published too recently to have been archived or had incomplete records.)

The good news is that large academic publishers appear to be reasonably good about getting things into archives; most of the unarchived issues stem from smaller publishers.

Eve acknowledges that the study has limits, primarily in that there may be additional archives he hasn’t checked. There are some prominent dark archives that he didn’t have access to, as well as things like Sci-hub, which violates copyright in order to make material from for-profit publishers available to the public. Finally, individual publishers may have their own archiving system in place that could keep publications from disappearing.

Should we be worried?

The risk here is that, ultimately, we may lose access to some academic research. As Eve phrases it, knowledge gets expanded because we’re able to build upon a foundation of facts that we can trace back through a chain of references. If we start losing those links, then the foundation gets shakier. Archiving comes with its own set of challenges: It costs money, it has to be organized, consistent means of accessing the archived material need to be established, and so on.

But, to an extent, we’re failing at the first step. “An important point to make,” Eve writes, “is that there is no consensus over who should be responsible for archiving scholarship in the digital age.”

A somewhat related issue is ensuring that people can find the archived material—the issue that DOIs were designed to solve. In many cases, the authors of the manuscript place copies in places like the arXiv/bioRxiv, or the NIH’s PubMed Centra (this sort of archiving is increasingly being made a requirement by funding bodies). The problem here is that the archived copies may not include the DOI that’s meant to ensure it can be located. That doesn’t mean it can’t be identified through other means, but it definitely makes finding the right document much more difficult.

Put differently, if you can’t find a paper or can’t be certain you’re looking at the right version of it, it can be just as bad as not having a copy of the paper at all.

None of this is to say that we’ve already lost important research documents. But Eve’s paper serves a valuable function by highlighting that the risk is real. We’re well into the era where print copies of journals are irrelevant to most academics, and digital-only academic journals have proliferated. It’s long past time for us to have clear standards in place to ensure that digital versions of research have the endurance that print works have enjoyed.

Journal of Librarianship and Scholarly Communication, 2024. DOI: 10.31274/jlsc.16288  (About DOIs).

https://arstechnica.com/?p=2009083