Tearing Down Science's Citation Paywall, One Link at a Time

A new initiative makes scientific paper citations open to everyone, for free.
Image may contain Paper Metropolis Building Urban City and Town
WIRED

To scientists, citations are currency. No, you can’t use them to put gas in your car or food on your table. But surviving in academia means publishing papers people want to read and, more to the point, cite in their own research. Citations establish credibility, and determine the impact of a given paper, researcher, and institution. Simply put, they fundamentally shape what people believe.

The problem with this lies in determining who’s citing whom. Over the last few decades, only researchers with subscriptions to two proprietary databases, Web of Science and Scopus, have been able to track citation records and measure the influence of a given article or scientific idea. This isn’t just a problem for scientists trying to get their resumes noticed; a citation trail tells the general public how it knows what it knows, each link a breadcrumb back to a foundational idea about how the world works.

On Thursday, a coalition of open data advocates, universities, and 29 journal publishers announced the Initiative for Open Citations with a commitment to make citation data easily available to anyone at no cost. “This is the first time we have something at this scale open to the public with no copyright restrictions,” says Dario Taraborelli, head of research at the Wikimedia Foundation, a founding member of the initiative. “Our long-term vision is to create a clearinghouse of data that can be used by anyone, not just scientists, and not just institutions that can afford licenses.”

Here’s how it works: When a researcher publishes a paper, the journal registers it with Crossref, a nonprofit you can think of as a database linking millions of articles. The journal also bundles those links with unique identifying metadata like author, title, page number of print edition, and who funded the research. All of the major publishers started doing this when Crossref launched in 2000. But most of them held the reference data---the information detailing who cited whom and where---under strict copyright restrictions. Accessing it meant paying tens of thousands of dollars in subscription fees to the companies that own Web of Science or Scopus.

Historically, just 1 percent of publications using Crossref made references freely available. Six months after the Initiative for Open Citations started convincing publishers to open up their licensing agreements, that figure is approaching 40 percent, with around 14 million citation links already indexed and ready for anyone to use. The group hopes to maintain a similar trajectory through the year.

“It’s not that much actual work to do it, it’s just about flipping a switch and getting publishers to agree to releasing this data,” says Taraborelli. He called on the open-access movement to focus its effort on citation data in September. In the months since, heavyweight publishers like Springer Nature, Taylor & Francis, and Wiley---three outfits that publisher nearly 25 percent of all peer-reviewed journals--- are among the 29 making their references freely available.

“It will make our customers’ lives easier by helping data scientists to mine a large body of references in one go,” says Steven Inchcoombe, chief publishing officer at Springer Nature. The company signed on in February and already has provided reference lists to more than 6 million articles in about one-third of its 3,000 journals, with the rest coming later this year. Elsevier, which owns Cell and The Lancet, not to mention some 30 percent of Crossref’s citation data and Scopus, is sitting things out for now. The initiative can’t hit its goal of 100 percent coverage without bringing the Dutch publisher aboard. But that’s not stopping Taraborelli from thinking about how to divine deeper truths beyond just citation metadata.

“We really want to have the ability to mine the contents of the entire paper,” he says. “Because then we’re talking about enabling the provenance of specific facts.” He points to a Wikimedia project that examined every published paper on the Zika virus (only about 1000 studies). Aided by machine learning, the team built a map connecting the dots between statements that get shared as facts online (think Wikipedia, Britannica.com, etc) and specific papers. The idea being, with enough data you could determine how common truths get formed, and trace them back to the primary source. Rigorous analysis like this he says, would advance fields much more quickly than at their current rate. And it would make it easier to figure out where the general public is getting its information.

That could come in handy when trying to combat the rising tide of alternate facts. And while science is more about narrowing the windows of uncertainty than claiming truth with an upper-case “T,” there’s always room to improve existing methods for doing that, starting one footnote at a time.