In a project that could unlock the world’s research papers for easier computerized analysis, an American technologist has released online a gigantic index of the words and short phrases contained in more than 100 million journal articles — including many paywalled papers.
The catalogue, which was released on 7 October and is free to use, holds tables of more than 355 billion words and sentence fragments listed next to the articles in which they appear. It is an effort to help scientists use software to glean insights from published work even if they have no legal access to the underlying papers, says its creator, Carl Malamud. He released the files under the auspices of Public Resource, a non-profit corporation in Sebastopol, California that he founded.
Malamud says that because his index doesn’t contain the full text of articles, but only sentence snippets up to five words long, releasing it does not breach publishers’ copyright restrictions on the re-use of paywalled articles. However, one legal expert says that publishers might question the legality of how Malamud created the index in the first place.
Some researchers who have had early access to the index say it’s a major development in helping them to search the literature with software — a procedure known as text mining. Gitanjali Yadav, a computational biologist at the University of Cambridge, UK, who studies volatile organic compounds emitted by plants, says she aims to comb through Malamud’s index to produce analyses of the plant chemicals described in the world’s research papers. “There is no way for me — or anyone else — to experimentally analyse or measure the chemical fingerprint of each and every plant species on Earth. Much of the information we seek already exists, in published literature,” she says. But researchers are restricted by lack of access to many papers, Yadav adds.
Read it all.