III: Medium: Collaborative Research: Citing Structured and Evolving Data
The National Science Foundation has funded this project (NSF IIS - 1302212) to investigate the computational issues involved with data citation. There is increasing demand for the provenance, authorship and ownership of data to be recognized through some form of citation in much the same way that conventional scholarship is served by citations. However, the specification and generation of citations for large, complex data sets poses non-trivial computational challenges.
At the same time new forms of publication, such as executable papers and large derived data sets require new forms of citation, and in the case of linked open data we do not even know how to organise our information so that citation is possible. Our hope is to make some headway with theses challenges as well. If you are interested in these problems, please get in touch with one of the investigators
- Susan Davidson, PI (University of Pennsylvania)
- Val Tannen, co-PI (University of Pennsylvania)
- Peter Buneman, co-PI (University of Edinburgh/University of Pennsylvania)
- Wenfei Fan, Senior Personnel (University of Edinburgh)
- James Frew, PI (UCSB)
Citation is an essential part of scientific publishing and, more generally, of scholarship. It is used to gauge the trust placed in published information and, for better or for worse, is an important factor in judging academic reputation. Now that so much scientific publishing involves data and takes place through a database rather than conventional journals, how is some part of a database to be cited? More generally, how should data stored in a repository that has complex internal structure and that is subject to change be cited?
The goal of this research is to develop a framework for data citation which takes into account the increasingly large number of possible citations; the need for citations to be both human and machine readable; and the need for citations to conform to various specifications and standards. A basic assumption is that citations must be generated, on the fly, from the database. The framework is validated by a prototype system in which citations conforming to pre-specified standards are automatically generated from the data, and tested on operational databases of pharmacological (IUPHAR) and Earth science data (ES3).
The broader impact of this research is on scientists who publish their findings in organized data collections or databases; data centers that publish and preserve data; businesses and government agencies that provide on-line reference works; and on various organizations who formulate data citation principles. The research also tackles the issue of how to enrich linked data so that it can be properly cited.
In addition to IUPHAR and ES3, we are working with the following data sources:
- Eagle-i, a resource discovery dataset for translational science research. Eagle-i has clearly specified data citation requirements, and automatically serves up persistent identifiers (Eagle-i IDs) for resources but does not automatically generate the citation. We have downloaded the RDF dataset, and have created an interface which, given the Eagle-i ID, will render the citation in human readable format, with optional XML/BibTEX/RIS exports. We have hosted this on AWS and are testing with Eagle-i developers.
- Reactome, a curated and peer reviewed pathway database whose goal is to support basic research, genome analysis, modeling, systems biology and education. Reactome also has clearly specified data citation requirements, but does not automatically generate the citation. We have downloaded XML versions of the dataset, and have developed citation rules reflecting these requirements.
- Hetionet, a public biological database being developed within collaborator Dr. Casey Greene's group.
- GENCODE, a high quality reference for human and mouse genomes which also provides summary information.
Our code has been published in Github. If you feel interested, please fork the following directory in Github: https://github.com/thuwuyinjun/Data_citation_demo , which references some code from a Github project (https://github.com/chenlica/alu01-corecover) on Query rewriting using views
Besides the existing scientific databases, e.g. IUPHAR, we also developed the following datasets for validation and experiments:
DBLP-NSF datasets, an integration of publication data extracted from DBLP dataset and the corresponding NSF grant data extracted from the National Science Foundation grant dataset.
Workshop on Computational Challenges in Data Citation
Event Date: April 17th and 18th, 2014
Event Venue: Singh Center for Nanotechnology
Event Wiki Page: Computational Challenges in Data Citation
Workshop Report: Computational Challenges in Data Citation - Workshop Report
The purpose of this workshop was to bring together people representing Information Science, Data Science, and Computer Science to enumerate the computational challenges and opportunities associated with data citation. The workshop was organized around three sessions – Citation Principles and Standards, Citation and Linked Open Data, and Executable Papers and Reproducibility – during which an overview talk was given followed by perspectives by participants. Participants then broke out into breakout groups, each of which contained people from different disciplines, and brainstormed what they believed to be the most important computational challenges for data citation. During a plenary session the next day, the challenges were revisited and refined.