Why not publish data?

I try to avoid writing things that may make me sound stupid, but this post falls in that category.

Recently I was reading about efforts related to data sharing: technological infrastructure, curation, educating researchers, and the like. I was struck by the thought that most of the advocacy for data sharing boils down to an exhortation to stick it in a digital repository.

This seems a bit odd considering that much of what propels science is the pressure to publish (written) results (in journals, conferences, monographs, etc.). There is a hierarchy of venues in terms of prestige, which is in turn linked to research funding, promotion, public attention (media coverage, policy influence), etc.

Might the best way to get researchers to share data be to create a similar system for datasets? It might provide a compelling incentive.

Moreover, publishing might provide a compelling incentive to the related issue of data curation (making data understandable / usable to others, e.g. through formatting, annotation, etc.). Currently, much data doesn’t see much use outside the lab where it was generated, so researchers have little incentive to spend time “prettying it up” for others (who may find the way it was recorded to be inscrutable). Even if they are convinced to “share” their data by posting it online, it may seem quite a low priority to spend time making it useful to others. If there was pressure to publish the dataset, though, then researchers would have that incentive to make the data as intuitively useful to others as practicable, so reviewers could quickly identify the novelty of the data.

This doesn’t seem so outlandish to me. There are similar efforts to provide publication fora for materials which were not traditionally unpublished (we might say undersupplied), such as negative results and experimental techniques.

If you think of it in terms of a CV, the difference is between these lines:

  • Created and shared large, valuable dataset which is highly regarded by peers
  • Publication in J. Big Useful Datasets, impact factor X

It may be hard for a reviewer to quantify or validate the former; the latter demonstrates that the researcher’s contribution has already been validated and provides built-in metrics to quantify the contribution.

There are other ways to skin the same cat. One option would be to build alternative systems for conferring recognition (e.g. awards, metrics for contributions to shared datasets, etc.). The other approach is to make data sharing a more enforceable part of other scientific endeavors, e.g. mandatory as a condition of research funding, mandatory as a condition of publication (of written results) in a journal, etc. I think multiple approaches will yield the best result. It seems to me that creating “journals” (or some other name) for “publishing” datasets could be a useful way to spur participation.

Has this been done already? What are the drawbacks to this approach?

This entry was posted in Academia, Open access, Science. Bookmark the permalink.

9 Responses to Why not publish data?

  1. See the JISC-funded CLADDIER project (led by BADC etc) for ideas on data publishing, along very similar lines to what you suggest…

  2. I think it depends a lot on what you mean by a “journal”. I would be against the idea of creating a formal journal (with peer review, page numbers, indexes) for data because I think it would retrograde step. The web was built to house datasets and we are now building an infrastructure that will let this be done properly. I think it much more effective to build purpose built systems designed to hold, aggregate, and process data in sensible, and web native, ways.

    If however you mean that there is value in making datasets citeable objects and providing incentives for people to “publish” datasets in a useful way then I am all for it. The crystallography community have been most effective at creating “data journals” (the Acta Cryst set) and it is because of these journals that crystallographers are amongst the most cited and highly published scientists.

    The flip side of this is the tendency of people to try and push “dataset shaped pegs into paper sized holes”. Many of the papers that describe both biological structures and genome sequences seem to be a dataset in a desperate search for some sort of interesting statement that can be hung off it. Many interesting and important structures and sequences would have been better published as “here it is, make of it what you will” rather than trying to craft a somewhat nebulous paper around it.

    The same could be said of the publication of software tools (and indeed databases). People are desperate to publish a paper so they can add it to their CV and so that people have something to cite. Not because a paper is a good way of describing or documenting the work, but because it is the only way people can currently get credit.

    So my argument would be that we need to expand the way we use and apply citations, and make much more serious efforts to make databases, web pages, tools, etc all solid citeable objects. Otherwise we are going to end up with a lot more journals (don’t we have enough already?) and all the additional overheads that go with it. If the purpose of creating a journal is only to make the dataset citeable then I think there are better ways of going about it.

  3. Not a stupid question at all, Gavin. Indeed, valuing new kinds of knowledge production is the key question just now in all sorts of fields, not just chemistry!

    I can’t add much to what Cameron said; he is of course much closer to the problem than I am. All I can suggest is that perhaps the only things that change researcher praxis are money and prestige. Funder data-retention mandates hit at the money angle; journal data-retention mandates hit at prestige.

    Organizations that want to move this agenda forward and have money to spend should perhaps consider awards; it seems a bit too early for grant programs at this juncture. But awards are both prestige AND money, so they get attention!

  4. Simon Coles says:

    I have spent some considerable time trying to provide and promote repository systems for data that are linked into the publication process. Most researchers shy away from open data in the fear that it will jeopardise publication possibilities – so joining up with publishers seems like a logical approach. This has been very hard to sell indeed, – most publishers dont see a business model and therefore are not inclined to invest. This is not a case of investing in technology to enable the process, but the vast amount of effort that it would take to alter the publication process in the ways both the publisher and the author will have to adapt (workflow essentially). A publisher will essentially only react in response to demand from its customers to provide.

    Technologists are busy selling the benefits of open (and thence linked) data or the potential for new research to be done via mash ups, but there are many researchers out there that see this as others *taking* their data and getting credit for building on it and are against this. One way of getting round this is to educate / advise the upcoming generation of researchers i.e. graduate students. One point to sell on are the advantages of considered data management – storing a well described dataset in a structured system will save a lot of time trying to find it later on (eg when writing a thesis). This might sound basic, but I assure you that there is a massive amount of lost data on laptop / instrument hard drives simply because it is not in a discoverable form, due to not being described properly at the time of generation. Basic authoring tools that will enable data to be easily incorporated into documents (eg theses!) and/or changes in practice that allow one to simply point at a dataset from a document in a formal way are obvious wins here. The valuable side effect of this approach is that datasets can then easily be made public in a stand alone way at an appropriate time and with little further effort on the part of the researcher. In this respect I wholly agree with Cameron in that we need to promote ways of making datasets citable (both protocols and types of resource).

    The next generation of researchers will understand the benefits of discoverable data!

    Increased visibility and rewards for publication are valuable incentives and definite matters to address, but I fear they are not the whole answer – the problem needs to be approached ground up before this icing on the cake.

    I draw mainly from my field of chemistry in making these comments, but after several years working in the area of open data publication I am aware of practice in other disciplines and see a considerable number of parallels…

  5. Wow, this is a great write-up, and not a stupid question at all. It’s also great to see so many other scientific fields engaged in the same discourse and having the same issues, because it really speaks to the universal nature of this problem.

    Myself, I’m a hydrologist, and see the disconnection of stream data, weather data and other forms of timeseries as a serious problem in the science of water and climate research, and have spent considerable time trying to find a way to deal with making it universally accessible and, most importantly, discoverable- as Simon Coles suggests.

    Currently, I’m founding a start-up that is focused specifically on this issue of data publication and discovery. My goal is to create an atmosphere where we can find ways to make standards for storage, presentation and discovery of scientific data while supporting and promoting the concept of data as open.

    Sadly, over the last 10 years, I’ve been involved in 3 university projects which tried to tackle this, and all either had their funding cut or died with the whims of student turnover; hence my plans for a private startup similar more to Netscape and The Mozilla Foundation than to a University project. But I digress.

    I do think your comment is spot on the mark, but that the critique of a paper publishing schema is also spot on. My thought is that we need an accepted and easy to use system that promotes the storage and linkage of data just as Simon Cole described along with the transmission of data across platforms (and technology sectors).

    Perhaps more importantly, we need linkage of data to source provider. That way, as a data creator, my reputation follows the data (to a certain extent) which would provide me incentive to open it’s use- much like people use a paper and cite my name and that is an incentive for me to write.

    There are issues surrounding that, of course- not the least of which is misuse of data. But until we accept that something is possible, we can never understand how much we are able to change and grow in order to make it work.

    Excellent discussion!

  6. By the way, Apologies for using my personal blog as my open ID- which means that my name is not listed. Didn’t realize that would happen.

    -John Metta

  7. We’re doing it — or some of it– Dryad is a data depository for ecology, evolutionary biology & related fields– see http://www.datadryad.org
    Dryad is developed by the National Evolutionary Synthesis Center http://www.nescent.org and the UNC Metadata Research Center http://ils.unc.edu/mrc/

    Dryad aims to provide one-stop data deposition upon publication by interfacing with specialized repositories which are already required for publication such as GenBank and Treebase. However, for many publications, these repositories do not capture the whole data package. As a result, many important datasets are not being preserved and are no longer available, or usable, at the time that they are sought by later investigators.

    To staunch this loss, Dryad is being designed as a “catch-all” repository for numerical tables and all other kinds of published data that do not currently have a home. A major design consideration with these data is to avoid placing an undue burden of metadata generation on individual researchers while at the same time capturing sufficient metadata to enable data discovery and reuse.

    We are working with stakeholders from a consortium of journals and scientific societies to develop data sharing policies and ensure the long-term sustainability of the repository. Dryad is built upon the DSpace repository software developed by the Massachusetts Institute of Technology and Hewlett-Packard, with moderate customizations. All Dryad-specific customizations are available from the Dryad code repository.

    Stay tuned…

  8. Heather Morrison says:

    What will drive open data is its use, or more accurately, re-use as this is the point of open. Or at least this is the impression that I am getting from looking at some of the discussions, and thinking about the kind of data that I like to reuse (library metadata). My impression is that what people like is the actual raw data, on which can be built pretty interfaces on the fly. This suggests databases rather than journals. Makes sense to me – just give me the metadata, properly tagged so that it can be manipulated.

    The most interesting discussion on this that I’m aware of is the civicaccess.ca group. These folks talk a lot about open data from the perspective of what they want to do with it (or what other people are doing with it). It all sounds very exciting, even if I don’t always know what they are talking about. Looking forward to the Democracy and Technology preconference at the Canadian Library Association conference in Montreal this year, which will be a gathering of these and like-minded folks (e.g. open government types).

  9. Publication of research data and access to research data are a big issue in the Helmholtz Association of Research Centres in Germany. As part of its Open Access project the Helmholtz Association promotes activities to make research data accessible . Making research data citeable as publications is part of their strategy.

    Other German research organisations have since joined the Helmholtz Association in this matter. More information is available at the website of the OA Information Platform .

    As a first step to formalise the publication of data as an independent work, alongside data as supplement to peer-review publications and data from environmental monitoring systems, the Helmholtz Association has funded the launch of an Open Access journal for data publication: Earth Systems Science Data (ESSD) .

    On the technical side, the German Science Foundation (DFG) has funded the project “Publication and Citation of Scientific Primary Data” (STD-DOI ) which has developed a technical and organisational infrastructure to assign Digital Object Identifiers (DOI) to scientific and technical data publications. This service is now part of the infrastructure of the German National Library for Science and Technology (TIB Hannover).

    An example of a data publication as a supplement to a peer-review paper (Scientific Drilling Database): doi:10.1594/GFZ.SDDB.1043, which supplements doi:10.1016/j.gloplacha.2004.11.011.

    An example of a data publication as an independent work (PANGAEA/WDC-MARE): doi:10.1594/PANGAEA.547983 with reference to ESSD.

    For reference see also: Klump, J., et al. (2006), Data publication in the Open Access Initiative, Data Science Journal, 5, 79-83, doi:10.2481/dsj.5.79.

    The technical solutions are there. The question is now how we can make data publication part of a common scientific practice.

    Jens Klump

Leave a Reply