I’m liveblogging the first meeting of the new Board on Research Data and Information today and yesterday. Standard liveblogging disclaimers apply. The presentation slides are on the meeting site. Because some of the slides are online, I’ll focus on what’s not on the slides.
International Workshop on the Socioeconomic Effects of Access and Reuse Policies for Online Public Sector Information, with OECD
Julia Lane, NSF
What is the value?
Can we develop a microdata infrastructure to study the economic impacts of science?
Q: I was impressed by ads by MIT about companies started by MIT alumni. I was told then by NSF it wasn’t possible to identify everyone who got NSF funding. Can we do that now?
A: Funding agencies are structured as proposal management, not management information. NIH and DOE, but not NSF, have contact info on postdocs and grad students. There’s not sufficient information on sub-awards.
Q: We scrape screens to track who gets funding to build databases.
A: I hope that’s going to change.
Q: Impact has always been important, but hard to measure. What’s the stumbling block to make this information available and usable?
A: Inertia, but also confidentiality.
Q: General question of impact of science and science investment. For data information, we’re concerned about the impact on science as well as societal decision making. How can we quantify the value of data?
A: That would be one of the issues. But there are silos, each with their own issues.
Q: What could the board do to help you?
A: This was just informative. We need a data infrastructure to make evidence-based science policy. (Science of Science Policy)
Overview of USNC/CODATA
BRDI is now US National Committee for CODATA
Robert Chen is CODATA Secretary General
CODATA is a committee of ICSU, but with its own membership and own organization
National members — mostly science academies
Union members — mostly hard science
Other ICSU bodies
What should BRDI do (in general)?
We could operate in two spaces, long- and short-term. We’ve heard a variety of interesting things — maybe we can articulate that in a way that’s actionable.
What would be helpful to me is to have a list of policy issues and functional issues, then select the ones to work on.
The big question is, what should I do with my data? There’s no obvious thing to do.
We could break it into short-, medium- and long-term. We could decide where we want to put our effort. We are at a very unusual moment in time. One thing we could do is to let new appointees know that this group exists, e.g. Obama’s science advisor, new FCC chair. We should identify their expected needs.
Who decides what I do with my data — the researcher or the funder? Why are we having this conversation — because technology has kicked up a new set of issues. We need to focus on the fact that there’s something new. We should focus on interoperability.
Researchers operate in a much more complex environment than “what should do I do about my data?” — e.g. what does my provost want, my lab director, etc.? We should make some recommendations to clarify so that everyone involved at least has something to poke at. What should the American research community do with its data?
We have to do our thinking at a different kind of level. The White House science office came up with a plan for what should happen to data.
This group can communicate to Congress. An agenda is being formed — this needs to be communicated.
We need to communicate the value proposition to the public at large.
Does the board need to communicate to scholars, to develop literacy here? Does the board need to look at “data and information” beyond numerical data — journal publications, grey literature, science blog posts, mailing lists and discussion forums, lab notebooks…?
We can’t yet give policy answers to policymakers — but can we at least give questions? If we could say there’s not yet a well-thought-out data policy in the U.S., that could be a starting point.
Looking at role of funders — if we start enacting data management policies, it exerts some influence. What can we do to help the administration look at this? What can we do to help grassroots efforts look at this?
CTO — may be in OMB or OSTP — will have a great title but no authority and no budget. But will Obama’s CTO go viral? Maybe not in the federal government. There are CTOs in many federal agencies but without much influence.
Foreign policy — information sharing as an instrument of diplomacy. We should talk to State.
Our prime constituency is scientists — we need to go between the government and what’s good for science. We need to be the guardians of trying to do new things, not just do the same thing a little better.
We can translate reports into an agenda for science — focus on what it means for science.
We need to focus on making it work for science — not an abstract discussion of policy or bureaucracy.
(1) Could be useful to do economic impact — try to measure value of data. (2) Try to design data system that others could use. It would help the sciences in general to develop principles for how systems should work.
What is the dynamic between National Academies and executive entities? How do we get involved?
Ex.: NRC recommended science in diplomacy — resulted in science advisor to SecState.
If we want to do recommendations, it takes a lot of time and money, but it does carry some weight. If we want to influence the debate, we should hold some high-level fora with different constituencies to explore well-thought-out questions, and disseminate it.
Unless we have personal contacts, we’ll never get high-level appointees’ attention.
All new appointees are being briefed by transition team. I think they’d welcome the opportunity to sit with us and discuss these issues.
It’d be good to connect these issues with value propositions for specific areas — energy, education, etc.
On the agenda was the issue of long-term data management and access. We need to focus on communication with libraries, as well as universities. We need to bring this discussion to high levels in universities.
I think most libraries have thought about it but have reacted with “we don’t have the budget, the skills, and don’t know why we specifically should pay for it”.
Key problem is one of money. We should take the discussion to university presidents, e.g. at AAU. Stuff is happening here, but the key challenge is institutional backing. Scientists need to stand up and demand it.
Can this board make a contribution there?
Yes — we can take this to universities as well as scientists.
What happens if PLoS wins, and we have to budget for authors-pays publishing? How much more do we need to budget for data?
We found we’d need $2m/year to handle the data created by Stanford.
We budget this every week — we could do some estimates of this. We spend about $12m/year — probably around $20m/year for all social sciences in U.S.
The concept of open access has been perverted — NIH policy is neutral as regards a funding/pricing models. They’ve used the author pays model as an argument against it. But NLM has been doing this since 1836 — we provide the umbrella for preserving and curating the information that you have. PMC is designed not just to capture articles but also datasets — designed to provide dissemination but also permanent archiving. Maybe GenBank is the model for all science.
This board should try to encourage the idea of universities taking distributed responsibility of stewardship of this data. Talking to AAU is a good start, but we have to talk to funding agencies, too. DataNet was designed to avoid a fear that NSF was talking responsibility for everything.
ADS — I don’t care where the data is, but they have to defend their grant — it all depends on some review process.
Could be useful to do a functional analysis — there’s this much data being produced, it’d cost this much to store it (as a baseline activity), then layer what else we want to do with it. Then we figure out who does it. Having that information would be a place to start. This board could be uniquely positioned to convene the people who know.
There are various numbers. What I’ve heard is, nobody seems interested in the economic value measurement. Everybody wants to design the storage system.
If you’re going to do cost-benefit, you have to identify the unit of analysis.
We could get those numbers. We could also compare strategies of other countries.
If you look at the IDC report — we have no equivalent of what’s stored, how long it’s stored — I’d be shocked if we could do a data census.
You may be expecting more than people will settle for. Rough numbers can be of use to people. This is not about terabytes — even at the Internet Archive, the budget is more about people than silicon. The question is really, what level of curation. We can have too high standards and let perfect be the enemy of good.
Funding is not necessarily a zero-sum game — there’s other potential sources. Creative Commons demonstrates that we can change the way we think about storage, preservation, and access.
Humanities stuff is so heavily paper that it pales in comparison to sensors.
But only 5% of faculty does big science — the rest are using much smaller data sets.
Education — videos conducted for education research, in terms of storage, is an enormous quantity.
A new paper showed that big science gets by far the most citations — it’s what everything else is based on. Everybody needs access to these.
We should focus on the criteria for defining data, measuring data, evaluation data — that’s the expertise this board could bring.