The Digital Public Domain
(visit book homepage)
Cover  
Contents  
Index  

8. Science Commons: Building the Research Web

Kaitlin Thaney

Science Commons, a project of Creative Commons (CC), works to encourage the sharing of scientific and academic knowledge. This chapter will look at the technology and infrastructure designed and used at Science Commons to better share knowledge, an approach contextualised here as “building the research Web”, in the hope of utilising the power of current Internet technologies to accelerate scientific research.1 There are three main tenets to consider: open access to the content; access to the physical research materials; and an open source knowledge management system.

This approach requires redesigning information that is already digital into a format that works better for research. This process needs structure, standardised agreements, access to the content and data, metadata that dictates under what terms information is available, common naming systems, and links to repositories, to name just a few. Only then can one start to bring the efficiencies commonly associated with the Internet and a network approach to the world of scientific research.

1. The problem

Printing, delivery and research are rapidly moving into the digital domain. Even with this shift in processes, however, scientific research still largely deals with “paper metaphor”—the idea that knowledge is transmitted by an individual, on paper, rather than making the information readable by machines.

Science Commons has identified four key problems. First, there is the issue of cognitive overload, especially as information is translated to a digital form or created that way. We are beginning to know too much for our brain to process and take care of, and in this way face a data deluge. Secondly, most of what we know is poorly fitted for use and reuse—a design problem—making the information impossible to say, text mine. Even the simple act of publishing a document as a PDF adds a barrier to fully utilising the information in the form provided. Documents are poorly linked or annotated, making it increasingly difficult to connect information. Thirdly, there is a licensing problem, where knowledge is licensed in such a way that it is not legally available (this is an issue routinely faced in data integration or text mining). Lastly, the physical materials, the non-digital objects on which this is based (for example, lab mice, DNA, gene snippets and plasmids) is not always freely available in reality.

The first three points—cognitive overload, the design, and licensing problems—all describe problems of the regular Internet, but in order to have “open science” or a “research Web”, one must include in this discussion an additional dimension: access to the physical materials.

Current ways of conducting this research are imperfect. Take, for example, the following research question, which could be asked of a “research Web”: based on what has been published in journals and databases, what signal transduction genes may be active in pyramidal neurons? This question would serve as a lead to find drug targets in Alzheimer’s disease, since signal transduction genes tend to make for good drug targets and pyramidal neurons are implicated. A simple Google search renders approximately 189,000 results. Conducting this search in other information warehouses such as the US National Institutes of Health’s PubMed or PubMed Central provides an enormous number of articles, references, and citations. Sorting through all of this knowledge would take far beyond the grant period for any normal researcher—it is an example of the aforementioned data deluge/ cognitive overload problem. What you should be able to access using the power of the Internet is a list of genes that meet the conditions specified in the original research question.

It is currently very difficult to use the network to build on and validate research. There is no technical barrier to doing this, no creative breakthrough nor “eureka moment” needed. It is a matter of reformatting what we already know into a way that works better. Three steps need to be taken to achieve this: first, one has to address the legal issues around accessing the content (be it raw data or scholarly literature); secondly, one has to address the legal, social, and technical issues that surround the physical tools; and thirdly, one has to begin some sort of Open Access knowledge management process. The goal: to go from the old way of collaborating—which was based on the idea of transmitting knowledge through paper, of reading the canon on paper and querying single-access databases—to a new way of collaborating using machines and standardised distribution. Those three areas are critical to building a research Web.

2. Open Access content

Our Scholar’s Copyright project began with the promotion of CC licenses to peer-reviewed journals. The most notable adopters are the Public Library of Science, BioMed Central and Hindawi. To date, there are more than 350 peer-reviewed journals using the CC Attribution license for their content. Other adopters include Nature Precedings, the preprint server run by the Nature Publishing Group, in conjunction with the Wellcome Trust, the British Library and Science Commons.

The second part to this project supports the self-archiving route to making scholarly literature freely available. In early 2007, Science Commons released a set of “author addenda” that could be printed, filled in and submitted along with the author’s manuscript. This allowed authors to retain certain rights dictated in the text of the addenda and to mark their research for reuse. We took this one step further and created a Web tool that allows authors to fill in the form online, choose an addendum that best suits their needs, and auto-generate the form. The tool can easily be dropped into a university’s website and is currently running on the sites of Carnegie-Mellon University, MIT and the Association of Research Libraries. This tool is called the Scholar’s Copyright Addendum Engine (SCAE). Since its launch in mid-May 2007, over 900 addenda have been generated.

The SCAE allows a user to plug in very basic publication information and generate a document that can be attached to a copyright transfer agreement in order to reserve a number of rights over their work. All versions reserve the basic right for an author to reuse their work in their own teaching and professional activities as well as in future works. Beyond that basic requirement, each addendum grants the author a variety of rights, whether it be the ability to place a copy of the final PDF version of their work on the Internet upon publication, or whether the work is subject to a six-month delay or otherwise dictated embargo period (“Delayed Access” addendum).

Our most recent work in Scholar’s Copyright revolves around the question of licensing data and databases. An extensive amount of research, exploratory conversations and a number of private workshops were convened and conducted to gain a better grasp of the complexity of this issue. On 15 December 2007, Science Commons released the outcome of these conversations—the Protocol for Open Access Data, which, along with the CC Zero Project, do the same things for data as CC licenses do for literature. The idea is to allow databases to be freely integrated with one another, reconstructing the public domain for data through contract, and creating zones of certainty. The protocol incorporated a number of recommendations based on established scientific norms, such as attribution and citation. The CC Zero tool identifies what rights need to be waived (for example, copyright in databases, sui generis rights under the European Union database directive, etc) in order to put data back into the public domain.

3. Open Access to physical materials

The Biological Materials Transfer (MTA) Project addresses the accessibility issues surrounding most research materials in biology—the physical research tools upon which the research Web is built. DNA, cell lines, lab mice, and more physical tools are more often than not subject to deliberate withholding, legal slowdowns, difficulties in fulfilling orders and many other kinds of delays that add to the drag on scientific discovery and the research cycle. Our MTA work is built on the idea of building an application that incorporates the principles of an “e-commerce” transaction system but applied to biological materials; we are working towards “one-click” access to these materials wherever possible.

To achieve this, our legal experts worked to create a suite of contracts, known as Materials Transfer Agreements (MTA). There are pre-existing standard MTAs, two of which are included in the suite: the National Institute of Health’s Uniform Biological Materials Transfer Agreement (UBMTA) and the Simple Letter Agreement (SLA). These two agreements cover a significant amount of materials already. Each MTA follows the CC “methodology” and design, consisting of a human-readable deed with iconographic representations of rights and obligations and metadata.

Included in this suite is a set of contracts developed in-house at Science Commons. This follows a two-tiered approach intended to allow for transfer among non-profit institutions as well as for transfers from non-profit institutions to for-profit companies for internal research uses (non-commercial use). For the former, we standardised the existing UBMTA and SLA. For the latter, we developed a suite of standard MTAs with modular options, guided by principles derived from the NIH Principles and Guidelines relating to the sharing of biomedical resources. In particular, we implemented the NIH Guidelines with respect to defining “non-commercial use” in this space.

4. Open source knowledge management

The last component needed to achieve a research Web is a way to manage all of this knowledge. Everything that we do at Science Commons takes an open source knowledge management approach. With access to the content, the data, and the physical materials, what remains is a method for fully utilising all of the information available. Science Commons is building its work using the Semantic Web as its platform. We are firm believers that the Semantic Web offers great potential for exploiting the legal access to digital knowledge and research materials through open source data integration and knowledge management.

The work previously discussed in regards to content, data, and physical materials comes together in a single proof-of-concept project: the Neurocommons. The project brings together the tools and techniques from each of these projects, serving as a proving ground for commons-based “e-science” or the research Web as we envision it to be.

The Neurocommons serves as our pilot knowledge management project with a focus specifically on the brain sciences. The goal is to enable scientists to ask very complex questions and receive precise answers, like the aforementioned question looking for potential drug targets for Alzheimer’s disease, and receive a list of genes, rather than 250,000 web pages that may be loosely associated with the topic area. This method is not new. Pharmaceutical companies have utilised such systems, in a proprietary and closed manner, for quite some time. However, to our knowledge the Neurocommons is the first iteration of such a system that is open source, making for a data-integration platform for the life sciences that gives researchers easy access to open content.

By reformatting the literature, the data, images, classification systems and ontologies into a common semantic Web frame, it is possible to write a single query asking a question over all of the information. The proof-of-concept we have created to make this tractable technically integrates a series of databases including the content from PubMed Central, gene data, mouse brain images, ontologies about molecular functions and a number of others, all pulled in to make a local system to prove the power of open digital knowledge.

The knowledge base also contains the digital descriptions of the physical research materials through our MTA work, showing the value of using these methods on physical tools. When a scientist gets a precise list of genes they can, with a single click, order those materials directly from a third party, thanks to the metadata. This is one of many opportunities and benefits of building this system on an open, commons-based foundation.

 

1This contribution was written following a Communia meeting in September 2007 in Turin, Italy.