Enabling Easier Collaboration on Open Data for AI and ML with CDLA-Permissive-2.0

  • Hannah
  • June 24, 2021
  • Comments Off on Enabling Easier Collaboration on Open Data for AI and ML with CDLA-Permissive-2.0

The Linux Foundation is pleased to announce the release of the CDLA-Permissive-2.0 license agreement, which is now available on the CDLA website at https://cdla.dev/permissive-2-0/. We believe that CDLA-Permissive-2.0 will meet a genuine need for a short, simple, and broadly permissive license agreement to enable wider sharing and usage of open data, particularly to bring clarity to the use of open data for artificial intelligence and machine learning models. 

We’re happy to announce that IBM and Microsoft are making data sets available today using CDLA-Permissive-2.0.

In this blog post, we’ll share some background about the original versions of the Community Data License Agreement (CDLA), why we worked with the community to develop the new CDLA-Permissive-2.0 agreement, and why we think it will benefit producers, users, and redistributors of open data sets.

Background: Why would you need an open data license agreement?

Licenses and license agreements are legal documents that define how content can be used, modified, and shared. They operate within the legal frameworks for copyrights, patents, and other rights that are established by laws and regulations around the world. These laws and regulations are not always clear and are not always in sync with one another.

Decades of practice have established a collection of open source software licenses and open content licenses that are widely used. These licenses typically work within the frameworks established by laws and regulations mentioned above to permit broad use, modification, and sharing of software and other copyrightable content in exchange for following the license requirements.

Open data is different. Various laws and regulations treat data differently from software or other creative content. Depending on what the data is and which country’s laws you’re looking at, the data often may not be subject to copyright protection, or it might be subject to different laws specific to databases, i.e., sui generis database rights in the European Union. 

Additionally, data may be consumed, transformed, and incorporated into Artificial Intelligence (AI) and Machine Learning (ML) models in ways that are different from how software and other creative content are used. Because of all of this, assumptions made in commonly-used licenses for software and creative content might not apply in expected ways to open data.

Choice is often a good thing, but too many choices can be problematic. To be clear, there are other licenses in use today for open data use cases. In particular, licenses and instruments from Creative Commons (such as CC-BY-4.0 and CC0-1.0) are used to share data sets and creative content. It was also important in drafting the CDLA agreements to enable collaboration with similar licenses. The CDLA agreements are in no way meant as a criticism of those alternatives, but rather the CDLA agreements are focused on addressing newer concerns born out of AI and ML use cases. AI and ML models generated from open data are the primary use case organizations have struggled with — CDLA was designed to address those concerns. Our goal was to strike a balance between updated choices and too many options.

First steps: CDLA version 1.0

Several years ago, in talking with members of the Linux Foundation member counsel community, we began collaborating to develop a license agreement that would clearly enable use, modification, and open data sharing, with a particular eye to AI and ML applications.

In October 2017, The Linux Foundation launched version 1.0 of the CDLA. The CDLA was intended to provide clear and explicit rights for recipients of data under CDLA to use, share and modify the data for any purpose. Importantly, it also explicitly permitted using the results from analyzed data to create AI and ML models, without any of the obligations that apply under the CDLA to sharing the data itself. It was launched with two initial types: a Permissive variant, with attribution-style obligations, and a Sharing variant, with a “copyleft”-style reciprocal commitment when resharing the raw data.

The CDLA-Permissive-1.0 agreement saw some amount of uptake and use. However, subsequent feedback revealed that some potential licensors and users of data under the CDLA-Permissive-1.0 agreement found it to be overly complex for non-lawyers to use. Many of its provisions were targeted at addressing specific and nuanced considerations for open data under various legal frameworks. While these considerations were worthwhile, we saw that communities may balance that specificity and clarity against the value of a concise set of easily comprehensible terms to lawyers and non-lawyers alike.

Partly in response to this, in 2019, Microsoft launched the Open Use of Data Agreement (O-UDA-1.0) to provide a more concise and simplified set of terms around the sharing and use of data for similar purposes. Microsoft graciously contributed stewardship of the O-UDA-1.0 to the CDLA effort. Given the overlapping scope of the O-UDA-1.0 and the CDLA-Permissive-1.0, we saw an opportunity to converge on a new draft for a CDLA-Permissive-2.0. 

Moving to version 2.0: Simplifying, clarifying, and making it easier

Following conversations with various stakeholders and after a review and feedback period with the Linux Foundation Member Counsel community, we have prepared and released CDLA-Permissive-2.0. 

In response to perceptions of CDLA-Permissive-1.0 as overly complex, CDLA-Permissive-2.0 is short and uses plain language to express the grant of permissions and requirements. Like version 1.0, the version 2.0 agreement maintains the clear rights to use, share and modify the data, as well as to use without restriction any “Results” generated through computational analysis of the data.

Unlike version 1.0, the new CDLA-Permissive-2.0 is less than a page in length.

  • The only obligation it imposes when sharing data is to “make available the text of this agreement with the shared Data,” including the disclaimer of warranties and liability. 

In a sense, you might compare its general “character” to that of the simpler permissive open source licenses, such as the MIT or BSD-2-Clause licenses, albeit specific to data (and with even more limited obligations).

One key point of feedback from users of the license and lawyers from organizations involved in Open Data were the challenges involved with associating attribution information with data (or versions of data sets). 

Although “attribution-style” provisions may be common in permissive open source software licenses, there was feedback that:

  • As data technologies continue to evolve beyond what the CDLA drafters might anticipate today, it is unclear whether typical ways of sharing attributions for open source software will fit well with open data sharing. 
  • Removing this as a mandated requirement was seen as preferable.

Recipients of Data under CDLA-Permissive-2.0 may still choose to provide attribution about the data sources. Attribution will often be important for appropriate norms in communities, and understanding its origination source is often a key aspect of why an open data set will have value. The CDLA-Permissive-2.0 simply does not make it a condition of sharing data.

CDLA-Permissive-2.0 also removes some of the more confusing terms that we’ve learned were just simply unnecessary or not useful in the context of an open data collaboration. Removing these terms enables the CDLA-Permissive-2.0 to present the terms in a concise, easy to read format that we believe will be appreciated by data scientists, AI/ML users, lawyers, and users around the world where English is not a first language.

We hope and anticipate that open data communities will find it easy to adopt it for releases of their own data sets.

Voices from the Community

“The open source licensing and collaboration model has made AI accessible to everyone, and formalized a two-way street for organizations to use and contribute to projects with others helping accelerate applied AI research. CDLA-Permissive-2.0 is a major milestone in achieving that type of success in the Data domain, providing an open source license specific to data that enables access, sharing and using data among individuals and organizations. The LF AI & Data community appreciates the clarity and simplicity CDLA-Permissive-2.0 provides.” Dr. Ibrahim Haddad, Executive Director of LF AI & Data 

“We appreciate the simplicity of the CDLA-Permissive-2.0, and we appreciate the community ensuring compatibility with Creative Commons licensed data sets.” Catherine Stihler, CEO of Creative Commons

“IBM has been at the forefront of innovation in open data sets for some time and as a founding member of the Community Data License Agreement. We have created a rich collection of open data sets on our Data Asset eXchange that will now utilize the new CDLAv2, including the recent addition of CodeNet – a 14-million-sample dataset to develop machine learning models that can help in programming tasks.” Ruchir Puri, IBM Fellow, Chief Scientist, IBM Research

“Sharing and collaborating with open data should be painless – and sharing agreements should be easy to understand and apply. We applaud the clear and understandable approach in the new CDLA-Permissive-2.0 agreement.” Jennifer Yokoyama, Vice President and Chief IP Counsel, Microsoft

“It’s exciting to see communities of legal and AI/ML experts come together to work on cross-organizational challenges to develop a framework to support data collaboration and sharing.” Nithya Ruff, Chair of the Board, The Linux Foundation and Executive Director, Open Source Program Office, Comcast

“Data is an essential component of how companies build their operations today, particularly around Open Data sets that are available for public use. At OpenUK, we welcome the CDLA-Permissive-2.0 license as a tool to make Open Data more available and more manageable over time, which will be key to addressing the challenges that organisations have coming up. This new approach will make it easier to collaborate around Open Data and we hope to use it in our upcoming work in this space.” Amanda Brock, CEO of OpenUK

“Verizon supports community efforts to develop clear and scalable solutions to legal issues around building artificial intelligence and machine learning, and we welcome the CDLA-Permissive-2.0 as a mechanism for data providers and software developers to work together in building new technology.” Meghna Sinha, VP – AI Center, Verizon

“Sony believes that the spread of clear and simple Open Data licenses like CDLA-2.0 activates Open Data ecosystem and contributes to innovation with AI. We support CDLA’s effort and hope CDLA will be used widely.” Hisashi Tamai, SVP, Sony Group Corporation

Data Sets Available under CDLA-Permissive-2.0

With today’s release of CDLA-Permissive-2.0, we are also pleased to announce several data sets that are now available under the new agreement. 

The IBM Center for Open Source Data and AI Technologies (CODAIT) will begin to re-license its public datasets hosted here using the CDLA-Permissive 2.0, starting with Project CodeNet, a large-scale dataset with 14 million code samples developed to drive algorithmic innovations in AI for code tasks like code translation, code similarity, code classification, and code search.

Microsoft Research is announcing that the following data sets are now being made available under CDLA-Permissive-2.0:

  • The Hippocorpus dataset, which comprises diary-like short stories about recalled and imagined events to help examine the cognitive processes of remembering and imagining and their traces in language;
  • The Public Perception of Artificial Intelligence data set, comprising analyses of text corpora over time to reveal trends in beliefs, interest, and sentiment about a topic;
  • The Xbox Avatars Descriptions data set, a corpus of descriptions of Xbox avatars created by actual gamers;         
  • A Dual Word Embeddings data set, trained on Bing queries, to facilitate information retrieval about documents; and
  • A GPS Trajectory data set, containing 17,621 trajectories with a total distance of about 1.2 million kilometers and a total duration of 48,000+ hours.

Next Steps and Resources

If you’re interested in learning more, please check out the following resources: