Can We Move Forward with the Open Source AI Definition?
December 04, 2024

Ben Cotton
Kusari

You can't observe what you can't see. It's why application developers and DevOps engineers collect performance metrics, and it's why developers and security engineers like open source software. The ability to inspect code increases trust in what the code does. With the increased prevalence of generative AI, there's a desire to have the same ability to inspect the AI models. Most generative AI models are black boxes, so some vendors are using the term "open source" to set their offerings apart.

But what does "open source AI" mean? There's no generally-accepted definition.

The Open Source Initiative (OSI) defines what "open source" means for software. Their Open Source Definition(link is external) (OSD) is a broadly-accepted set of criteria for what makes a software license open source. New licenses are reviewed in public by a panel of experts to evaluate their compliance with the OSD. OSI focuses on software, leaving "open" in other domains to other bodies, but since AI models are, at the simplest level, software plus configuration and data, OSI is a natural home for creating a definition of open source AI.


What's Wrong with the OSAID?

The OSI attempted to do just that. The Open Source AI Definition (OSAID), released at the end of October, represents a collaborative attempt to craft a set of rules for what makes an AI model "open source." But while the OSD is generally accepted as an appropriate definition of "open source" for software, the OSAID has received mixed reviews.

The crux of the criticism is that the OSAID does not require that the training data be available, only "data information." Version 1.0(link is external), and its accompanying FAQ(link is external) require that training data be made available if possible, but still permits providing only a description when the data is "unshareable." OSI's argument is that the laws that cover data are more complex, and have more jurisdictional variability, than the laws governing copyrightable works like software. There's merit to this, of course. The data used to train AI models includes copyrightable works like blog posts, paintings, and books, but it can also include sensitive and protected information like medical histories and other personal information. Model vendors that train on sensitive data couldn't legally share their training data, OSI argues, so a definition that requires it is pointless.

I appreciate the merits of that argument, but I — and others — don't find it compelling enough to craft the definition of "open source" around it, especially since model vendors can find plausible reasons to claim they can't share training data. The OSD is not defined based on what is convenient, it's defined based on what protects certain rights for the consumers of software. The same should be true for a definition of open source AI. The fact that some models cannot meet the definition should mean that those models are not open source; it should not mean that the definition is changed to be more convenient. If no models could possibly meet a definition of open, that's one thing. But many existing models do, and more could if the developers chose.

Any Definition Is a Starting Point

Despite the criticisms of the OSI's definition, a flawed definition is better than no definition. Companies use AI in many ways: from screening job applicants to writing code to creating social media images to customer service chatbots. Any of these uses pose the risk for reputational, financial, and legal harm. Companies who use AI need to know exactly what they're getting — and not getting. A definition for "open source AI" doesn't eliminate the need to carefully examine an AI model, but it does give a starting point.

The current OSD has evolved over the last two decades; it is currently on version 1.9. It stands to reason that the OSAID will evolve as people use it to evaluate real-world AI models. The criticisms of the initial version may inform future changes that result in a more broadly-accepted definition. In the meantime, other organizations have announced their own efforts to address deficiencies in the OSAID. The Digital Public Goods Alliance — a UN-endorsed initiative — will continue to require published training data(link is external) in order to grant Digital Public Good status to AI systems.

It is also possible that we'll change how we speak about openness. Just like OSD-noncompliant movements like Ethical Source(link is external) have introduced a new vocabulary, open source AI may force us to recognize that openness is a spectrum on several axes, not simply a binary attribute.

Ben Cotton is Head of Community at Kusari
Share this

Industry News

February 13, 2025

LaunchDarkly announced the private preview of Warehouse Native Experimentation, its Snowflake Native App, to offer Data Warehouse Native Experimentation.

February 13, 2025

SingleStore announced the launch of SingleStore Flow, a no-code solution designed to greatly simplify data migration and Change Data Capture (CDC).

February 13, 2025

ActiveState launched its Vulnerability Management as a Service (VMaas) offering to help organizations manage open source and accelerate secure software delivery.

February 12, 2025

Genkit for Node.js is now at version 1.0 and ready for production use.

February 12, 2025

JFrog signed a strategic collaboration agreement (SCA) with Amazon Web Services (AWS).

February 12, 2025

mabl launched of two new innovations, mabl Tools for Playwright and mabl GenAI Test Creation, expanding testing capabilities beyond the bounds of traditional QA teams.

February 11, 2025

Check Point® Software Technologies Ltd.(link is external) announced a strategic partnership with leading cloud security provider Wiz to address the growing challenges enterprises face securing hybrid cloud environments.

February 11, 2025

Jitterbit announced its latest AI-infused capabilities within the Harmony platform, advancing AI from low-code development to natural language processing (NLP).

February 11, 2025

Rancher Government Solutions (RGS) and Sequoia Holdings announced a strategic partnership to enhance software supply chain security, classified workload deployments, and Kubernetes management for the Department of Defense (DOD), Intelligence Community (IC), and federal civilian agencies.

February 10, 2025

Harness and Traceable have entered into a definitive merger agreement, creating an advanced AI-native DevSecOps platform.

February 10, 2025

Endor Labs announced a partnership with GitHub that makes it easier than ever for application security teams and developers to accurately identify and remediate the most serious security vulnerabilities—all without leaving GitHub.

February 07, 2025

Are you using OpenTelemetry? Are you planning to use it? Click here to take the OpenTelemetry survey(link is external).

February 06, 2025

GitHub announced a wave of new features and enhancements to GitHub Copilot to streamline coding tasks based on an organization’s specific ways of working.

February 06, 2025

Mirantis launched k0rdent, an open-source Distributed Container Management Environment (DCME) that provides a single control point for cloud native applications – on-premises, on public clouds, at the edge – on any infrastructure, anywhere.

February 06, 2025

Hitachi Vantara announced a new co-engineered solution with Cisco designed for Red Hat OpenShift, a hybrid cloud application platform powered by Kubernetes.