Can We Move Forward with the Open Source AI Definition?
December 04, 2024

Ben Cotton
Kusari

You can't observe what you can't see. It's why application developers and DevOps engineers collect performance metrics, and it's why developers and security engineers like open source software. The ability to inspect code increases trust in what the code does. With the increased prevalence of generative AI, there's a desire to have the same ability to inspect the AI models. Most generative AI models are black boxes, so some vendors are using the term "open source" to set their offerings apart.

But what does "open source AI" mean? There's no generally-accepted definition.

The Open Source Initiative (OSI) defines what "open source" means for software. Their Open Source Definition(link is external) (OSD) is a broadly-accepted set of criteria for what makes a software license open source. New licenses are reviewed in public by a panel of experts to evaluate their compliance with the OSD. OSI focuses on software, leaving "open" in other domains to other bodies, but since AI models are, at the simplest level, software plus configuration and data, OSI is a natural home for creating a definition of open source AI.


What's Wrong with the OSAID?

The OSI attempted to do just that. The Open Source AI Definition (OSAID), released at the end of October, represents a collaborative attempt to craft a set of rules for what makes an AI model "open source." But while the OSD is generally accepted as an appropriate definition of "open source" for software, the OSAID has received mixed reviews.

The crux of the criticism is that the OSAID does not require that the training data be available, only "data information." Version 1.0(link is external), and its accompanying FAQ(link is external) require that training data be made available if possible, but still permits providing only a description when the data is "unshareable." OSI's argument is that the laws that cover data are more complex, and have more jurisdictional variability, than the laws governing copyrightable works like software. There's merit to this, of course. The data used to train AI models includes copyrightable works like blog posts, paintings, and books, but it can also include sensitive and protected information like medical histories and other personal information. Model vendors that train on sensitive data couldn't legally share their training data, OSI argues, so a definition that requires it is pointless.

I appreciate the merits of that argument, but I — and others — don't find it compelling enough to craft the definition of "open source" around it, especially since model vendors can find plausible reasons to claim they can't share training data. The OSD is not defined based on what is convenient, it's defined based on what protects certain rights for the consumers of software. The same should be true for a definition of open source AI. The fact that some models cannot meet the definition should mean that those models are not open source; it should not mean that the definition is changed to be more convenient. If no models could possibly meet a definition of open, that's one thing. But many existing models do, and more could if the developers chose.

Any Definition Is a Starting Point

Despite the criticisms of the OSI's definition, a flawed definition is better than no definition. Companies use AI in many ways: from screening job applicants to writing code to creating social media images to customer service chatbots. Any of these uses pose the risk for reputational, financial, and legal harm. Companies who use AI need to know exactly what they're getting — and not getting. A definition for "open source AI" doesn't eliminate the need to carefully examine an AI model, but it does give a starting point.

The current OSD has evolved over the last two decades; it is currently on version 1.9. It stands to reason that the OSAID will evolve as people use it to evaluate real-world AI models. The criticisms of the initial version may inform future changes that result in a more broadly-accepted definition. In the meantime, other organizations have announced their own efforts to address deficiencies in the OSAID. The Digital Public Goods Alliance — a UN-endorsed initiative — will continue to require published training data(link is external) in order to grant Digital Public Good status to AI systems.

It is also possible that we'll change how we speak about openness. Just like OSD-noncompliant movements like Ethical Source(link is external) have introduced a new vocabulary, open source AI may force us to recognize that openness is a spectrum on several axes, not simply a binary attribute.

Ben Cotton is Head of Community at Kusari
Share this

Industry News

April 16, 2025

CodeSecure and FOSSA announced a strategic partnership and native product integration that enables organizations to eliminate security blindspots associated with both third party and open source code.

April 16, 2025

Bauplan, a Python-first serverless data platform that transforms complex infrastructure processes into a few lines of code over data lakes, announced its launch with $7.5 million in seed funding.

April 15, 2025

Perforce Software announced the launch of the Kafka Service Bundle, a new offering that provides enterprises with managed open source Apache Kafka at a fraction of the cost of traditional managed providers.

April 14, 2025

LambdaTest announced the launch of the HyperExecute MCP Server, an enhancement to its AI-native test orchestration platform, HyperExecute.

April 14, 2025

Cloudflare announced Workers VPC and Workers VPC Private Link, new solutions that enable developers to build secure, global cross-cloud applications on Cloudflare Workers.

April 14, 2025

Nutrient announced a significant expansion of its cloud-based services, as well as a series of updates to its SDK products, aimed at enhancing the developer experience by allowing developers to build, scale, and innovate with less friction.

April 10, 2025

Check Point® Software Technologies Ltd.(link is external) announced that its Infinity Platform has been named the top-ranked AI-powered cyber security platform in the 2025 Miercom Assessment.

April 10, 2025

Orca Security announced the Orca Bitbucket App, a cloud-native seamless integration for scanning Bitbucket Repositories.

April 10, 2025

The Live API for Gemini models is now in Preview, enabling developers to start building and testing more robust, scalable applications with significantly higher rate limits.

April 09, 2025

Backslash Security(link is external) announced significant adoption of the Backslash App Graph, the industry’s first dynamic digital twin for application code.

April 09, 2025

SmartBear launched API Hub for Test, a new capability within the company’s API Hub, powered by Swagger.

April 09, 2025

Akamai Technologies introduced App & API Protector Hybrid.

April 09, 2025

Veracode has been granted a United States patent for its generative artificial intelligence security tool, Veracode Fix.

April 09, 2025

Zesty announced that its automated Kubernetes optimization platform, Kompass, now includes full pod scaling capabilities, with the addition of Vertical Pod Autoscaler (VPA) alongside the existing Horizontal Pod Autoscaler (HPA).

April 08, 2025

Check Point® Software Technologies Ltd.(link is external) has emerged as a leading player in Attack Surface Management (ASM) with its acquisition of Cyberint, as highlighted in the recent GigaOm Radar report.