GitLab announced the general availability of GitLab Duo with Amazon Q.
In today's business ecosystem, there isn't an issue of obtaining data. We share data daily as consumers and business leaders. But in healthcare, where data is sensitive, private, and highly regulated, access to quality, representative datasets is a challenge.
This is where synthetic data is emerging — not as a perfect solution but as a vital tool in the AI development toolbox.
The Challenge: Limited Access to Real-World Data
AI systems require vast amounts of data to learn, adapt, and make accurate predictions. However, in industries such as healthcare, finance, and insurance, data faces additional protections (consider regulations like HIPAA, GDPR, and CCPA) which limits access to data available to train and inform models. There is a challenge when considering the responsible use and application of the large data pools organizations have at their disposal.
The Promising, Yet Unclear Solution: Synthetic Data
Synthetic data — artificially generated datasets that mirror real-world patterns — is one solution when considering limited data access, specifically in more sensitive industries. Generated using algorithms, machine learning models, and business rules, synthetic data aims to replicate the statistical properties of real-world data without containing actual patient or customer information. This allows AI models to be trained without exposing private or proprietary data.
However, it's important to point out that synthetic data exists in a legal and ethical gray area. There is no universal benchmark or definition for what qualifies as synthetic data versus data modification versus de-identified data.
The success of synthetic data depends on its ability to reliably reflect real-world scenarios without introducing errors or biases. This approach is an important way to feed models scenarios to learn digestible and clear outcomes. Training strategies with synthetic data include:
1. Benchmarking Against Real Data: AI models trained on synthetic data should be tested against real-world datasets to ensure they capture genuine patterns and behaviors.
2. Probabilistic vs. Deterministic Testing: AI needs to recognize minority cases, not just the dominant trends. For example, if 9 out of 10 customer service agents perform well and one does poorly, an AI trained solely on synthetic data may assume all agents perform at the same level — missing critical performance gaps.
3. Diversity and Bias Checks: Synthetic data can accidentally reinforce biases if not carefully curated. If a dataset overrepresents certain demographics or behaviors, it may skew AI predictions in unintended ways.
4. Human-in-the-Loop Oversight: AI should not be left to validate itself. Human experts need to assess whether synthetic data aligns with real-world expectations and correct deviations.
The goal should be to develop AI that reliably recognizes meaningful trends — similar to the difference between memorizing school textbook content and genuinely understanding the subject.
Synthetic data is just one of many methods for addressing data scarcity while preserving privacy. Other approaches include data redaction, retrieval-augmented generation (RAG), and federated learning. Each approach has trade-offs. Synthetic data offers a promising balance of privacy and utility, but its reliability depends on the robustness of its generation methods.
The Bottom Line: Synthetic Data as a Toolbox Essential, Not a Silver Bullet
Synthetic data is not the ultimate solution to AI's data challenges, but it is an indispensable tool in the broader AI ecosystem. It allows for AI training without exposing sensitive data, helps mitigate privacy concerns, and enables innovation in fields where data access is restricted or limited.
However, organizations must remain cautious. Without proper validation, synthetic data can introduce biases, reinforce misinformation, and contribute to AI hallucinations. The key is to approach synthetic data as a complementary resource — not a silver-bullet solution to data training.
As AI evolves, so will the application and definition of synthetic data. In the meantime, synthetic data remains a promising, yet enigmatic asset in the pursuit of responsible and effective AI development.
Industry News
Perforce Software and Liquibase announced a strategic partnership to enhance secure and compliant database change management for DevOps teams.
Spacelift announced the launch of Saturnhead AI — an enterprise-grade AI assistant that slashes DevOps troubleshooting time by transforming complex infrastructure logs into clear, actionable explanations.
CodeSecure and FOSSA announced a strategic partnership and native product integration that enables organizations to eliminate security blindspots associated with both third party and open source code.
Bauplan, a Python-first serverless data platform that transforms complex infrastructure processes into a few lines of code over data lakes, announced its launch with $7.5 million in seed funding.
Perforce Software announced the launch of the Kafka Service Bundle, a new offering that provides enterprises with managed open source Apache Kafka at a fraction of the cost of traditional managed providers.
LambdaTest announced the launch of the HyperExecute MCP Server, an enhancement to its AI-native test orchestration platform, HyperExecute.
Cloudflare announced Workers VPC and Workers VPC Private Link, new solutions that enable developers to build secure, global cross-cloud applications on Cloudflare Workers.
Nutrient announced a significant expansion of its cloud-based services, as well as a series of updates to its SDK products, aimed at enhancing the developer experience by allowing developers to build, scale, and innovate with less friction.
Check Point® Software Technologies Ltd.(link is external) announced that its Infinity Platform has been named the top-ranked AI-powered cyber security platform in the 2025 Miercom Assessment.
Orca Security announced the Orca Bitbucket App, a cloud-native seamless integration for scanning Bitbucket Repositories.
The Live API for Gemini models is now in Preview, enabling developers to start building and testing more robust, scalable applications with significantly higher rate limits.
Backslash Security(link is external) announced significant adoption of the Backslash App Graph, the industry’s first dynamic digital twin for application code.
SmartBear launched API Hub for Test, a new capability within the company’s API Hub, powered by Swagger.
Akamai Technologies introduced App & API Protector Hybrid.