In a recent article published by the Communications of the ACM — the flagship publication of the Association for Computing Machinery — OpenMined’s Executive Director, Andrew Trask, was featured as a key voice in the growing conversation around synthetic data, AI training, and the critical importance of controlling how data shapes model behavior.
The Growing Role of Synthetic Data in AI
The article, titled “AI Goes Synthetic to Get Real,” explores how synthetic data — data created by humans or algorithms to simulate real-world information — is rapidly becoming a cornerstone of AI development. With high-quality human-generated data increasingly scarce, AI developers are turning to synthetic datasets to train large language models across fields including finance, medicine, criminal justice, and engineering.
While synthetic data offers significant benefits, such as enabling organizations to build more equitable and resilient AI models without navigating privacy constraints, the article highlights a crucial concern: the risk of data manipulation and degraded model quality. As synthetic and real data increasingly blend together, subtle errors can compound into a process researchers describe as “model collapse.”
Who Controls the Data Controls the Model
The article presents Andrew’s perspective on the value of AI training data. As Trask explains in the piece:
“Whoever controls an AI’s training data gets to decide how that model will behave.”
This insight underscores a central challenge in AI development, that without proper governance and transparency mechanisms, training data can be manipulated, whether inadvertently or intentionally, to produce deceptive or biased results. Andrew’s remarks highlight the need for technical infrastructure that gives stakeholders meaningful control over how data influences AI systems.
Attribution-Based Control: A Path Forward
The article also spotlights OpenMined’s work on Attribution-Based Control as a promising remedy for these challenges. As described in the piece, Attribution-Based Control uses cryptographic and deep learning techniques to allow AI users to choose which sources influence each prediction or model, while also enabling data owners to decide how their data will be used.
A secondary benefit of this approach is improved management of hallucinations, which remains a persistent challenge in large language models. As Andrew notes, if you can choose the sources that inform a model’s outputs, you can also determine whether those sources are appropriate for the task at hand.
Why This Matters
This recognition in Communications of the ACM places OpenMined’s work at the center of one of AI’s most pressing challenges: ensuring that the data powering AI systems is governed transparently and responsibly. As synthetic data grows from roughly 60% of all AI training data in 2024 to potentially surpassing real data by 2030, the need for tools like Attribution-Based Control will only intensify.
OpenMined remains committed to building the technical infrastructure that ensures data governance serves the public interest so that the future of AI is shaped by accountability, not opacity.
Read the full article on the Communications of the ACM website.
