Gen AI Needs Synthetic Data. We Need to Be Able to Trust It
Today’s Generative AI Model is like behind it chatgpt and Geminitrained on the introduction of real-world data, but even everything on the Internet is not enough to prepare a model for all possible situations.
To continue to grow, training on simulated or synthetic data is required, and these models are reasonable but not real. Experts said in a group in the Southwest that AI developers need to do this responsibly or things may come soon.
Since its launch deepseek aiThis is a new model produced in China that uses more synthetic data training than other models, saving money and processing power.
But experts say it’s more than just saving on data collection and processing. Synthesize data – Computers often generated by AI itself – A model can be taught to understand what does not exist in the real-world information it provides, but it will face in the future. If you see its simulation, then the tens of millions of possibilities don’t have to be surprised by the AI model.
“With the mock data, you can get rid of the idea of edge cases, assuming you can trust it,” says Oji Udezue, who leads product teams on Twitter, Atlassian, Microsoft and other companies. He and other panel members spoke at the SXSW conference in Austin, Texas on Sunday. “In theory, as long as we can trust it, we can build a product that suits 8 billion people.”
The hard part is making sure you can trust it.
Simulation data problem
There are many benefits to simulate data. First, production costs are lower. You might use some software crash to test thousands of simulated cars, but to get the same results in real life, you have to actually smash the car (spending a lot of money).
Tahir Ekin, a business analysis professor at Texas State University, says if you are training self-driving cars, you need to capture some less common situations, even if they are not in the training data. He used the case of bats that had a spectacular appearance from the Capitol Avenue Bridge in Austin. This may not be seen in the training data, but self-driving cars will need a sense of how to cope with a bunch of bats.
The risk comes from how machines trained with synthetic data deal with real-world changes. Ekin said it could not exist in alternative reality or become less useful or even dangerous. “How would you feel, getting into an autonomous vehicle that is not trained with simulated data?” He said that any system using simulated data needs to be “rooted in the real world”, including feedback on how its simulated reasoning is consistent with what actually happens.
Udezue compared the problem with the creation of social media, a way to scale communication from the global level, which is what it achieved. But social media has also been abused, he said, noting: “Now, authoritarians use it to control people, and people use it to tell jokes at the same time.”
As AI tools grow in size and popularity, using synthetic training data makes scenarios easier, untrustworthy training potential real-world impacts and models become more important with real-world models. “The burden on American builders, twice as much as scientists, is certainly reliable,” Udezue said. “It’s not a fantasy.”
How to check simulation data
One way to ensure that models are trustworthy is to make their training transparent, and users can choose the model to use based on their evaluation of that information. The analogy of the group members repeatedly using nutrition labels is easy for users to understand.
There are some transparency, such as model cards obtained through the developer platform Hug the face This breaks the details of different systems. Mike Hollinger, director of enterprise generation AI product management at Chipmaker Nvidia, said the information must be as clear and transparent as possible. “Those types of things have to be in place,” he said.
Hollinger eventually said that he would not only be an AI developer, but also an AI user who would define industry best practices.
Udezu said the industry also needs to keep ethics and risks in mind. “Synthetic data will make a lot of things easier to do,” he said. “It will reduce the cost of building things. But some of these things will change society.”
Observability, transparency and trust must be built into the model to ensure its reliability, Udezue said. This includes updating the training models so that they reflect accurate data and not amplifying errors in the synthetic data. One problem is that the model crashes, and when the AI model trained AI models on data generated by other AI models will become increasingly far away from reality until it becomes useless.
“The more you avoid capturing the diversity of the real world, the response may be unhealthy,” Udezue said. The solution is error correction, he said. “If the concepts of trust, transparency and error correction are combined, these don’t feel inevitable.”