SyntheticData

ThoughtStorms Wiki

(ReadWith) DataPollution

I'm increasingly concerned about synthetic data of the type talked about in this video :

https://www.youtube.com/watch?v=7G7noECab4A

My comment :

I think you've missed a big issue if Sora is trained on synthetic game engine content.

Computer graphics engines are NOT accurate simulations of real world physics. Instead they use a bunch of heuristics that produce content that is close enough that WE DON'T NOTICE THE DIFFERENCE, while economizing on the computing power needed to run a detailed simulation.

In other words, "game engine physics" is inaccurate in a way which is capable of fooling us into thinking its accurate when we see it. It's "fake news" by design.

So I think it's very plausible that text-to-video AIs are building some kind of internal approximation of a "physics model". But if they are approximating a physics model which is, itself, inaccurate in ways we don't notice, it will be a very bad model to start relying on when we want good models of the world.

You wouldn't dream of training real world car driving algorithms using GTA. Because those AI driven cars are going to do a bunch of things that are wholly unsuitable to do in real cities full of actual people you don't want to hurt. That's kind of a no-brainer.

But the same problem will turn up in subtler ways. As companies see opportunities to, say, grab an off-the-shelf open-source model to generate synthetic data that then ends up in another product which is used to create synthetic data for another AI which ... at some point does influence robot behaviour. It's going to be really hard to keep track of when some model or synthetic data-set has been poisoned somewhere upstream, by heuristic hacks that we can't see (because they were designed to look OK to us when we glance at them) but bear no resemblance to how real objects move and collide in the physical world.

Had some push-back.

Yea, but you are assuming OpenAI is using the "junk" physics for training everything. I'd think they are smart enough to train their best models with the best physics possible. Sure, things won't be perfect. But I think you are being a bit melodramatic.

Here was my reply.

The point is I'm hearing a lot of people explicitly talking about the fact that the text2video models have developed a viable physics understanding. At the same time, many people (including Wes in this video) are saying that the text2video generators are probably trained on synthetic data from games engines, and he's implying this is a good thing.

What I'm saying is that both can't be true at the same time. Either we're training video generators from synthetic data from game engines like Unity etc. OR our models are learning a viable physics..

But a physics model derived from a game engine is NOT a good physical model. It's an approximation of physics that's optimised for fooling the human visual system.

What I would hope is that people who are currently praising text2video generators for having internalized a physics model. And who are enthusiastic for synthetic data, should ALSO be pointing out the contradictions and dangers in this.

I'm sure OpenAI as the flagship AI company, who are very responsible, know this, and aren't trying to pass off a model trained on data from Unity as a serious physics model. But you can bet your bottom dollar that someone will probably do that sooner or later.

And I'd say that right now, in this phase when everyone's hyped about both text2video and synthetic data for training it, is exactly the right time for the leaders in the community to be alerting everyone to this potential issue, and planting the seeds of awareness in people's minds.

Think of it this way. NVIDIA are training real world robots in simulation and making amazing breakthroughs. I'm sure their physics simulation is a good one. I'm sure, if they are training robots with vision, they are also giving a good visual rendering of their visual world to the simulated robots.

But you just know that someone is going to want to try to do this cheaper. And will start simulation training for real robots, in a world made with off-the-shelf games technology. It's almost too tempting not to.

But those robots will be potentially very dangerous if let out into the real world. People should be talking about that now.