A synthetic solution for AI as chatbots run out of road
2024-11-29T09:00:00+11:00
A leading UNSW computer scientist says a touted solution to a big problem for generative AI is better suited for other forms of artificial intelligence.
AI chatbots, like ChatGPT and Google Gemini, are running out of data to eat.
Generative AI models they鈥檙e legally allowed to process that is of high-enough quality to improve their function.
Chatbots may only have until 2032 until the good data runs out.
Even the low-quality data (taken from sources that aren鈥檛 that reliable, for example webpages instead of published books) is meant to run out鈥攁t the very most鈥攁 couple of decades after.
AI companies are looking at what else they can use to keep from stalling in the fiercely competitive race to provide the best artificial assistant.
Industry leaders are pointing to 鈥榮ynthetic data鈥 as a potential solution.
Synthetic data is two things. One is data generated by AI, based off real-world information. Give a chatbot a spreadsheet filled with numbers and ask it to make another one just like it with different numbers. That鈥檚 synthetic data.
It can also mean information that may have been edited or manipulated by humans, but more on that later.
The CEO of ChatGPT creator OpenAI, Sam Altman, says chatbots will one day be smart enough to train themselves purely on synthetic data.
鈥淎s long as you can get over the event horizon where that model is smart enough to make good synthetic data, I think it should be all right,鈥 .
UNSW Computer Science Professor Claude Sammut doesn鈥檛 agree. 聽
鈥淚f all they鈥檙e doing is feeding on themselves, they鈥檙e not really producing anything that new. Unless there鈥檚 significantly different mechanisms [for training] that we don鈥檛 know about yet, I don鈥檛 think it鈥檚 going to keep scaling like that.鈥
Patterns and logic
The way AI models learn is by looking at many pieces of content that real people have labelled as certain things. So, for an AI 鈥榲ision system鈥 like a self-driving car to learn what a traffic cone is, someone has to manually file many pictures under the label 鈥榯raffic cone鈥 and then feed them to the program.
A generative AI model, like ChatGPT, is different. Think of it as a very sophisticated version of the predictive text on your phone. Based on all the texts it鈥檚 seen before, it learns to predict what follows your prompt. But where your phone only learns to predict the next few words, a large language model can be trained on large pieces of text so that it can generate whole documents.
Prof. Sammut says generative AI systems have big limitations because chatbots lack critical thinking.
鈥淭hese systems are based on doing pattern matching, they are very good at that, but they can鈥檛 do any sort of logical sequential reasoning.鈥
A chatbot can only tell you 1+1=2 because someone told it so, not because it learned how to do arithmetic.
鈥淭hese systems can even write computer programs, which is a like a sequential plan, but they all do it on patterns they鈥檝e seen before and assemble it all together,鈥 Prof. Sammut says.
鈥淯sually, they manage to assemble it correctly, but not always.鈥
A , after researchers fed a photo into an image generator and asked for it to make a copy of the picture. They fed the copy back in and asked it to do the task again and then repeated the process several times. It didn鈥檛 take long until the generated image looked like a blurry mess. The model had 鈥榗ollapsed鈥 by eating and regurgitating its own product.
Something like this is why Prof. Sammut says, for the foreseeable future, there鈥檚 always going to be a need for an intervening force with generative AI.
鈥淭hat鈥檚 why combining it with classical AI systems, that do this logical reasoning, I think that鈥檚 really necessary.鈥
Below is a photo of UNSW鈥檚 Library Lawn uploaded to Adobe鈥檚 AI generator Firefly with the prompt to create 鈥渁n exact replica of the photo uploaded for reference鈥.
The results were fed back to it and the process was repeated several times. The AI may have done a good job of avoiding a pixelated disaster, but the photo has morphed into something fairly different.
Classic and synthetic
鈥楥lassical AI鈥 is AI that represents knowledge in the form of symbols, like a set of rules. They are often slower than generative AI systems, but they can have guarantees of correctness. Think of a playing chess against a computer, or an early version of a robot vacuum cleaner, bumping into a wall, reversing and correcting course.
Synthetic data may not be the best fix for generative AI鈥檚 data shortfall, but classical AI has plenty of uses for it.
Prof. Sammut鈥檚 robot soccer team, the rUNSWifts, uses synthetic data to train its players.
鈥淲e collect a lot of sample images, and then we do things to them, like we invert them, we rotate them, we do various transformations,鈥 he says. 鈥淚f you try and teach a robot to recognize an object, and all your data shows the object in one orientation, you take the image and you rotate it, and then you help it train to recognize in different orientations. All that stuff does work.鈥
This is the other version of the synthetic data mentioned in the beginning of this story. Alterations to real data to make more data, rather than generating with AI.
The rUNSWift team has taken five world titles since 2000, so there must be something in it.
Data in a different light
An AI industry that needs as much data as possible, regardless of where it comes from, is self-driving cars.
A between robot cars and ones driven by humans found that self-driving cars are 鈥榞enerally鈥 safer, but have a big problem with safety at sunrise and sunset.
A in the US happened during one of these low-light conditions, where a truck pulled out in front of a car. The Tesla hit the truck鈥檚 carriage and kept driving even though it had been split in half.
鈥淎nd that was because the vision system thought they saw this big white thing and thought it was a bridge you could drive under, and that's the sort of thing that obviously [the Autopilot had] never seen that particular configuration of the truck,鈥 Prof. Sammut says.
鈥淭he accidents that you see happening with self-driving cars is because they've collected lots and lots of data, but they can never collect everything.鈥
Self-driving vehicle companies operating in the US are using synthetic data to train their cars, and on providing synthetic, virtual worlds for companies to train their autonomous devices.
鈥淭here will always be a certain amount of uncertainty,鈥 Prof. Sammut says.
鈥淒o you want driverless cars to be 100% perfect? Because people aren't either. It's just they've got to be better than, or at least as good as, the reliability of people. But maybe that's not going to be enough for everybody to accept that.鈥
Media enquiries
For enquiries about this story, please contact Jacob Gillard, News & Content Coordinator (Engineering)
Tel: +61 2 9348 2511
贰尘补颈濒:听jacob.gillard@unsw.edu.au