As we deploy Generative AI models to synthesize information, generate content and serve as our copilots, it is important to understand how and why these models may produce errors. More importantly, we must be prepared for entirely different categories of errors compared to those produced by Classical AI models. For the purposes of this article, by Classical AI (often referred to as Traditional or Symbolic AI) we mean AI systems that use rules and logic to mimic human intelligence. Humans perceive data, think about, and process it, then act based on their thoughts.
Classical AI models are primarily used to analyze data and make predictions. What is a customer likely to consume next? What is the best action a salesperson should take with a customer at that moment in time? Which customers are at risk of churn, why, and how can we mitigate such risks? Classical AI algorithms process data and return expected results such as analyses or predictions. They are trained on data that can help them do that task exceedingly well.
Generative AI models, on the other hand, are prebuilt to serve as a foundation for a range of applications. Generative AI entails training one model on a huge amount of data, then adapting it to many applications. The data can be formatted as text, images, speech, and more. The tasks that such models can perform include – but are not limited to – answering questions, analyzing sentiments, extracting information from text, labeling images, and recognizing objects. These models work because they can effectively apply knowledge learned in one task to another task. Generative AI models create new data such as answers to questions, summaries of texts, and new images, all based on pre-existing training data.
Predictably, Classical AI and Generative AI use different paradigms for training. Classical AI focuses on a narrow task and is therefore trained on a well-defined and curated dataset, whereas Generative AI focuses on large volumes of public data to learn the language and all information within it. Due to these differences in data curation, and other differences encompassing AI training and consumption of outputs, the point of failure is different for Classical and Generative AI. Let us consider three areas of error across AI models: input data; model training and fine-tuning; and output generation and consumption.
Errors with input data
In the case of Classical AI, altering the datasets in any way will likely lead to errors in output. Imagine a factory that produces electronic components, such as microchips, using a Classical AI-based quality control system. This system has been trained on a highly curated dataset of images of defect-free and defective microchips. The dataset includes various types of defects like cracks, missing components, and soldering issues, and the AI model has been trained to identify these defects accurately. Now, suppose the manufacturing process undergoes some changes, perhaps using different materials or a slightly modified production line. As a result, the appearance of the microchips might change subtly, even though they are still fully functional. When the updated production line starts producing microchips and the AI quality control system is deployed to inspect them, the model might start producing errors in its output. It might incorrectly identify some of the newly manufactured microchips as defective because it’s not familiar with the variations introduced by the new production process. This discrepancy between old training data and new manufacturing conditions can lead to false positives and negatives in the quality control process.
Meanwhile, Generative AI models are trained on a diverse range of data. But if an input is given to them that deviates significantly from what they have trained on, they might produce inaccurate or nonsensical outputs. You might also get inaccurate information if you ask such a model about a topic that originated after its training data. And, of course, biased or ambiguous questions can lead Generative AI models astray. If I ask a LLM, “How does the sun rise in the west?”, the model may respond with an explanation of this fake phenomenon. In this scenario, the LLM has misinterpreted the context and generated a response that attempts to provide a coherent explanation for an illogical statement.
Errors with models
In the case of Classical AI, there are principally three types of errors to consider. Because these models are trained for narrow and well-defined tasks, the problem formulation is very important. If you are making a recommendation on which movie to watch, should you solve for long-term customer satisfaction, or the probability that the user picks the first movie that you surface — which may not lead to customer satisfaction in the long run? Another type of error arises when one estimates the wrong functional form. For example, predicting home prices based on features is a complex non-linear relationship, so using a linear regression would provide errors. Using data that was held out can promptly reveal such mistakes. Known information while training with out-of-sample validation may expose such errors quickly. Lastly, overfitting may cause errors. Suppose one is trying to predict if a patient has a disease based on medical tests. If one keeps increasing the model’s complexity to reflect all the data, it may have high training accuracy but may generalize very poorly.
In the case of Generative AI, given that LLMs are stochastic parrots, models need to be evaluated on how informative they are and how honest they are. If a model is fine-tuned with a small dataset for the wrong problem or un-reliable data with incorrect labels, it will lead to inaccuracies. Consider a LLM that transforms any input images to the style of famous artists. It may have been trained on a large dataset of various artistic styles. If I were to fine tune this LLM with a small dataset of cartoon images, the model may not perform as intended due to lack of diversity in the training.
Errors in consumption
In the case of Classical AI, this error occurs when a model is trained with a specific task in mind, but ultimately utilized for a different task. For example, a customer segmentation algorithm may not perform as well in detecting fraud.
In the case of Generative AI, consumption errors come in a few flavors. Hallucination errors occur when a model is unable to accurately generate responses due to certain shortcomings in its training process or architecture. A model that is expected to generate images of animals may generate images of fantastical creatures that don’t exist, such as a unicorn or a dragon. A second source of error is due to infringement. Recall that LLMs are stochastic parrots. Such a model may inadvertently include passages of copyrighted text from its training data, leading to plagiarism or copyright violations. A third source of error is that the response may be obsolete because it only has limited information in the allotted time frame. This last error could be due to repeatability — LLMs are suited for questions with No One Right Answer (NORA) however different runs may lead to different responses and associated inaccuracies.
It is important to understand the source of the errors so as to mitigate them in the most appropriate manner.
Read the full article here