As generative AI (genAI) continues to move into broad use by the public and various enterprises, its adoption is sometimes plagued by errors, copyright infringement issues and outright hallucinations, undermining trust in its accuracy.
One study from Stanford University found genAI makes mistakes when answering legal questions 75% of the time. “For instance,” the study found, “in a task measuring the precedential relationship between two different [court] cases, most LLMs do no better than random guessing.”
The problem is that the large language models (LLMs) behind genAI technology, like OpenAI’s GPT-4, Meta’s Llama 2 and Google’s PaLM 2, are not only amorphous with nonspecific parameters, but they’re also trained by fallible human beings who have innate biases.
LLMs have been characterized as stochastic parrots — as they get larger, they become more random in their conjectural or random answers. These “next-word prediction engines” continue parroting what they’ve been taught, but without a logic framework.
One method of reducing hallucinations and other genAI-related errors is Retrieval Augmented Generation or “RAG” — a method of creating a more customized genAI model that enables more accurate and specific responses to queries.
But RAG doesn’t clean up the genAI mess because there are still no logical rules for its reasoning.
In other words, genAI’s natural language processing has no transparent rules of inference for reliable conclusions (outputs). What’s needed, some argue, is a “formal language” or a sequence of statements — rules or guardrails — to ensure reliable conclusions at each step of the way toward the final answer genAI provides. Natural language processing, absent a formal system for precise semantics, produces meanings that are subjective and lack a solid foundation.
But with monitoring and evaluation, genAI can produce vastly more accurate responses.
“Put plainly, it’s akin to the straightforward agreement that 2+2 equals 4. There is no ambiguity with that final answer of 4,” David Ferrucci, founder and CEO of Elemental Cognition, wrote in a recent blog post.
Ferrucci is a computer scientist who worked as the lead researcher for IBM’s Watson supercomputer, the natural language processor that won the television quiz show Jeopardy! In 2011.
A recent example of genAI going wildly astray involves Google’s new Gemini tool, which took user text prompts and created images that were clearly biased toward a certain sociopolitical view. User text prompts requesting images of Nazis generated Black and Asian Nazis. When asked to draw a picture of the Pope, Gemini responded by creating an Asian, female Pope and a Black Pope.
Google was forced to take the platform offline to address the issues. But Gemini’s problems are not unique.
Elemental Cognition developed something called a “neuro-symbolic reasoner.” The reasoner, named Braid, builds a logical model of the language it is reading from an LLM based on interviews performed by Ferrucci’s employees.
“We interview the business analysts and say, ‘Let me make sure I understand your problem. Let’s go through the various business rules and relation constraints and authorizations that are important to you,’” Ferrucci said. “Then what you end up with is a formal knowledge model executed by this formal logical reasoner that knows how to solve these problems.
“To put it simply, we use neural networks for what they’re good at, then add logic, transparency, explicability, and collaborative learning,” Ferrucci said. “If you tried to do this end-to-end with an LLM, it will make mistakes, and it will not know that it’s made mistakes. Our architecture is not an LLM-alone architecture.”
Subodha Kumar, a professor of statistics, operations, and data science at Temple University, said no genAI platform will be without biases, “at least in the near future.”
“More general-purpose platforms will have more biases,” Kumar said. “We may see the emergence of many specialized platforms that are trained on specialized data and models with less biases. For example, we may have a separate model for oncology in healthcare and a separate model for manufacturing.”
Prompt engineering, which is how LLMs are fine-tuned by people to provide business-specific answers, is replaced with a a set of logical rules; those rules can ensure a precise and unambiguous conversation run by the general-purpose reasoner that can drive an interactive conversation through an LLM, according to Ferrucci.
Elemental Cognition is among a series of startups and established cloud service providers, including IBM, creating genAI monitoring, evaluation and observability tools that act as a type of checksum against their outputs. In some cases, those checksum technologies are other AI engines; in other words, one AI platform monitors another AI platform to help ensure the first isn’t spewing erroneous answers or content.
Along with Elemental Cognition, companies providing these kinds of genAI tools include Arize, TruEra, and Humanloop. A variety of machine-learning platforms such as DataRobot are also moving into the AI-monitoring arena, according to Kathy Lang, research director for IDC’s AI and Automation practice.
Monitoring genAI outputs has so far generally required keeping a human in the loop, especially within enterprise deployments. While that will likely be the case for the foreseeable future, monitoring and evaluation technology can drastically reduce the amount of AI errors.
“You can have humans judge the output and responses of LLMs and then incorporate that feedback into the models, but that practice isn’t scalable. You can also use evaluation functions or other LLMs to judge the output of other LLMs,” Lang said. “It is definitely becoming a trend.”
Lang places LLM monitoring software in the category of Large Language Model Operations (LLMOps), which generally evaluate and debug LLM-based applications. More generally, it’s called Foundation Model Ops, or FMOps.
“FMOps is…explicitly used for automating and streamlining the genAI lifecycle,” Lang said. “The subjective nature of genAI models requires some new FMOps tools, processes, and best practices. FMOps capabilities include testing, evaluating, tracking, and comparing foundation models; adapting and tuning them with new data; developing custom derivative models; debugging and optimizing performance; and deploying and monitoring FM-based applications in production.
“It’s literally machine learning operations for LLMs…that focuses on new sets of tools, architectural principles and best practices to operationalize the lifecycle of LLM-based applications,” Lang said.
For example, Arize’s Phoenix tool uses one LLM to evaluate another for relevance, toxicity, and quality of responses. The tool uses “Traces” to record the paths taken by LLM requests (made by an application or end user) as they propagate through multiple steps. An accompanying OpenInference specification uses telemetry data to understand the execution of LLMs and the surrounding application context. In short, it’s possible to figure out where an LLM workflow broke or troubleshoot problems related to retrieval and tool execution.
Avivah Litan, a distinguished vice president analyst with Gartner Research, said the LLM monitoring and evaluation technologies work in different ways. Some, she said, check the source of the data and try to check the provenance of the response from the LLM, “and if they can’t find one, then they assume it’s a hallucination.”
Other technologies look for contradictions between the input and the output embeddings, and if they don’t match or “add up,” it’s flagged as a hallucination. Otherwise, it’s cleared as an appropriate response.
Other vendors’ technologies look for “outliers” or responses that are out of the ordinary.
In the same way Google search operates, information in the database is transformed into numerical data, a practice known as “embedding.” For example, a hotel in a region may be given a five-digit designation because of its price, amenities and location. If you’re searching Google for hotels in an area with similar pricing and amenities, the search engine will feed back all hotels with similar numbers.
In the same way, LLM evaluation software looks for answers that are similar to embedding — or the data that most closely resembles the query. “If it’s something [that’s] far away from that embedding, then that indicates an outlier, and then you can look up why it’s an outlier. You can then determine that it’s not a correct source of data,” Litan said. “Google likes that method because they have all the search data and search capabilities.”
Another way LLM evaluation tools can minimize hallucinations and erroneous outputs is to look for the source of the response that’s given. If there’s no credible source for it, that means it’s a hallucination.
“All the major cloud vendors are also working on similar types of technology that helps to tune and evaluate LLM applications,” Lang said.