Using generative AI to solve problems

Alexander Hudek (PhD ’10) shares his expert opinion on the future of large language models (LLMs) and guidance on how best to use them.

Alex HudekWaterloo alum Alexander Hudek (PhD ’10) co-founded Kira Systems and spent two and half years fine-tuning Kira’s custom AI algorithms. Currently, he’s an advisor to Zuva and Vigilant AI and is the author of AI for Lawyers.

AI has captured the world's attention and imagination, fueling fears and excitement alike. Will it take our jobs? Bury us in fake content? Take over the world? What can individuals do in the face of such questions? Much has been written on these topics by experts, and rather than add another voice, I’d like instead to focus on where AI will go in the immediate future and give guidance on issues you’ll face when using AI to solve your own problems.

But first, AI is a big topic encompassing many technologies. The focus of this article is the class of machine learning models known as instruction tuned large language models (LLMs.) Examples of these are ChatGPT, LLaMA and PaLM 2.

Popular media often implies that these models can be used easily for any task, like replacing Alexa and Siri or replacing internet search. However, it’s crucial to remember that LLMs by themselves are simply text generators. They can’t turn on your lights, tell you the weather, or search the internet. If you want to use an LLM to solve a problem of your own, you’ll need to combine it with both existing and new technology. And when doing that, you’ll need to understand and account for their limitations.

Many articles discuss the limitations of LLMs and rather than repeat them, I’d like to focus on some lesser-discussed problems. But first, let’s quickly recap three of the most common limitations.

Hallucinations and a lack of trusted information. LLMs like ChatGPT will often give you made-up information, and they will come across as being very confident in doing so.

Expensive to train and operate. Supposedly, ChatGPT costs $700,000 per day to run. This is because they require expensive GPU hardware to run, and even more expensive hardware to build.

Bias and inappropriate responses. All LLMs are trained on vast quantities of information that has been scraped from the internet. This information contains inappropriate language, vulgarity, and human biases, and these can come out in LLM responses.

While these are well discussed, there are also more often overlooked limitations that are still very important to understand when working with LLMs.

They don’t actively learn. This can be confusing, as when you chat with an LLM it certainly appears to remember things you’ve said previously. Unfortunately, there is a limit to how far back it will remember during a chat, and once the chat ends, it forgets everything. It won’t remember the news yesterday, or even what you said an hour ago. This means they fall out of date very quickly when not explicitly updated. You can update them via fine-tuning, but this is costly and popular APIs don’t let you do it on all models.

Limited reasoning abilities. Often LLMs can appear to be very intelligent, even to the point of laying out what appear to be logical arguments. However, this is often more illusion than reality. An easy test to see this for yourself is to ask ChatGPT to add some very large numbers. While it will give correct answers for small numbers, it will usually fail on very large numbers. It won’t tell you that it’s unable to do it, it will just give you a wrong answer. If you tell it that it’s wrong, and to correct it, it will give you another wrong answer!  This is an area where further work is being done. In the meantime, please be cautious in asking it to do complex reasoning tasks. Large models like GPT-4 will appear to do well due to having seen many reasoning problems, but on new or niche reasoning tasks it can still fail.

Lack of information on model accuracy. The accuracy of LLMs is judged based on standard task benchmarks such as summarizing, answering a standard set of questions, or performing reasoning tasks. However, these won’t exactly match your particular use case and domain. In particular, if the domain you are applying LLM technology to lacks public examples, such as some medical or legal topics, the reported accuracy on standard benchmarks is very unlikely to reflect actual accuracy in your domain.

So where do we go from here? You think LLMs can enhance your product and you want to integrate it into your software stack, but how do you handle these issues? Fortunately, the near future will see continued rapid advances in AI, but not just by making bigger models.

Tool use. One thing you will increasingly see, and which you can try in your own organization, is to integrate LLM with more specialized tools. For example, LLMs can be taught to use search tools that have access to verified, trusted, and up-to-date information, rather than trying to give an answer themselves and risk giving hallucinated information.  Similarly, teaching LLMs to use calculators and logical reasoners can give them more reliable and even human verifiable reasoning abilities. Today, people are using integration frameworks like LangChain and LMQL to achieve this, but new AI research will further improve tool use.

Measure accuracy. A crucial part of any production machine learning system is to measure its accuracy both in terms of correctness and in terms of bias. Unfortunately, ad-hoc “prompt engineering” that is common in the industry does not address this. Even if you feel that ad-hoc results are “good enough for you,” coming AI regulation like the EU AI Act will require more formal accuracy and bias testing. Fortunately, existing techniques such as using holdout test data can still work, especially if you pair it with fine-tuning on a training set rather than trying to use prompt engineering.

Be ahead of regulation. AI regulation is coming, and it’s worth being ahead of it. Adding bias or accuracy testing into your product after the fact can be more work than doing it up front. Regulation will likely include standardization around best practices in building and operation models, requirements for transparency in terms of how AI systems are built and maintained, clear measures of accuracy and bias, formal risk assessments, and for more sensitive applications, requirements that humans be involved in decision making, filtering, or correction of outputs.  The EU’s AI Act is a good place to start, and there are good summaries of it you can find online. You can also look at Canada’s Bill C-27.

Finally, keep up with the ongoing work in the area. There are many good sources, from blogs, to academic publications, and even code examples. The field is fast moving and being up-to-date requires constant attention!