Achieving alignment: How U of T researchers are working to keep AI on track

Michael Zhang focuses on AI safety as a graduate researcher at U of T's Schwartz Reisman Institute for Technology and Society
""

Michael Zhang, a PhD student in computer science, says there are myriad reasons why AI models may not respect the intentions of their human creators (supplied image)

In the year since OpenAI released ChatGPT, what once seemed like an esoteric question among researchers has pushed its way to the forefront of public discourse: As artificial intelligence becomes more capable, how do we ensure AI systems act in the best interests of humans and – crucially – not turn against us? 
 
This dilemma could determine the fate of humanity in the eyes of some researchers, including University of Toronto University Professor emeritus Geoffrey Hinton, known as the “godfather of AI,” who is warning that the technology he helped create could evolve into an existential threat. Others have raised alarm about nearer-term risks such as job losses, disinformation and AI-powered warfare. 
 
Michael Zhang, a PhD student in computer science in U of T’s Faculty of Arts & Science, is focused on AI safety and interdisciplinary thinking about the technology as a graduate fellow at the Schwartz Reisman Institute for Technology and Society – and co-authored an article on the subject earlier this year. 
 
He recently spoke with U of T News about the alignment problem and what is being done to try and solve it.

What, exactly, is meant by AI alignment? 

 
In the research sense, it means trying to make sure that AI does what we intended it to do – so it follows the objectives that we try to give it. But there are lots of problems that can arise, some of which we’re already seeing in today’s models. 
 
One is called reward misspecification. It’s tricky to specify what reward function, or objective, you want in the form of a number that an AI model can understand. For example, if you’re a company, you might try to maximize profits – that’s a relatively simple objective. But in pursuing it, there can be unintended consequences in the real world. The model might make or recommend decisions that are harmful to employees or the environment. This example of rewards being underspecified can occur in even more simple settings. If we ask a robot to bring us coffee, we are also implicitly asking it to do so without breaking anything in the kitchen. 
 
Another problem is bias. The AI model doesn’t have a mind of its own – it’s given a very strict mathematical objective. But we’re biased, and we generate data that’s biased, and that’s what we give our models. If there exists some underlying bias in the training dataset, the model will “learn” the bias because it best accomplishes that mathematical objective. We’ve already seen how this can lead to issues when we ask AI systems to make decisions such as whether someone should receive bail, or to do the first round of resume screening.  
 
If we built the AI models, how is it they learn to do things we didn’t foresee? 
 
When we talk about emergent behaviours – abilities that are present in larger models but not in smaller ones – it’s useful to think about large language models (LLMs) such as ChatGPT. If given an incomplete sentence, ChatGPT’s objective is to predict what the next word is going to be. But if you’re giving it a bunch of different training data – from the works of Shakespeare to mathematical textbooks – the model is going to gain some level of understanding in order to get better at predicting what word comes next. 
 
We don’t specify hard-coded rules for what these models are supposed to learn, so we don’t have that much control over what the model generates. One example of this is hallucinations, where models such as ChatGPT create plausible but false claims. 
 
What is artificial general intelligence (AGI) and what are some of the existential concerns about it? 
 
There are many definitions, but in a general sense, AGI refers to the potential that we develop an AI system that performs most tasks that require intelligence better than or at the same level as humans. 
 
People who believe this might happen are concerned about whether these models are going to be aligned with human values. In other words, if they’re more intelligent than the average human, it’s not clear that they’ll actually help us. 
 
Some sci-fi ideas about AIs taking over the world or hurting a lot of humans are getting a lot of media attention. One reason people think this might happen is an AI can often act better on its objectives if it has more resources. Hypothetically, an AI system might decide that manipulating humans, or hurting them in some way, might make it easier to acquire resources. This scenario is not going to happen today, but the potential risk is why luminaries such as Geoffrey Hinton emphasize the importance of studying and better understanding the models we are training. 
 
How are U of T researchers working to tackle the short- and long-term risks of AI? 
 
There are five key areas of AI alignment research: specification, interpretability, monitoring, robustness and governance. The Schwartz Reisman Institute is at the forefront of bringing together people from different disciplines to try to steer this technology in a positive direction.  
 
In the case of specification, a common approach to fix the problem of reward misspecification is a technique that allows models to learn from human feedback. This is already being put into practice in training LLMs like ChatGPT. Going forward, some researchers are looking for ways to encode a set of human principles for future advanced models to follow. An important question that we can all think about is alignment to whom? What sort of guidelines do we want these models to follow?  
 
Then there’s interpretability. A lot of these giant models, like ChatGPT, might have millions or even billions of parameters. These parameters take in an input and then compute a complicated mathematical function to give us the output – but we’re not always sure what happens in this “black box” in the middle. The goal of interpretability is to try to better understand how a model arrives at a given decision. For example, Roger Grosse, an associate professor in the department of computer science in the Faculty of Arts & Science and faculty affiliate at SRI, and his students are researching influence functions, which aims to understand which training examples are most responsible producing a certain output. 
 
Another area is monitoring. Due to the presence of emergent behaviours, sometimes we don’t actually know what a new model is capable of until a bunch of different researchers and practitioners poke around and figure it out. This area of research aims to create systematic ways to understand how capable a model actually is. For example, PhD students Yangjun Ruan and Honghua Dong are among the U of T researchers who co-authored a paper that used simulation testing to evaluate the safety risks that could arise from giving current LLMs access to tools such as email and bank accounts. 
 
Robustness is a term that broadly refers to making sure that AI models are resistant to unusual events or manipulations by bad actors. That means models shouldn’t be sensitive to small changes and should behave consistently in various circumstances. SRI Faculty Affiliate Nicolas Papernot, an assistant professor in the [Edward S. Rogers Sr.] department of computer engineering in the Faculty of Applied Science & Engineering and the department of computer science, has been working on trustworthy machine learning, which seeks to address some of these challenges. 
 
Finally, there’s governance. Many different countries are trying to develop rules on how we should regulate AI. For example, SRI Chair Gillian Hadfield has been influential in pushing for policies to curb the dangers of highly capable frontier AI models. There’s also research on the technical side about tools to hold AI developers accountable. PhD student Dami Choi and Associate Professor David Duvenaud recently co-authored a paper developing a method to check that a model is actually trained on the data that the organization claims.
UTC