When it comes to training AI models, bigger datasets may not always be better: U of T study

“We need to pay attention to the information richness, rather than just gathering as much data as we can” 
a wide view of a data center

A new study by researchers at U of T Engineering suggests that models trained on relatively small datasets can perform well if the data is of high enough quality (photo by Jasmin Merdan/Getty Images)

A new study by researchers at the University of Toronto suggests that one of the fundamental assumptions of deep learning artificial intelligence models – that they require enormous amounts of training data to make accurate predictions – may not be as solid as once thought.   

Jason Hattrick-Simpers, a professor in the department of materials science and engineering in the Faculty of Applied Science & Engineering, and his team are focused on the design of next-generation materials – from catalysts that convert captured carbon into fuels to non-stick surfaces that keep airplane wings ice-free.  

Their findings, recently published in the journal Nature Communications, stemmed from efforts to navigate a key challenge in the field: the enormous potential search space. For example, the Open Catalyst Project contains more than 200 million data points for potential catalyst materials – which still only covers  a tiny portion of the vast chemical space that could, for example, yield the right catalyst to help us address climate change.  

“AI models can help us efficiently search this space and narrow our choices down to those families of materials that will be most promising,” says Hattrick-Simpers.  

“Traditionally, a significant amount of data is considered necessary to train accurate AI models. But a dataset like the one from the Open Catalyst Project is so large that you need very powerful supercomputers to be able to tackle it. So, there’s a question of equity – we need to find a way to identify smaller datasets that folks without access to huge amounts of computing power can train their models on.”  

This leads to a second challenge: many of the smaller materials datasets currently available have been developed for a specific domain – for example, improving the performance of battery electrodes. In other words, the data tend to cluster around a few chemical compositions similar to those already in use while missing more promising possibilities that may be less obvious.  

“Imagine if you wanted to build a model to predict students’ final grades based on previous test scores,” says Kangming Li, a postdoctoral researcher in Hattrick-Simpers’ lab.  

“If you trained it only on students from Canada, it might do perfectly well in that context, but it might fail to accurately predict grades for students from France or Japan. That’s the situation we are up against in the world of materials.”  

One possible solution is to identify subsets of data from within very large datasets that are easier to process, but which nevertheless retain the full range of information and diversity present in the original.  

To better understand how the qualities of datasets affect the models they are used to train, Li designed methods to identify high-quality subsets of data from previously published materials datasets, such as JARVIS, The Materials Project, and the Open Quantum Materials Database (OQMD). Together, these databases contain information on more than a million different materials.  

Li built a computer model that predicted material properties and trained it in two ways: one used the original dataset, but the other used a subset of that same data that was approximately 95 per cent smaller.   

“What we found was that when trying to predict the properties of a material that was contained within the domain of the dataset, the model that had been trained on only 5 per cent of the data performed about the same as the one that had been trained on all the data,” Li says.  

“Conversely, when trying to predict the properties of a material that was outside the domain of the dataset, both of them did similarly poorly.”  

Li says that the findings suggest a way of measuring the amount of redundancy in a given dataset: if more data does not improve model performance, it could be an indicator that those additional data are redundant and do not provide new information for the models to learn.   

“Our results also reveal a concerning degree of redundancy hidden within these highly sought-after large datasets,” Li adds.    

The study underscores what AI experts from many fields are now discovering:  that even models trained on relatively small datasets can perform well if the data is of high enough quality.  

“All this grew out of the fact that in terms of using AI to speed up materials discovery, we’re just getting started,” says Hattrick-Simpers.  

“What it suggests is that as we go forward, we need to be really thoughtful about how we build our datasets. That’s true whether it’s done from the top down, as in selecting a subset of data from a much larger dataset, or from the bottom up, as in sampling new materials to include.  

“We need to pay attention to the information richness, rather than just gathering as much data as we can.” 

The Bulletin Brief logo

Subscribe to The Bulletin Brief