There is more to AI than data-hungry AI

February 23, 2022

The European Union is regulating artificial intelligence (AI) systems under incorrect assumptions that could result in undesirable outcomes. The EU is assuming more data is always useful to AI and incentivising “big data” approach.

AI is not restricted to data-hungry techniques. There are several alternative techniques that rely on less data, require less computation and less memory. They are more in line with the principle of data minimisation in the GDPR. The Commission should take a smarter approach and incentivise AI techniques that require less data, that benefit society and that assists with climate change mitigation.

More data is not always useful for AI

EU Commission believes “AI can only thrive when there is smooth access to data [emphasis added].”¹ It believes that it is in “the nature of AI” to rely on “large and varied datasets [emphasis added].”² It believes more data automatically produces more insight and understanding. The AI Act, the Data Governance Act and the Data Act assume that more data is needed for AI.³ These beliefs are wrong.

What matters is not how large the datasets are, but the statistical properties of datasets such as how varied the data are. The context in which data is collected is also important. However, the Commission seems to believe that all data is interchangeable. Instead of focusing on “re-use, sharing and pooling of data,”⁴ the Commission should recognise that more data is not always useful to AI.

“The more the data, the surer we fool ourselves.”⁵ Quantity of data is not a metric of AI innovation. Many AI techniques do not require large amounts of data. In fact, innovation in mathematical techniques enable powerful AI models based on small datasets.⁶

Data-hungry AI techniques are expensive

It is true that some AI techniques are data hungry. These techniques, such as deep neural networks, demand vast computation and memory, which are expensive. They need an infrastructure to collect, store and process lots of data. Despite all these resources, these data-hungry AI techniques produce diminishing returns with more data.⁷ Exorbitant amounts of data are needed to make expensive marginal improvements.

The process of discovering a large model that performs well involves a range of steps that include searching and finding the optimal parameters, testing them, and training multiple prototypes. The cost of merely training a large model with a set of parameters that has already been optimized by another team could cost more than $2 million. A recently published research on large models trained more than 1000 variations to find one optimised model.⁸

Furthermore, large models require large memory. Limited memory limits the model size. Specialized hardware with large memory can accelerate the training process. However, specialized hardware required to train the models could cost in the range hundreds of thousands of dollars. Even worse, specialized hardware does not generalize to different computation tasks, thus requiring further investment for newer models.

This means that the actual cost of discovering just one large model that performs well could be in millions or even billions of dollars. Currently, such costs are only affordable to a small set of companies such as Facebook, Google and OpenAI. Such costs may create barriers for small and medium sized enterprises (SMEs).

We should also consider the carbon footprint of training a large model, which is estimated to be about five times the carbon footprint of a car over its lifetime. This, along with large amounts of data being used to surveil humans, is an immense cost for our society to pay for marginal gains in performance of data-hungry AI models. Thus, we should incentivise the use of AI techniques that require less data.

Communication on Fostering a European approach to Artificial Intelligence, 21 April 2021. p.2.↩︎
Proposal for a Regulation of the European parliament and of the Council laying down harmonised rules on artificial intelligence (Artificial Intelligence Act) and amending certain union legislative acts, 21 April 2021., p. 6↩︎
Ibid., p. 5: “the promotion of AI-driven innovation is closely linked to the Data Governance Act, the Open Data Directive and other initiatives under the EU strategy for data, which will establish trusted mechanisms and services for the re-use, sharing and pooling of data that are essential for the development of data-driven AI models of high quality.”↩︎
Ibid., p. 5↩︎
Meng, Xiao-Li. “Statistical paradises and paradoxes in big data (I): Law of large populations, big data paradox, and the 2016 US presidential election.” The Annals of Applied Statistics 12.2 (2018): 685-726.↩︎
See Ilia Sucholutsky and Matthias Schonlau. ‘Less Than One’-Shot Learning: Learning N Classes From M < N Samples. AAAI 2021: 9739-974; Husanjot Chahal, et al. Small Data’s Big AI Potential (Center for Security and Emerging Technology, September 2021).↩︎
Also see Fig 2 in Dhruv Mahajan, et al. Exploring the Limits of Weakly Supervised Pretraining. ECCV (2) 2018: 185-201 that shows performance improves linearly when the training examples are increased exponentially.↩︎
See Vincent J. Hellendoorn, et al. Global Relational Models of Source Code. ICLR 2020↩︎