The ‘Unsolved’ Problems in Machine Learning

Listen to this story

While artificial intelligence and machine learning are solving a lot of real world problems, a complete comprehension of a lot of the “unsolved” problems in these fields is hindered due to fundamental limitations that are yet to be resolved with finality. There are various domains in the field of machine learning that developers dive deep into and come up with small incremental improvements. However, challenges to further advancement in these fields persist. 

A recent discussion on Reddit brought in several developers of the AI/ML landscape to talk about some of these “important” and “unsolved” problems which, when solved, are likely to pave the way for significant improvements in these fields.

Uncertainty prediction 

Arguably, the most important aspect of creating a machine learning model is gathering information from reliable and abundant sources. Beginners in the field of machine learning, who formerly worked as computer scientists, face the difficulty of working with imperfect or incomplete information—which is inevitable in the field. 

“Given that many computer scientists and software engineers work in a relatively clean and certain environment, it can be surprising that machine learning makes heavy use of probability theory,” said Andyk Maulana in his book series—‘Adaptive Computation and Machine learning’.

Three major sources of uncertainty in machine learning are:

  • Presence of noise in data: Observations in machine learning are referred to as “sample” or “instance” that often consist of variability and randomness which ultimately impact the output.
  • Incomplete coverage of the domain: Models trained on observations that are by default incomplete as they only consist of a “sample” of the larger unattainable dataset.
  • Imperfect models: “All models are wrong but some are useful,” said George Box. There is always some error in every model.

Check out a research paper by Francesca Tavazza on uncertainty prediction for machine learning models here.

Convergence time and low-resource learning systems

Optimising the process of training and then inferring data requires a large amount of resources. The problems of reducing the convergence time of neural networks and requiring low-resource systems are countering each other. Developers might be able to build tech that is groundbreaking in applications but requires huge amounts of resources like hardware, power, storage, and electricity. 

For example, language models require vast amounts of data. The ultimate goal of reaching human-level interaction in the models requires training on a massive scale. This means a longer convergence time and requirement of higher resources for training. 

A key factor in the development of machine learning algorithms is scaling the amount of input data that, arguably, increases the accuracy of a model. But in order to achieve this, the recent success of deep learning models shows the importance of stronger processors and resources, thus resulting in continuous juggling of the two problems.

Click here to learn how to converge neural networks faster.


Recent text-to-image generators like DALL-E or Midjourney showcase possibilities of what overfitting of input and training data can look like.

Overfitting, also a result of noise in data, is when a learning model picks up random fluctuations in the training data and treats them like concepts of the model resulting in errors and impacting the model’s ability to generalise.

To counter this problem, most non-parametric and non-linear models include techniques and input guiding parameters to limit the reach of learning of the model. Even then, in practice, fitting a perfect dataset into a model is a difficult task. Two suggested techniques to limit overfitting data are:

  • Using resampling techniques to gauge model accuracy: ‘K-fold cross validation’ is the most popular sampling technique that allows developers to train and test models several times with different subsets of training data.
  • Holding back validation dataset: After tuning the machine learning algorithm on the initial dataset, developers input a validation dataset to achieve the final objective of the model and check how the model would perform on previously unseen data.

Estimating causality instead of correlations

Causal inferences come to humans naturally. Machine learning algorithms like deep neural networks are great for analysing patterns in huge datasets but struggle to make causal inferences. This occurs in fields like computer vision, robotics, and self-driving cars where models—though capable of recognising patterns—do not comprehend physical environmental properties of objects, resulting in making predictions about the situations and not actively dealing with novel situations.

Researchers from Max Planck Institute for Intelligent Systems along with Google Research published a paper—Towards Causal Representation Learning, which talks about the challenges in machine learning algorithms due to the lack of causal representation. According to the researchers, to counter the absence of causality in machine learning models, developers try to increase the amount of datasets on which the models are trained, but fail to understand that this eventually leads to models recognising patterns and not independently “thinking”.

The introduction of “inductive bias” into models is believed to be a step towards building causality into machines. But that, arguably, can be counter productive in building AI that is free of bias.


AI/ML being the most promising tool in almost all fields has resulted in many newcomers diving straight into it without fully grasping the intricacies of the subject. While reproducibility or replication is a combined outcome of the above mentioned problems, it still poses great challenges for newly developing models.

Due to lack of resources and reluctance to conduct extensive trials, many of the algorithms fail when tested and implemented by other expert researchers. Big companies offering hi-tech solutions do not always publicly release their codes, making new researchers experiment on their own and propose solutions for large problems without rigorous testing, thus lacking reliability.

Click here to find out about how lack of reproducibility in machine learning models is making the healthcare industry risky.