The AI gap: from good accuracy to bad decisions

Intelligent agents rely on AI/ML functionalities to predict the consequence of possible actions and optimise the policy. However, the effort of the research community in addressing prediction accuracy has been so intense (and successful) that it created the illusion that the more accurate the learner prediction (or classification) the better would have been the final decision. Now, such an assumption is valid only if the (human or artificial) decision maker has complete knowledge of the utility of the possible actions.

This paper argues that AI/ML community has taken so far a too unbalanced approach by devoting excessive attention to the estimation of the state (or target) probability to the detriment of accurate and reliable estimations of the utility. In particular, few evidence exists about the impact of a wrong utility assessment on the resulting expected utility of the decision strategy. This situation is creating a substantial gap between the expectations and the effective impact of AI solutions, as witnessed by recent criticisms and emphasised by the regulatory legislative efforts.

This paper aims to study this gap by quantifying the sensitivity of the expected utility to the utility uncertainty and comparing it to the one due to probability estimation. Theoretical and simulated results show that an inaccurate utility assessment may as (and sometimes) more harmful than a poor probability estimation. The final recommendation to the community is then to undertake a focus shift from a pure accuracy-driven (or obsessed) approach to a more utility-aware methodology.

Advertisement

Inductive strategies and non-inductive bias

The justification of induction is probably the most discussed topic in epistemology: no way that a few lines blog can solve it but the idea is that some nice insight may be obtained with a simple derivation based on the notion of mean-squared estimation error. In particular we address here the (somewhat puzzling) no-free-lunch statement (by Wolpert here)) about the non-superiority of cross-validation wrt anti-cross-validation strategies if no assumption about the target distribution is made.

The derivation shows that, if it may indeed happen that an anti cross-validation (or non inductive) strategy outperforms a cross-validation (or inductive) strategy, this requires however a sort of favorable “non-inductive” bias.

Suppose that the target is the quantity \theta_{ts} and that we are in an off-training setting, i.e. where the target distribution is completely outside the researcher’s control. The only information accessible to the researcher is a data set D_{N} sampled from a parametric distribution with parameter \theta_{tr} \neq  \theta_{ts}. In an estimation setting, the problem of induction can be formalised in terms of the estimation error that an inductive approach based on D_{N} makes when targeting \theta_{ts} . More in particular, is an inductive approach always better than a non-inductive one?

So the first issue is to define properly what we mean by inductive approach. I will define by inductive approach a learning strategy whose goal is to minimise the mean-squared error on the basis of the training-set, i.e. the quantity

MSE_{tr}(\theta_{tr})= E_{D_{N}}[ (\theta_{tr} - \hat{\boldsymbol \theta})^2]

where \hat{\boldsymbol \theta} is random (bold notation) since function of the training set D_N and \theta_{tr} is considered fixed.

In the same vein, I will define as non-inductive any strategy which does not target the minimisation of the mean-squared error in the training setting. In this definition, cross-validation is an inductive strategy while anti cross-validation (defined by Wolpert here) is non-inductive. According to Wolpert for any target off-training scenario where cross-validation is superior to anti cross-validation, it is possible to define another scenario where the reverse is true as well.

Let us analyse this statement from a MSE perspective where both {\boldsymbol \theta}_{tr} and {\boldsymbol \theta}_{ts} are random:

Let us write the off training MSE as

MSE_{off}= E_{D_{N}, \theta_{ts}, \theta_{tr}}[ ({\boldsymbol \theta}_{ts} -\hat{\boldsymbol \theta})^2]

It follows

MSE_{off}= E_{D_{N}, \theta_{ts}, \theta_{tr}}[ ({\boldsymbol \theta}_{ts} - {\boldsymbol \theta}_{tr} + {\boldsymbol \theta}_{tr} - \hat{\boldsymbol \theta})^2] = E_{\theta_{ts}, \theta_{tr}}[ ({\boldsymbol \theta}_{ts} - {\boldsymbol \theta}_{tr})^2] +  E_{\theta_{tr}}[{ \textbf{MSE}}_{tr}] - 2C

where the first term represents the amount of drift from the training setting to the test setting (which inevitably deteriorates the accuracy), the second term is the average training MSE and the third term is a covariance term

C= E_{D_{N}, \theta_{ts}, \theta_{tr}} [ ({\boldsymbol \theta}_{ts} - {\boldsymbol \theta}_{tr}) (\hat{\boldsymbol \theta}- {\boldsymbol \theta}_{tr})  ]

This derivation shows that inductive approaches outperform non-inductive approaches (i.e. approaches which do not aim at minimising MSE_{tr}) if the covariance term C is null.

Now such covariance term is indeed related to the alignement between estimator (or hypothesis) and test target. As stated by the no-free-lunch theorem it may indeed happen that non-induction strategies are more accurate than inductive ones. However, assuming that would necessarily imply that the covariance term is positive and this can only happen if the non-inductive approach has some proper “non-inductive” bias. It appears then interesting to see that, like the superiority of a learning approach over another relies on the choice of a proper (or lucky) inductive bias, the superiority of a non-inductive approach over an inductive requires a similar “strong” assumption. For instance if you knew in advance that there will be a significant downward shift of {\boldsymbol \theta}_{ts} it would be more convenient to have a downward biased estimator of {\boldsymbol \theta}_{tr} instead of an unbiased one.

My 5cent (reassuring) conclusion is then that, if no assumption about the knowledge of the test distribution is available and a MSE derivation is considered, inductive approaches necessarily outperform non-inductive ones.

Transfer learning and Stein’s estimator

I have often the feeling that amazing ML novelties are just rephrasing of older statistical concepts. I have in particular such feeling when I read about the notions of indirect evidence in the Stein’s estimator (1955). The fact that estimating the mean of more than 4 independent variables may take advantage of the other estimates is probably the first (and sound) example of transfer learning. Take a look at this paper or the amazing book of Efron on Computer Age Statistical Inference.

Interpretability and models

Given the increasing impact of models on human daily lives, interpretability is a hot keyword in the Machine Learning community. Issues like “right of explanation” (that a person targeted by the decision or the prediction of an automated decision-maker could require) make clear that the question of interpretability goes beyond the pure academic or technical community.

A large part of the works on interpretability deals with the interpretability of a ML model. The motivation of this post is that I personally find quite odd to argue about the interpretability of a model, and a fortiori of a data-driven ML algorithmic model. To motivate my reasoning I will make a rapid digression about what, though almost forgotten nowadays, used to be an important field of AI : qualitative physics.

Qualitative physics (btw my first research topic 🙂 ) was based on the idea that most models used in physics are of no use to explain physics to laymen. The rationale is that human common sense is quite distant from complex mathematical formalism (e.g. differential equations) used to model and predict the behaviour of a physical system. Thinks about the motion of a pendulum: if you want to explain your 4-year-old child its behaviour, you will probably be much more successful by using some qualitative (and visual) swinging notion rather than a second-order differential equation including a sinusoidal term…

The model, in this case the differential equation, says very little about the phenomenon to a person with no sophisticated mathematical background: nevertheless, people who know (almost) nothing about mathematics may easily and successfully reason about physical phenomena.

The research approaches discussing about interpretable models seem to forget this aspect: by attacking the interpretablity issue through the interpretability of the models, they neglect that models have being often conceived for accuracy/optimisation sake and address mathematical minds and/or computational engines but not human common sense.

There is also another disturbing aspect in dissecting models to make automatic predictions (or decisions) interpretable: the fact that (good) models are not unique. The famous saying “all models are wrong, some are useful”, stresses that several different models may be used to describe a phenomenon and many of them (though very different) may be chosen for very disparate reasons.

If you want to explain a bank customer why he did not get the credit (and your predictive ML model is the combination of a neural network and a logistic model), it will be of no worth to enter the details (and the inductive biases) of those two algorithms. ML models are estimators: specifically in a supervised learning problem they estimate conditional means (or conditional probability) describing the relation between input and output variables. Describing the estimation machinery behind the prediction process won’t be of any use for the final user (or victim) of the prediction process.

Interpretability should focus on the phenomenon (i.e. the relation between the descriptors used by the predictor and the target variable) and not the multiple (heterogenous) ways of describing it. The more the phenomenon is high-variate, the more the interpretation should focus on representing (e.g. graphically) the relations of (in)dependence between variables.

My conclusion is that since human may reason about complex behaviours and they typically do that without use of any mathematical formalism, decorticate complex algorithms to explain phenomena is of very little use. So what? How can we provide insight to humans confronted with black-box driven decisions?

My suggestion is to forget the model(s) yet targeting the object of the modelling effort, i.e. the phenomenon. In supervised learning the phenomenon is a probabilistic (since uncertain) relation between observed variables, which (most of the time) is the consequence of a causal relationship between (some of the observed) variables. And causality is a notion that human beings seem to much better capture than algorithmic or mathematical details. Observations, experiments, statistical and machine learning models should have a single common objective: shed light on the dependence and (better) causal relationships underlying the data. Once discovered it (or more humbly) once estimated it, this should be the information to explain the decision to the client. “Dear client, our R&D department reached with high-confidence the evidence that the features x and y are causally related to the non refunding of loans and that the higher those values the lower the probability of a default: therefore, given your status…. we regret to inform you …”. Would such explanation convince the customer? Probably not, but this is the most honest interpretation that a machine learner should give to a final user. The causal mechanism is supposed to be robust and model independent.

Of course, the use of features x and y to justify the decision might be sometimes unethical but this is another domain where causality might help…

Non falsifiability, conspiracy theories and eternity

The evolutionary advantage of non falsifiable statements is that they will never be falsified and as such they will never disappear but they will periodically resurge. Like dogmatic thinking along all human history, conspiracy statements resist all denials, all disprovals, all data, all empirical evidence. How can you convince a dogmatic believer that God does not exist? How may you convince a QAnon activist that Satan is not inside acting Bill Gates?

If all scientific statements are caducous, since inevitably approximate and meant to be improved or refused sooner or later, conspirationists have found their own way to immortality. Science submits itself to the Darwinist selection of reality, conspiracy lives in the immortal space of unfalsifiable and then sempiternal illusions…

Read for negationists

Inspired from this article, I went through “Epistemic Dependence” from John Hardwig. Quotes below could be an interesting read for negationists, flat earthers, no-vax, anti masks (or better call them epistemic individualists?)…

the expert-layman relationship is essential to the scientific and scholarly pursuit of knowledge…”

Kant’s statement that one of the three basic rules or maxims for avoiding error in thinking is to “think for oneself”…..this model provides us with a romantic ideal which is thoroughly unrealistic and which, in practice, results in less rational belief and judgment. But if I were to pursue epistemic autonomy across the board, I would succeed only in holding relatively uninformed, unreliable, crude, untested, and therefore irrational beliefs

the layman cannot rationally refuse to defer to the views of the expert or experts he acknowledges. ..

The rational layman recognizes that his own judgment, uninformed by training and inquiry as it is, is rationally inferior to that of the expert (and the community of experts for whom the expert usually speaks) and consequently can always be rationally overruled

We must also either agree that one can know without possessing the supporting evidence or accept the idea that there is knowledge that is known by the community, not by any individual knower.

Goodhart’s law, overfitting and AI

I discovered today in a very interesting TR article on model short-termism what is known as the Goodhart’s law: “Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes” which has also been rephrased as “When a measure becomes a target, it ceases to be a good measure“.

This seems seems to be a very pertinent adage for machine learners which strive all days against overfitting. In particular this law reminds me the concepts of internal and external validation, for instance in model selection, automatic hyper parameter selection (grid search) or feature selection. After you have run million of iterations of your preferred grid search algorithm to minimise a cross-validation function, this (internal) cross-validation measure has nothing to do any more with generalisation error. Then you need to measure another cost function (the external cross-validation). In more formal terms, the caducity (and the bias) of the optimised cost function (for which you probably spent tons of GigaWatts) is made explicit by statisticians by the following inequality:

E[\min_i z_i] \le \min_i E[z_i]

where z_i could be the cost function (cross-validation) of the ith model candidate.

This formula means that the target that ML minimises (e.g. by model selection) can be a very biased (i.e. optimistic) mesure of what they wanted to address (generalisation error).

But we can see this law as a more general warning for AI practitioners: every time we transform a measure of intelligence in a cost function, this stops being a measure of intelligence. It is probably this “lost in minimisation” effect that justifies this sense of inadequacy that humans perceive when they are in front of some AI semblance of human intelligence. And what the AI application strived so much to minimise, it is no more what humans consider intelligence…

About the lack of a common good data strategy to cope with crisis

My recent experience, trying to preach the importance of data-driven support to crisis decision making, taught me that the open data philosophy is still wishful thinking and that a more pragmatic approach should be considered.

As of today, GDPR and confidentiality issues largely reduce the scope, recency and usefulness of open data. At the same time, many companies and institutions (telecom, payment providers, pharmacies, e-commerce, water/electricity distribution, post, banks) own at different levels (national, regional, city) essential data to characterize the epidemics state and evolution. My feeling is that those same companies/institutions are more than willing to share data for the common good yet on a short limited horizon and under well-defined institutional control. What is still time-intensive is to take advantage of those data, even after agreement with their owners. Most of the time, no strategy for sharing and accessing data is available, and everything relies on personal goodwill.

My feeling is that data should not be necessarily open but that (common data) owners should be asked (by politics) to organize and structure them in an interoperable manner allowing a rapid usage if needed.

Crisis managers should be kept informed about the up-to-date structure of those data and organize their deployment if needed. Ideally, they could ask for volunteering data scientists to assess the value of those data according to contingent forecasting and decision-making needs.

At the end of the day, all those data are simply observations of social behaviours which are loaned free of charge to companies and institutions. Although such entities made an effort to collect them (for their profit), we all produced them. What would make more sense to ask (or enforce) those data owners to be at least ready to deploy them for the common good?

If you are a data scientist then act as a scientist

The No-Free-Lunch theorem stated already several years ago that there is no best ML algorithm. Nevertheless ML is still too often an auto-referential arena of competing algorithms, with trends, influencers and hypes… No need mentioning that DL is the hype winner in recent years.

The mission of a data scientist should not be the promotion of a specific algorithm (or family of algorithms) but acting as a scientist. This means he should use his know-how NOT to return information about the merits of his preferred algorithm BUT about the nature of the data generating distribution (s)he is dealing with. The outcome of a ML research activity (e.g. publication) should be additional insight about the observed reality (or Nature) and not a contingent statement about the temporary superiority of an algorithm. Newton’s aim was to use differential calculus to model and explain dynamics in Nature and not to promote a differential equation tool and make a company out of it !

Consider also that every ML algorithm (also the least fashionable and the least performing) might return some information (e.g. about the degree of noise, nonlinearity) about the phenomenon we are observing. For instance a low accurate linear model tells us a lot about the lack of validity of the (embedded) linear assumption in the observed phenomenon. In that sense also wrong models might play a relevant role since they could return important information about the phenomenon under observation, notably (non)linearity, (non)stationarity, degree of stochasticity, relevance of features, nature of noise.