I love the field of interpretability, but one issue faced by everyone who tries dipping their toes into interpretability is: “What is Interpretability?”.  There never seems to be a universally agreed-upon definition for interpretability.  Much like philosophy, this often leads to disagreements over the definitions, fights over the contexts, and arguments over the objectives.  Interpretability is simply a “you know it when you see it” phenomenon.  In many ways there is no single definition, since interpretability depends on the context, the goals, the target audience, the application, and even more.  Nonetheless, my feeling is that the field continues to make progress by figuring out what interpretability is not, slowly refining the collective definition and slowly improving on the available tools.  Again, much like the moral philosophy of Plato’s “Republic”, we must keep the conversation going to continue getting closer to the truth.

Although many researchers in interpretability are already aware of the difference between “Interpretability” and “Explainability” since their seminal distinction in [1], these too are not completely universal definitions.  As a reminder, we will call something interpretable if it is ‘intrinsically interpretable’ or ‘interpretable by design’, meaning we can directly understand the decisions of a model.  The classical examples are linear regression and decision trees.  This is contrasted with something explainable, meaning that the model itself is blackbox, but we provide a post-hoc explanation or a subsequent justification for the model’s decision.

Like many of the definitions in interpretability, this might raise more questions than it answers.  Although the distinction of “explanations beforehand” and “explanations afterward” is a fairly straightforward characterization of the difference, the question of “How can I tell if my model is intrinsically interpretable?” is much less straightforward.  In this blog post, I will attempt to answer this last question, focusing on the task of supervised machine learning.  We will look at several quintessential examples of interpretable models and how they generalize and extend to cover a wide class of machine learning algorithms.

The Basic Methods

The most commonly cited examples of “interpretable” machine learning methods are: linear regression and decision trees.  Although these models are not free from interpretation issues (e.g. correlated features for linear models and depth requirements for decision trees), they are widely accepted as simple models where the reasoning process is easily understood.  For the linear model, one can easily add each individual contribution to get the final prediction.  For the decision tree, one can easily follow the flow chart leading to the final decision.

A third method which often gets left out here is the nearest neighbor algorithm.  I will argue here that the nearest neighbor also makes decisions in a way which is easy to interpret and that it represents a third type of understandable logic.  A test sample is compared to –under a fixed distance metric –all of the training samples (or even better, the learned prototypes), and the test sample is categorized as the same as the representative to which it was closest.  Although the distance itself is not necessarily well understood, the classification into archetypes provides a sufficient and contrastive rationale (X is closer to A than to B).

In this blog post, I will argue that each of these methods has a quintessential reasoning approach which defines one of the “pillars” of interpretability.  The linear model will develop into the additive approach which uses the accumulation over multiple input factors.  The decision tree will develop into the logical approach which uses deduction over a given set of input literals.  Finally, the nearest neighbor model will develop into the categorical approach which codifies each set of inputs into its correct type.  It is quickly noted that additivity is most naturally continuous to continuous, logic is most naturally discrete to discrete, and classification is most naturally continuous to discrete.

The Pillar of Accumulation

The Pillar of Evidence Accumulation or the Additive Pillar is about a logical process which additively incorporates different pieces of evidence to come to a final conclusion.  In the case of the linear model for independent input variables, this corresponds with the additive influence according to each of the linear coefficients.  It is straightforward to generalize these independent influences from linear functions to nonlinear functions of the input variables, leading to additive models [2, 3, 4].

These models generally remain interpretable by using interactions of size two or less between the input variables.  This allows for plotting the additive contribution as a 1D dependence function or a 2D dependence heatmap.  Across many tabular datasets, these models even achieve state-of-the-art performance, matching the performance of blackbox methods like XGB and MLPs.  This is especially true when allowing for 3D and higher interaction terms; however, this begins to push the boundary of what can be considered fully interpretable or not.

For additive models, we would like the individual terms to be as sparse as possible (considering as few factors as possible) and as simple as possible (considering only factors which are themselves easily understood).  With these criteria obeyed, one is able to have an additive model which simplistically defines its predictions in terms of the input variables.  Another key point for the ease of interpretation, however, is the independence of these different terms.  When the factors are themselves independent, we can easily interpret the evidences provided by each of them as separate additive contributions to the prediction.  This interpretation becomes more difficult in the presence of heavily correlated features, begging the question of which feature of the pair is actually causing the output to be as such.  How to train and interpret additive models under these heavily correlated settings is a key direction of current exploration and an active area of research.

The Pillar of Deduction

The Pillar of Logical Deduction or the Reasoning Pillar is about a logical process which carefully follows a deductive argument to come to a final conclusion.  In the case of the decision tree, this is a simple flowchart logic which follows a sequence of logical steps to arrive at a final decision.  The necessary variables and sufficient variables are clear to see through the chain of logic.  Extensions to interpretable machine learning are optimal rule lists and optimal sparse decision trees [5, 6].  These provide simple decision making processes which achieve good performance on tabular datasets.  These have been further combined into ‘Rashomon sets’ which, instead of providing a single optimal tree, provides a large set of near-optimal trees [7]. 

The key advantage of this is the ability to reason across many candidate solutions and choose one which obeys a secondary criterion like minimizing unfairness, aligning with domain expertise, or maximizing robustness.  Recent approaches also investigate how to make inferences over the entire set of models.  Interestingly, this is not the same as an ensemble of decision trees, although there are many similarities.  Most glaringly, we never directly add the outputs of the individual decision trees.  This very slight nuance in the interpretation makes the difference in giving an interpretable model or an uninterpretable model.  In general, ensembles of models are considered uninterpretable (unless the base model is itself an additive model).  This is because the naive combination of these two approaches does not respect what makes the other simple in the first place.  A random forest is a simple sum of factors which are no longer simple or independent from one another; a random forest also combines the decision tree’s cold and calculating logic with a heuristic voting in the final step.  The simplicity and sparsity of the decision logic is also another key aspect of allowing for a deduction to be interpretable. 

A perhaps more natural representative generalization of a decision process would be a general logical function or a boolean circuit; however, these approaches have received little attention in interpretable machine learning.  This is due to the fact that these branches of study are typically not concerned with the aspects that ensure these logical functions are sufficiently simple when applied as an ML model.  This would involve learning circuits which have ‘necessity’ queries and ‘sufficiency’ queries easily computable and which minimize the required variable context to complete a computation.  This likely poses an important direction of future research exploration.  It is noted that for generative applications, circuits have recently attracted some attention, such as the interpretability of “Probabilistic Circuit” models and the explainable ‘circuit finding’ of “Mechanistic Interpretability”.

The Pillar of Categorization

The Pillar of Inductive Categorization or the Concept Pillar is about stereotyping an example to be similar to those examples you have seen before.  In the case of nearest neighbors, this is simply matching a new query against all existing examples and prototypes, comparing them under a distance metric, and categorizing the new query to be the same as the closest example.  Extensions in interpretable machine learning include both prototype-based approaches [8] and concept-based approaches [9].  Importantly, this covers both the ideas of unknown, machine-learned concepts and human-known, alignment-learned concepts.  Although I think this might be the most debatable pillar, I think it is quite natural in many ways.  Moreover, I think it gives a clear reason for why we call the above two models, ProtoPNet and CBMs, interpretable models in the first place.

If you will believe me that there is always an implicit distance metric whenever we try to categorize objects, then maybe you will also believe me that the kernel machine is the natural combination of the prototype method and the additive method.  In particular, the kernel machine is the generalization of the k-NN approach which extends the additive averaging across multiple prototypes to directly incorporate the weighted influence according to the distance metric.  It is easy to see why the naive combination of two pillars again leads to a model which is itself uninterpretable.  The simple data influence is blurred by the additive voting across several different samples; the simple sum is across factors which are completely dependent on one another and completely weighted according to the distance metric.  Nevertheless, ProtoPNet balances simple prototypes corresponding to parts of the image with a simple scoring procedure, making progress towards an application-specific combination which remains interpretable.

As a final combination, let us consider the combination of the concept pillar and the logic pillar.  I believe it is clear that this results in the class of neurosymbolic approaches, where part of the model’s job is to use neural networks or other blackboxes to complete the categorization of symbolic concepts and then symbolic approaches are combined, often at the end, to reason across the learned concepts.  Once again, the naive combination of these approaches will not always lead to an interpretable model.  The blackbox solver may result in concepts which do not easily map to human concepts, reducing the validity of the symbolic interpretation; the symbolic component’s simple logic only obfuscates the prototypes learned by the concept encoder.  Moreover, even for neurosymbolic approaches intended to have interpretability like CBM, there are already concerns about the leakage of the concepts (especially soft concepts) limiting the interpretability guarantees.  Moreover, the symbolic reasoning is in practice often done with a second blackbox model, greatly reducing the end-to-end interpretability.  In general application, it seems challenging to measure these concerns.

Nonetheless, I think it can be seen how the naive combination of these pillars does not lead to interpretability and how these two approaches leverage concepts to make progress towards interpretable models which can still handle computer vision tasks.  It is seemingly the case that concept approaches are specifically needed for tasks like this.  In particular, the first two pillars cannot really handle CV applications at all and an increasing number of theoretical results are actually proving this is the case.  In greater generality, it seems the case that all of the fields which have been solved exclusively by deep learning (and not by machine learning before that) are the exact same fields where the first two pillars seem to be insufficient.  Accordingly, I think the necessity of a third reasoning type, here depicted as concept-based approaches, is truly necessary for expanding the capabilities of interpretable machine learning.  And yet, the rigor of how exactly to define this approach and how to integrate it into existing pipelines and applications remains a mystery.  I think this pillar provides some of the key opportunities for developing the future of interpretability.

Conclusion

We have seen how additive models, decision models, and prototype models correspond to the cornerstones of interpretability.  We have also seen how the naive combination of these interpretable approaches lead to uninterpretable models.  Additive plus decisions gives random forests and ensemble approaches; additive plus prototypes gives kernel machines and support vector machines; and decisions plus prototypes gives neurosymbolic approaches.  Importantly, exactly how these building blocks are combined together has a serious impact on the interpretability of the final model.

I hope I have explained in sufficient detail why these different interpretability logics do not always mix well with each other and have given some first thoughts on how people are trying to address these concerns.  I believe there may be other simple reasoning styles not represented by one of the three pillars presented here, but I do feel that a large number of the existing approaches clearly fall under these categories or under combinations thereof.  There are also opportunities to delve deeper into the metaphysical questions on scopes beyond supervised machine learning, but I feel this would fall outside of the range of this post.  For instance, interpretability for generative models seems to be a different beast entirely, one which has received a lot less rigorous study up to this point, but has equally many research opportunities.

Finally, I hope I have explained in sufficient detail what these three pillars stand for and why they belong at the base of making interpretable decisions and predictions.  I hope I have also highlighted their key flaws and why they would want to be complemented by the other approaches.  The composition of these approaches to build larger and more capable interpretable systems is a straightforward next step, but with no shortage of its own challenges.  In total, I hope this consolidation of interpretable ML methods can help researchers build future approaches through the careful combination of: the additive pillar, the reasoning pillar, and the concept pillar. Getting us all one step closer to answering: “What is Interpretability?”

Citations

[1] Rudin, Cynthia. (2019). “Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead”.

[2] Lou, Yin et al. (2012). “Intelligible Models for Classification and Regression”.

[3] Lou, Yin et al. (2013). “Accurate Intelligible Models with Pairwise Interactions”.

[4] Caruana, Rich et al. (2015). “Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission”.

[5] Angelino, Elaine et al. (2018). “Learning Certifiably Optimal Rule Lists for Categorical Data”.

[6] Hu, Xiyang et al. (2019). “Optimal Sparse Decision Trees”.

[7] Xin, Rui et al. (2022). “Exploring the Whole Rashomon Set of Sparse Decision Trees”.

[8] Chen, Chaofan et al. (2019). “This Looks Like That: Deep Learning for Interpretable Image Recognition”.

[9] Koh, Pang Wei et al. (2020). “Concept Bottleneck Models”.

Posted in

Leave a comment