Interpreting ResNet-based CLIP via Neuron-Attention Decomposition

Bu, Edmund; Gandelsman, Yossi

Interpreting ResNet-based CLIP via Neuron-Attention Decomposition

Edmund Bu and Yossi Gandelsman

NeurIPS 2025 Workshop on Mechanistic Interpretability

Abstract

We present a novel technique for interpreting the neurons in CLIP-ResNet by decomposing their contributions to the output into individual computation paths. More specifically, we analyze all pairwise combinations of neurons and the following attention heads of CLIP's attention-pooling layer. We find that these neuron-head pairs can be approximated by a single direction in CLIP-ResNet's image-text embedding space. Leveraging this insight, we interpret each neuron-head pair by associating it with text. Additionally, we find that only a sparse set of the neuron-head pairs have a significant contribution to the output value, and that some neuron-head pairs, while polysemantic, represent sub-concepts of their corresponding neurons. We use these observations for two applications. First, we employ the pairs for training-free semantic segmentation, outperforming previous methods for CLIP-ResNet. Second, we utilize the contributions of neuron-head pairs to monitor dataset distribution shifts. Our results demonstrate that examining individual computation paths in neural networks uncovers interpretable units, and that such units can be utilized for downstream tasks

Neuron-Attention Decomposition

Because a nonlinearity follows CLIP-ResNet's residual connections and because CLIP-ResNet's neuron contributions cannot be approximated by a single direction in the joint embedding space, existing CLIP-ViT interpretability methods do not readily extend to the ResNet counterparts. To address this gap, we introduce a new approach that provides a fine-grained decomposition of CLIP-ResNet model outputs into individual contributions of neurons in the last layers and the following attention-pooling heads. As each such neuron-head contribution lives in the joint image-text space, it can be compared and interpreted via text.

Pictured: a neuron-head pair whose image retrieval represents the fine-grained concept 'butterfly clothing' (a subconcept of its parent 'butterfly' neuron).

An ongoing research question is finding the most useful unit of measurement for interpretability and its downstream applications. Decomposition into neuron-head pairs retains more information compared to neuron-only decomposition, and we qualitatively observe that this can lead to the encoding of more granular and focused concepts (see paper for more examples). Furthermore, the nature of our decomposition is such that a neuron contribution is simply the sum of the contributions of its corresponding neuron-head pairs (see paper for details), and we observe that some neuron-head pairs represent subconcepts of their parent neurons, corresponding to existing literature on the additivity of concepts in multimodal embedding space. The next slide discusses the quantitative advantages of neuron-attention decomposition.

Pictured: ImageNet classification accuracy for 1) reconstruction from principal components and for 2) mean-ablation of less-contributing components.

Neuron-head contributions are rank-1 in the embedding space (table rows 1-2), meaning they can be approximated by a single direction in the embedding space — unlike neuron contributions (table rows 3-6). This single direction allows us to label each neuron-head pair with text, which is applied successfully to semantic segmentation (next slide) and monitoring distribution shift (paper). Additionally, neuron-head contributions are sparse — keeping the contributions of a fixed set of only the top 20% neuron-head pairs (while mean-ablating the other contributions) results in only an ~5% decrease in classification accuracy. This does not hold true for neuron contributions, whose analogous mean-ablation leads to an ~25% decrease in classification performance.

Pictured: class segmentation maps for 'dog'.

Given the approximation of each neuron-head pair by a single direction, we can rank each pair by its similarity to text representations. We apply this to associate each class name with a set of k neuron-head pairs, and use this association for dense semantic segmentation. Specifically, we collect feature maps from associated neurons and collect similarity maps from associated attention heads, and multiply these two maps together to reduce noise, mitigating the failure modes of either individual map. In the example above, those failure modes are 1) the focus on the cat shown by the aggregated neuron maps and 2) the negative localization shown by the aggregated similarity maps. Altogether, this segmentation method outperforms previous training-free methods for CLIP-ResNet, and additionally outperforms segmentation utilizing solely neurons or solely attention heads

BibTeX

@article{bu2025interpretingresnetbasedclipneuronattention,
  title={Interpreting ResNet-based CLIP via Neuron-Attention Decomposition},
  author={Edmund Bu and Yossi Gandelsman},
  journal={NeurIPS 2025 Workshop on Mechanistic Interpretability},
  year={2025},
  url={https://arxiv.org/abs/2509.19943}
}

Previous work interpreting CLIP-ViT

Interpreting the Second-Order Effects of Neurons in CLIP

Interpreting CLIP's Image Representation via Text-Based Decomposition

Interpreting ResNet-based CLIP via Neuron-Attention Decomposition

Abstract

Neuron-Attention Decomposition

BibTeX