AI interpretability

I am currently working on AI interpretability, with a particular focus on transformer-based large language models (LLMs). I am actively working on two projects:

Geometry of LLMs’ latent representations and emergence of compositional features.

Previous work point to a nontrivial intrinsic geometry of token embeddings: they can be seen as points of a stratified space (Robinson et al., 2024), and different models (even trained on different modalities) have comparable embeddings (Roads & Love, 2020; Luo et al., 2024; Jha et al., 2025). I have computed that in language models such as GPT-2, k-means clustering or hierarchical clustering of the token embeddings reveal some degree of organization that combines semantics and potential syntactic roles. For the purpose of this investigation, I propose that a primitive feature is a characteristic of an embedded token (possibly shared by members of cluster) that is exploited by some attention or MLP block. Instead of taking the matrices of these blocks as the main objects of interest, my focus is on the coordinate-independent geometric entities (e.g. linear subspaces) that the matrices define in residual space, e.g. via their kernels, images, and singular vectors. Rather than interpreting the singular vectors directly in terms of input tokens, as in (Elhage et al., 2021) or (Dar et al., 2023), I interpret them as providing incremental modifications to the initial embeddings, “constructing” new context-dependent features from the primitives (thus obtaining compositional features); the interaction with layer norm might thus induce discrete classes via clustering as in the toy examples of (Winsor, 2022). Since deeper layers do not only read the initial token embeddings but also the modifications introduced by previous blocks, we need new tools to quantify the relevance of modifications introduced at each layer. A combination of hierarchical clustering and matrix-based input-independent analysis of heads will give us a hierarchy of coarse grained descriptions of the action of each block on tokens and a coarse grained view of the interaction between different blocks.

Understanding internal representations of syntax in LLMs via mechanistic interpretability

Joint project with Aman Burman (undergraduate student) and Matilde Marcolli. Transformer-based language models are able to produce text that follows syntactic rules; such ability suddenly emerges in a brief period during training (Chen et al., 2024). By analogy with some toy models (see e.g. (Li et al., 2023)), we can hypothesize that a transformer encodes syntax as part of an internal “world representation”. We are using tools from mechanistic interpretability on small language models to extract encoded syntactic features, understand how they are distributed across layers, and identify which subnetworks or circuits utilize this information. We are investigating whether the syntactic processing in language models is hierarchical and analogous to syntactic trees in linguistics. Here we have in mind, in particular, the mathematical formalization of Chomsky’s minimalist program in the language of Hopf algebras (Marcolli et al., 2023; Marcolli et al., 2023). We want to refine the analysis in (Manning et al., 2020), which shows that the activations of attention heads are correlated with syntactic binary relations, but did not explore the mechanisms by which these relations are assembled into more complex trees.

Mathematics of information

More broadly, I’m interested in mathematical aspects of information theory, particularly in connection with category theory and geometry (metric geometry, geometric measure theory, …). My work in this regard can be organized around three axes:

Follow the links for descriptions, videos and slides of relevant presentations, along other resources.