CAI Researchers discover fundamental structures in modern AI models (ICML 2025)
In collaboration with the GreweLab of the Institute of Neuroinformatics (ETH Zurich/University of Zurich), CAI researchers Pascal Sager and Thilo Stadelmann co-authored a paper presented at the International Conference on Machine Learning (ICML) 2025. Their work uncovers symmetry and directionality in Transformer self-attention, advancing both theoretical understanding and training efficiency of AI models.

Transformers have become essential in artificial intelligence, powering applications in language, vision, and audio processing. At the core of these models is self-attention, a mechanism that helps the model determine how different parts of the input relate to each other. However, understanding exactly how self-attention learns and organizes information during training remains a challenge.
Recent research by CAI researchers Pascal Sager and Thilo Stadelmann, in collaboration with Matteo Saponati, Pau Aceitun,o and Benjamin Grewe from the GreweLab at ETH Zurich and the University of Zurich, sheds new light on this question. Their work reveals that the way Transformer models are trained leads to distinct structural patterns in the self-attention matrices. Bidirectional training, used in models like BERT, creates symmetric attention patterns, while autoregressive training, common in models like GPT, produces directional patterns that emphasize specific parts of the input.
These discoveries are confirmed through extensive experiments on a variety of Transformer models and input types, including text, images, and audio. Building on this understanding, the researchers developed new methods for initializing models that take advantage of these patterns, improving training speed and model performance for encoder-only Transformers.
Their findings will be presented at the International Conference on Machine Learning (ICML) 2025, one of the best conferences in the field, demonstrating the importance of this contribution to both AI theory and practice.
This work was supported by fellowships from ETH Zurich, University of Zurich, and ZHAW digital, as well as computational resources from The Swiss AI Initiative.
The full preprint is available here: arxiv.org/pdf/2502.10927