TAILOR – Trustworthy and sample efficient vision transformers

Beschreibung

After the breakthrough of transformers in the context of natural language processing, these models are now being adapted for computer vision and image classification tasks. Transformer-based models showed at least equal descriptive properties compared with convolutional models, however, initial specimen required a larger amount of data for generalization compared to convolutional models. Furthermore, these models didn’t produce translation, rotation, and scale equivariant features. Researchers introduced rotation equivariant transformer models in a recent study, although the generalization properties and sample efficiency of these models were not well investigated yet. During this visit, we aim at extending the concept of rotation equivariance to affine-transformation equivariance by adding translation and scaling equivariance to the previous methodology to improve the trustworthiness of the decisions and robustness of the vision transformer models. Computer vision models demonstrated vulnerability towards variations in the angle and scale of the input images. This weakness leads to reduced trust in models’ decisions in some circumstances. In this proposal, we approach enhancing trustworthiness through furthering reliability and robustness using affine-transform equivariance. Furthermore, gain in sample efficiency and improvement in generalization is expected as the features show consistency over different variations in translation, rotation, and scale of the original image.