CLS token pooling is a strategy used in Vision Transformer (ViT) models, where a special classification token (CLS token) is added to the input sequence. The output representation of this token is then used for the final classification task, aggregating information from all patches to provide a global feature representation of the image.
6 answers
Stefano
Mon Nov 18 2024
To represent the sequence of embeddings as a single vector, various techniques known as "pooling" are employed.
EchoWave
Mon Nov 18 2024
One commonly used method is [CLS] pooling. In this technique, the embedding of the [CLS] token is taken as the representation for the entire sequence.
GeishaWhisper
Mon Nov 18 2024
The [CLS] token is typically inserted at the beginning of the input sequence in models like BERT, and its embedding is trained to capture the overall context of the sequence.
Giulia
Mon Nov 18 2024
Another popular pooling technique is mean pooling.
CryptoTamer
Sun Nov 17 2024
In mean pooling, the average of all token embeddings in the sequence is calculated to obtain a single vector representation.