CLS token pooling is a strategy used in Vision Transformer (ViT) models, where a special classification token (CLS token) is added to the input sequence. The output representation of this token is then used for the final classification task, aggregating information from all patches to provide a global feature representation of the image.
            
            
 
            
            
            
            
          
            6 answers
            
            
  
     Stefano
    Mon Nov 18 2024
    Stefano
    Mon Nov 18 2024
   
  
    To represent the sequence of embeddings as a single vector, various techniques known as "pooling" are employed.
  
  
 
            
            
  
     EchoWave
    Mon Nov 18 2024
    EchoWave
    Mon Nov 18 2024
   
  
    One commonly used method is [CLS] pooling. In this technique, the embedding of the [CLS] token is taken as the representation for the entire sequence.
  
  
 
            
            
  
     GeishaWhisper
    Mon Nov 18 2024
    GeishaWhisper
    Mon Nov 18 2024
   
  
    The [CLS] token is typically inserted at the beginning of the input sequence in models like BERT, and its embedding is trained to capture the overall context of the sequence.
  
  
 
            
            
  
     Giulia
    Mon Nov 18 2024
    Giulia
    Mon Nov 18 2024
   
  
    Another popular pooling technique is mean pooling.
  
  
 
            
            
  
     CryptoTamer
    Sun Nov 17 2024
    CryptoTamer
    Sun Nov 17 2024
   
  
    In mean pooling, the average of all token embeddings in the sequence is calculated to obtain a single vector representation.