Feature aggregation using VLAD and netVLAD

Before the age of deep learning, hand-crafted features such as SIFT ruled the computer vision world. These features combined with Bag of Features (BOF) methods were very popular in indexing and categorization tasks. Their main advantage was getting a very discriminative higher dimensional representation of the image. The best one was the Fisher Vectors (FV) kernel which uses a Gaussian Mixture Model (GMM) and their derivatives to approximate the feature distribution in the image. They had excellent performance in various tasks but their main drawback was a very (sometimes too) high dimensionality.

Inspired by BOF and FV  arose VLAD – Vector of Locally Aggregated Descriptors.  VLAD had performance similar to FV while having a much lower dimensionality.

The main idea behind VLAD is:

  • Compute k centers of features (using kmeans).
  • Assign features to their closest center.
  • Sum over the residual between the features and their corresponding center.
  • Flatten the matrix into a vector and normalize (L2)

Mathematically, it looks like this:

{{v}_{ij}}=\sum\limits_{x\,such\,that\,NN(x)={{c}_{i}}}{\left( {{x}_{i}}-{{c}_{ij}} \right)}

Here, v_{ij} is the VLAD representation, x is the input and c are the centers.


In the age of deep learning, the question arises – can we learn this representation?

The short answer is – YES.

I found the netVLAD paper (the original code is available on Github and uses MATLAB and MatConvNet). The general idea is the same as the original VLAD with two main difference:

  1. netVLAD uses a learned softmax for the assignment (because it is differentiable).
  2. netVLAD performs intra-normalization (presented the paper All about VLAD) before the normalization over the entire VLAD vector.

{{v}_{ij}}=\sum\limits_{i=1}^{N}{\frac{{{e}^{w_{k}^{T}{{x}_{i}}+{{b}_{k}}}}}{\sum\limits_{k'}{{{e}^{w_{k}^{T}{{x}_{i}}+{{b}_{k}}}}}}}\left( {{x}_{ij}}-{{c}_{kj}} \right)

Here, w_k, b_k, c_k are parameters learned by the network.

The netVLAD architecture is presented below


netVLAD architecture as presented in the original netVALD paper by R. Arandjelovic et al.


In the original paper, they stated that the centers and weights should be initialized to some pre-trained value (using kmeans on features learned without the VLAD layer). However, I experienced no change in performance while just using the standard truncated normal distribution initialization or Xavier.


And now for my contribution to the community (and the world) – I implemented the VLAD orderless pooling layer in Tensorflow 1.0.  The code is available on my Github.