Pre-trained CNNs as Feature-Extraction Modules for Image Captioning
An Experimental Study
In this work, we present a thorough experimental study about feature extraction using Convolutional Neural
Networks (CNNs) for the task of image captioning in the context of deep learning. We perform a set of 72
experiments on 12 image classification CNNs pre-trained on the ImageNet  dataset. The features are
extracted from the last layer after removing the fully connected layer and fed into the captioning model. We use
a unified captioning model with a fixed vocabulary size across all the experiments to study the effect of changing
the CNN feature extractor on image captioning quality. The scores are calculated using the standard metrics in
image captioning. We find a strong relationship between the model structure and the image captioning dataset
and prove that VGG models give the least quality for image captioning feature extraction among the tested
CNNs. In the end, we recommend a set of pre-trained CNNs for each of the image captioning evaluation metrics
we want to optimise, and show the connection between our results and previous works. To our knowledge, this
work is the most comprehensive comparison between feature extractors for image captioning.
KeywordsConvolutional Neural Network, Feature Extraction, Image Captioning, Deep Learning
Lowe, David G. "Object recognition from local scale-invariant features." Proceedings of the seventh IEEE
international conference on computer vision. Vol. 2. Ieee, 1999.
Bay, Herbert, Tinne Tuytelaars, and Luc Van Gool. "Surf: Speeded up robust features." European conference
on computer vision. Springer, Berlin, Heidelberg, 2006.
Hodosh, Micah, Peter Young, and Julia Hockenmaier. "Framing image description as a ranking task: Data,
models and evaluation metrics." Journal of Artificial Intelligence Research 47 (2013): 853-899.
Plummer, Bryan A., et al. "Flickr30k entities: Collecting region-to-phrase correspondences for richer imageto-
sentence models." Proceedings of the IEEE international conference on computer vision. 2015.
Lin, Tsung-Yi, et al. "Microsoft coco: Common objects in context." European conference on computer
vision. Springer, Cham, 2014.
Xu, Kelvin, et al. "Show, attend and tell: Neural image caption generation with visual attention."
International conference on machine learning. PMLR, 2015.
Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image
recognition." arXiv preprint arXiv:1409.1556 (2014).
Sünderhauf, Niko, et al. "On the performance of convnet features for place recognition." 2015 IEEE/RSJ
international conference on intelligent robots and systems (IROS). IEEE, 2015.
Babenko, Artem, et al. "Neural codes for image retrieval." European conference on computer vision.
Springer, Cham, 2014.
Chen, Zetao, et al. "Convolutional neural network-based place recognition." arXiv preprint arXiv:1411.1509
Holliday, Andrew, and Gregory Dudek. "Pre-trained CNNs as Visual Feature Extractors: A Broad
Evaluation." 2020 17th Conference on Computer and Robot Vision (CRV). IEEE, 2020.
Papineni, Kishore, et al. "BLEU: a method for automatic evaluation of machine translation." Proceedings of
the 40th annual meeting of the Association for Computational Linguistics. 2002.
Lin, Chin-Yew. "Rouge: A package for automatic evaluation of summaries." Text summarization branches
Banerjee, Satanjeev, and Alon Lavie. "METEOR: An automatic metric for MT evaluation with improved
correlation with human judgments." Proceedings of the acl workshop on intrinsic and extrinsic evaluation
measures for machine translation and/or summarization. 2005.
Vedantam, Ramakrishna, C. Lawrence Zitnick, and Devi Parikh. "Cider: Consensus-based image
description evaluation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
Anderson, Peter, et al. "Spice: Semantic propositional image caption evaluation." European conference on
computer vision. Springer, Cham, 2016.
Tan, Mingxing, and Quoc Le. "Efficientnet: Rethinking model scaling for convolutional neural networks."
International Conference on Machine Learning. PMLR, 2019.
Xie, Qizhe, et al. "Self-training with noisy student improves imagenet classification." Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.
Valev, Krassimir, et al. "A systematic evaluation of recent deep learning architectures for fine-grained
vehicle classification." Pattern Recognition and Tracking XXIX. Vol. 10649. International Society for Optics and
Krause, Jonathan, et al. "3d object representations for fine-grained categorization." Proceedings of the IEEE
international conference on computer vision workshops. 2013.
Karpathy, Andrej, and Li Fei-Fei. "Deep visual-semantic alignments for generating image descriptions."
Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
Szegedy, Christian, et al. "Rethinking the inception architecture for computer vision." Proceedings of the
IEEE conference on computer vision and pattern recognition. 2016.
Szegedy, Christian, et al. "Inception-v4, inception-resnet and the impact of residual connections on
learning." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 31. No. 1. 2017.
Huang, Gao, et al. "Densely connected convolutional networks." Proceedings of the IEEE conference on
computer vision and pattern recognition. 2017.
Zoph, Barret, et al. "Learning transferable architectures for scalable image recognition." Proceedings of the
IEEE conference on computer vision and pattern recognition. 2018.
He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on
computer vision and pattern recognition. 2016.
Chollet, François. "Xception: Deep learning with depthwise separable convolutions." Proceedings of the
IEEE conference on computer vision and pattern recognition. 2017.
Ke, Alexander, et al. "CheXtransfer: Performance and Parameter Efficiency of ImageNet Models for Chest
X-Ray Interpretation." arXiv preprint arXiv:2101.06871 (2021).
Deng, Jia, et al. "Imagenet: A large-scale hierarchical image database." 2009 IEEE conference on computer
vision and pattern recognition. Ieee, 2009.
Irvin, Jeremy, et al. "Chexpert: A large chest radiograph dataset with uncertainty labels and expert
comparison." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. No. 01. 2019.
Rajpurkar, Pranav, et al. "CheXpedition: investigating generalization challenges for translation of chest xray
algorithms to the clinical setting." arXiv preprint arXiv:2002.11379 (2020).
Cho, Kyunghyun, et al. "Learning phrase representations using RNN encoder-decoder for statistical
machine translation." arXiv preprint arXiv:1406.1078 (2014).
Zhang, Jing, Kangkang Li, and Zhe Wang. "Parallel-fusion LSTM with synchronous semantic and visual
information for image captioning." Journal of Visual Communication and Image Representation (2021): 103044.
Zhang, Yu, et al. "Image captioning with transformer and knowledge graph." Pattern Recognition Letters
Yi, Yanzhi, Hangyu Deng, and Jinglu Hu. "Improving image captioning evaluation by considering inter
references variance." Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning
to align and translate." arXiv preprint arXiv:1409.0473 (2014).
Tran, Kenneth, et al. "Rich image captioning in the wild." Proceedings of the IEEE conference on computer
vision and pattern recognition workshops. 2016.
Sharif N, Jalwana MA, Bennamoun M, Liu W, Shah SA. Leveraging Linguistically-aware Object Relations
and NASNet for Image Captioning. In2020 35th International Conference on Image and Vision Computing New
Zealand (IVCNZ) 2020 Nov 25 (pp. 1-6). IEEE.
Copyright (c) 2022 Muhammad Abdelhadie Al-Malla, Muhammad Abdelhadie Al-Malla, Assef Jafar, Nada Ghneim
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.