Pre-trained CNNs as Feature-Extraction Modules for Image Captioning: An Experimental Study

Muhammad Abdelhadie Al-Malla; Assef Jafar; Nada Ghneim

doi:10.5565/rev/elcvia.1436

Pre-trained CNNs as Feature-Extraction Modules for Image Captioning

An Experimental Study

Authors

Muhammad Abdelhadie Al-Malla Higher Institute of Applied Science and Technology
Assef Jafar Higher Institute for Applied Sciences and Technology (HIAST)
Nada Ghneim Higher Institute for Applied Sciences and Technology (HIAST)

PDF

Abstract

In this work, we present a thorough experimental study about feature extraction using Convolutional Neural
Networks (CNNs) for the task of image captioning in the context of deep learning. We perform a set of 72
experiments on 12 image classification CNNs pre-trained on the ImageNet [29] dataset. The features are
extracted from the last layer after removing the fully connected layer and fed into the captioning model. We use
a unified captioning model with a fixed vocabulary size across all the experiments to study the effect of changing
the CNN feature extractor on image captioning quality. The scores are calculated using the standard metrics in
image captioning. We find a strong relationship between the model structure and the image captioning dataset
and prove that VGG models give the least quality for image captioning feature extraction among the tested
CNNs. In the end, we recommend a set of pre-trained CNNs for each of the image captioning evaluation metrics
we want to optimise, and show the connection between our results and previous works. To our knowledge, this
work is the most comprehensive comparison between feature extractors for image captioning.

Keywords

Convolutional Neural Network, Feature Extraction, Image Captioning, Deep Learning

References

Lowe, David G. "Object recognition from local scale-invariant features." Proceedings of the seventh IEEE

international conference on computer vision. Vol. 2. Ieee, 1999.

Bay, Herbert, Tinne Tuytelaars, and Luc Van Gool. "Surf: Speeded up robust features." European conference

on computer vision. Springer, Berlin, Heidelberg, 2006.

Hodosh, Micah, Peter Young, and Julia Hockenmaier. "Framing image description as a ranking task: Data,

models and evaluation metrics." Journal of Artificial Intelligence Research 47 (2013): 853-899.

Plummer, Bryan A., et al. "Flickr30k entities: Collecting region-to-phrase correspondences for richer imageto-

sentence models." Proceedings of the IEEE international conference on computer vision. 2015.

Lin, Tsung-Yi, et al. "Microsoft coco: Common objects in context." European conference on computer

vision. Springer, Cham, 2014.

Xu, Kelvin, et al. "Show, attend and tell: Neural image caption generation with visual attention."

International conference on machine learning. PMLR, 2015.

Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image

recognition." arXiv preprint arXiv:1409.1556 (2014).

Sünderhauf, Niko, et al. "On the performance of convnet features for place recognition." 2015 IEEE/RSJ

international conference on intelligent robots and systems (IROS). IEEE, 2015.

Babenko, Artem, et al. "Neural codes for image retrieval." European conference on computer vision.

Springer, Cham, 2014.

Chen, Zetao, et al. "Convolutional neural network-based place recognition." arXiv preprint arXiv:1411.1509

(2014).

Holliday, Andrew, and Gregory Dudek. "Pre-trained CNNs as Visual Feature Extractors: A Broad

Evaluation." 2020 17th Conference on Computer and Robot Vision (CRV). IEEE, 2020.

Papineni, Kishore, et al. "BLEU: a method for automatic evaluation of machine translation." Proceedings of

the 40th annual meeting of the Association for Computational Linguistics. 2002.

Lin, Chin-Yew. "Rouge: A package for automatic evaluation of summaries." Text summarization branches

out. 2004.

Banerjee, Satanjeev, and Alon Lavie. "METEOR: An automatic metric for MT evaluation with improved

correlation with human judgments." Proceedings of the acl workshop on intrinsic and extrinsic evaluation

measures for machine translation and/or summarization. 2005.

Vedantam, Ramakrishna, C. Lawrence Zitnick, and Devi Parikh. "Cider: Consensus-based image

description evaluation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

Anderson, Peter, et al. "Spice: Semantic propositional image caption evaluation." European conference on

computer vision. Springer, Cham, 2016.

Tan, Mingxing, and Quoc Le. "Efficientnet: Rethinking model scaling for convolutional neural networks."

International Conference on Machine Learning. PMLR, 2019.

Xie, Qizhe, et al. "Self-training with noisy student improves imagenet classification." Proceedings of the

IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.

Valev, Krassimir, et al. "A systematic evaluation of recent deep learning architectures for fine-grained

vehicle classification." Pattern Recognition and Tracking XXIX. Vol. 10649. International Society for Optics and

Photonics, 2018.

Krause, Jonathan, et al. "3d object representations for fine-grained categorization." Proceedings of the IEEE

international conference on computer vision workshops. 2013.

Karpathy, Andrej, and Li Fei-Fei. "Deep visual-semantic alignments for generating image descriptions."

Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

Szegedy, Christian, et al. "Rethinking the inception architecture for computer vision." Proceedings of the

IEEE conference on computer vision and pattern recognition. 2016.

Szegedy, Christian, et al. "Inception-v4, inception-resnet and the impact of residual connections on

learning." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 31. No. 1. 2017.

Huang, Gao, et al. "Densely connected convolutional networks." Proceedings of the IEEE conference on

computer vision and pattern recognition. 2017.

Zoph, Barret, et al. "Learning transferable architectures for scalable image recognition." Proceedings of the

IEEE conference on computer vision and pattern recognition. 2018.

He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on

computer vision and pattern recognition. 2016.

Chollet, François. "Xception: Deep learning with depthwise separable convolutions." Proceedings of the

IEEE conference on computer vision and pattern recognition. 2017.

Ke, Alexander, et al. "CheXtransfer: Performance and Parameter Efficiency of ImageNet Models for Chest

X-Ray Interpretation." arXiv preprint arXiv:2101.06871 (2021).

Deng, Jia, et al. "Imagenet: A large-scale hierarchical image database." 2009 IEEE conference on computer

vision and pattern recognition. Ieee, 2009.

Irvin, Jeremy, et al. "Chexpert: A large chest radiograph dataset with uncertainty labels and expert

comparison." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. No. 01. 2019.

Rajpurkar, Pranav, et al. "CheXpedition: investigating generalization challenges for translation of chest xray

algorithms to the clinical setting." arXiv preprint arXiv:2002.11379 (2020).

Cho, Kyunghyun, et al. "Learning phrase representations using RNN encoder-decoder for statistical

machine translation." arXiv preprint arXiv:1406.1078 (2014).

Zhang, Jing, Kangkang Li, and Zhe Wang. "Parallel-fusion LSTM with synchronous semantic and visual

information for image captioning." Journal of Visual Communication and Image Representation (2021): 103044.

Zhang, Yu, et al. "Image captioning with transformer and knowledge graph." Pattern Recognition Letters

(2021): 43-49.

Yi, Yanzhi, Hangyu Deng, and Jinglu Hu. "Improving image captioning evaluation by considering inter

references variance." Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning

to align and translate." arXiv preprint arXiv:1409.0473 (2014).

Tran, Kenneth, et al. "Rich image captioning in the wild." Proceedings of the IEEE conference on computer

vision and pattern recognition workshops. 2016.

Sharif N, Jalwana MA, Bennamoun M, Liu W, Shah SA. Leveraging Linguistically-aware Object Relations

and NASNet for Image Captioning. In2020 35th International Conference on Image and Vision Computing New

Zealand (IVCNZ) 2020 Nov 25 (pp. 1-6). IEEE.

Pre-trained CNNs as Feature-Extraction Modules for Image Captioning

An Experimental Study

Authors

Abstract

Keywords

References

DOI

Published

How to Cite

Downloads