2017 IEEE International Conference on Multimedia and Expo (ICME)
Download PDF

Abstract

Image captioning is a fundamental task which requires semantic understanding of images and the ability of generating description sentences with proper and correct structure. In consideration of the problem that language models are always shallow in modern image caption frameworks, a deep residual recurrent neural network is proposed in this work with the following two contributions. First, an easy-to-train deep stacked Long Short Term Memory (LSTM) language model is designed to learn the residual function of output distributions by adding identity mappings to multi-layer LSTMs. Second, in order to overcome the over-fitting problem caused by larger-scale parameters in deeper LSTM networks, a novel temporal Dropout method is proposed into LSTM. The experimental results on the benchmark MSCOCO and Flickr30K datasets demonstrate that the proposed model achieves the state-of-the-art performances with 101.1 in CIDEr on MSCOCO and 22.9 in B-4 on Flickr30K, respectively.
Like what you’re reading?
Already a member?Sign In
Member Price
$11
Non-Member Price
$21
Add to CartSign In
Get this article FREE with a new membership!

Related Articles