• <tr id="yyy80"></tr>
  • <sup id="yyy80"></sup>
  • <tfoot id="yyy80"><noscript id="yyy80"></noscript></tfoot>
  • 99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

    Efficient Image Captioning Based on Vision Transformer Models

    2022-11-10 02:31:28SamarElbedwehyMedhatTaherHamzaandMohammedAlrahmawy
    Computers Materials&Continua 2022年10期

    Samar Elbedwehy,T.Medhat,Taher Hamza and Mohammed F.Alrahmawy

    1Department of Data Science,Faculty of Artificial Intelligence,Kafrelsheikh University,Egypt

    2Department of Electrical Engineering,Faculty of Engineering,Kafrelsheikh University,Egypt

    3Department of Computer Science,Faculty of Computer and Information Science,Mansoura,Egypt

    Abstract:Image captioning is an emerging field in machine learning.It refers to the ability to automatically generate a syntactically and semantically meaningful sentence that describes the content of an image.Image captioning requires a complex machine learning process as it involves two sub models:a vision sub-model for extracting object features and a language sub-model that use the extracted features to generate meaningful captions.Attention-based vision transformers models have a great impact in vision field recently.In this paper,we studied the effect of using the vision transformers on the image captioning process by evaluating the use of four different vision transformer models for the vision sub-models of the image captioning The first vision transformers used is DINO (self-distillation with no labels).The second is PVT (Pyramid Vision Transformer) which is a vision transformer that is not using convolutional layers.The third is XCIT (cross-Covariance Image Transformer) which changes the operation in self-attention by focusing on feature dimension instead of token dimensions.The last one is SWIN(Shifted windows),it is a vision transformer which,unlike the other transformers,uses shifted-window in splitting the image.For a deeper evaluation,the four mentioned vision transformers have been tested with their different versions and different configuration,we evaluate the use of DINO model with five different backbones,PVT with two versions:PVT_v1and PVT_v2,one model of XCIT,SWIN transformer.The results show the high effectiveness of using SWIN-transformer within the proposed image captioning model with regard to the other models.

    Keywords:Image captioning;sequence-to-sequence;self-distillation;transformer;convolutional layer

    1 Introduction

    One of the most difficult problems in artificial intelligence is automatic caption synthesis for images,i.e.,image captioning.Models of picture captioning,as shown in Fig.1,not only handles computer vision difficulties of object recognition,but also describes relationships between them in plain language.Automatic generation of image captions has a huge impact in the fields of information retrieval,accessibility for the vision impaired,categorization of images,and image indexing.There are applications for image captioning that do not have enough data and require methods to deal with this amount of data and help in producing a smaller model that does not need many parameters,one of these methods used transfer learning that exists in this reference[1].The advances in research in both language modeling and object identification have a direct influence on the image captioning.Many researches in image classification,object detection and segmentation have achieved promising results using deep convolutional neural networks.The image captioning can be applied even to images with low resolution that can be improved by using the modern method for example this research[2].

    Figure 1:Example of image captioning

    Modern convolutional neural networks require a massive computations and massive storage capacities to achieve good performance.This challenge has been thoroughly researched in recent years and one of possible solutions is using attention.Attention methods can be divided into two category;Global(or Soft)Attention which is placed on all positions in the image and Local(or Hard)Attention which is placed only on a few positions in the image.Transformer designs are built on a self-attention mechanism that learns the connections between sequence parts,which is a model that usesattentionto improve the speed of training data.The transformer architecture can be used to detect objects,as it enables the model to distinguish between foreground and background objects in the encoder part to caption an image.Also,it can predict locations and categories for these objects that exist in the image.This aids image captioning models in predicting the bounding boxes and category labels for each object.Vision models with self-attention are classified into two categories[3]:The models which use single-head self-attention Transformer and the models which employ multi-head self-attention Transformer into their architectures.Self-Supervised Vision Transformers which have achieved significant success for CNN(convolutional neural network)-based vision tasks,have also been investigated for ViTs(vision transformer)[4]which is the first captioning work that used Transformers instead of standard convolutions in deep neural networks on large image datasets.They applied the original Transformer model with some changes on a sequence of image‘patches’flattened as vectors[5]and extended by the preceding works.We consider solving the problems of image captioning with four new proposed methods.We concentrated on image phase in the captioning stages as it is the first phase and it should be done by for extracting the features from the images with high accuracy.This paper is organized as follows;Section 2 for the related work and Section 3 for discussing the proposed framework.Section 4 presents the evaluation stage;then,we compare the results of the four proposed framework with the ViT,ResNet50(residual neural network)and VGG16-LSTM(Visual Geometry Group-Long short term memory)model.Finally,Section 5 presents the conclusion and future work.

    2 Background

    The common machine translation tasks include classical operations such as translating words,aligning words,reordering...etc.Image captioning using deep learning models is now considered a part of the current machine translation breakthroughs as ittranslatesthe visual image into its corresponding textual description.The goal any machine translation process is to maximize p (T|S),which is used for estimating the parameters of an assumed probability distribution given some observed data,to translate a sentence S in the source language into the target language T.Machine captioning is very similar to machine translation as the encoder part commonly used in machine translation is replaced in image captioning by CNN instead of an RNN (recurrent neural network)[6].This is because recent research has shown a rich representation of an input image can be obtained by embedding its contents in a fixed-length vector using CNN,which is very useful for many computer vision applications.

    Vision Transformer now is the most popular model commonly used nowadays in computer vision;it uses attention mechanism to build enhanced vision models.We study in this paper using them as submodels for the vision part of the image captioning model.

    2.1 Vision Transformer

    The Transformer model was first created to help solving natural language processing problems.It enables modeling long dependencies between input sequence elements and it supports parallel processing of sequences.The transformer model has been effectively recently in the field of computer vision.Transformers,unlike convolutional networks,are designed with minimal inductive biases and are naturally suited as set-functions.Transformer architectures are based on a self-attention mechanism that learns the relationships between elements of a sequence[5],which we’ll go over in the next section.

    2.1.1 Attention Mechanism on Vision Transformer

    The attention mechanism is currently used increasingly in deep learning models with neural machine translation applications for improving the performance.Attention is how we can focus on different parts of an image or related words in one sentence.Fig.2 is an example;Human visual attention allows us to pay attention to a specific part like focusing on two resolutions;one is high and the other is low.High resolution in the yellow box and the low is in the background.This attention makes detecting the whole image correct as in the yellow box indicates to an animal from the ear of the nose so the blanket and T-shirt doesn’t mean anything for us and this attention can be done from the small patch of an image.

    Figure 2:An example for human visual attention[7]

    Sneha et al.proposed[8]four categories for attention depending on the number of sequences,abstraction,positions and representations as follows:

    a)Number of sequenceshas three types:distinctivewhen key and query state belong to two distinct single input and single output sequences,co-attentionwhen multiple input sequences are presented simultaneously and attention weights are learned to find the interactions between these inputs,and the last type isself-attentionorintra-attention when the key and query state are from the same input sequence,this type is the most popular used in the recent research[8].

    b)Number of abstraction levelshas two types Single level when the weights of the attention are calculated only for the original input,and the multi-level when attention is applied sequentially on multiple abstraction levels of the input by using the output of certain level as the query state for the next higher level either a top-down or bottom-up[8].

    c)Number of positions,this category defines the types of attention depending on the positions of input sequences where the attention is computed.These are three types.Softorglobal attention[8]which is compute using all the data at on all positions in the input sequencelocalattention type that uses soft shading for focusing on a window in the image to calculate the attention more computationally-efficient and the last type in this category ishard attentionwhich was proposed by Xu et al.[9]to compute attentions using stochastically sampled hidden states in the input sequence,i.e.On certain predicted positions of the input sequence.

    d)Number of representations;this category has two types.Multi-representationin which different aspects of the input are captured through multiple feature representations and attention is used to assign importance weights to these different representations,and the other type is multidimensional that computes the relevance of each dimension of the input embedding vector,extracts contextual meaning of input dimensions.

    2.1.2 Image Captioning with Attention

    The most popular architectures that are used in most recent image captioning research are based on Encoder-Decoder and Transformer models.The attention is used in Encoder-Decoder model by converting first the input to a single fixed-length vector to reduce the length of the input,and then this vector is passed to the decoding step.The most commonly used Encode-Decoder is CNN-LSTM in which CNN represents the encoder and LSTM represents the decoder.It uses attention not only to select relevant regions and locations within the input image but also to reduce the complexity that with the size of the image.Attention is used to reduce this complexity and help to control it with considering the size of the image by dealing with selected regions at high resolution only.On contrary,Transformer model is built using multiple Encoder and Decoder layers stacked together and connected to each other.Transformer is based entirely onself-attentionto compute representations of its input and output without using sequence-aligned RNNs or convolution.It means making a relation between tokens and their positions that exist in the same input.

    2.1.3 Transformer Models

    Several vision transformer models have been presented in the literature,we focus in this paper on four of 5hese models.In the following,we present the Transformer models that we used in this paper.

    a)ViT

    ViT was proposed by Dosovitskiy et al.[10].Instead of using CNN in its architecture.It is directly uses the original Transformer architecture on image patches along with positional embeddings for image classification task.It has outperformed a comparable state-of-the-art CNN with four times fewer computational resources that most of classification tasks in computer vision used it[11-13].

    b)PVT

    PVT[14]inherits the advantages of both CNN and Transformer and try to solve their problems bycombining the advantages of CNN and ViT,without convolution.Itemployes a variant of self-attention called SRA (Spatial-Reduction Attention) to overcome the quadratic complexity of the attention mechanism used in ViT.Also,unlike ViT,PVT can be used for dense prediction as it can be trained on dense partitions of an image to achieve high output resolution.Moreover,PVT uses a progressive shrinking pyramid to reduce the computations of large feature mapsThe second version of PVT,PVT_v2[15]that solved some problems of fixed-size of PVT_v1 in position encoding,Also it reduces the complexity that come with PVT_v1,and finally it fixes losing of local continuity of the image to a certain extent due to using non-overlapping patches that exists in PVT_v1

    c)DINO

    DINO model[16]is a self-supervised vision transformer that applies knowledge distillation with no labels,where a student model is trained to match the output of a supervising teacher mode.In this transformer,teacher knowledge is distillated from the student during the training which is called dynamic training.The meaning of knowledge distillation is training a student network model to match the output of a teacher network model which means using a single trained model with identical architecture.Self-training means sending small annotations to a large unlabeled instances and this way help in improving the quality of the features and can work with soft-labels which referred to as knowledge distillation.

    d)XCIT

    XCIT[17]replaced the self-attention model with a transposed attention model called crosscovariance attention(XCA).This is done as core self-attention operations which have relatively high time and memory complexity that increases with the number of input tokens or the patches.Theproposed cross-covariance attention modifies the transformer model by adding a transposed attention to deal with feature dimensions instead of token dimensions.These make reduction in the computational complexity to become better than self-attention and make dealing with flexibility with high resolution as XCIT focus on using a fixed number of channels instead of the number of tokens.The policy of this model is to take the features and divide them into heads and apply cross-covariance on each one of them separately then the weights that will be obtained from one of them in the tensors are retted.

    e)Swin Transformer

    Feature maps in this model[18]are built with a hierarchical Transformer by integrating the patches of the image in deeper layers.The representation is computed with(SWIN)shifted window.It has linear computation complexity relative to the size of the input image due to computation of self-attention only within each local window.This strategy brings greater efficiency by reducing the computation of self-attention to non-overlapping local windows while also allowing for cross-window connection.The model uses layers for patch merging with linear layer to reduce the number of tokens.Swin transformer is applied to get the features and those steps are repeated to produce the hierarchical structure.This model also change the standard multi-head self-attentions “MSA”with shifted window and this is the core of the model that makes the change in the results as the traditional connections uses window self-attention that lacks connections among windows and this effect on power.The advantage of this model exists in shift-window manner as it cure the previous layer that may lacks in window-based manner and this makes enhancement in power.The shifted window manner has much lower latency than the sliding window as it makes the connections between neighboring non-overlapping windows in the previous layer.The model takes the image as an input and divided it into patches with non-overlapping manner by using patchsplitting module as exists in ViT[18].It looks to every patch as a token.It applies SWIN transformer on each patch which is a modification on self-attention.

    3 Related Wor k

    Image captioning has been the subject in many research papers.For example;Lu et al.in[19]found that most attention methods ignore the phenomenon that words such as“the”or“an”in the captioning text cannot match the image parts and force each word to match an image part only.So the authors proposed an adaptive attention model that solve this problem and improve the mandatory matching between words and image areas.Huang et al.[20]proposed AoA(Attention on Attention)mechanism which uses conventional attention mechanisms for determining the relationship between queries and results of the attention.The model generates a vector for information and an attention gatewhich uses the attention results and the current context.After that,they apply element-wise multiplication to them by adding another attention.Then,the attended information and the expected useful knowledge are obtained.AoA was applied in this work for both the encoder and decoder of the image captioning model.The main advantage of the model that it counts objects of the same type accurately.Another mechanism presented that uses top-down soft attention was proposed by He et al.[21],it uses a topdown soft attention for simulating the human attention in captioning tasks and show that the behavior of human attention is different in seeing the image and in describing the tasksand there is a strong relevance between described and attended objects.This mechanism used CNN as feature encoder and integrated the soft attention model with the salience of the image by using a top-down soft attention mechanism for automatic captioning systems.In[22],Wang et al.proposed a novel method to model is the relevance between important parts of the image using a graph neural network,where features are extracted first from the image using deep CNN;then,GNN(Graph Neural Network)model is used to learn the relationship between visual objects.After that,the selection of the important,relevant objects is done by using a visual context-aware attention model.Finally,sentences are generated using an LSTM-based language model.This mechanism is used to guide attention selection by memorizing previously attended visual content as it takes into consideration the historical context information of the previous attention,besides it can learn relation-aware visual representations of image captioning.The work in Biswas et al.[23]is concerned with improving the level of image features.The authors proposed a novel image captioning system with a mechanism of bottom-up attention.The model combines low-level features like contrast,sharpness,contrast and colorfulness with high-level features like classification of motion or face recognition for detecting the attention regions in the image and for detecting regions that adapt to the bottom-up attention mechanism.Then,the weights of the impact factors for each region are adjusted by using a Gaussian filter.Then,a Faster RCNN(Region-based Convolutional Neural Network)is used for detecting objectsCompared to the“CNN+Transformer”paradigm,Liu introduced a CPTR (CaPtion TransformeR) in[24]which is a simple and effective method that totally avoids convolution operations.Due to the local operator essence of convolution,the CNN encoder has limitations in global context modeling,which can only be fulfilled by gradually enlarging the receptive field gradually as the convolution layers go deeper.However,the encoder of CPTR can utilize long-range dependencies among the sequentialized patches from the very beginning via a self-attention mechanism.During the generation of words,CPTR models “words-to-patches”attention in the cross attention layer of decoder which is proved to be effective.

    4 The Proposed Framework

    The task of captioning images can be separated into two sub-models;the first is a vision-based sub-model,which acts as a vision encoder that uses computer vision model to extract features from input images,and the second is a language-based sub-model,which acts as a decoder that translates the features and objects given by the image sub-model into natural sentences.Fig.3 shows our proposed image captioning model,where the block labelled“Attention-based Transformer”refers to the vision transformers that extracts the features vectors from the input images and feeds them to the language decoder that generates the captions.

    We propose for our work experimenting vision transformer(encoder)to evaluate them in order to find the most efficient transformer to be used for image captioning.We experimented four different attention-based visions Transformer for extracting the features from the images and for each Transformer,we applied it with its known,in the following subsections we illustrate more details on how these transformers have been used in our proposed model.Regarding the language-based(decoder),we used only (LSTM) Long Short Term Memory model to predict the sequences of the generated captions,from the feature vectors obtained after applying the vision transformer.

    4.1 Vision-Transformer Encoder Model

    We present here the different attention based vision transformer models that are tested in out proposed model.These transformers are DINO transformer[16]that have been used with different backbones in our proposed captioning model including (ResNet50,ViT s/8,ViT s/16,Xcit_meduim_24/p8 and ViT b/8).The second tested transformer is PVT,and we tested two different versions of it as presented in[14,15],and they have been tested with different configurations in our captioning model includingPVT_v1_Small,PVT_v1_Large,PVT_v2_b5 and PVT_v2_b2_Linear.The third transformer is XCIT model[17]and we tested its XCIT-Large version only.Finally,the fourth transformer is SWIN-transformer presented in[18],and we tested its SWIN-Large version.

    4.2 The Proposed Language Model

    For our proposed image captioning to predict the word sequences corresponding to the image contents,i.e.,captions,we used a single language decoder model,as we focus only on this paper on evaluating the attention-based vision transformers.The fixed language decoder is as LSTM-based model that uses the feature vectors obtained from the vision transformer proceeding to generate the captions.In Fig.4,the blue arrows correspond to the recurrent connections between the LSTM memories.All LSTMs share the same parameters.After receiving the image and all preceding words as defined byP(St|I,S0,S1,...St-1),each word in the sentence is predicted by the LSTM model,which has been trained,whereIis an image,Sits correct transcription,andWeis word embedding.If we denoteS=S0,S1,...SNis a true sentence describing this image whereS0a special start word like(startseq)and bySNa special stop word like(endseq).For the image and each sentence word,a copy of the LSTM memory is produced so that all LSTMs have the identical settings and the output of the LSTM at timet-1 is supplied to the LSTM at timet.In the unrolled version,all recurrent connections are converted to feed forward connections[25].

    5 Evaluation and Configurations

    We evaluated our proposed model on the MS COCO (Microsoft Common Objects in Context)dataset[26],which is the most used image captioning benchmark.To be consistent with previous work,we used 30.000 images for training and 5000 images for testing.We trained our model in an end-to-end using Keras model using a laptop with one GPU (2060 RTX).Fig.5 shows the Plot of the Caption Generation Deep Learning Model for using ViTs/16 and s/8 with DINO,where input1 is the input of image features,input2 is the text sequences or captions and dense is a vector of 384 elements that are processed by a dense layer to produce a 256 element representation of the image as all the settings are the same in five backbones with DINO,ResNet50,VGG16,PVT_v1,PVT_v2,ViT,XCIT and SWIN with their different methods except the shape of the image will be change upon the model.We used netron site[27]for plot the model by uploading the file of the model.

    One is the VGG16 model with LSTM,the second is the PVT_v1 with their methods(Small and Large)model with LSTM and PVT_v2 with(b5 and b2_linear),and the other is the ResNet50-LSTM model and DINO with 5 different backbones(ResNet50,ViTs/8,ViTs/16,ViTb/8 and DINO-XCITm24_p8)with LSTM and XCIT,SWIN,ViT with LSTM.

    The hyper parameter settings for our model are as follows:

    Language model layers:1-Layer LSTM,Word Embedding Dimensionality:512,Hidden Layer Dimensionality:256,Maximum Epochs:20,LSTM dropout settings:[0.5],learning rate:[4e-4],Optimizer:We used Adam optimizer and the batch size is 16.

    Vision Transformer,We compared different transformer models with different versions configurations as follow:

    -DINO with five different backbones which are(ResNet50,ViTs16,ViTs8,ViTb8 and XCITm24_p8)with:-

    1- ViT model which is proposed as a replacement of CNN that achieved results better than convolutional networks as it is applied directly to patches of the image as a sequence.Selfattention allows ViT to concatenate the information among image even in the lowest layers.

    Figure 5:Plot of the caption generation deep learning model for DINO(vits8 and vits16)

    2-PVT model in different two versions PVT_v1 and PVT_v2:

    a-PVT_v1 was introduced in[14]to overcome the difficulty of porting transformer to various dense prediction tasks and it is unlike convolutional network,to control the scale of feature maps,PVT_v1 uses a progressive shrinking strategy by patch embedding layers instead of use different convolutional strides but PVT_v1 achieved a little improvement.

    b-PVT_v2[15]proposed to solve the problem with PVT_v1 in fixed-size of PVT_v1in position encoding that make a problem with processing the images with a flexible way,it reduces the complexity that come with PVT_v1 and it fix losing of local continuity of the image to a certain extent due to using non-overlapping patches that exists in PVT_v1.We compared also this model with its two different models(PVTv2-b5 and PVTv2-b2-Linear).

    3-XCIT model proposed a solution for global interactions among all tokens by add a modification on the operations of self-attention which is a transposed on self-attention.This transposition occur in the interactions on features channels instead of tokens which make model has the flexibility and reduce the computational complexity as it has linear complexity and also achieve good results on images that have high resolutions.

    4-SWIN model used shift-window in splitting the image instead of the traditional splitting that was sliding-window.This new manner effect on the performance as sliding-window can lack the connections between windows but with shift-window it use non-overlapping manner.

    For evaluation purposes,we compare the results generated by the tested attention-based transformer models with other non-transformer models including ResNet50 and VGG16 which are types of convolutional network.In total,the tested vision models in our experiments are 14 different models.We compared and evaluated using each one of these different vision models as features extractor from images in the proposed image captioning model in several ways as explained next in this section.

    The input images to the vision encoder are resized to square shaped images with different resolutions in each experiment as required by each of the used vision model.The size used in each of the tested model is shown in Tab.1.

    Table 1:Shape of the image for different models

    In our proposed evaluation process for evaluation the use of different attention-based vision transformers for image captioning,we considered some criteria.In the following we define each of these criteria and present the evaluation metric(s)used for evaluation it:

    A.Efficiency of Image Captioning

    This criterion aims to measure how efficient is the model in producing the captions for the input image.For this,criterion,we used a set of common metrics including BLEU (bilingual evaluation understudy)-1,2,3,4,METEOR (Metric for Evaluation of Translation with Explicit ORdering),ROUGE (Recall-Oriented Understudy for Gisting Evaluation),CIDEr (Consensus-based Image Description Evaluation) and SPICE (Semantic Propositional Image Caption Evaluation) scores,which are denoted as B1,2,3,4,M,R,C and S,respectively.BLEU scores[30]are used in text translation for evaluating translated text against one or more reference translations.We compared each generated caption against all of the reference captions for the image and considered very popular for captioning tasks.BLEU scores for 1,2,3 and 4 are calculated for cumulative n-grams.SPICE metric[31]is a more meaningful evaluation also for the semantics of generated captions.ROUGE[32]is used for text summary originally.METEOR[32]is another metric for the evaluation of machine translation output slightly younger than BLEU.CIDER and SPICE are specially designed for image captioning,CIDER[33]uses TF-IDF weighting term,it works with calculate the frequency of a word in a certain corpus which indicates for the character or semantic meaning of the word but SPICE uses word tuples to calculate the intersection of the candidate and ground truth captions.

    The evaluation results of the image captioning model using these metrics for all the tested vision models is shown in Tab.2 where the scores of our tested models with DINO,PVT_v1 and PVT_v2 indicates that using it for extracting the features of the images,improves the performance of the captioning model as the scores of the three backbones of DINO with LSTM is better than using VGG16 and ResNet50-LSTM model which are types of CNN models.ViTb/8 backbone and LSTM,with smaller patches of ViTs enhancing the quality of the generated features.XCIT also achieves better results than the previous models due toadding transposed attention to deal with feature dimensions instead of token dimensions.But the most effective model is SWIN-transformer which makes the result better than all the other models due to changing the manner of splitting the patched using windowshifting instead of windows-sliding that also reduce the computational complexity and time rather than other models.

    Table 2:Image captioning efficiency measurements comparisons on MSCOCO.All models are finetuned with self-critical training

    Table 2:Continued

    Fig.6 show the comparison of BLEU scores for using the tested models and Fig.7 shows the comparison between METEOR,ROUGE,Cider and SPICE for the same models.

    As shown in Fig.7,the performance of the five backbones of DINO is better than using VGG16 and ResNet50-LSTM models and little better than PVT_v1 and PVT_v2.XCIT in better than all above models but again SWIN-transformer is the best model among all the others.

    B.Space Requirements

    This criterion is particularly important in case of using the image captioning model on memoryconstrained devices.In our evaluation,we used to metrics for this criterion:the number of parameters and the size of the trained model.Finding the number of parameters in its structure of the model is a common metric to evaluate the space requirements of the model as the smaller this number the less the size of the model.In case of the proposed captioning process,the language decoder is fixed in all the experiment and only the vision model is changing.Tab.3 the number of parameters for both the vision model and the corresponding image captioning model for all the 14 tested models are shown,where Dino with ViTs16 and ViTs8 is have shown the least number of the parameters compared with the others,while VGG has the is the largest number.

    Table 3:Number of parameters for each model used in the captioning model

    We also measured the size of each trained image captioning model for the 14 tested vision models.As shown in Tab.4 the image captioning model using DINO transformer with the ViTs8 and ViTs16 have the smallest size,while the largest model is the one using VGG16 model,while SWIN-based model,the most efficient in image captioning,is about 3%more in size.These results agree with the results obtained by using the number of parameters metric.

    Table 4:Sizes of the tested of image captioning models

    Table 4:Continued

    C.Time Evaluation

    For each of the 14 tested vision model,we compared the time taken for training for each model to get the best epoch for captioning.As shown in Fig.8,PVTv2-b5 and PVTv2-b2-was the fastest in training as they took the least training time(7.4 h),and PVTv1-Large was the slowest,as it finished training in 17.5 h while SWIN,which is the most efficient in producing caption,has taken 10 h.

    D.Performance Evaluation

    The performance evaluation is an important criterion as it reflects how fast the image captioning model in generating captions of an input image is.We used the Flops metric to evaluate the performance for the 14 tested image captioning models,as FLOPs are used to describe how many operations are required to run a single instance of a given model.The more the FLOPs the more time model will take for inference,i.e.,the better models have a smaller number of FLOPS.Tab.5 shows number of FLOPS of each of the 14 tested image captioning models.The worst model VGG16 was the worst,while DINO-ViTs8 and DINO-ViTs16 were the fastest models in generating the captions while SWIN model was a bit slower(2.4%slower).

    Table 5:Number of FLOPS for the tested image captioning models

    Different samples of the image captioning produced by the tested models are shown in Fig.9,on the left are the given images,on the right are the corresponding captions.The captions in each box are from the same model sample.We show the captions from all the tested models.Most of the captioning sentences are more accurate than using the ResNet50-LSTM and VGG16 models,especially using SWIN-transformer.

    6 Conclusion

    In this paper,we focused on increasing image captioning performance by extracting the features from images using an attentions-based vision transformer model acting as an encoder to extract the features;followed by an LSTM-based decoder acting as a language model to build the corresponding caption.For finding the best vision transformer encoder for this model,we tested 4 relatively a new transformer:self-distillation transformer model called DINO with five different backbones,PVT transformer with two versions using (two different methods for both),the third is XCIT transformer and the last transformer is SWIN model.In our experiments,we built different image captioning models for using each of these transformers with different configurations/versions and make a comparison between them and other image captioning models that use non-transformerbased convolutional networks which are ResNet50 and VGG16.It was found that an image captioning model using SWIN transformer as a vision transformer is significantly efficient in generating the image captions over all the other models based on BLEU,METEOR,ROUGE,CIDEr,and SPICE metrics.This efficiency is due to the manner of SWIN transformer in splitting the image using shifted-window instead traditional manner.In terms of the performance in producing the captions,the DINO-based image captioning model is the fastest as its structure has less number of FLOPS.On the other hand,PVT-based image captioning model is the fastest in training in time,especially by using PVT_v2_b5.Finally,image captioning model based on XCIT transformer shows very good results in terms of its efficiency in generating the captions as it comes after the SWIN model,while it has a slightly smaller size,and it shows a slightly faster performance and takes less time for training.

    Funding Statement:The authors received no specific funding for this study.

    Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.

    国产不卡一卡二| 成人av在线播放网站| 国产97色在线日韩免费| 亚洲国产日韩欧美精品在线观看 | 亚洲 国产 在线| 欧美午夜高清在线| 一本久久中文字幕| 男女那种视频在线观看| 国产精品久久久久久久电影 | 母亲3免费完整高清在线观看| 久久草成人影院| 真人做人爱边吃奶动态| 午夜福利高清视频| 国产午夜精品久久久久久| 免费看十八禁软件| 蜜桃久久精品国产亚洲av| 熟妇人妻久久中文字幕3abv| 人人妻人人澡欧美一区二区| 亚洲av日韩精品久久久久久密| 真人一进一出gif抽搐免费| 天堂影院成人在线观看| 日韩中文字幕欧美一区二区| 亚洲av免费在线观看| 成人无遮挡网站| www.精华液| 亚洲国产精品合色在线| 亚洲片人在线观看| 又大又爽又粗| 夜夜躁狠狠躁天天躁| 亚洲人成网站在线播放欧美日韩| 精品乱码久久久久久99久播| 亚洲欧美日韩高清专用| 免费观看精品视频网站| 又粗又爽又猛毛片免费看| 久久精品影院6| 国产亚洲av嫩草精品影院| 成熟少妇高潮喷水视频| 国产精品一区二区精品视频观看| 一a级毛片在线观看| 亚洲精品中文字幕一二三四区| 美女高潮的动态| 久久久久免费精品人妻一区二区| 亚洲真实伦在线观看| 国产精品爽爽va在线观看网站| 成人av在线播放网站| 国产激情欧美一区二区| 99国产极品粉嫩在线观看| 亚洲熟妇熟女久久| 在线看三级毛片| 国产亚洲精品综合一区在线观看| 亚洲自拍偷在线| 亚洲九九香蕉| 精品久久久久久成人av| 在线观看免费视频日本深夜| 美女高潮喷水抽搐中文字幕| 女警被强在线播放| 91在线观看av| 日韩 欧美 亚洲 中文字幕| 日日干狠狠操夜夜爽| 女人被狂操c到高潮| 亚洲专区字幕在线| 99re在线观看精品视频| 俺也久久电影网| 久久国产精品人妻蜜桃| 精品一区二区三区视频在线观看免费| 国产三级黄色录像| 欧美中文日本在线观看视频| 一二三四在线观看免费中文在| 午夜a级毛片| 一个人观看的视频www高清免费观看 | 国产成人精品久久二区二区免费| 国产野战对白在线观看| 啪啪无遮挡十八禁网站| 亚洲人成网站高清观看| 久久精品人妻少妇| 性色av乱码一区二区三区2| 成人鲁丝片一二三区免费| 又黄又粗又硬又大视频| 久久婷婷人人爽人人干人人爱| 亚洲国产欧美一区二区综合| 欧美一级a爱片免费观看看| 欧美日韩黄片免| 国产成人福利小说| 性欧美人与动物交配| 老司机深夜福利视频在线观看| 无遮挡黄片免费观看| 观看美女的网站| 国产成人欧美在线观看| 麻豆成人av在线观看| 免费av毛片视频| 俺也久久电影网| 欧美一级a爱片免费观看看| 一区二区三区高清视频在线| 俄罗斯特黄特色一大片| 午夜成年电影在线免费观看| 麻豆久久精品国产亚洲av| 亚洲精品粉嫩美女一区| 久久久国产成人精品二区| 欧美精品啪啪一区二区三区| 在线免费观看的www视频| 久久久色成人| 欧美乱色亚洲激情| 色噜噜av男人的天堂激情| 欧美黑人巨大hd| 日韩欧美国产在线观看| 身体一侧抽搐| 国产精品女同一区二区软件 | 亚洲18禁久久av| 99精品久久久久人妻精品| 一级毛片高清免费大全| 人人妻人人澡欧美一区二区| 一边摸一边抽搐一进一小说| 国产激情偷乱视频一区二区| 一级a爱片免费观看的视频| 日韩av在线大香蕉| 欧美黄色淫秽网站| 757午夜福利合集在线观看| 九九久久精品国产亚洲av麻豆 | 成人国产综合亚洲| 99精品欧美一区二区三区四区| 国产真实乱freesex| 日日夜夜操网爽| 精品人妻1区二区| 搡老妇女老女人老熟妇| 黑人欧美特级aaaaaa片| 午夜影院日韩av| 69av精品久久久久久| 一本精品99久久精品77| 国产三级中文精品| 成人午夜高清在线视频| 亚洲在线自拍视频| 精品电影一区二区在线| 精品免费久久久久久久清纯| svipshipincom国产片| 欧美三级亚洲精品| 亚洲av日韩精品久久久久久密| 欧美3d第一页| 在线观看66精品国产| 国产精品一区二区三区四区久久| 日本撒尿小便嘘嘘汇集6| 日本熟妇午夜| 国产精品一区二区三区四区久久| 99久国产av精品| 三级男女做爰猛烈吃奶摸视频| www.999成人在线观看| 午夜激情欧美在线| 亚洲欧美日韩东京热| 热99在线观看视频| 国产精品美女特级片免费视频播放器 | 亚洲,欧美精品.| 可以在线观看的亚洲视频| 欧美乱色亚洲激情| 亚洲av成人av| 日本撒尿小便嘘嘘汇集6| 老熟妇乱子伦视频在线观看| 淫秽高清视频在线观看| 一本精品99久久精品77| 国产精品影院久久| 国产成人影院久久av| 国产av麻豆久久久久久久| 亚洲成人精品中文字幕电影| 国产亚洲av嫩草精品影院| 国产亚洲精品综合一区在线观看| 宅男免费午夜| 国产97色在线日韩免费| 黄色 视频免费看| 制服丝袜大香蕉在线| 成人18禁在线播放| 国内少妇人妻偷人精品xxx网站 | www.自偷自拍.com| 欧美成人性av电影在线观看| 夜夜躁狠狠躁天天躁| 午夜福利18| 国产精品 国内视频| 一级毛片女人18水好多| 国产精品香港三级国产av潘金莲| 非洲黑人性xxxx精品又粗又长| 成人高潮视频无遮挡免费网站| 精品乱码久久久久久99久播| 色综合欧美亚洲国产小说| 99视频精品全部免费 在线 | 夜夜看夜夜爽夜夜摸| 成人特级av手机在线观看| 精品日产1卡2卡| 国产视频一区二区在线看| 亚洲国产欧美人成| 最新美女视频免费是黄的| 午夜免费成人在线视频| 亚洲色图 男人天堂 中文字幕| 亚洲一区二区三区不卡视频| 亚洲av美国av| 美女扒开内裤让男人捅视频| 人人妻,人人澡人人爽秒播| 老鸭窝网址在线观看| 99热精品在线国产| 国产精品久久久久久精品电影| 亚洲成人久久爱视频| 激情在线观看视频在线高清| 视频区欧美日本亚洲| 精品久久久久久成人av| 国产野战对白在线观看| 欧美日韩福利视频一区二区| 亚洲欧美精品综合一区二区三区| 国内毛片毛片毛片毛片毛片| 窝窝影院91人妻| 男人舔女人下体高潮全视频| 色视频www国产| 亚洲人成网站在线播放欧美日韩| 亚洲欧美日韩卡通动漫| 91在线观看av| 长腿黑丝高跟| 黄色丝袜av网址大全| 亚洲精品在线观看二区| 久久九九热精品免费| 韩国av一区二区三区四区| 日韩欧美一区二区三区在线观看| 又爽又黄无遮挡网站| 国产极品精品免费视频能看的| 久久久久国内视频| 在线免费观看的www视频| 桃色一区二区三区在线观看| 色哟哟哟哟哟哟| 国产成人精品久久二区二区91| 脱女人内裤的视频| 国产精品久久久人人做人人爽| 三级毛片av免费| 麻豆成人av在线观看| 舔av片在线| 99久久成人亚洲精品观看| 午夜福利在线在线| 在线观看日韩欧美| 欧美三级亚洲精品| 精品久久久久久久末码| 亚洲自偷自拍图片 自拍| 国产精品免费一区二区三区在线| 成人亚洲精品av一区二区| 国产一区二区三区视频了| 在线观看美女被高潮喷水网站 | 亚洲九九香蕉| 国产av麻豆久久久久久久| 男女做爰动态图高潮gif福利片| 美女 人体艺术 gogo| 成人国产综合亚洲| 日本免费a在线| 久久久精品大字幕| 国产蜜桃级精品一区二区三区| 精品无人区乱码1区二区| 全区人妻精品视频| 久久亚洲真实| 免费在线观看视频国产中文字幕亚洲| 亚洲av日韩精品久久久久久密| 国产又黄又爽又无遮挡在线| 国产又色又爽无遮挡免费看| 欧美3d第一页| 婷婷精品国产亚洲av在线| 免费观看的影片在线观看| 少妇熟女aⅴ在线视频| 亚洲av日韩精品久久久久久密| 国产人伦9x9x在线观看| 久久精品夜夜夜夜夜久久蜜豆| 美女 人体艺术 gogo| 成人特级黄色片久久久久久久| 免费观看人在逋| 99精品在免费线老司机午夜| 国产精品一及| 亚洲第一电影网av| 国产精品久久久av美女十八| 在线看三级毛片| 老汉色∧v一级毛片| 一个人观看的视频www高清免费观看 | 欧美日本亚洲视频在线播放| 欧美日韩福利视频一区二区| 两个人的视频大全免费| 欧美乱妇无乱码| h日本视频在线播放| 亚洲九九香蕉| 久久精品国产综合久久久| 亚洲熟妇中文字幕五十中出| 久久久久久久久免费视频了| 国产欧美日韩一区二区三| 性欧美人与动物交配| 日韩人妻高清精品专区| 别揉我奶头~嗯~啊~动态视频| 亚洲专区国产一区二区| 国产成人精品久久二区二区91| 9191精品国产免费久久| 综合色av麻豆| 可以在线观看毛片的网站| 18禁国产床啪视频网站| 国产av一区在线观看免费| 亚洲成人中文字幕在线播放| 欧美黄色淫秽网站| 男女午夜视频在线观看| aaaaa片日本免费| 老汉色∧v一级毛片| 亚洲精品久久国产高清桃花| 亚洲欧美精品综合一区二区三区| 天天躁日日操中文字幕| 亚洲午夜精品一区,二区,三区| 最近视频中文字幕2019在线8| 国产成人av激情在线播放| 久久这里只有精品中国| 免费看日本二区| 嫩草影视91久久| 亚洲欧美日韩高清专用| 人妻夜夜爽99麻豆av| 丝袜人妻中文字幕| 手机成人av网站| 制服丝袜大香蕉在线| 午夜日韩欧美国产| 亚洲av日韩精品久久久久久密| 高潮久久久久久久久久久不卡| 国产精品久久久久久精品电影| 欧美另类亚洲清纯唯美| 在线观看一区二区三区| 国产亚洲欧美98| 久久久久久九九精品二区国产| 俺也久久电影网| 黄色日韩在线| 色精品久久人妻99蜜桃| 婷婷精品国产亚洲av| 亚洲七黄色美女视频| 搡老熟女国产l中国老女人| 亚洲av日韩精品久久久久久密| 精品国内亚洲2022精品成人| 国产黄a三级三级三级人| 久久人妻av系列| 久久精品aⅴ一区二区三区四区| 特大巨黑吊av在线直播| 91av网一区二区| 久久久久性生活片| 国产男靠女视频免费网站| 三级毛片av免费| 欧美一区二区精品小视频在线| 午夜福利视频1000在线观看| 久久精品91蜜桃| 十八禁人妻一区二区| 国产黄片美女视频| 欧美成人一区二区免费高清观看 | 好男人在线观看高清免费视频| 国产视频一区二区在线看| 亚洲精华国产精华精| 欧美不卡视频在线免费观看| 一个人观看的视频www高清免费观看 | 日本 av在线| 特大巨黑吊av在线直播| avwww免费| 欧美成人免费av一区二区三区| 国产成人系列免费观看| 欧美日韩亚洲国产一区二区在线观看| 日本 av在线| 99热这里只有精品一区 | 国内精品久久久久久久电影| 999久久久精品免费观看国产| 男人的好看免费观看在线视频| 在线观看免费视频日本深夜| 18禁裸乳无遮挡免费网站照片| h日本视频在线播放| 久久亚洲真实| 午夜两性在线视频| 真人一进一出gif抽搐免费| 不卡一级毛片| 男人和女人高潮做爰伦理| 国产极品精品免费视频能看的| 国产亚洲精品久久久com| 亚洲人成网站在线播放欧美日韩| 国产亚洲精品久久久com| 久久精品91蜜桃| 亚洲成av人片免费观看| 午夜视频精品福利| 狂野欧美白嫩少妇大欣赏| 91字幕亚洲| 熟女人妻精品中文字幕| 亚洲真实伦在线观看| 国产亚洲欧美98| 无遮挡黄片免费观看| 亚洲精品456在线播放app | 看免费av毛片| 琪琪午夜伦伦电影理论片6080| 天堂av国产一区二区熟女人妻| 亚洲18禁久久av| 看免费av毛片| 亚洲av电影在线进入| 99久久99久久久精品蜜桃| 一个人免费在线观看的高清视频| 国产成+人综合+亚洲专区| 性色av乱码一区二区三区2| 在线十欧美十亚洲十日本专区| а√天堂www在线а√下载| 久久久久国产精品人妻aⅴ院| 欧美成人性av电影在线观看| 国产精品1区2区在线观看.| 国产精品久久视频播放| 午夜福利在线观看免费完整高清在 | 亚洲av美国av| 可以在线观看毛片的网站| 久久久国产精品麻豆| 亚洲av片天天在线观看| 香蕉丝袜av| 免费在线观看视频国产中文字幕亚洲| 啪啪无遮挡十八禁网站| 一区二区三区激情视频| 久久久国产成人免费| 久久香蕉精品热| 欧美乱妇无乱码| 一进一出抽搐gif免费好疼| 99精品久久久久人妻精品| 国产三级中文精品| 国产v大片淫在线免费观看| 一区福利在线观看| 九色国产91popny在线| 88av欧美| 午夜日韩欧美国产| 亚洲精品中文字幕一二三四区| 成年版毛片免费区| 美女cb高潮喷水在线观看 | 成年女人毛片免费观看观看9| 97人妻精品一区二区三区麻豆| 特大巨黑吊av在线直播| 1024香蕉在线观看| 国产伦精品一区二区三区四那| 90打野战视频偷拍视频| 亚洲最大成人中文| 俺也久久电影网| 12—13女人毛片做爰片一| 18禁黄网站禁片午夜丰满| 2021天堂中文幕一二区在线观| 成年女人毛片免费观看观看9| 久久亚洲精品不卡| 亚洲在线自拍视频| 久久久精品欧美日韩精品| 超碰成人久久| 老鸭窝网址在线观看| 夜夜看夜夜爽夜夜摸| 亚洲人成电影免费在线| 久久国产乱子伦精品免费另类| 精品日产1卡2卡| 丰满的人妻完整版| 成人欧美大片| 一个人免费在线观看电影 | 成人性生交大片免费视频hd| 男女午夜视频在线观看| 亚洲国产精品久久男人天堂| www.999成人在线观看| 午夜日韩欧美国产| 免费人成视频x8x8入口观看| 亚洲中文字幕一区二区三区有码在线看 | 女同久久另类99精品国产91| 一本久久中文字幕| 伊人久久大香线蕉亚洲五| 性欧美人与动物交配| 久久这里只有精品中国| 国产探花在线观看一区二区| 免费观看人在逋| 国产精品99久久99久久久不卡| 免费观看的影片在线观看| 伊人久久大香线蕉亚洲五| 少妇的丰满在线观看| 男女做爰动态图高潮gif福利片| 99热这里只有是精品50| 淫秽高清视频在线观看| 白带黄色成豆腐渣| 精品欧美国产一区二区三| 波多野结衣高清无吗| 亚洲第一欧美日韩一区二区三区| 久久中文看片网| 一级毛片高清免费大全| 99国产精品一区二区蜜桃av| 欧美三级亚洲精品| 97超级碰碰碰精品色视频在线观看| 一a级毛片在线观看| 黑人欧美特级aaaaaa片| 欧美性猛交╳xxx乱大交人| 欧美日韩精品网址| 一级作爱视频免费观看| 日韩欧美精品v在线| 午夜成年电影在线免费观看| 国产黄a三级三级三级人| 欧美乱色亚洲激情| 午夜免费激情av| 好男人电影高清在线观看| 国产单亲对白刺激| 久久中文看片网| 人人妻人人看人人澡| 成人午夜高清在线视频| 三级国产精品欧美在线观看 | 丰满的人妻完整版| 一区二区三区国产精品乱码| 欧美日韩乱码在线| 欧美三级亚洲精品| 最新在线观看一区二区三区| 一进一出好大好爽视频| 久久久久久久久中文| 免费人成视频x8x8入口观看| 一区二区三区高清视频在线| 精品午夜福利视频在线观看一区| 中文字幕人妻丝袜一区二区| 无限看片的www在线观看| 欧美xxxx黑人xx丫x性爽| 国产精品综合久久久久久久免费| 老司机福利观看| 亚洲18禁久久av| 成人永久免费在线观看视频| 十八禁人妻一区二区| 国产精品影院久久| 黄色丝袜av网址大全| 久久久久久久午夜电影| 手机成人av网站| 成人av在线播放网站| 五月伊人婷婷丁香| 欧美一级a爱片免费观看看| 在线观看美女被高潮喷水网站 | 国产精品一区二区免费欧美| 亚洲九九香蕉| 夜夜看夜夜爽夜夜摸| 99久久精品国产亚洲精品| 国产高清激情床上av| 99热精品在线国产| 欧美zozozo另类| 悠悠久久av| 老司机福利观看| 亚洲精品美女久久av网站| 亚洲欧美日韩东京热| tocl精华| 国产一区二区在线观看日韩 | 免费在线观看视频国产中文字幕亚洲| 国产精华一区二区三区| 精品久久久久久,| 一本精品99久久精品77| 亚洲九九香蕉| 亚洲精品在线美女| 中出人妻视频一区二区| 波多野结衣高清作品| 免费看光身美女| 毛片女人毛片| 亚洲成人中文字幕在线播放| 脱女人内裤的视频| 免费搜索国产男女视频| 97碰自拍视频| 日韩欧美国产在线观看| 免费看美女性在线毛片视频| 亚洲精品一卡2卡三卡4卡5卡| 国产成人av教育| 国产免费av片在线观看野外av| 久久久久久久久中文| 18美女黄网站色大片免费观看| 男人舔奶头视频| 黄色丝袜av网址大全| 国产精品精品国产色婷婷| 小说图片视频综合网站| 老司机午夜十八禁免费视频| 国产私拍福利视频在线观看| 亚洲专区字幕在线| 麻豆国产av国片精品| 在线观看一区二区三区| 亚洲中文日韩欧美视频| 国产成人影院久久av| 日韩成人在线观看一区二区三区| 国产精品一区二区免费欧美| 狠狠狠狠99中文字幕| 午夜视频精品福利| 亚洲精品乱码久久久v下载方式 | 欧美3d第一页| 亚洲精品美女久久av网站| 亚洲色图av天堂| 热99re8久久精品国产| 久久精品国产清高在天天线| 又黄又粗又硬又大视频| 午夜精品久久久久久毛片777| 91av网一区二区| 一级毛片精品| 中亚洲国语对白在线视频| 亚洲真实伦在线观看| 91老司机精品| 国产高清有码在线观看视频| 男人舔女人的私密视频| 日韩欧美国产在线观看| 黄频高清免费视频| 国内少妇人妻偷人精品xxx网站 | 90打野战视频偷拍视频| 亚洲成人免费电影在线观看| 91麻豆av在线| 亚洲无线在线观看| 午夜日韩欧美国产| 午夜福利视频1000在线观看| 国产真人三级小视频在线观看| 真人做人爱边吃奶动态| 一本久久中文字幕| 啦啦啦免费观看视频1| 一级毛片高清免费大全| 午夜免费激情av| 国产真人三级小视频在线观看| 波多野结衣巨乳人妻| 午夜免费激情av| 中文字幕人成人乱码亚洲影| 成人精品一区二区免费| 性色avwww在线观看| 一级a爱片免费观看的视频| 99国产精品99久久久久| 男人的好看免费观看在线视频| 巨乳人妻的诱惑在线观看| 国产1区2区3区精品| 亚洲五月天丁香| 日韩成人在线观看一区二区三区| 可以在线观看毛片的网站| 性色avwww在线观看| 国产一区在线观看成人免费| 欧美高清成人免费视频www| 午夜精品在线福利| 男女下面进入的视频免费午夜| 国产亚洲精品av在线| 欧美色视频一区免费| 国产主播在线观看一区二区| 在线观看舔阴道视频| 麻豆成人午夜福利视频|