An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

专业资料 > IT/计算机 > 计算机视觉 > 文档预览

9 页 0 下载 1681 浏览 0 评论 0 收藏 3.0分

温馨提示：如果当前文档出现乱码或未能正常浏览，请先下载原文档进行浏览。

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition 第 1 页

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition 第 2 页

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition 第 3 页

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition 第 4 页

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition 第 5 页

下载文档到电脑，方便使用

下载文档

还有 4 页可预览，继续阅读

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition内容摘要：

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition Baoguang Shi, Xiang Bai and Cong Yao School of Electronic Information and Communications Huazhong University of Science and Technology, Wuhan, China arXiv:1507.05717v1 [cs.CV] 21 Jul 2015 {shibaoguang,xbai}@hust.edu.cn, yaocong2010@gmail.com Abstract sual objects, such as scene text, handwriting and musical score, tend to occur in the form of sequence, not in isolation. Unlike general object recognition, recognizing such sequence-like objects often requires the system to predict a series of object labels, instead of a single label. Therefore, recognition of such objects can be naturally cast as a sequence recognition problem. Another unique property of sequence-like objects is that their lengths may vary drastically. For instance, English words can either consist of 2 characters such as “OK” or 15 characters such as “congratulations”. Consequently, the most popular deep models like DCNN [25, 26] cannot be directly applied to sequence prediction, since DCNN models often operate on inputs and outputs with fixed dimensions, and thus are incapable of producing a variable-length label sequence. Some attempts have been made to address this problem for a specific sequence-like object (e.g. scene text). For example, the algorithms in [35, 8] firstly detect individual characters and then recognize these detected characters with DCNN models, which are trained using labeled character images. Such methods often require training a strong character detector for accurately detecting and cropping each character out from the original word image. Some other approaches (such as [22]) treat scene text recognition as an image classification problem, and assign a class label to each English word (90K words in total). It turns out a large trained model with a huge number of classes, which is difficult to be generalized to other types of sequencelike objects, such as Chinese texts, musical scores, etc., because the numbers of basic combinations of such kind of sequences can be greater than 1 million. In summary, current systems based on DCNN can not be directly used for image-based sequence recognition. Recurrent neural networks (RNN) models, another important branch of the deep neural networks family, were mainly designed for handling sequences. One of the advantages of RNN is that it does not need the position of each element in a sequence object image in both training and testing. However, a preprocessing step that converts Image-based sequence recognition has been a longstanding research topic in computer vision. In this paper, we investigate the problem of scene text recognition, which is among the most important and challenging tasks in image-based sequence recognition. A novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified framework, is proposed. Compared with previous systems for scene text recognition, the proposed architecture possesses four distinctive properties: (1) It is end-to-end trainable, in contrast to most of the existing algorithms whose components are separately trained and tuned. (2) It naturally handles sequences in arbitrary lengths, involving no character segmentation or horizontal scale normalization. (3) It is not confined to any predefined lexicon and achieves remarkable performances in both lexicon-free and lexicon-based scene text recognition tasks. (4) It generates an effective yet much smaller model, which is more practical for real-world application scenarios. The experiments on standard benchmarks, including the IIIT-5K, Street View Text and ICDAR datasets, demonstrate the superiority of the proposed algorithm over the prior arts. Moreover, the proposed algorithm performs well in the task of image-based music score recognition, which evidently verifies the generality of it. 1. Introduction Recently, the community has seen a strong revival of neural networks, which is mainly stimulated by the great success of deep neural network models, specifically Deep Convolutional Neural Networks (DCNN), in various vision tasks. However, majority of the recent works related to deep neural networks have devoted to detection or classification of object categories [12, 25]. In this paper, we are concerned with a classic problem in computer vision: imagebased sequence recognition. In real world, a stable of vi1 2. The Proposed

本文档由 sddwt 于 2021-05-26 02:58:17上传分享

下载原文档(1.01 MB)

收藏分享

给文档打分

评论列表

暂时还没有评论，期待您的金玉良言