Using Visual Features to Generate Language Context
MetadataShow full item record
Publisher:The Ohio State University
Series/Report no.:The Ohio State University. Department of Computer Science and Engineering Honors Theses; 2021
Current multimodal data processing methods use deep learning to combine complementary visual and textual information. Although large neural networks can be useful, they are often computationally intensive and require very large training datasets. This thesis explores a novel and direct method for predicting useful and understandable language context from the visual information within a video. Each video is represented by a scene-weighted sum of deep feature vectors across the video. Then the method predicts a novel cluster-weighted term frequency-inverse document frequency (TF–IDF) vector for a test video by averaging the TF–IDF vectors of the K training videos having the most similar visual features. The predicted vector provides information on what words are likely to be the most important in the video. We demonstrate that this method provides reliable textual information for a collection of instructional YouTube videos. The predicted TF–IDF vectors could be used to aid in speech-to-text or machine translation applications by providing a rich semantic context.
Academic Major: Electrical and Computer Engineering
Dr. Timothy Anderson, Air Force Research Laboratory
Items in Knowledge Bank are protected by copyright, with all rights reserved, unless otherwise indicated.