Show simple item record

dc.contributor.advisorDavis, Jim
dc.creatorZachmann, Isaac
dc.description.abstractCurrent multimodal data processing methods use deep learning to combine complementary visual and textual information. Although large neural networks can be useful, they are often computationally intensive and require very large training datasets. This thesis explores a novel and direct method for predicting useful and understandable language context from the visual information within a video. Each video is represented by a scene-weighted sum of deep feature vectors across the video. Then the method predicts a novel cluster-weighted term frequency-inverse document frequency (TF–IDF) vector for a test video by averaging the TF–IDF vectors of the K training videos having the most similar visual features. The predicted vector provides information on what words are likely to be the most important in the video. We demonstrate that this method provides reliable textual information for a collection of instructional YouTube videos. The predicted TF–IDF vectors could be used to aid in speech-to-text or machine translation applications by providing a rich semantic context.en_US
dc.description.sponsorshipDr. Timothy Anderson, Air Force Research Laboratoryen_US
dc.publisherThe Ohio State Universityen_US
dc.relation.ispartofseriesThe Ohio State University. Department of Computer Science and Engineering Honors Theses; 2021en_US
dc.subjectvideo processingen_US
dc.subjectvisual featuresen_US
dc.subjecttext processingen_US
dc.subjectlanguage processingen_US
dc.titleUsing Visual Features to Generate Language Contexten_US
dc.description.embargoNo embargoen_US
dc.description.academicmajorAcademic Major: Electrical and Computer Engineeringen_US

Files in this item


Items in Knowledge Bank are protected by copyright, with all rights reserved, unless otherwise indicated.

This item appears in the following Collection(s)

Show simple item record