INSERT from Reality: A Schema-driven Approach to Image Capture of Structured Information
Publisher:
The Ohio State UniversitySeries/Report no.:
The Ohio State University. Department of Computer Science and Engineering Honors Theses; 2018Abstract:
There is a tremendous amount of structured information locked away on document images, e.g., receipts, invoices, medical testing documents, and banking statements. However, the document images that retain this structured information are often ad hoc and vary between businesses, organizations, or time periods.
Although optical character recognition allows us to digitize document images into sequences of words, there still does not exist a means to identify schema attributes in the words of these ad hoc images and extract them into a database. In this thesis, we push beyond optical character recognition: while current information extraction techniques use only optical character recognition from structured images, we infer the visual structure and combine it with the textual information on the document image to create a highly-structured INSERT statement, ready to be executed against a database. We call this approach IFR. We use OCR to obtain the textual contents of the image. Our natural language processes annotate this with relevant information such as data type. We also prune irrelevant words to improve performance in subsequent steps. In parallel to textual analysis, we visually segment the input document image, with no a-priori information, to create a visual context window around each textual token. We merge the two analyses to augment the textual information with context from the visual context windows. Using analyst-defined heuristic functions, we can score each of these context-enabled entities to probabilistically construct the final INSERT statement. We evaluated IFR on three real-world datasets and were able to achieve F1 scores of over 83% in INSERT generation on these datasets, spending approximately 2 seconds per image on average. Comparing IFR to natural language processing approaches, such as regular expressions and conditional random fields, we found IFR to perform better at detecting the correct schema attributes. To compare IFR to a human baseline, we conducted a user study to find the human baseline of INSERT quality on our datasets and found IFR to produce INSERT statements that were comparable or exceeded that baseline.
Description:
3rd place in the undergraduate 3-Minute Thesis Competition
Academic Major:
Academic Major: Computer Science and Engineering
Sponsors:
National Science Foundation
Embargo:
No embargo
Type:
ThesisItems in Knowledge Bank are protected by copyright, with all rights reserved, unless otherwise indicated.