Fonduer: Knowledge Base Construction from Richly Formatted Data
Published in Proceedings of 2018 ACM SIGMOD/PODS Conference on the Management of Data, June 2018.
Abstract
We focus on knowledge base construction (KBC) from richly formatted data. In contrast to KBC from text or tabular data, KBC from richly formatted data aims to extract relations conveyed jointly via textual, structural, tabular, and visual expressions. We introduce Fonduer, a machine-learning-based KBC system for richly formatted data. Fonduer presents a new data model that accounts for three challenging characteristics of richly formatted data: (1) prevalent document-level relations, (2) multimodality, and (3) data variety. Fonduer uses a new deep-learning model to automatically capture the representation (i.e., features) needed to learn how to extract relations from richly formatted data. Finally, Fonduer provides a new programming model that enables users to convert domain expertise, based on multiple modalities of information, to meaningful signals of supervision for training a KBC system. Fonduer-based KBC systems are in production for a range of use cases, including at a major online retailer. We compare Fonduer against state-of-the-art KBC approaches in four different domains. We show that Fonduer achieves an average improvement of 41 F1 points on the quality of the output knowledge base—and in some cases produces up to 1.87× the number of correct entries—compared to expert-curated public knowledge bases. We also conduct a user study to assess the usability of Fonduer’s new programming model. We show that after using Fonduer for only 30 minutes, non-domain experts are able to design KBC systems that achieve on average 23 F1 points higher quality than traditional machine-learning-based KBC approaches.
BibTeX entry
@inproceedings{fonduer-sigmod18,
author = "Sen Wu and Luke Hsiao and Xiao Cheng and Braden Hancock and Theodoros Rekatsinas and Philip Levis and Christopher Ré",
title = "{Fonduer: Knowledge Base Construction from Richly Formatted Data}",
booktitle = "{Proceedings of 2018 ACM SIGMOD/PODS Conference on the Management of Data}",
year = {2018},
month = {June}
}