Google develops VRDU AI dataset benchmark to scan and understand documents

At the Knowledge Discovery in Databases (KDD) 2023 conference in Long Beach, CA, Google's Athena team presented its development of the Visually Rich Document Understanding (VRDU) dataset. This dataset can formulate a system that can automatically extract rich data from documents like receipts, insurance quotes, financial statements, and more.

While large models like PaLM 2 have impressive levels of accuracy, their real-world usability depends on the dataset's ability to train it. VRDU aims to bridge the gap between these models and complex real-world applications. To do this, the Athena team came up with five benchmarking requirements:

Rich Schema: In practice, we see a wide variety of rich schemas for structured extraction. Entities have different data types (numeric, strings, dates, etc.) that may be required, optional, or repeated in a single document or may even be nested. Extraction tasks over simple flat schemas like (header, question, answer) do not reflect typical problems encountered in practice.

Layout-Rich Documents: The documents should have complex layout elements. Challenges in practical settings come from the fact that documents may contain tables, key-value pairs, switch between single-column and double-column layout, have varying font-sizes for different sections, include pictures with captions and even footnotes. Contrast this with datasets where most documents are organized in sentences, paragraphs, and chapters with section headers — the kinds of documents that are typically the focus of classic natural language processing literature on long inputs.

Diverse Templates: A benchmark should include different structural layouts or templates. It is trivial for a high-capacity model to extract from a particular template by memorizing the structure. However, in practice, one needs to be able to generalize to new templates/layouts, an ability that the train-test split in a benchmark should measure.

High-Quality OCR: Documents should have high-quality Optical Character Recognition (OCR) results. Our aim with this benchmark is to focus on the VRDU task itself and to exclude the variability brought on by the choice of OCR engine.

Token-Level Annotation: Documents should contain ground-truth annotations that can be mapped back to corresponding input text, so that each token can be annotated as part of the corresponding entity. This is in contrast with simply providing the text of the value to be extracted for the entity. This is key to generating clean training data where we do not have to worry about incidental matches to the given value. For instance, in some receipts, the ‘total-before-tax’ field may have the same value as the ‘total’ field if the tax amount is zero. Having token level annotations prevents us from generating training data where both instances of the matching value are marked as ground-truth for the ‘total’ field, thus producing noisy examples.

10 Sec

Fortnite Teams Up With Balenciaga For Epic Games First High-Fashion Crossover

VRDU is a combination of Registration Forms and Ad-Buy forms - publicly available datasets. This dataset can be used to process Single Template, Mixed Template, and Unseen Template Learning documents. It can identify and categorize types of information in structured and unstructured documents, and help researchers track progress on document understanding tasks. The performance of the dataset on these types of documents can be read in the paper published here.

Source: Google Research

Recommend

Are Wi-Fi Conferences Worth It?

董荣杰正式离职，虎牙进入新发展期_原创_科技频道首页_财经网 - CAIJING.COM.CN

利用 Fly.io 部署 Windmill

培育更多创新尖兵

Adventure: source

Revving Up for Rev5, Part 2: SCRM, Privacy and Encryption

The Galaxy Z Flip5 and Z Fold5 pre-orders in the US and Canada end today, last c...

菜鸟集团本季度营收231.64亿元人民币，同比增长34%

阿里本地生活集团Q1营收144.5亿元同比增30%

我关注的项目这周更新了什么-20230630

About Joyk