When you create a training dataset for document classification or data extraction, you need to split your documents into two subsets: one for training the model and one for evaluating the model. The training subset is used to teach the model how to recognize the patterns and features of your document types and fields. The evaluation subset is used to measure the performance and accuracy of the model on unseen data. The evaluation subset should not be used for training, as this would bias the model and overfit it to the data1.
The recommended split of documents for training and evaluation depends on the size and diversity of your data. However, a general guideline is to use a 70/30 or 80/20 ratio, where 70% or 80% of the documents are used for training and 30% or 20% are used for evaluation. This ensures that the model has enough data to learn from and enough data to test on. For example, if you have 15 documents per vendor, you can use 10 documents for training and 5 documents for evaluation. This would give you a 67/33 split, which is close to the 70/30 ratio. You can also use the Data Manager tool to create and manage your training and evaluation datasets2.
References: 1: Document Understanding - Training High Performing Models 2: Data Manager - Creating a Dataset
Contribute your Thoughts:
Chosen Answer:
This is a voting comment (?). You can switch to a simple comment. It is better to Upvote an existing comment if you don't have anything to add.
Submit