Add option to skip data by adding a flag instead of removing them #566
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR implements the feature to add skip labels to filtered entries in the json/parquet outputs instead of completely removing filtered entries. When this feature is enabled, it will also log which filter discarded an entry by adding its class name to a field ("reason" by default).
This allows easy tracking/book-keeping in some scenarios, for example:
The feature can be applied to all filters without extra code change.
Despite all the entries being preserved in the dataset, we ensure when filters are chained in the form of
g(f(x))
,g
will still only be ran on entries that's not filtered out byf
.Usage
Simply adding an extra flag
add_skip_label_only=True
to any filter definition. For example:This feature won't work with the plain bitext output format because extra flags can't be added there. Make sure json or parquet format is used.
A working example is in the updated
tutorials/bitext_cleaning
.Checklist
(Consider this as an initial draft. I'll write tests and docs if this is deemed merge-worthy.)