Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to skip data by adding a flag instead of removing them #566

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

shuoyangd
Copy link
Contributor

Description

This PR implements the feature to add skip labels to filtered entries in the json/parquet outputs instead of completely removing filtered entries. When this feature is enabled, it will also log which filter discarded an entry by adding its class name to a field ("reason" by default).

This allows easy tracking/book-keeping in some scenarios, for example:

  • When there is another modality (e.g. speech) and the data files are not self-contained
  • When someone is experimenting with some new filters and need to know how much entries each filter throw out
  • When someone is running other pipelines outside Nemo-Curator

The feature can be applied to all filters without extra code change.

Despite all the entries being preserved in the dataset, we ensure when filters are chained in the form of g(f(x)), g will still only be ran on entries that's not filtered out by f.

Usage

Simply adding an extra flag add_skip_label_only=True to any filter definition. For example:

            LengthRatioFilter(
                max_ratio=2,
                src_lang=SRC_LANG,
                tgt_lang=TGT_LANG,
                score_field="length_ratio",
                score_type=float,
                add_skip_label_only=True,
            )

This feature won't work with the plain bitext output format because extra flags can't be added there. Make sure json or parquet format is used.

A working example is in the updated tutorials/bitext_cleaning.

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

(Consider this as an initial draft. I'll write tests and docs if this is deemed merge-worthy.)

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant