-
Notifications
You must be signed in to change notification settings - Fork 296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Masked Language Modelling #1030
Conversation
…taparallel_metric_calculation Conflicts: jiant/trainer.py jiant/utils/utils.py
…com/nyu-mll/jiant into fix_dataparallel_metric_calculation
jiant/tasks/lm.py
Outdated
@register_task("wikipedia_corpus_mlm", rel_path="wikipedia_corpus_small/") | ||
class MaskedLanguageModelingTask(Task): | ||
""" | ||
Masked language modeling task on Wikipedia dataset | ||
Attributes: | ||
max_seq_len: (int) maximum sequence length | ||
min_seq_len: (int) minimum sequence length | ||
files_by_split: (dict) files for three data split (train, val, test) | ||
We are currently using an unpreprocessed version of the Wikipedia corpus | ||
that consists of 5% of the data. Please reach out to jiant admin if you | ||
would like access to this dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the task registered as wikipedia_corpus_mlm
(using dataset under relative path /wikipedia_corpus_small
) an experiment with some research significance that users would want to reproduce (or is it perhaps a toy dataset being used to demo MaskedLanguageModelingTask
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not toy data. The full Wikipedia is very large and jiant currently does not have ability to load the entire thing. So, I tried to extract a subset with the same size as Wikitext 103, which is around 5% of full Wikipedia.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying, @phu-pmh. Maybe relevant: there was a recent discussion and change to another LM task to reduce its memory footprint. It may be relatively easy to make a similar change here to allow you to use the full dataset (if memory, not time, is the concern).
But if you want to introduce a modified dataset with this task, please also submit the code you used to construct the dataset. There's an example of a data preprocessing script under jiant/scripts
. But all your script would need to do is document/link the data it takes as input, perform your preprocessing, and save the output as it's expected by your task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The preprocessing code we used was a slight modification of what is here (@phu-pmh
can speak more to this), so it might not be the best idea to copy paste it into scripts. https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT/data . Perhaps we could put instructions (maybe in the documentation in the Task class?) about how to generate it instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For English Wikipedia, I think the nvidia code is ready to use. I only added modifications so that we can do cross-lingual experiments.
jiant/tasks/lm.py
Outdated
"val": os.path.join(path, "valid.txt"), | ||
"test": os.path.join(path, "test.txt"), | ||
} | ||
self.examples_by_split = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll revert this in next commit
Thanks for your recent changes, @pruksmhc. It looks like there are only a few open comments remaining:
Finally, as you suggested, we'll want to re-run your validation experiments after the changes are in. |
@pyeres , for 3, I realized that 1. the masking code is a little hard to test because it randomly masks, which makes any asserts hard and 2. code is from the well-maintained Transformers library. Thus, I decided to hold off on unit tests for that part of the code. I'll still extract it out into its own function though, but just a heads up. |
jiant/models.py
Outdated
device=inputs.device, dtype=torch.uint8 | ||
) | ||
tokenizer_name = self.sent_encoder._text_field_embedder.tokenizer_required | ||
labels, _ = self.sent_encoder._text_field_embedder.correct_sent_indexing( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the labels
will be modified (as a result of correct_sent_indexing()
). It looks like the inputs
aren't getting the same adjustment here. Is that intentional/correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes because we do it here for inputs: https://github.com/nyu-mll/jiant/blob/master/jiant/huggingface_transformers_interface/modules.py#L401
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pruksmhc — thanks for these changes, and thanks especially for the creative testing work.
There are only a few open items to tie up before this is mergeable into master:
- A few open comments (most or all are on this most recent round of changes).
- Providing data/script/instructions for the new task
- Re-running your validation experiments after the final changes are in.
After that I think it's ready for approval.
jiant/tasks/lm.py
Outdated
files_by_split: (dict) files for three data split (train, val, test) | ||
We are currently using an unpreprocessed version of the Wikipedia corpus | ||
that consists of 5% of the data. You can generate the data using code from | ||
https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT/data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pyeres here's the instructions for data generation (point 2 you raised).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my request got buried a ways back in the thread — resurfacing here:
We need to make this Task reproducible, and to do that I don't know of a way to get around making the task's data dependencies reproducible. This can be done w/ a script or instructions (if the instructions are involved they should be in a script).
A script doesn't need to copy the functionality of other open source scripts involved in generating the data — our script can simply document that the other steps/scripts are used at some step. The goal is that using our script/instructions the user should be able to exactly reproduce the task's data dependencies. (@phu-pmh for visibility).
Here's the current validation checks. We do uniform mixing between MLM and the intermediate task, which means that the MLM perplexity here will be higher than what was reported in the description (which didn't use uniform mixing). We did uniform mixing so that the runs would finish faster.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* misc run scripts * sbatch * sweep scripts * update * qa * update * update * update * update * update * sb file * moving update_metrics to outside scope of dataparallel * fixing micro_avg calculation * undo debugging * Fixing tests, moving update_metrics out of other tasks * remove extraneous change * MLM task * Added MLM task * update * fix multiple choice dataparallel forward * update * add _mask_id to transformers * Update * MLM update * adding update_metrics abstraction * delete update_metrics_ notation * fixed wrong index problem * removed unrelated files * removed unrelated files * removed unrelated files * fix PEP8 * Fixed get_pretained_lm_head for BERT and ALBERT * spelling check * black formatting * fixing tests * bug fix * Adding batch_size constraints to multi-GPU setting * adding documentation * adding batch size test * black correct version * Fixing batch size assertion * generalize batch size assertion for more than 2 GPU setting * reducing label loops in code * fixing span forward * Fixing span prediction forward for multi-GPU * fix commonsenseQA forward * MLM * adding function documentation * resolving nits, fixing seq_gen forward * remove nit * fixing batch_size assert and SpanPrediction task * Remove debugging * Fix batch size mismatch multi-GPU test * Fix order of assert checking for batch size mismatch * mlm training * update * sbatch * update * data parallel * update data parallel stuffs * using sequencelabel, using 1 paragraph per example * update label mapping * adding exmaples-porportion-mixing * changing dataloader to work with wikitext103 * weight sampling * add early stopping only onb one task * commit * Cleaning up code * Removing unecessarily tracked git folders * Removing unnecesary changes * revert README * revert README.md again * Making more general for Transformer-based embedders * torch.uint8 -> torch.bool * Fixing indexing issues * get rid of unecessary changes * black cleanup * update * Prevent updating update_metrics twice in one step * update * update * add base_roberta * update * reverting CCG edit added for debugging * refactor defaults.conf * black formatting * merge * removed SOP task and mlm_manual_scaling * Fixing label namespace vocabulary creation, mergeing from master * Deleting MLM weight * black formatting * Adding early_stopping_method to defaults.conf * Fixing MLM with preprocessed wikitext103 * Deleting intermediate class hierarchy for MLM * Correcting black * LanguageModelingTask -> AutoregressiveModelingTask * code style * fixing MaskedLanguageModelTask * Fixing typo * Fixing label namespace * extracting out masking portion * Revert "extracting out masking portion" This reverts commit f21165c. * Code cleanup * Adding tests for early_stpping_method * Adding pretrain_stop_metric * Reverting get_data_iter * Reverting to get_data_iter * Fixing get_pretrained_lm_head for all embedder types * Extracting out MLM probability masking * Move dynamic masking function to Task for easier testing * Adding unit tests for MLM * Adding change to MLM forward function to expose more intermediate steps for testing * Fixing code style * Adding more detailed instructions of how to generate Wikipedia data * Adding rest of MLM data generation code * Black style and remove comment * black style * updating repro code for MLM data Co-authored-by: phu-pmh <phumon91@gmail.com> Co-authored-by: Haokun Liu <haokunliu412@gmail.com> Co-authored-by: pruksmhc <pruks22y@mtholyoke.edu> Co-authored-by: DeepLearning VM <google-dl-platform@googlegroups.com>
Adding Masked Language Modeling Task for RoBERTa and ALBERT
What this version of MLM supports: RoBERTa and ALBERT embedders.
Additionally, we fix the get pretrained_lm head function for Transformer-based embedders
Performance Tests:
We tested this on CCG and QQP, on both making sure that RoBERTa MLM + QQP -> target task and RoBERTa MLM + CCG -> target were getting reasonable numbers (meaning that MLM perplexity was decreasing and QQP/CCG performance should be close to RoBERTa without MLM training.