Adding Masked Language Modelling #1030

pruksmhc · 2020-03-08T13:22:22Z

Adding Masked Language Modeling Task for RoBERTa and ALBERT
What this version of MLM supports: RoBERTa and ALBERT embedders.
Additionally, we fix the get pretrained_lm head function for Transformer-based embedders

Performance Tests:
We tested this on CCG and QQP, on both making sure that RoBERTa MLM + QQP -> target task and RoBERTa MLM + CCG -> target were getting reasonable numbers (meaning that MLM perplexity was decreasing and QQP/CCG performance should be close to RoBERTa without MLM training.

Task	Task performance (best RoBERTa single finetuned on task)	Task performance (multitask trained with MLM with RoBERTa)	Perplexity after training
CCG	0.96	0.953	2.90563
QQP	0.898	0.853	2.45982

Target task	Task Performance	CCG w/o MLM performance
WIC	0.760	0.716
COPA	0.860	0.55
CB	0.820	0.791
RTE	0.834	0.794
CSenseQA	0.739
BoolQ	0.837	0.84
MultiRC	0.655	0.42
ReCoRD	0.82976	0.838
Cosmos	0.81	0.774

…taparallel_metric_calculation Conflicts: jiant/trainer.py jiant/utils/utils.py

…com/nyu-mll/jiant into fix_dataparallel_metric_calculation

pyeres · 2020-04-03T16:34:36Z

jiant/tasks/lm.py

+@register_task("wikipedia_corpus_mlm", rel_path="wikipedia_corpus_small/")
 class MaskedLanguageModelingTask(Task):
    """
    Masked language modeling task on Wikipedia dataset
    Attributes:
        max_seq_len: (int) maximum sequence length
        min_seq_len: (int) minimum sequence length
        files_by_split: (dict) files for three data split (train, val, test)
+    We are currently using an unpreprocessed version of the Wikipedia corpus
+    that consists of 5% of the data. Please reach out to jiant admin if you 
+    would like access to this dataset.


Is the task registered as wikipedia_corpus_mlm (using dataset under relative path /wikipedia_corpus_small) an experiment with some research significance that users would want to reproduce (or is it perhaps a toy dataset being used to demo MaskedLanguageModelingTask)?

It's not toy data. The full Wikipedia is very large and jiant currently does not have ability to load the entire thing. So, I tried to extract a subset with the same size as Wikitext 103, which is around 5% of full Wikipedia.

Thanks for clarifying, @phu-pmh. Maybe relevant: there was a recent discussion and change to another LM task to reduce its memory footprint. It may be relatively easy to make a similar change here to allow you to use the full dataset (if memory, not time, is the concern).

But if you want to introduce a modified dataset with this task, please also submit the code you used to construct the dataset. There's an example of a data preprocessing script under jiant/scripts. But all your script would need to do is document/link the data it takes as input, perform your preprocessing, and save the output as it's expected by your task.

The preprocessing code we used was a slight modification of what is here (@phu-pmh
can speak more to this), so it might not be the best idea to copy paste it into scripts. https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT/data . Perhaps we could put instructions (maybe in the documentation in the Task class?) about how to generate it instead.

For English Wikipedia, I think the nvidia code is ready to use. I only added modifications so that we can do cross-lingual experiments.

pruksmhc · 2020-04-06T22:32:24Z

jiant/tasks/lm.py

+            "val": os.path.join(path, "valid.txt"),
+            "test": os.path.join(path, "test.txt"),
+        }
+        self.examples_by_split = {}


I'll revert this in next commit

pyeres · 2020-04-07T13:43:42Z

Thanks for your recent changes, @pruksmhc.

It looks like there are only a few open comments remaining:

Providing data/script/instructions for the new task.
Is correct_sent_indexing necessary with the final configuration of the task/data?
I think we left our discussion of the token masking functionality with a plan to extract the masking logic into a single-purpose/documented/testable function/method.
It looks like there are also a few smaller open comment threads (here and here).

Finally, as you suggested, we'll want to re-run your validation experiments after the changes are in.

pruksmhc · 2020-04-07T13:49:36Z

@pyeres , for 3, I realized that 1. the masking code is a little hard to test because it randomly masks, which makes any asserts hard and 2. code is from the well-maintained Transformers library. Thus, I decided to hold off on unit tests for that part of the code. I'll still extract it out into its own function though, but just a heads up.
For 2, yes it is still necessary.

pyeres · 2020-04-07T22:10:46Z

jiant/models.py

+            device=inputs.device, dtype=torch.uint8
+        )
+        tokenizer_name = self.sent_encoder._text_field_embedder.tokenizer_required
+        labels, _ = self.sent_encoder._text_field_embedder.correct_sent_indexing(


It looks like the labels will be modified (as a result of correct_sent_indexing()). It looks like the inputs aren't getting the same adjustment here. Is that intentional/correct?

Yes because we do it here for inputs: https://github.com/nyu-mll/jiant/blob/master/jiant/huggingface_transformers_interface/modules.py#L401

…ps for testing

pyeres

@pruksmhc — thanks for these changes, and thanks especially for the creative testing work.

There are only a few open items to tie up before this is mergeable into master:

A few open comments (most or all are on this most recent round of changes).
Providing data/script/instructions for the new task
Re-running your validation experiments after the final changes are in.

After that I think it's ready for approval.

jiant/tasks/lm.py

tests/tasks/test_mlm.py

jiant/tasks/lm.py

pruksmhc · 2020-04-09T00:23:27Z

jiant/tasks/lm.py

+        files_by_split: (dict) files for three data split (train, val, test)
+    We are currently using an unpreprocessed version of the Wikipedia corpus
+    that consists of 5% of the data. You can generate the data using code from
+    https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT/data.


@pyeres here's the instructions for data generation (point 2 you raised).

I think my request got buried a ways back in the thread — resurfacing here:

We need to make this Task reproducible, and to do that I don't know of a way to get around making the task's data dependencies reproducible. This can be done w/ a script or instructions (if the instructions are involved they should be in a script).

A script doesn't need to copy the functionality of other open source scripts involved in generating the data — our script can simply document that the other steps/scripts are used at some step. The goal is that using our script/instructions the user should be able to exactly reproduce the task's data dependencies. (@phu-pmh for visibility).

pruksmhc · 2020-04-09T20:05:42Z

Here's the current validation checks. We do uniform mixing between MLM and the intermediate task, which means that the MLM perplexity here will be higher than what was reported in the description (which didn't use uniform mixing). We did uniform mixing so that the runs would finish faster.

Performance metric	CCG w/ RoBERTa	CCG w/ ALBERT	QQP w/ RoBERTa	QQP w/ ALBERT
Perplexity	4.060865	5.1527	10.98	8.529
Performance of interm task	0.9530	0.84	0.8227	0.76

pyeres

Thanks @pruksmhc and @phu-pmh!

* misc run scripts * sbatch * sweep scripts * update * qa * update * update * update * update * update * sb file * moving update_metrics to outside scope of dataparallel * fixing micro_avg calculation * undo debugging * Fixing tests, moving update_metrics out of other tasks * remove extraneous change * MLM task * Added MLM task * update * fix multiple choice dataparallel forward * update * add _mask_id to transformers * Update * MLM update * adding update_metrics abstraction * delete update_metrics_ notation * fixed wrong index problem * removed unrelated files * removed unrelated files * removed unrelated files * fix PEP8 * Fixed get_pretained_lm_head for BERT and ALBERT * spelling check * black formatting * fixing tests * bug fix * Adding batch_size constraints to multi-GPU setting * adding documentation * adding batch size test * black correct version * Fixing batch size assertion * generalize batch size assertion for more than 2 GPU setting * reducing label loops in code * fixing span forward * Fixing span prediction forward for multi-GPU * fix commonsenseQA forward * MLM * adding function documentation * resolving nits, fixing seq_gen forward * remove nit * fixing batch_size assert and SpanPrediction task * Remove debugging * Fix batch size mismatch multi-GPU test * Fix order of assert checking for batch size mismatch * mlm training * update * sbatch * update * data parallel * update data parallel stuffs * using sequencelabel, using 1 paragraph per example * update label mapping * adding exmaples-porportion-mixing * changing dataloader to work with wikitext103 * weight sampling * add early stopping only onb one task * commit * Cleaning up code * Removing unecessarily tracked git folders * Removing unnecesary changes * revert README * revert README.md again * Making more general for Transformer-based embedders * torch.uint8 -> torch.bool * Fixing indexing issues * get rid of unecessary changes * black cleanup * update * Prevent updating update_metrics twice in one step * update * update * add base_roberta * update * reverting CCG edit added for debugging * refactor defaults.conf * black formatting * merge * removed SOP task and mlm_manual_scaling * Fixing label namespace vocabulary creation, mergeing from master * Deleting MLM weight * black formatting * Adding early_stopping_method to defaults.conf * Fixing MLM with preprocessed wikitext103 * Deleting intermediate class hierarchy for MLM * Correcting black * LanguageModelingTask -> AutoregressiveModelingTask * code style * fixing MaskedLanguageModelTask * Fixing typo * Fixing label namespace * extracting out masking portion * Revert "extracting out masking portion" This reverts commit f21165c. * Code cleanup * Adding tests for early_stpping_method * Adding pretrain_stop_metric * Reverting get_data_iter * Reverting to get_data_iter * Fixing get_pretrained_lm_head for all embedder types * Extracting out MLM probability masking * Move dynamic masking function to Task for easier testing * Adding unit tests for MLM * Adding change to MLM forward function to expose more intermediate steps for testing * Fixing code style * Adding more detailed instructions of how to generate Wikipedia data * Adding rest of MLM data generation code * Black style and remove comment * black style * updating repro code for MLM data Co-authored-by: phu-pmh <phumon91@gmail.com> Co-authored-by: Haokun Liu <haokunliu412@gmail.com> Co-authored-by: pruksmhc <pruks22y@mtholyoke.edu> Co-authored-by: DeepLearning VM <google-dl-platform@googlegroups.com>

phu-pmh and others added 30 commits October 30, 2019 15:05

misc run scripts

430f942

sbatch

39603c3

sweep scripts

9b324f9

Merge branch 'master' of https://github.com/nyu-mll/jiant

d3cc769

Merge branch 'master' of https://github.com/nyu-mll/jiant

00bc40c

update

4e297b1

qa

b75d0f5

update

1aadf48

Merge branch 'master' of https://github.com/nyu-mll/jiant

8993b9e

update

a3f10e2

update

aa0d8b4

Merge branch 'master' of https://github.com/nyu-mll/jiant

275d7a3

Merge branch 'master' of https://github.com/nyu-mll/jiant

4b6b939

update

7252ea5

update

f0d9c56

Merge branch 'master' of https://github.com/nyu-mll/jiant

00223c6

sb file

b0a8ec3

moving update_metrics to outside scope of dataparallel

c4d2601

fixing micro_avg calculation

acb9d24

undo debugging

8bdec95

Merge branch 'master' of https://github.com/nyu-mll/jiant

0d879b1

Merge branch 'master' into fix_dataparallel_metric_calculation

4f0a169

Fixing tests, moving update_metrics out of other tasks

5bb8389

Merge branch 'master' of https://github.com/nyu-mll/jiant into fix_da…

fb59ecc

…taparallel_metric_calculation Conflicts: jiant/trainer.py jiant/utils/utils.py

Merge branch 'fix_dataparallel_metric_calculation' of https://github.…

04dbbda

…com/nyu-mll/jiant into fix_dataparallel_metric_calculation

remove extraneous change

3ddf564

MLM task

e588909

Added MLM task

dfa9fd9

update

46182a9

Merge branch 'MLM' of https://github.com/nyu-mll/jiant into MLM

607bcd2

pyeres reviewed Apr 3, 2020

View reviewed changes

Yada Pruksachatkun and others added 2 commits April 6, 2020 15:07

Reverting to get_data_iter

9b377ab

Fixing get_pretrained_lm_head for all embedder types

bf841e9

pruksmhc commented Apr 6, 2020

View reviewed changes

Yada Pruksachatkun added 2 commits April 7, 2020 10:04

Extracting out MLM probability masking

2349464

Merge branch 'MLM' of https://github.com/nyu-mll/jiant into MLM

cf223a4

pyeres reviewed Apr 7, 2020

View reviewed changes

Yada Pruksachatkun added 3 commits April 7, 2020 17:06

Move dynamic masking function to Task for easier testing

a3465c1

Adding unit tests for MLM

0f5b849

Adding change to MLM forward function to expose more intermediate ste…

fb9ce83

…ps for testing

pyeres suggested changes Apr 8, 2020

View reviewed changes

jiant/tasks/lm.py Show resolved Hide resolved

tests/tasks/test_mlm.py Outdated Show resolved Hide resolved

jiant/tasks/lm.py Outdated Show resolved Hide resolved

jiant/tasks/lm.py Show resolved Hide resolved

jiant/tasks/lm.py Outdated Show resolved Hide resolved

pruksmhc commented Apr 9, 2020

View reviewed changes

Fixing code style

a59c762

Yada Pruksachatkun and others added 6 commits April 9, 2020 17:07

Adding more detailed instructions of how to generate Wikipedia data

e9eb5f0

Adding rest of MLM data generation code

1a76df0

Black style and remove comment

34c924b

black style

da5fe19

updating repro code for MLM data

9446cb7

updating repro code for MLM data

3f6eb92

pyeres approved these changes Apr 10, 2020

View reviewed changes

pruksmhc merged commit de3c44a into master Apr 10, 2020

pruksmhc mentioned this pull request Apr 12, 2020

Increasing Length for each Example in MLM #1064

Merged

sleepinyourhat mentioned this pull request Apr 20, 2020

Transient test failures from MLM/#1030 #1076

Closed

This was referenced Sep 17, 2020

[CLOSED] Adding Masked Language Modelling nyu-mll/jiant-v1-legacy#1030

Closed

[CLOSED] Increasing Length for each Example in MLM nyu-mll/jiant-v1-legacy#1064

Closed

jeswan added the jiant-v1-legacy Relevant to versions <= v1.3.2 label Sep 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Masked Language Modelling #1030

Adding Masked Language Modelling #1030

pruksmhc commented Mar 8, 2020 •

edited

Loading

pyeres Apr 3, 2020

phu-pmh Apr 3, 2020

pyeres Apr 3, 2020

pruksmhc Apr 7, 2020

phu-pmh Apr 7, 2020

pruksmhc Apr 6, 2020

pyeres commented Apr 7, 2020

pruksmhc commented Apr 7, 2020 •

edited

Loading

pyeres Apr 7, 2020 •

edited

Loading

pruksmhc Apr 7, 2020

pyeres left a comment

pruksmhc Apr 9, 2020

pyeres Apr 9, 2020 •

edited

Loading

pruksmhc commented Apr 9, 2020 •

edited

Loading

pyeres left a comment

Adding Masked Language Modelling #1030

Adding Masked Language Modelling #1030

Conversation

pruksmhc commented Mar 8, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pyeres commented Apr 7, 2020

pruksmhc commented Apr 7, 2020 • edited Loading

pyeres Apr 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pyeres left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pyeres Apr 9, 2020 • edited Loading

Choose a reason for hiding this comment

pruksmhc commented Apr 9, 2020 • edited Loading

pyeres left a comment

Choose a reason for hiding this comment

pruksmhc commented Mar 8, 2020 •

edited

Loading

pruksmhc commented Apr 7, 2020 •

edited

Loading

pyeres Apr 7, 2020 •

edited

Loading

pyeres Apr 9, 2020 •

edited

Loading

pruksmhc commented Apr 9, 2020 •

edited

Loading