Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] What environment did you use for fetching large data set like dialog_mixture #66

Open
quq99 opened this issue Apr 28, 2023 · 5 comments

Comments

@quq99
Copy link

quq99 commented Apr 28, 2023

Hi, I am trying to fetch FLAN v2 by running the

PYTHONPATH=. python flan/v2/run_example.py

I could successfully run cot_submix, but I faced out of memory issue when I was trying to fetch dialog_submix in a single AWS p4d instance.
Some of the logs showed it download wiki_dialog data and also did some processing (maybe) using Apache Beam.

Warning: The dataset you're trying to generate is using Apache Beam,
yet no `beam_runner` nor `beam_options` was explicitly provided.

Some Beam datasets take weeks to generate, so are usually not suited
for single machine generation. Please have a look at the instructions
to setup distributed generation:

https://www.tensorflow.org/datasets/beam_datasets#generating_a_beam_dataset

What I was doing is follow the readme file in flan/v2 directory, bash setup.sh and ran PYTHONPATH=. python flan/v2/run_example.py. The only entrance I could get was the seqio.get_mixture_or_task('dialog_submix').get_dataset() function in that run_example script. I was not clear about how seqio.get_dataset() interact with or call apache beam. Apart from pip install apache beam, is there any other step about settings? like the environment? how could we pass runner type to beam_runner?

And I assume dialog_submix is not the largest in this five categories. So could you give me some help on explaining what environment you are using when running the script to generate data? for example, do you use multiple machines to do it, like google cloud or aws EC2, EMR? Are there further steps like settings, configs before running that run_example.py code? Thanks a lot!

@lehougoogle
Copy link
Collaborator

Hi quq99,

There's nothing special about the dialog submix, it should work on a single machine, unless we missed something...

@quq99
Copy link
Author

quq99 commented Apr 30, 2023

Hi @lehougoogle, thanks so much for the reply. Could you give me more info about how much memory needs when I try to run on a single machine? I noticed this issue #44 mentioned he could run it on a 300G memory machine.

Another question is, how long does it typically need if I want to fetch all five categories(cot, t0, dialog, flan ..). Thanks :)

@quq99
Copy link
Author

quq99 commented May 3, 2023

Hi @lehougoogle, more context is when I run the script for "dialog_submix" I face an error

python: malloc.c:4615: _int_realloc: Assertion `ncopies >= 3' failed.

I thought it was memory not enough issue, but when I change another machine(500G memory), I still saw the error, and I run free -g to see it was using around 70G memory. so I assume it was not a memory issue. Have you faced this issue before, any thoughts would be helpful, thanks a lot!

@shayne-longpre
Copy link
Collaborator

@quq99 it is quite memory intensive. We ran it a while ago internally with Google infrastructure so I don't have specific numbers unfortunately, but in terms of compute it should roughly be this order (least to most): cot, dialog, niv2, flan, t0, with fsopt using much more than zsopt.

If your compute is constrained, you can make it more efficient by splitting task configs into different submixtures (e.g. splitting t0 into 10) and running them separately then joining them at the end. I hope this helps!

@shayne-longpre
Copy link
Collaborator

You can also now manually download the Dialog submixture (and the others) -- see the new README! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants