[Question] What environment did you use for fetching large data set like dialog_mixture #66

quq99 · 2023-04-28T23:49:39Z

Hi, I am trying to fetch FLAN v2 by running the

PYTHONPATH=. python flan/v2/run_example.py

I could successfully run cot_submix, but I faced out of memory issue when I was trying to fetch dialog_submix in a single AWS p4d instance.
Some of the logs showed it download wiki_dialog data and also did some processing (maybe) using Apache Beam.

Warning: The dataset you're trying to generate is using Apache Beam,
yet no `beam_runner` nor `beam_options` was explicitly provided.

Some Beam datasets take weeks to generate, so are usually not suited
for single machine generation. Please have a look at the instructions
to setup distributed generation:

https://www.tensorflow.org/datasets/beam_datasets#generating_a_beam_dataset

What I was doing is follow the readme file in flan/v2 directory, bash setup.sh and ran PYTHONPATH=. python flan/v2/run_example.py. The only entrance I could get was the seqio.get_mixture_or_task('dialog_submix').get_dataset() function in that run_example script. I was not clear about how seqio.get_dataset() interact with or call apache beam. Apart from pip install apache beam, is there any other step about settings? like the environment? how could we pass runner type to beam_runner?

And I assume dialog_submix is not the largest in this five categories. So could you give me some help on explaining what environment you are using when running the script to generate data? for example, do you use multiple machines to do it, like google cloud or aws EC2, EMR? Are there further steps like settings, configs before running that run_example.py code? Thanks a lot!

The text was updated successfully, but these errors were encountered:

lehougoogle · 2023-04-29T16:43:04Z

Hi quq99,

There's nothing special about the dialog submix, it should work on a single machine, unless we missed something...

quq99 · 2023-04-30T06:47:16Z

Hi @lehougoogle, thanks so much for the reply. Could you give me more info about how much memory needs when I try to run on a single machine? I noticed this issue #44 mentioned he could run it on a 300G memory machine.

Another question is, how long does it typically need if I want to fetch all five categories(cot, t0, dialog, flan ..). Thanks :)

quq99 · 2023-05-03T23:29:26Z

Hi @lehougoogle, more context is when I run the script for "dialog_submix" I face an error

python: malloc.c:4615: _int_realloc: Assertion `ncopies >= 3' failed.

I thought it was memory not enough issue, but when I change another machine(500G memory), I still saw the error, and I run free -g to see it was using around 70G memory. so I assume it was not a memory issue. Have you faced this issue before, any thoughts would be helpful, thanks a lot!

shayne-longpre · 2023-05-08T19:03:40Z

@quq99 it is quite memory intensive. We ran it a while ago internally with Google infrastructure so I don't have specific numbers unfortunately, but in terms of compute it should roughly be this order (least to most): cot, dialog, niv2, flan, t0, with fsopt using much more than zsopt.

If your compute is constrained, you can make it more efficient by splitting task configs into different submixtures (e.g. splitting t0 into 10) and running them separately then joining them at the end. I hope this helps!

shayne-longpre · 2023-05-25T14:29:55Z

You can also now manually download the Dialog submixture (and the others) -- see the new README! :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] What environment did you use for fetching large data set like dialog_mixture #66

[Question] What environment did you use for fetching large data set like dialog_mixture #66

quq99 commented Apr 28, 2023

lehougoogle commented Apr 29, 2023

quq99 commented Apr 30, 2023

quq99 commented May 3, 2023

shayne-longpre commented May 8, 2023

shayne-longpre commented May 25, 2023

[Question] What environment did you use for fetching large data set like dialog_mixture #66

[Question] What environment did you use for fetching large data set like dialog_mixture #66

Comments

quq99 commented Apr 28, 2023

lehougoogle commented Apr 29, 2023

quq99 commented Apr 30, 2023

quq99 commented May 3, 2023

shayne-longpre commented May 8, 2023

shayne-longpre commented May 25, 2023