-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] What environment did you use for fetching large data set like dialog_mixture #66
Comments
Hi quq99, There's nothing special about the dialog submix, it should work on a single machine, unless we missed something... |
Hi @lehougoogle, thanks so much for the reply. Could you give me more info about how much memory needs when I try to run on a single machine? I noticed this issue #44 mentioned he could run it on a 300G memory machine. Another question is, how long does it typically need if I want to fetch all five categories(cot, t0, dialog, flan ..). Thanks :) |
Hi @lehougoogle, more context is when I run the script for "dialog_submix" I face an error
I thought it was memory not enough issue, but when I change another machine(500G memory), I still saw the error, and I run free -g to see it was using around 70G memory. so I assume it was not a memory issue. Have you faced this issue before, any thoughts would be helpful, thanks a lot! |
@quq99 it is quite memory intensive. We ran it a while ago internally with Google infrastructure so I don't have specific numbers unfortunately, but in terms of compute it should roughly be this order (least to most): cot, dialog, niv2, flan, t0, with fsopt using much more than zsopt. If your compute is constrained, you can make it more efficient by splitting task configs into different submixtures (e.g. splitting t0 into 10) and running them separately then joining them at the end. I hope this helps! |
You can also now manually download the Dialog submixture (and the others) -- see the new README! :) |
Hi, I am trying to fetch FLAN v2 by running the
PYTHONPATH=. python flan/v2/run_example.py
I could successfully run
cot_submix
, but I faced out of memory issue when I was trying to fetchdialog_submix
in a single AWS p4d instance.Some of the logs showed it download wiki_dialog data and also did some processing (maybe) using Apache Beam.
What I was doing is follow the readme file in flan/v2 directory,
bash setup.sh
and ranPYTHONPATH=. python flan/v2/run_example.py
. The only entrance I could get was theseqio.get_mixture_or_task('dialog_submix').get_dataset()
function in that run_example script. I was not clear about how seqio.get_dataset() interact with or call apache beam. Apart from pip install apache beam, is there any other step about settings? like the environment? how could we pass runner type to beam_runner?And I assume dialog_submix is not the largest in this five categories. So could you give me some help on explaining what environment you are using when running the script to generate data? for example, do you use multiple machines to do it, like google cloud or aws EC2, EMR? Are there further steps like settings, configs before running that
run_example.py
code? Thanks a lot!The text was updated successfully, but these errors were encountered: