Train, Test and Val TFRecords files partitions always created #16

jmarrietar · 2020-08-14T03:40:17Z

Describe the bug
Train, Test and Val TFRecords files always created even if I only specify TRAIN on CSV file

To Reproduce
Change Split to only TRAIN

split,image_uri,label
TRAIN,../tfrecorder/test_data/images/cat/cat-640x853-1.jpg,cat
TRAIN,../tfrecorder/test_data/images/cat/cat-800x600-2.jpg,cat
TRAIN,../tfrecorder/test_data/images/cat/cat-800x600-3.jpg,cat
TRAIN,../tfrecorder/test_data/images/goat/goat-640x640-1.jpg,goat
TRAIN,../tfrecorder/test_data/images/goat/goat-320x320-2.jpg,goat
TRAIN,../tfrecorder/test_data/images/goat/goat-640x427-3.jpg,goat

Expected behavior
Only create a partitions for TRAIN TFRecords

Screenshots

System (please complete the following information):

OS: [ iOS]
Python Version: [3.7.4]
TensorFlow Version: [2.2.0]

Additional context
Not sure if this a Bug or this indeed is expected behavior. But as a user If my CSV partitions file only specified TRAIN is strange to create the other files (test and val) are also created, but without images.

mbernico · 2020-08-14T14:31:47Z

This behavior makes sense, knowing the code, but it does seem unexpected/weird to a user. I think we should probably address it. We are getting ready to add some features for 0.2's input schema parsing. I think it would be good to address this as part of that change. @cfezequiel awareness.

cfezequiel · 2020-09-18T16:55:55Z

Hi @jmarrietar , thanks for the feedback. Yes it makes sense to just write TFRecord files for splits with data. I've created a fix in #25 .

One caveat is that Apache Beam doesn't really support conditional pipeline transforms (e.g. if dataset is not empty, run transform), so the check for train/validation/test counts is done at the input (DataFrame) level. The issue is that when the pipeline is run, it may flag some erroneous samples in a split as 'discarded', which may still result in an empty TFRecord file.

For instance, let's say all samples in the test dataset are erroneous. The pipeline may still generate a test*.tfrecords.gz file as there were valid samples in the test dataset when it was counted, even though all samples were marked as 'discarded'. This would probably not be a common case however, and it would be up to the user to ensure that their data is correct.

jmarrietar added the bug Something isn't working label Aug 14, 2020

mbernico self-assigned this Aug 14, 2020

cfezequiel self-assigned this Sep 17, 2020

cfezequiel mentioned this issue Sep 18, 2020

Generate TFRecords only if data exists in a split. #25

Merged

mbernico closed this as completed in #25 Sep 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Train, Test and Val TFRecords files partitions always created #16

Train, Test and Val TFRecords files partitions always created #16

jmarrietar commented Aug 14, 2020 •

edited

Loading

mbernico commented Aug 14, 2020

Uh oh!

cfezequiel commented Sep 18, 2020

Uh oh!

Train, Test and Val TFRecords files partitions always created #16

Train, Test and Val TFRecords files partitions always created #16

Comments

jmarrietar commented Aug 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

mbernico commented Aug 14, 2020

Uh oh!

cfezequiel commented Sep 18, 2020

Uh oh!

jmarrietar commented Aug 14, 2020 •

edited

Loading