Skip to content
This repository was archived by the owner on Jul 31, 2023. It is now read-only.

Train, Test and Val TFRecords files partitions always created #16

Closed
jmarrietar opened this issue Aug 14, 2020 · 2 comments · Fixed by #25
Closed

Train, Test and Val TFRecords files partitions always created #16

jmarrietar opened this issue Aug 14, 2020 · 2 comments · Fixed by #25
Assignees
Labels
bug Something isn't working

Comments

@jmarrietar
Copy link
Contributor

jmarrietar commented Aug 14, 2020

Describe the bug
Train, Test and Val TFRecords files always created even if I only specify TRAIN on CSV file

To Reproduce
Change Split to only TRAIN

split,image_uri,label
TRAIN,../tfrecorder/test_data/images/cat/cat-640x853-1.jpg,cat
TRAIN,../tfrecorder/test_data/images/cat/cat-800x600-2.jpg,cat
TRAIN,../tfrecorder/test_data/images/cat/cat-800x600-3.jpg,cat
TRAIN,../tfrecorder/test_data/images/goat/goat-640x640-1.jpg,goat
TRAIN,../tfrecorder/test_data/images/goat/goat-320x320-2.jpg,goat
TRAIN,../tfrecorder/test_data/images/goat/goat-640x427-3.jpg,goat

Expected behavior
Only create a partitions for TRAIN TFRecords

Screenshots
Screen Shot 2020-08-13 at 10 31 50 PM

System (please complete the following information):

  • OS: [ iOS]
  • Python Version: [3.7.4]
  • TensorFlow Version: [2.2.0]

Additional context
Not sure if this a Bug or this indeed is expected behavior. But as a user If my CSV partitions file only specified TRAIN is strange to create the other files (test and val) are also created, but without images.

@jmarrietar jmarrietar added the bug Something isn't working label Aug 14, 2020
@mbernico
Copy link
Contributor

This behavior makes sense, knowing the code, but it does seem unexpected/weird to a user. I think we should probably address it. We are getting ready to add some features for 0.2's input schema parsing. I think it would be good to address this as part of that change. @cfezequiel awareness.

@cfezequiel
Copy link
Contributor

Hi @jmarrietar , thanks for the feedback. Yes it makes sense to just write TFRecord files for splits with data. I've created a fix in #25 .

One caveat is that Apache Beam doesn't really support conditional pipeline transforms (e.g. if dataset is not empty, run transform), so the check for train/validation/test counts is done at the input (DataFrame) level. The issue is that when the pipeline is run, it may flag some erroneous samples in a split as 'discarded', which may still result in an empty TFRecord file.

For instance, let's say all samples in the test dataset are erroneous. The pipeline may still generate a test*.tfrecords.gz file as there were valid samples in the test dataset when it was counted, even though all samples were marked as 'discarded'. This would probably not be a common case however, and it would be up to the user to ensure that their data is correct.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants