Skip to content
This repository was archived by the owner on Nov 11, 2022. It is now read-only.

Backporting FileBasedSink changes from Beam #482

Closed
wants to merge 1 commit into from

Conversation

jkff
Copy link
Contributor

@jkff jkff commented Nov 10, 2016

Original PRs:
apache/beam#1050
apache/beam#1278

Changes simply copied over. No modifications.

R: @lukecwik

LOG.debug("{} temporary files matched {}", matches.size(), pattern);
LOG.debug("Removing {} files.", matches.size());
fileOperations.remove(matches);
protected final void removeTemporaryFiles(List<String> knownFiles, PipelineOptions options)
Copy link
Contributor

@dhalperi dhalperi Nov 11, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not backwards-compatible. Please be on-guard for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest of the current change is technically also not backward-compatible (e.g. the set of protected member variables has changed) for someone who was overriding FileBasedSink/FileBasedWriteOperation/... in a highly non-trivial way (e.g. trying to construct filenames manually using the protected members).

Note that, of the file-based sinks I've seen, none would be broken by this change, because they don't peek into these protected variables.

However, as for those that potentially do, I don't think there's a way to make the current change backward-compatible for them. There's a choice:

  • Make the current change, with a (I think) very small chance of breaking somebody, but in return, fix deletion of files and data loss/duplication issues.
  • Abandon the current change and say that the problem is only fixed in the Beam SDK
  • Do something much weaker but backward-compatible instead: e.g., simply add code for removing all known files, but do not place temp files in a subdirectory. As an unfortunate side effect, this will cause a divergence in code and behavior between Dataflow and Beam SDKs.

Which would you prefer? Are there other options?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#3 is the default choice, unless very strong guarantees can be made regarding #1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Beam and Dataflow SDKs have diverged on several points already and are expected to keep diverging. Maintaining consistency between the two is not a goal, providing the best experience with Beam SDK is, maintaining backwards compatibility in Dataflow SDK is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lukecwik, you said it so nicely ;-)

@jkff
Copy link
Contributor Author

jkff commented Nov 14, 2016

OK, I'm closing this PR in favor of #484.

@jkff jkff closed this Nov 14, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants