-
Notifications
You must be signed in to change notification settings - Fork 323
Backporting FileBasedSink changes from Beam #482
Conversation
LOG.debug("{} temporary files matched {}", matches.size(), pattern); | ||
LOG.debug("Removing {} files.", matches.size()); | ||
fileOperations.remove(matches); | ||
protected final void removeTemporaryFiles(List<String> knownFiles, PipelineOptions options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not backwards-compatible. Please be on-guard for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rest of the current change is technically also not backward-compatible (e.g. the set of protected member variables has changed) for someone who was overriding FileBasedSink/FileBasedWriteOperation/... in a highly non-trivial way (e.g. trying to construct filenames manually using the protected members).
Note that, of the file-based sinks I've seen, none would be broken by this change, because they don't peek into these protected variables.
However, as for those that potentially do, I don't think there's a way to make the current change backward-compatible for them. There's a choice:
- Make the current change, with a (I think) very small chance of breaking somebody, but in return, fix deletion of files and data loss/duplication issues.
- Abandon the current change and say that the problem is only fixed in the Beam SDK
- Do something much weaker but backward-compatible instead: e.g., simply add code for removing all known files, but do not place temp files in a subdirectory. As an unfortunate side effect, this will cause a divergence in code and behavior between Dataflow and Beam SDKs.
Which would you prefer? Are there other options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Beam and Dataflow SDKs have diverged on several points already and are expected to keep diverging. Maintaining consistency between the two is not a goal, providing the best experience with Beam SDK is, maintaining backwards compatibility in Dataflow SDK is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lukecwik, you said it so nicely ;-)
OK, I'm closing this PR in favor of #484. |
Original PRs:
apache/beam#1050
apache/beam#1278
Changes simply copied over. No modifications.
R: @lukecwik