-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get object sizes based on S3's ListObjects output (just with scan_object_sizes() to start with) #2248
base: master
Are you sure you want to change the base?
Get object sizes based on S3's ListObjects output (just with scan_object_sizes() to start with) #2248
Conversation
…ect_sizes() to start with) 8560764974
02760e6
to
7739abc
Compare
} | ||
return result; | ||
folly::QueuedImmediateExecutor inline_executor; | ||
return folly::collect(sizes_futs).via(&inline_executor).get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this how we're meant to be collecting in general?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes more sense, rather than worrying about which executors the futures themselves use and whether deadlocks are possible. Happy to change to the normal idiom and add some tests with single thread IO and CPU pools if you feel strongly though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I'll add those tests anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there actually is any risk of deadlock with the CPU and IO executors. This pattern seems fine, but I can't see a reason to go around changing all the existing usages - we should settle on one pattern and use it everywhere
@@ -184,6 +188,16 @@ class Storages { | |||
} | |||
} | |||
|
|||
ObjectSizes get_object_sizes(KeyType key_type, const std::string& prefix) { | |||
ObjectSizes res{key_type, 0, 0}; | |||
for (const auto& storage : storages_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you're going for here, but AFAIK none of our other methods check anything other than the first storage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a check that there is exactly one storage, as there are a few different semantics you might want for this with multiple storages, and doesn't seem worth guessing while the multiple storages thing is theoretical.
There's a case for ripping out this multiple storages stuff until we implement it properly, as I imagine most of the implementations here would change when we do that anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not make it obey the primary_only as the other read methods do?
cpp/arcticdb/storage/storage.hpp
Outdated
constexpr auto parse(ParseContext &ctx) { return ctx.begin(); } | ||
|
||
template<typename FormatContext> | ||
auto format(const ObjectSizes &srv, FormatContext &ctx) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
srv?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cut and paste fail
// Ignore some exceptions, someone might be deleting while we scan | ||
res.push_back(std::move(fut) | ||
.thenValue([](auto&&) {return folly::Unit{};}) | ||
.thenError(folly::tag_t<storage::KeyNotFoundException>{}, [](auto&&) { return folly::Unit{}; })); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can use collectAll
for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it would simplify things - we could collectAll
on the batch_read_compressed
result but we would still need a for loop to chain the continuations on to the results. And short-circuiting if there is an unexpected exception seems desirable.
} | ||
|
||
ListObjectsOutput output = {s3_object_names, next_continuation_token}; | ||
ListObjectsOutput output = {s3_object_names, s3_object_sizes, next_continuation_token}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like first 2 arguments should be moved in
do { | ||
auto list_objects_result = s3_client.list_objects(path_info.key_prefix_, bucket_name, continuation_token); | ||
if (list_objects_result.is_success()) { | ||
auto& output = list_objects_result.get_output(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const?
…ew comments 8560764974
Monday: 8560764974
Limited to just
scan_object_sizes
to start with, to show the idea.Calculate object sizes using functionality in the storage backend itself when possible. This PR calculates compressed sizes on S3 based on the
ListObjectsV2
output, so we don't need to read all the keys.For storages where we haven't implemented anything fancy, just read all the (compressed) keys and check the compressed size on their header.
Remaining work:
scan_object_sizes_by_stream
scan_object_sizes_for_stream
AdminTools
Python API on the v2 API, including search by regex. Docs page to explain the output.