Get object sizes based on S3's ListObjects output (just with scan_object_sizes() to start with) #2248

poodlewars · 2025-03-14T16:24:13Z

Monday: 8560764974

Limited to just scan_object_sizes to start with, to show the idea.

Calculate object sizes using functionality in the storage backend itself when possible. This PR calculates compressed sizes on S3 based on the ListObjectsV2 output, so we don't need to read all the keys.

For storages where we haven't implemented anything fancy, just read all the (compressed) keys and check the compressed size on their header.

Remaining work:

Use this approach for scan_object_sizes_by_stream
Make sure we have testing against all backends especially NFS
Implement scan_object_sizes_for_stream
Native size calculations for LMDB and Azure
Native size calculation for library size (LMDB has an API to do this "in one")
AdminTools Python API on the v2 API, including search by regex. Docs page to explain the output.

…ect_sizes() to start with) 8560764974

alexowens90 · 2025-03-17T09:21:34Z

cpp/arcticdb/version/local_versioned_engine.cpp

-    }
-    return result;
+    folly::QueuedImmediateExecutor inline_executor;
+    return folly::collect(sizes_futs).via(&inline_executor).get();


Is this how we're meant to be collecting in general?

I think it makes more sense, rather than worrying about which executors the futures themselves use and whether deadlocks are possible. Happy to change to the normal idiom and add some tests with single thread IO and CPU pools if you feel strongly though.

Actually I'll add those tests anyway.

I don't think there actually is any risk of deadlock with the CPU and IO executors. This pattern seems fine, but I can't see a reason to go around changing all the existing usages - we should settle on one pattern and use it everywhere

alexowens90 · 2025-03-17T09:26:13Z

cpp/arcticdb/storage/storages.hpp

@@ -184,6 +188,16 @@ class Storages {
        }
    }

+    ObjectSizes get_object_sizes(KeyType key_type, const std::string& prefix) {
+        ObjectSizes res{key_type, 0, 0};
+        for (const auto& storage : storages_) {


I see what you're going for here, but AFAIK none of our other methods check anything other than the first storage

I'll add a check that there is exactly one storage, as there are a few different semantics you might want for this with multiple storages, and doesn't seem worth guessing while the multiple storages thing is theoretical.

There's a case for ripping out this multiple storages stuff until we implement it properly, as I imagine most of the implementations here would change when we do that anyway.

Why not make it obey the primary_only as the other read methods do?

alexowens90 · 2025-03-17T09:27:24Z

cpp/arcticdb/storage/storage.hpp

+    constexpr auto parse(ParseContext &ctx) { return ctx.begin(); }
+
+    template<typename FormatContext>
+    auto format(const ObjectSizes &srv, FormatContext &ctx) const {


Cut and paste fail

alexowens90 · 2025-03-17T10:14:42Z

cpp/arcticdb/stream/stream_source.hpp

+            // Ignore some exceptions, someone might be deleting while we scan
+            res.push_back(std::move(fut)
+                              .thenValue([](auto&&) {return folly::Unit{};})
+                              .thenError(folly::tag_t<storage::KeyNotFoundException>{}, [](auto&&) { return folly::Unit{}; }));


Can use collectAll for this

I'm not sure it would simplify things - we could collectAll on the batch_read_compressed result but we would still need a for loop to chain the continuations on to the results. And short-circuiting if there is an unexpected exception seems desirable.

alexowens90 · 2025-03-17T10:18:51Z

cpp/arcticdb/storage/s3/s3_client_impl.cpp

    }

-    ListObjectsOutput output = {s3_object_names, next_continuation_token};
+    ListObjectsOutput output = {s3_object_names, s3_object_sizes, next_continuation_token};


Looks like first 2 arguments should be moved in

alexowens90 · 2025-03-17T10:21:04Z

cpp/arcticdb/storage/s3/detail-inl.hpp

+    do {
+        auto list_objects_result = s3_client.list_objects(path_info.key_prefix_, bucket_name, continuation_token);
+        if (list_objects_result.is_success()) {
+            auto& output = list_objects_result.get_output();


…ew comments 8560764974

poodlewars added the patch Small change, should increase patch version label Mar 14, 2025

Get object sizes based on S3's ListObjects output (just with scan_obj…

7739abc

…ect_sizes() to start with) 8560764974

poodlewars force-pushed the aseaton/8560764974/library-size-api-storage-native branch from 02760e6 to 7739abc Compare March 14, 2025 16:31

poodlewars marked this pull request as ready for review March 14, 2025 16:52

poodlewars requested review from alexowens90 and willdealtry as code owners March 14, 2025 16:52

alexowens90 reviewed Mar 17, 2025

View reviewed changes

alexowens90 approved these changes Mar 17, 2025

View reviewed changes

Get object sizes based on S3's ListObjects output - implement PR revi…

9322077

…ew comments 8560764974

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get object sizes based on S3's ListObjects output (just with scan_object_sizes() to start with) #2248

Get object sizes based on S3's ListObjects output (just with scan_object_sizes() to start with) #2248

poodlewars commented Mar 14, 2025 •

edited

Loading

alexowens90 Mar 17, 2025

poodlewars Mar 18, 2025

poodlewars Mar 18, 2025

willdealtry Mar 18, 2025

alexowens90 Mar 17, 2025

poodlewars Mar 18, 2025

willdealtry Mar 18, 2025 •

edited

Loading

alexowens90 Mar 17, 2025

poodlewars Mar 18, 2025

alexowens90 Mar 17, 2025

poodlewars Mar 18, 2025 •

edited

Loading

alexowens90 Mar 17, 2025

alexowens90 Mar 17, 2025

Get object sizes based on S3's ListObjects output (just with scan_object_sizes() to start with) #2248

Are you sure you want to change the base?

Get object sizes based on S3's ListObjects output (just with scan_object_sizes() to start with) #2248

Conversation

poodlewars commented Mar 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

willdealtry Mar 18, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

poodlewars Mar 18, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

poodlewars commented Mar 14, 2025 •

edited

Loading

willdealtry Mar 18, 2025 •

edited

Loading

poodlewars Mar 18, 2025 •

edited

Loading