Skip to content

Foundation of API for reading Variant data and metadata #7535

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 44 commits into
base: main
Choose a base branch
from

Conversation

mkarbo
Copy link

@mkarbo mkarbo commented May 21, 2025

Which issue does this PR close?

Closes #7423

Rationale for this change

We need to agree on an API for reading Variant metadata. Based on the work and discussions in #7452, in this PR we propose an API plus an implementation (WIP while draft) for reading variant metadata in the parquet-variant crate.
A lot of the work is based on the work in #7452 by @PinkCrow007 and feedback from @alamb, @scovich, and @Weijun-H.

What changes are included in this PR?

  • Adds Variant enum (and associated structs)
  • Adds an API for parsing and reading metadata
  • Adds an API for parsing and reading Variant values of various types

We attempt to be result- and validation driven while ensuring zero-allocations, and we do so avoiding serde_json.

We tried to keep the Variant API similar to the Json::Value api.

Are there any user-facing changes?

The new API's added in parquet-variant will be user facing.

@mkarbo
Copy link
Author

mkarbo commented May 21, 2025

@alamb @PinkCrow007 fyi

For first draft, it only contains a small subset of the primitive types, so we can get feedback on that before implementing decoding and conversion for all primitive types.

Tried to incorporate and address as many comments from alamb and scovich in #7452 as possible, but can’t promise that all have been addressed, please let me know.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thank you so much @mkarbo -- I left some comments but I think this is a super strong

Let's see if we get any more API suggestions

I also made a PR for your consideration:

I'll also go and try to fix some of the examples upstream in parquet-data

}

#[test]
fn test_string() -> Result<(), ArrowError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I don't think these unit tests are necessary given we have code that is reading the vaues from parquet-testing

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let me write these out a little more. I think having some hand-crafted tests (e.g., also for object) where we see how the bytes look will be helpful for readers as documentation, not just in validating. We can add test variants using both the builder & manual process.

I personally learnt a lot by writing out the details with pen and paper for the metadata and value bytes for various variant data such as 42, <short string>, <longer string> {'a': 1, 'b': 2, 'c': <uuid>} and {'a':1, 'b': 'hello 💯', 'c': {'x': 99, 'y': true, 'z': <uuid>}} and I think that could help future maintainers get onboarded (would update the tests with more in-depth docs)

However, we can also skip them if you guys prefer, lmk.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they are ok, maybe we just shouldn't go too crazy.

The other thing we could do is to make them examples in the documentation -- which would help both users and maintainers.

use strum_macros::EnumDiscriminants;

#[derive(Clone, Copy, Debug, PartialEq)]
/// Encodes the Variant Metadata, see the Variant spec file for more information
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this API looks pretty good to me.

One thing that we might want to do is to make a VariantMetadata::try_new function that checks and parses out the sorted_strings, dict_len, etc fields. To do anything with the metadata those fields will need to be decoded, so it seems like it would be a nicer API to compute and validate them once on constructon rather than on each access

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Also, pay the cost to construct it once and pass references to everyone else that needs it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this:

struct VariantMetadataHeader {
    version: u8,
    sorted: bool,
    offset_size_bytes: OffsetSizeBytes,
}

impl<'m> VariantMetadata<'m> {
    pub fn try_new(bytes: &'m [u8]) -> Result<Self, ArrowError> {
        let Some(header) = bytes.get(0) else {
            return Err(...);
        }
        let header = VariantMetadataHeader::try_new(bytes)?;
        let dictionary_size = header.offset_size_bytes.unpack_usize(bytes, 1)?;
        
        // verify the offsets are monotonically increasing
        let valid = (0..=dictionary_size)
            .map(|i| header.offset_size_bytes.unpack_usize(bytes, 1, i)?)
            .scan(0, |prev, cur| cur.is_ok_and(|i| prev <= i))
            .all(|valid| valid);
        if !valid {
            return Err(...);
        }
        Ok(Self { ... })
    }
    
    pub fn get(i: usize) -> Result<&'m str, ArrowError> {
        // TODO: Should we memoize the offsets? There could be thousands of them...
        let start = self.header.offset_size_bytes.unpack_usize(self.bytes, self.first_offset, i)?;
        let end = self.header.offset_size_bytes.unpack_usize(self.bytes, self.first_offset, i+1)?;
        let bytes = slice_from_slice(self.value_bytes, self.first_value+start..self.first_value+end)?;
        Ok(str::from_utf8(bytes).map_err(|| invalid_utf8_err())?)
    }
}

The important thing, I suspect, is to verify that the offsets are monotonically increasing, so we don't risk panics due to a case where end < begin. Tho I suppose we could also check that in the individual accessors instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, there are four different XX_size_minus_one situations in the variant spec (metadata dictionary offsets, array value offsets, object field ids, and object value offsets). We should probably factor out a helper class for working with them, since it's ~always the same:

enum PackedUsizeArrayElementSize {
    One = 1,
    Two = 2,
    Three = 3,
    Four = 4,
};

impl PackedUsizeArrayElementSize {
    fn try_new(offset_size_minus_one: u8) -> Result<Self, ArrowError> { 
        use PackedUsizeArrayElementSize::*; 
        let result = match offset_size_minus_one { 
            0 => One, 
            1 => Two, 
            2 => Three, 
            3 => Four, 
            _ => return Err(...),
         }; 
         Ok(result)
     }
}

struct PackedUsizeArray<'a> {
    element_size: PackedArrayElementSize,
    elements: &'a [u8],
}

impl<'a> PackedUsizeArray<'a> {
    /// NOTE: Only looks at the lower 2 bits of the element size!
    fn try_new(
        bytes: &'a [u8],
        byte_offset: usize,
        element_size: PackedUsizeArrayElementSize,
        num_elements: usize,
    ) -> Result<Self, ArrowError> {
        let len = (element_size as usize)*num_elements;
        Ok(Self {
            elements: slice_from_slice(bytes, byte_offset..byte_offset+len)?
            element_size,
            num_elements,
        })
    }

    fn get(&self, i: usize) -> Result<usize, ArrowError> {
        self.element_size.unpack_usize(self. bytes, 0, i)
    }
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, there are four different XX_size_minus_one situations in the variant spec (metadata dictionary offsets, array value offsets, object field ids, and object value offsets). We should probably factor out a helper class for working with them, since it's ~always the same:

If we really want to obsess about performance, we could potentially make VariantMetadata templated on the offset size. I am not sure how much that will matter so we probably shouldn't do it at this point

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... and the use sites in object and array variant would anyway be the more likely performance bottleneck. Unless we want to define a different enum variant for each of those as well... 4*4 (object) + 4 (array) = 20 total combinations... seems excessive. A middle ground would be to special-case just the common ones (e.g. object where both sizes are 1 or both are 4; array where the size is 1 or 4). But it's still extra complexity for likely a very small gain. We'd be trading off more match arms in one place, in order to eliminate match arms somewhere nearby.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we open a ticket to explore this after this one? Would like to get the API ready so my team can work in parallel on the rest of the primitives and optimisations needed

self.buffer[(self.seen ) * self.offset_size + 1 // +1 to skip header
..(self.seen ) * self.offset_size + 1]
.try_into()
.ok()?,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is interesting -- what should we do when we hit an offset that is an invlalid usize -- I think this code will basically return None (stop iterating) rather than erroring in this situation

Maybe when someone calls iter() it would make sense to validate all the offsets then 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the offsets are ever valid usizes? They occupy 1, 2, 3, or 4 bytes, depending on the metadata header's (2-bit) offset_size_minus_one field? We'll need to extract the correct number of bytes, pad it out to u32 (or usize) and then try_into+ from_le_bytes?

Also, I think this code misses length checks needed for those subslice ranges to not panic?
Or did I miss some earlier global check that covers the whole iteration?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd want some helper like this:

enum OffsetSizeBytes {
    One = 1,
    Two = 2,
    Three = 3,
    Four = 4,
}

impl OffsetSizeBytes {
    fn try_new(offset_size_minus_one: u8) -> Result<Self, ArrowError> {
        use OffsetSizeBytes::*;
        let result = match offset_size_minus_one {
            0 => One,
            1 => Two,
            2 => Three,
            3 => Four,
            _ => return Err(...),
        };
        Ok(result)
    }

    fn unpack_usize(
        bytes: &[u8], 
        byte_offset: usize, // how many bytes to skip
        offset_index: usize, // which offset in an array of offsets
    ) -> Result<usize, ArrowError> {
        use OffsetSizeBytes::*;
        let offset = byte_offset + (self as usize)*offset_index;
        let result = match self {
            One => u8::from_le_bytes(array_from_slice(bytes, offset)?).into(),
            Two => u16::from_le_bytes(array_from_slice(bytes, offset)?).into(),
            Three => todo!(), // ugh, endianness
            Four => u32::from_le_bytes(array_from_slice(bytes, offset)?).try_into()?,
        };
        Ok(result)
    }
}

For simplicity (and IIRC also speed), I believe this kind of enum-based dispatch is generally preferred in rust, instead of function pointers?

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing comments at EOD...

use strum_macros::EnumDiscriminants;

#[derive(Clone, Copy, Debug, PartialEq)]
/// Encodes the Variant Metadata, see the Variant spec file for more information
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this:

struct VariantMetadataHeader {
    version: u8,
    sorted: bool,
    offset_size_bytes: OffsetSizeBytes,
}

impl<'m> VariantMetadata<'m> {
    pub fn try_new(bytes: &'m [u8]) -> Result<Self, ArrowError> {
        let Some(header) = bytes.get(0) else {
            return Err(...);
        }
        let header = VariantMetadataHeader::try_new(bytes)?;
        let dictionary_size = header.offset_size_bytes.unpack_usize(bytes, 1)?;
        
        // verify the offsets are monotonically increasing
        let valid = (0..=dictionary_size)
            .map(|i| header.offset_size_bytes.unpack_usize(bytes, 1, i)?)
            .scan(0, |prev, cur| cur.is_ok_and(|i| prev <= i))
            .all(|valid| valid);
        if !valid {
            return Err(...);
        }
        Ok(Self { ... })
    }
    
    pub fn get(i: usize) -> Result<&'m str, ArrowError> {
        // TODO: Should we memoize the offsets? There could be thousands of them...
        let start = self.header.offset_size_bytes.unpack_usize(self.bytes, self.first_offset, i)?;
        let end = self.header.offset_size_bytes.unpack_usize(self.bytes, self.first_offset, i+1)?;
        let bytes = slice_from_slice(self.value_bytes, self.first_value+start..self.first_value+end)?;
        Ok(str::from_utf8(bytes).map_err(|| invalid_utf8_err())?)
    }
}

The important thing, I suspect, is to verify that the offsets are monotonically increasing, so we don't risk panics due to a case where end < begin. Tho I suppose we could also check that in the individual accessors instead.

Variant::ShortString(Cow::Borrowed(decoder::decode_short_string(value)?))
}

VariantType::Object => Variant::Object(VariantObject {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably? VariantObject::try_new and VariantArray::try_new should probably do basic header and offset monotonicity validation, similar to what I suggested for VariantMetadata::try_new (see comment above)?

impl<'m, 'v> Index<usize> for VariantArray<'m, 'v> {
type Output = Variant<'m, 'v>;

fn index(&self, _index: usize) -> &Self::Output {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this trait works, if it forces us to return a reference?
We need to return a Variant by value, which wraps the referenced bytes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree --that will likely become obvious if we try to implement it

use strum_macros::EnumDiscriminants;

#[derive(Clone, Copy, Debug, PartialEq)]
/// Encodes the Variant Metadata, see the Variant spec file for more information
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, there are four different XX_size_minus_one situations in the variant spec (metadata dictionary offsets, array value offsets, object field ids, and object value offsets). We should probably factor out a helper class for working with them, since it's ~always the same:

enum PackedUsizeArrayElementSize {
    One = 1,
    Two = 2,
    Three = 3,
    Four = 4,
};

impl PackedUsizeArrayElementSize {
    fn try_new(offset_size_minus_one: u8) -> Result<Self, ArrowError> { 
        use PackedUsizeArrayElementSize::*; 
        let result = match offset_size_minus_one { 
            0 => One, 
            1 => Two, 
            2 => Three, 
            3 => Four, 
            _ => return Err(...),
         }; 
         Ok(result)
     }
}

struct PackedUsizeArray<'a> {
    element_size: PackedArrayElementSize,
    elements: &'a [u8],
}

impl<'a> PackedUsizeArray<'a> {
    /// NOTE: Only looks at the lower 2 bits of the element size!
    fn try_new(
        bytes: &'a [u8],
        byte_offset: usize,
        element_size: PackedUsizeArrayElementSize,
        num_elements: usize,
    ) -> Result<Self, ArrowError> {
        let len = (element_size as usize)*num_elements;
        Ok(Self {
            elements: slice_from_slice(bytes, byte_offset..byte_offset+len)?
            element_size,
            num_elements,
        })
    }

    fn get(&self, i: usize) -> Result<usize, ArrowError> {
        self.element_size.unpack_usize(self. bytes, 0, i)
    }
}

}

/// Extracts the primitive type from a header byte
pub(crate) fn get_primitive_type(header: u8) -> Result<VariantPrimitiveType, ArrowError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scovich 's comments got me thinking we could implement this as impl TryFrom<u8> for VariantPrimitiveType

Which I think would be more "idomatic" rust and would make for better look.

Similarly for other APIs in this module

}

#[test]
fn test_string() -> Result<(), ArrowError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they are ok, maybe we just shouldn't go too crazy.

The other thing we could do is to make them examples in the documentation -- which would help both users and maintainers.

("primitive_string", Variant::from("This string is longer than 64 bytes and therefore does not fit in a short_string and it also includes several non ascii characters such as 🐢, 💖, ♥\u{fe0f}, 🎣 and 🤦!!")),
// Using the From<String> trait
("short_string", Variant::from("Less than 64 bytes (❤\u{fe0f} with utf8)")),
// TODO Reenable when https://github.com/apache/parquet-testing/issues/81 is fixed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use strum_macros::EnumDiscriminants;

#[derive(Clone, Copy, Debug, PartialEq)]
/// Encodes the Variant Metadata, see the Variant spec file for more information
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, there are four different XX_size_minus_one situations in the variant spec (metadata dictionary offsets, array value offsets, object field ids, and object value offsets). We should probably factor out a helper class for working with them, since it's ~always the same:

If we really want to obsess about performance, we could potentially make VariantMetadata templated on the offset size. I am not sure how much that will matter so we probably shouldn't do it at this point

impl<'m, 'v> Index<usize> for VariantArray<'m, 'v> {
type Output = Variant<'m, 'v>;

fn index(&self, _index: usize) -> &Self::Output {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree --that will likely become obvious if we try to implement it

Comment on lines +91 to +94
/// To be used in `map_err` when unpacking an integer from a slice of bytes.
fn map_try_from_slice_error(e: TryFromSliceError) -> ArrowError {
ArrowError::InvalidArgumentError(e.to_string())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better actually to improve this conversion a bit to explain the error comes from trying to parse variants. If we just add a blanket From impl the code will just work, but users will see messages like Invalid slice which will be pretty unspecific

@mkarbo
Copy link
Author

mkarbo commented May 23, 2025

I am refactoring the Array to have try_new and will try to consolidate to avoid double validations of metadata.

@alamb
Copy link
Contributor

alamb commented May 23, 2025

I took the liberty of merging up from main and adding headers to the files to get CI to pass

@alamb alamb changed the title API for reading Variant data and metadata Foundation of API for reading Variant data and metadata May 23, 2025
@alamb alamb marked this pull request as ready for review May 23, 2025 21:33
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR is I think this PR is good enough to merge so we can begin making improvements (like running checks in the CI, adding docs, tests, other support, etc)

Thank you so much for spearheading it @mkarbo and for all the detailed feedback @scovich and @mapleFU .

In my opinion the only outstanding questions are some of the details of handling Objects and arrays but I think we can iterate on that in subsequent PRs as we haven't really committed to an API

What do you think @mkarbo and @scovich ? Is this good enough to merge and make follow on PRs?

FYI @PinkCrow007 -- if you have time to review this it would be most appreciated

"First offset is non-zero".to_string(),
));
}
if prev.is_some_and(|prev| prev >= offset) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just found a interesting questing: does empty key valid here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code can probably be simplified, but meanwhile -- empty prev means this is the first entry, and there is no previous entry it could disagree with.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean what if prev == offset, which means key.empty, could this scenerio exists?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yipes, good catch! The metadata dictionary encodes field names, and JSON definitely allows empty field names. Plus, the spec allows duplicates in an unordered dictionary (**), so we could even encounter more than one empty string.

(**) The spec says the following:

If sorted_strings is set to 1, strings in the dictionary must be unique and sorted in lexicographic order. If the value is set to 0, readers may not make any assumptions about string order or uniqueness.

The spec also says:

An object may not contain duplicate keys

... which would be "not fun" to enforce if the dictionary contained duplicate entries -- you could end up with two different field ids whose values compare equal.

I guess this opens yet another question about how much validation we should be performing? If it's sorted, do we need to verify that each string is lexicographically greater-than its predecessor?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently in C++ implementations, I just do some O(1) checks to prevent from memory issue and believe user pass right data. Validate whether a string dictionary is sorted is really tricky ( Maybe it's also tricky for field_ids in object ). I think a Validate function can be added but not run by default

For empty key, I don't know how shredding handles this, I've send a mail in mail list, maybe we can make it clear later these days: https://lists.apache.org/thread/1shsf89w9dxgjp0d8kgr7dom6rh5sk9t

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a Validate function can be added but not run by default

I agree with this design idea -- as long as invalid data supplied by the user can not cause undefined behavior (crashes or memory unsafety) I think it is fine to elide expensive checks by default and provide a more expensive validation routine

If you agree @scovich I will file a ticket to track the idea

Copy link
Contributor

@scovich scovich May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree -- a separate in-depth validate method sounds like a great way to balance efficiency with safety.

So for this PR, we just add basic O(1) validation and sanity checks (**), and leave the in-depth validation as a tracking ticket?

(**) I guess such checks must include the header, and probably also would check that the data section of bytes is not obviously too small? As in, verify that the first offset is 0 and that last offset makes sense vs. the byte slice we have to work with?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we file a ticket already?

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished a review pass. I think more changes landed while I was reviewing tho.

Comment on lines +489 to +490
impl<'m, 'v> From<&'v str> for Variant<'m, 'v> {
fn from(value: &'v str) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I'm not sure these From conversions are actually helpful in the decoder path? Strings, in particular, can't just be any &'v str -- they have to reside in the correct location inside the value byte slice that comprises the overall variant. Even in the builder path, turning a string to a variant would be a bytewise copy.

@scovich
Copy link
Contributor

scovich commented May 23, 2025

In my opinion the only outstanding questions are some of the details of handling Objects and arrays but I think we can iterate on that in subsequent PRs as we haven't really committed to an API

What do you think @mkarbo and @scovich ? Is this good enough to merge and make follow on PRs?

We're getting very close. I agree the specifics of objects and arrays can be follow-on work.

A few remaining items on my mind (some of which are comments that github oh-so-helpfully hides):

  • How much "extra" validation (= not immediately needed by the function that performs it) should we be doing? See Foundation of API for reading Variant data and metadata #7535 (comment)
  • The VariantMetadata API seems overly wide and complex, I wonder if we can simplify it? See Foundation of API for reading Variant data and metadata #7535 (comment)
  • Significant code simplifications still seem to be possible. A good chunk of this is IMO due to the previous two points. I'm interested in this because we're still finding ways to factor out redundant code to helper functions that will likely make the remaining pathfinding easier. Also because this is very complex stuff; the less code we use to express it, the smaller the cognitive burden and bug surface will be. Plus, it's something that gets harder and harder to fix later as the code grows and matures.

I suspect one more pass will address the code simplifications. Not sure how to resolve the other two most efficiently? Maybe others could chime in so we have more voices to consider?

@scovich
Copy link
Contributor

scovich commented May 23, 2025

Thank you so much for spearheading it @mkarbo

+100 -- to both you and PinkCrow007 (who github won't let me mention for some reason)

@mkarbo
Copy link
Author

mkarbo commented May 26, 2025

Thanks all, I can give it another go tomorrow I think with all the comments.

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, except a few older review comments might have been missed? Also a bunch of nits to consider either before merge or as follow-up work.

https://github.com/apache/arrow-rs/pull/7535/files#r2105478634

  • Potential to have VariantMetadata::unpack method

https://github.com/apache/arrow-rs/pull/7535/files#r2105513563

  • Questionable utility of impl<'v> From<&'v str> for Variant

#7535 (comment):

  • Better for VariantMetadata to present a (narrow) Vec<&str>-like interface?

}

/// Extracts the basic type from a header byte
pub(crate) fn get_basic_type(header: u8) -> Result<VariantBasicType, ArrowError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Do we ever expect to return Err? Seems infallible?

Also, out of curiosity, why is one a function and the other an impl [Try]From with a function as well?


/// Decodes a long string from the value section of a variant.
pub(crate) fn decode_long_string(value: &[u8]) -> Result<&str, ArrowError> {
let len = u32::from_le_bytes(array_from_slice(value, 1)?) as usize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
let len = u32::from_le_bytes(array_from_slice(value, 1)?) as usize;
let len = OffsetSizeBytes::Four.unpack_usize(value, 1, 0)?;

Comment on lines +102 to +104
let len = (first_byte_from_slice(value)? >> 2) as usize;

let string = string_from_slice(value, 1..1 + len)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
let len = (first_byte_from_slice(value)? >> 2) as usize;
let string = string_from_slice(value, 1..1 + len)?;
let len = (first_byte_from_slice(value)? >> 2) as usize;
let string = string_from_slice(value, 1..1 + len)?;

Comment on lines +25 to +26

pub(crate) fn slice_from_slice<I: SliceIndex<[u8]> + Clone + Debug>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: doc comment?

Suggested change
pub(crate) fn slice_from_slice<I: SliceIndex<[u8]> + Clone + Debug>(
// Extracts a subslice from a slice without risk of panic
pub(crate) fn slice_from_slice<I: SliceIndex<[u8]> + Clone + Debug>(

(more below)

use crate::utils::{array_from_slice, first_byte_from_slice, string_from_slice};

#[derive(Debug, Clone, Copy)]
pub enum VariantBasicType {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seem pretty basic, should they be in the top-level lib.rs instead of here in decoder.rs?

}

/// Get a single offset by index
pub fn get_offset_by(&self, index: usize) -> Result<usize, ArrowError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on previous discussions, should this be pub(crate)?

Comment on lines +363 to +374
// 1. num_elements + 1
let n_offsets = num_elements.checked_add(1).ok_or_else(overflow)?;

// 2. (num_elements + 1) * offset_size
let value_bytes = n_offsets
.checked_mul(offset_size as usize)
.ok_or_else(overflow)?;

// 3. first_offset_byte + ...
let first_value_byte = first_offset_byte
.checked_add(value_bytes)
.ok_or_else(overflow)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Any particular reason this is spread out, instead of one chain like Metadata::try_new did?

Suggested change
// 1. num_elements + 1
let n_offsets = num_elements.checked_add(1).ok_or_else(overflow)?;
// 2. (num_elements + 1) * offset_size
let value_bytes = n_offsets
.checked_mul(offset_size as usize)
.ok_or_else(overflow)?;
// 3. first_offset_byte + ...
let first_value_byte = first_offset_byte
.checked_add(value_bytes)
.ok_or_else(overflow)?;
// (num_elements + 1) * offset_size + first_offset_byte
let n_offsets = num_elements
.checked_add(1)
.and_then(|n| n.checked_mul(offset_size as usize))
.and_then(|n| n.checked_add(value_bytes))
.ok_or_else(overflow)?;

Comment on lines +385 to +386
first_value_byte + start_field_offset_from_first_value_byte
..first_value_byte + end_field_offset_from_first_value_byte,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any risk a malicious input could cause either of these to overflow on a 32-bit arch?

}
}

pub fn metadata(&self) -> Option<&'m VariantMetadata> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find my previous github comment now... but it's still not clear to me when/how this method could be useful in practice?

}
}

impl<'m, 'v> From<&'v str> for Variant<'m, 'v> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'm surprised clippy didn't warn about unnecessary named lifetimes:

Suggested change
impl<'m, 'v> From<&'v str> for Variant<'m, 'v> {
impl<'v> From<&'v str> for Variant<'_, 'v> {

@mkarbo
Copy link
Author

mkarbo commented May 30, 2025

You're too fast @scovich! I am not done yet 😉

@scovich
Copy link
Contributor

scovich commented May 30, 2025

You're too fast @scovich! I am not done yet 😉

Hehe sorry... this is just too exciting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Variant: Rust API to Read Variant Values
4 participants