Skip to content

Support reading from and writing to Arrow files #415

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Apr 30, 2025
Merged

Support reading from and writing to Arrow files #415

merged 6 commits into from
Apr 30, 2025

Conversation

nalimilan
Copy link
Member

This requires overriding Arrow.DictEncoding so that an Arrow.DictEncoded with a CategoricalArray dictionary with one entry per level is created. This is the only way to ensure that indexing the Arrow column gives CategoricalValue objects. In practice such columns will most often be used after conversion to CategoricalArray via copy, DataFrame, etc.

Apparently, pandas do not allow reading the resulting file if the array allows for missing values as it does not accept missing in the dictionary. Instead it would need missing entries to be coded via null indices, which is less efficient.

This requires overriding `Arrow.DictEncoding` so that an `Arrow.DictEncoded`
with a `CategoricalArray` dictionary with one entry per level is created.
This is the only way to ensure that indexing the Arrow column gives
`CategoricalValue` objects. In practice such columns will most often
be used after conversion to `CategoricalArray` via `copy`, `DataFrame`, etc.

Apparently, pandas do not allow reading the resulting file if the array
allows for missing values as it does not accept `missing` in the dictionary.
Instead it would need missing entries to be coded via null indices, which
is less efficient.
@nalimilan nalimilan requested a review from quinnj February 11, 2025 22:32
@nalimilan
Copy link
Member Author

Failure on 32-bit seems due to an Int32 vs Int64 bug inside Arrow.jl. Since its CI doesn't run on 32-bit I assume it's not supported. I'll disable tests there.

@palday
Copy link

palday commented Apr 30, 2025

@nalimilan feel like adding some arch checks to disable testing/the extension defining any methods on 32bit?

@nalimilan nalimilan closed this Apr 30, 2025
@nalimilan nalimilan reopened this Apr 30, 2025
@nalimilan
Copy link
Member Author

nalimilan commented Apr 30, 2025

Done! I also bumped the minimal Julia version to 1.6 as packages fail to install on 1.0 (which is super old anyway).

@palday
Copy link

palday commented Apr 30, 2025

I'm not an org member so I can't approve but LGTM!

nalimilan and others added 2 commits May 1, 2025 00:19
Co-authored-by: Phillip Alday <palday@users.noreply.github.com>
@nalimilan
Copy link
Member Author

Thanks!

@nalimilan nalimilan merged commit a7ccfd5 into master Apr 30, 2025
14 of 16 checks passed
@nalimilan nalimilan deleted the nl/arrow branch April 30, 2025 22:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants