Skip to content

Make it possible to exclude pyarrow dep #276

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

scosman
Copy link

@scosman scosman commented Mar 19, 2025

Fixes #274

pyarrow has a few issues:

  • it's huge: about 100MB uncompressed
  • It's not compatible with all systems (Intel Macs)

This change allows client to exclude the pyarrow dep if they don't need it. It's only used for parquet file validation, which isn't needed by all users.

Note: I'm not removing the dependency- just making it run-time import. It still works as expected for all users, unless users go out of their way to manually exclude this dependency.

Have you read the Contributing Guidelines?
yes

Issue # #274

This allows client to exclude the pyarrow dep if they don't need it. Saved ~80MB and more compatible with older systems.

Will still get a runtime error if they exclude it, then try to use it.

Still works as expected unless users go out of their way to manually exclude this dependency (I'm not removing the dep, you need to manually exclude it).
@orangetin
Copy link
Member

@azahed98 @artek0chumak could you review this?

@scosman
Copy link
Author

scosman commented May 5, 2025

@orangetin I'd love to get this reviewed and integrated (or hear it's not going to make it so I can maintain my fork). Should be a quick 2 min review if you know the right folks.

Copy link
Member

@orangetin orangetin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the PR! i'd like some changes before we can merge this:

  1. Move pyarrow an optional dependency in a new group in the pyproject.toml file so it doesn't get installed by default
  2. Add the try/except wrapper (see comment below)
  3. Add a small note in the readme about this

@@ -372,6 +371,8 @@ def _check_jsonl(file: Path) -> Dict[str, Any]:


def _check_parquet(file: Path) -> Dict[str, Any]:
# in method import - this allows client to exclude the pyarrow dep if they don't need it. Saved ~80MB and more compatible with older systems.
from pyarrow import ArrowInvalid, parquet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you wrap this in a try/except with details on how to install this with the dependency group? something like pip install together[parquet]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make pyarrow dependency optional
2 participants