Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jvm-packages] Support spark connect #11381

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

wbo4958
Copy link
Contributor

@wbo4958 wbo4958 commented Mar 31, 2025

This PR is to make xgboost jvm package run on spark connect. Currently, I put the xgboost4j python wrapper into jvm-packages/xgboost4j-spark/python, but I can change it to other place. @WeichenXu123 @trivialfis let's discuss it on this thread.

To Do,

  • add read/write
  • add XGBoostRegressor
  • add XGBoostRanker

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the Python code should be part of the Python package, we can reuse a lot of code there and it also helps unify the interfaces. In addition, Python packaging is not trivial, @hcho3 has been working on related projects and I can feel the difficulty there. Let's not do it for another package.

Comment on lines +31 to +44
"Environment :: GPU :: NVIDIA CUDA :: 11",
"Environment :: GPU :: NVIDIA CUDA :: 11.4",
"Environment :: GPU :: NVIDIA CUDA :: 11.5",
"Environment :: GPU :: NVIDIA CUDA :: 11.6",
"Environment :: GPU :: NVIDIA CUDA :: 11.7",
"Environment :: GPU :: NVIDIA CUDA :: 11.8",
"Environment :: GPU :: NVIDIA CUDA :: 12",
"Environment :: GPU :: NVIDIA CUDA :: 12 :: 12.0",
"Environment :: GPU :: NVIDIA CUDA :: 12 :: 12.1",
"Environment :: GPU :: NVIDIA CUDA :: 12 :: 12.2",
"Environment :: GPU :: NVIDIA CUDA :: 12 :: 12.3",
"Environment :: GPU :: NVIDIA CUDA :: 12 :: 12.4",
"Environment :: GPU :: NVIDIA CUDA :: 12 :: 12.5",
"Environment :: GPU :: NVIDIA CUDA :: 12 :: 12.6",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can be removed, we dropped suport for all previous versions in 3.0

from .params import XGBoostParams


class XGBoostClassifier(_JavaProbabilisticClassifier["XGBoostClassificationModel"], XGBoostParams):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be a new class that needs to be used for both normal and spark connect invocation ? Can we not modify _fit method in SparkXGBClassifier to use try_remote_fit decorator or will it be a big change ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx @jjayadeep06 for your reply. Well, when involving connect, things is becoming complicated. You know, we could make the existing xgboost-pyspark support spark connect by changing the RDD operations to Dataframe without using any try_remote_xxxx. Yes, we have a plan to do that.

While this PR is to make xgboost jvm package to support connect by introducing a light-weight python wrapper. If we add the python wrapper over xgboost jvm package to the existing xgboost-python-pyspark, then it's going to raise an issue which backends (xgboost jvm package or python package) will be chose when running xgboost over connect?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants