[jvm-packages] Support spark connect #11381

wbo4958 · 2025-03-31T10:26:58Z

This PR is to make xgboost jvm package run on spark connect. Currently, I put the xgboost4j python wrapper into jvm-packages/xgboost4j-spark/python, but I can change it to other place. @WeichenXu123 @trivialfis let's discuss it on this thread.

To Do,

add read/write
add XGBoostRegressor
add XGBoostRanker

trivialfis

I think the Python code should be part of the Python package, we can reuse a lot of code there and it also helps unify the interfaces. In addition, Python packaging is not trivial, @hcho3 has been working on related projects and I can feel the difficulty there. Let's not do it for another package.

trivialfis · 2025-03-31T13:23:51Z

jvm-packages/xgboost4j-spark/python/pyproject.toml

+  "Environment :: GPU :: NVIDIA CUDA :: 11",
+  "Environment :: GPU :: NVIDIA CUDA :: 11.4",
+  "Environment :: GPU :: NVIDIA CUDA :: 11.5",
+  "Environment :: GPU :: NVIDIA CUDA :: 11.6",
+  "Environment :: GPU :: NVIDIA CUDA :: 11.7",
+  "Environment :: GPU :: NVIDIA CUDA :: 11.8",
+  "Environment :: GPU :: NVIDIA CUDA :: 12",
+  "Environment :: GPU :: NVIDIA CUDA :: 12 :: 12.0",
+  "Environment :: GPU :: NVIDIA CUDA :: 12 :: 12.1",
+  "Environment :: GPU :: NVIDIA CUDA :: 12 :: 12.2",
+  "Environment :: GPU :: NVIDIA CUDA :: 12 :: 12.3",
+  "Environment :: GPU :: NVIDIA CUDA :: 12 :: 12.4",
+  "Environment :: GPU :: NVIDIA CUDA :: 12 :: 12.5",
+  "Environment :: GPU :: NVIDIA CUDA :: 12 :: 12.6",


These can be removed, we dropped suport for all previous versions in 3.0

jjayadeep06 · 2025-04-02T05:20:17Z

jvm-packages/xgboost4j-spark/python/src/xgboost4j/estimator.py

+from .params import XGBoostParams
+
+
+class XGBoostClassifier(_JavaProbabilisticClassifier["XGBoostClassificationModel"], XGBoostParams):


Will this be a new class that needs to be used for both normal and spark connect invocation ? Can we not modify _fit method in SparkXGBClassifier to use try_remote_fit decorator or will it be a big change ?

Thx @jjayadeep06 for your reply. Well, when involving connect, things is becoming complicated. You know, we could make the existing xgboost-pyspark support spark connect by changing the RDD operations to Dataframe without using any try_remote_xxxx. Yes, we have a plan to do that.

While this PR is to make xgboost jvm package to support connect by introducing a light-weight python wrapper. If we add the python wrapper over xgboost jvm package to the existing xgboost-python-pyspark, then it's going to raise an issue which backends (xgboost jvm package or python package) will be chose when running xgboost over connect?

[jvm-packages] Support spark connect

1222b02

trivialfis reviewed Mar 31, 2025

View reviewed changes

jjayadeep06 reviewed Apr 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[jvm-packages] Support spark connect #11381

[jvm-packages] Support spark connect #11381

wbo4958 commented Mar 31, 2025 •

edited

Loading

trivialfis left a comment

trivialfis Mar 31, 2025

jjayadeep06 Apr 2, 2025

wbo4958 Apr 7, 2025

		from .params import XGBoostParams


		class XGBoostClassifier(_JavaProbabilisticClassifier["XGBoostClassificationModel"], XGBoostParams):

[jvm-packages] Support spark connect #11381

Are you sure you want to change the base?

[jvm-packages] Support spark connect #11381

Conversation

wbo4958 commented Mar 31, 2025 • edited Loading

trivialfis left a comment

Choose a reason for hiding this comment

trivialfis Mar 31, 2025

Choose a reason for hiding this comment

jjayadeep06 Apr 2, 2025

Choose a reason for hiding this comment

wbo4958 Apr 7, 2025

Choose a reason for hiding this comment

wbo4958 commented Mar 31, 2025 •

edited

Loading