Add Hugging Face Chat Completion support to Inference Plugin #127254

Jan-Kazlouski-elastic · 2025-04-23T14:34:16Z

Change to existing Hugging Face provider integration allowing completion (both streaming and non-streaming) and chat_completion (only streaming) to be executed as part of inference API.
Examples of RQ/RS from local testing:

Non-streaming

Create Completion Endpoint.txt

Create Completion Endpoint:

RQ:
curl --location --request PUT 'localhost:9200/_inference/completion/hugging-face-completion' \
--header 'Content-Type: application/json' \
--header 'Authorization: Basic %auth_token%' \
--data '{
    "service": "hugging_face",
    "service_settings": {
        "api_key": "%hf_token%",
        "model_id": "tgi",
        "max_input_tokens" : 128,
        "url": "%hf_url%"
    }
}'

RS:
{
    "inference_id": "hugging-face-completion",
    "task_type": "completion",
    "service": "hugging_face",
    "service_settings": {
        "model_id": "tgi",
        "url": "%hf_url%",
        "max_input_tokens": 128,
        "rate_limit": {
            "requests_per_minute": 3000
        }
    }
}

Perform Non-Streaming Completion.txt

Perform Non-Streaming Completion:

RQ:
curl --location 'localhost:9200/_inference/completion/hugging-face-completion' \
--header 'Content-Type: application/json' \
--header 'Authorization: Basic %auth_token%' \
--data '{
    "input": "The sky above the port was the color of television tuned to a dead channel."
}'

RS:
{
    "completion": [
        {
            "result": "This instruction describes an imagery, specifically a scene that creates a visual analogy between the sky and a television screen. The analogy implies that the sky is empty, dull, and possibly grim, as a television set with no signal would be bare and desolate.\n\n\nFor a rephrased instruction that maintains the same core meaning but changes the details, one could say:\n\n\n\"The firmament above the harbor mirrored the static hum of an unlit screen.\"\n\n\nThis sentence still captures the essence of an uneventful sky while using different words and imagery. It evokes a sense of abandonment and stillness by comparing it to the static seen on an unused television."
        }
    ]
}

Streaming

Create Chat Completion Endpoint.txt

RQ:
curl --location --request PUT 'localhost:9200/_inference/chat_completion/hugging-face-chat-completion' \
--header 'Content-Type: application/json' \
--header 'Authorization: Basic %auth_token%' \
--data '{
    "service": "hugging_face",
    "service_settings": {
        "api_key": "%hf_token%",
        "model_id": "tgi",
        "max_input_tokens" : 128,
        "url": "%hf_url%"
    }
}'

RS:
{
    "inference_id": "hugging-face-completion",
    "task_type": "completion",
    "service": "hugging_face",
    "service_settings": {
        "model_id": "tgi",
        "url": "%hf_url%",
        "max_input_tokens": 128,
        "rate_limit": {
            "requests_per_minute": 3000
        }
    }
}

Perform Streaming Completion.txt

RQ:
curl --location 'localhost:9200/_inference/completion/hugging-face-completion/_stream' \
--header 'Content-Type: application/json' \
--header 'Authorization: Basic %auth_token%' \
--data '{
    "input": "The sky above the port was the color of television tuned to a dead channel."
}'

Perform Streaming Chat Completion.txt

RQ:
curl --location 'localhost:9200/_inference/chat_completion/hugging-face-chat-completion/_stream' \
--header 'Content-Type: application/json' \
--header 'Authorization: Basic %auth_token%' \
--data '{
    "model": "tgi",
    "messages": [
        {
            "role": "user",
            "content": "What is deep learning?"
        }
    ],
    "max_completion_tokens": 150
}'

Tested on models:
https://huggingface.co/Qwen/QwQ-32B
https://huggingface.co/microsoft/Phi-3-mini-128k-instruct

Have you signed the contributor license agreement? - YES
Have you followed the contributor guidelines? - YES
If submitting code, have you built your formula locally prior to submission with gradle check? - YES
If submitting code, is your pull request against main? Unless there is a good reason otherwise, we prefer pull requests against main and will backport as needed. - YES
If submitting code, have you checked that your submission is for an OS and architecture that we support? - YES
If you are submitting this code for a class then read our policy for that. - YES

jonathan-buttner · 2025-04-24T12:52:11Z

x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/InferencePlugin.java

@@ -361,6 +362,7 @@ public void loadExtensions(ExtensionLoader loader) {
    public List<InferenceServiceExtension.Factory> getInferenceServiceFactories() {
        return List.of(
            context -> new HuggingFaceElserService(httpFactory.get(), serviceComponents.get()),
+            context -> new HuggingFaceChatCompletionService(httpFactory.get(), serviceComponents.get()),


I haven't looked through the entire PR but just wanted to check. We should try to add the chat completion functionality to the existing HuggingFaceService logic.

For example the OpenAI service supports many task types: https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/services/openai/OpenAiService.java#L175-L197

Change is done. Now completion logic is in single HuggingFaceService class.

jonathan-buttner · 2025-04-24T13:59:22Z

.../elasticsearch/xpack/inference/services/huggingface/HuggingFaceCompletionRequestManager.java

+import java.util.Objects;
+import java.util.function.Supplier;
+
+public class HuggingFaceCompletionRequestManager extends HuggingFaceRequestManager {


We're trying to move away from the request manager pattern because it adds duplicate code. Could you look into following the pattern we started here (we haven't refactored all the services yet but if it's possible to do for hugging face it'd be great if we could do it now)?

#124144

One option would be to leave the other hugging face request managers as they are (if possible, it may not be though) and then use one of the generic request managers like shown in the PR above for this new functionality.

Sure thing. I will adopt the approach from the shared PR. Thanks Jonathan!

I did the change that allowed us to move away from request manager for chat_completion and completion tasks.

…hugging-face-chat-completion-integration

…Manager, added Unit Tests

…chat-completion-integration

elasticsearchmachine · 2025-04-29T20:16:15Z

Pinging @elastic/ml-core (Team:ML)

jonathan-buttner

Looking good, left a few comments

I was testing streaming chat completions and ran into an issue with how hugging face returns an error related to a request with a tools message:

I provisioned a HF inference endpoint for the model Qwen/QwQ-32B

This is the endpoint name: jon-qwq-32b-qkm

PUT _inference/chat_completion/test-chat
{
    "service": "hugging_face",
    "service_settings": {
        "api_key": "<api_key>",
        "max_input_tokens" : 128,
        "model_id": "tgi",
        "url": "<url>"
    }
}

The following request fails:

POST _inference/chat_completion/test-chat/_stream
{
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What's the price of a scarf?"
                }
            ]
        }
    ],
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "get_current_price",
                "description": "Get the current price of a item",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "item": {
                            "id": "123"
                        }
                    }
                }
            }
        }
    ],
    "tool_choice": {
        "type": "function",
        "function": {
            "name": "get_current_price"
        }
    }
}

Response

event: error
data: {"error":{"code":"bad_request","message":"Required [id, choices, model, object]","type":"illegal_argument_exception"}}

The logs show the underlying issue is that hugging face returns:

{"error":{"message":"Input validation error: cannot compile regex from schema: Unsupported JSON Schema structure {\"id\":\"123\"} \nMake sure it is valid to the JSON Schema specification and check if it's supported by Outlines.\nIf it should be supported, please open an issue.","http_status_code":422}}

{"error": {"message" ...}, ...}

Our openai code requires there to be a type in the object. To fix this we can do this:

We'll create a new response handler similar to what we're doing here: https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/services/elastic/ElasticInferenceServiceUnifiedChatCompletionResponseHandler.java

The new response handler can extend OpenAiUnifiedChatCompletionResponseHandler

We'll do the same thing as the file I linked but when we'll pass in a different lambda like we're doing here: https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/services/elastic/ElasticInferenceServiceUnifiedChatCompletionResponseHandler.java#L37

Our lambda will use an error parser that can extract the fields from the error I mentioned about. I think my suggestion would be to create a new class similar to ErrorMessageResponseEntity which can extract the message field and maybe the http_status_code.

We'll then use that code like what's being done here: https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/services/elastic/ElasticInferenceServiceUnifiedChatCompletionResponseHandler.java#L62

Finally we'll use our new response handler here:

elasticsearch/x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/services/huggingface/HuggingFaceService.java

Line 74 in ceebb9a

    
           private static final ResponseHandler UNIFIED_CHAT_COMPLETION_HANDLER = new OpenAiUnifiedChatCompletionResponseHandler(

It'll look something like this:

    private static final ResponseHandler UNIFIED_CHAT_COMPLETION_HANDLER = new HuggingFaceChatCompletionResponseHandler(
        "hugging face chat completion",
        OpenAiChatCompletionResponseEntity::fromResponse
    );

We're actually trying to move away from including "unified" in the names but we haven't gotten around to cleaning up the rest of the code base yet.

jonathan-buttner · 2025-04-30T12:49:29Z

...src/main/java/org/elasticsearch/xpack/inference/services/huggingface/HuggingFaceService.java

        private static final LazyInitializable<InferenceServiceConfiguration, RuntimeException> configuration = new LazyInitializable<>(
            () -> {
                var configurationMap = new HashMap<String, SettingsConfiguration>();

                configurationMap.put(
                    URL,
-                    new SettingsConfiguration.Builder(supportedTaskTypes).setDefaultValue("https://api.openai.com/v1/embeddings")
+                    new SettingsConfiguration.Builder(SUPPORTED_TASK_TYPES).setDefaultValue("https://api.openai.com/v1/embeddings")


Oops looks like we have an existing bug here (unrelated to your changes). Can you remove the setDefaultValue that shouldn't be pointing to openai 😅

I initially assumed it is there for some internal configuration and didn't want to introduce any risks by changing it. Removed.

jonathan-buttner · 2025-04-30T12:53:58Z

...src/main/java/org/elasticsearch/xpack/inference/services/huggingface/HuggingFaceService.java

+            unifiedChatInput -> new HuggingFaceUnifiedChatCompletionRequest(unifiedChatInput, overriddenModel),
+            UnifiedChatInput.class
+        );
+        var errorMessage = format(FAILED_TO_SEND_REQUEST_ERROR_MESSAGE, "CHAT COMPLETION", model.getInferenceEntityId());


nit: How about we move this into a function something like:

private static String errorMessage(String requestDescription, String inferenceId) { return format("Failed to send Hugging Face %s request from inference entity id [%s]", requestDescription, inferenceId) }

It might be a little easier to see how the string is being formatted if the raw string is included in the format call.

Suggestion:
Maybe we should use TaskType taskType instead of String requestDescription parameter in it? That way we'd restrict values to be a part of specified list of clearly defined tasks erasing possibility of different formatting. Because in current implementation it is "text embeddings" and "ELSER" which is a bit messy.
Such approach would change "ELSER" to sparse_embedding and make other values lowercase as well.

P.S. Also having elser vs sparse embedding used interchangeably might be worth unifying to keep the vocabulary more strict.

Added version described above. Please do tell if you'd like to stick with the version you proposed initially.

jonathan-buttner · 2025-04-30T12:54:26Z

.../org/elasticsearch/xpack/inference/services/huggingface/action/HuggingFaceActionCreator.java

-            "text embeddings",
-            model.getInferenceEntityId()
-        );
+        var errorMessage = format(FAILED_TO_SEND_REQUEST_ERROR_MESSAGE, "text embeddings", model.getInferenceEntityId());


nit: Same comment as above suggesting making this a function.

Did the change described in my comment above.

jonathan-buttner · 2025-04-30T16:00:55Z

...pack/inference/services/huggingface/completion/HuggingFaceChatCompletionServiceSettings.java

+
+    @Override
+    public TransportVersion getMinimalSupportedVersion() {
+        return TransportVersions.V_8_14_0;


We'll need to create a new version number instead of using an old one here. Here's an example: https://github.com/elastic/elasticsearch/pull/122218/files#diff-85e782e9e33a0f8ca8e99b41c17f9d04e3a7981d435abf44a3aa5d954a47cd8f

Basically we need to create a version for the 8.x branch, in the linked PR that was ML_INFERENCE_DEEPSEEK_8_19 = def(8_841_0_09);

And we need to create a version for the 9.x branch: ML_INFERENCE_DEEPSEEK = def(9_029_00_0);

When creating the variables we'll want to increment the value so it is the newest for both 8.x and 9.x. So as of writing this the latest value is here: https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/TransportVersions.java#L231

So our value for the 9.x branch should be def(9_0656_0_00);, for the 8.x branch it should be: def(8_842_0_20);

The value we put here on line 181 should be the 9.x version. When we backport this PR to 8.x branch we'll switch it to the 8.x variable name. Here's an example of the backport for deepseek: https://github.com/elastic/elasticsearch/pull/124796/files#diff-85e782e9e33a0f8ca8e99b41c17f9d04e3a7981d435abf44a3aa5d954a47cd8f

This will changes as other people in the organization add their own transport versions and it will cause merge conflicts, so as you update from the main branch we'll just need to keep bumping the value until we merge the PR.

for the 8.x branch it should be: def(8_842_0_20);

842 would be a new version of server part and 20 would be new version of patch. It should be incremented only as server part or patch part. Not both. According to documentation in TransportVersions.java.

To determine the id of the next TransportVersion constant, do the following:

Use the same major version, unless bumping majors

Bump the server version part by 1, unless creating a patch version

Leave the subsidiary part as 0

Bump the patch part if creating a patch version

The last 23 versions 8.x versions are patch updates and 9.x versions are ALL server part updates. I would assume that new version for 8.x would be another patch update and for 9.x - another server part update.

Added the versions.

Ah you're right, good catch 👍

jonathan-buttner · 2025-04-30T16:02:20Z

...pack/inference/services/huggingface/completion/HuggingFaceChatCompletionServiceSettings.java

+        this.uri = createUri(in.readString());
+        this.maxInputTokens = in.readOptionalVInt();
+
+        if (in.getTransportVersion().onOrAfter(TransportVersions.V_8_15_0)) {


We only need to do the onOrAfter() call if we're adding a new field to an existing setting that was introduced in a previous version. Since this entire file is new we can remove these if blocks as this code is guaranteed to be introduced after v8.15.0

Removed redundant check.

jonathan-buttner · 2025-04-30T18:56:20Z

...pack/inference/services/huggingface/completion/HuggingFaceChatCompletionServiceSettings.java

+
+        String modelId = extractOptionalString(map, MODEL_ID, ModelConfigurations.SERVICE_SETTINGS, validationException);
+
+        var uri = extractUri(map, URL, validationException);


Just a note it seems like for both dedicated inference endpoints and serverless hugging face requires v1/chat/completions to be in the path. We'll want to make sure we include this in the documentation, that users need to include that segment or they'll get an error.

Sure thing. Will include this into documentation along with the fact that model must support OpenAI interface. Not all of them do apparently.

jonathan-buttner · 2025-04-30T18:56:52Z

...pack/inference/services/huggingface/completion/HuggingFaceChatCompletionServiceSettings.java

+        out.writeString(uri.toString());
+        out.writeOptionalVInt(maxInputTokens);
+
+        if (out.getTransportVersion().onOrAfter(TransportVersions.V_8_15_0)) {


Let's remove the if-block

jonathan-buttner · 2025-04-30T19:02:36Z

...elasticsearch/xpack/inference/services/huggingface/action/HuggingFaceActionCreatorTests.java

@@ -112,14 +115,14 @@ public void testExecute_ReturnsSuccessfulResponse_ForElserAction() throws IOExce
            );

            assertThat(webServer.requests(), hasSize(1));
-            assertNull(webServer.requests().get(0).getUri().getQuery());
+            assertNull(webServer.requests().getFirst().getUri().getQuery());


I would actually leave these as get(0) because when we backport it, the 8.19 branch uses an older version of the jdk and this method (getFirst) won't exist I believe.

Yes, that is true. It wouldn't exist. Reverted the changes.

jonathan-buttner · 2025-04-30T19:08:28Z

...inference/services/huggingface/completion/HuggingFaceChatCompletionServiceSettingsTests.java

+import static org.hamcrest.Matchers.containsString;
+import static org.hamcrest.Matchers.is;
+
+public class HuggingFaceChatCompletionServiceSettingsTests extends AbstractWireSerializingTestCase<


Let's extend AbstractBWCWireSerializationTestCase instead (it helps with future testing when we add new fields to the serialization).

jonathan-buttner · 2025-04-30T19:10:59Z

...inference/services/huggingface/completion/HuggingFaceChatCompletionServiceSettingsTests.java

+        serviceSettings.toXContent(builder, null);
+        String xContentResult = Strings.toString(builder);
+
+        assertThat(xContentResult, is("""


You can leave this as is but for future tests let's use this utility class so we can create the expected json in a more readable fashion (I realize most of our tests use the method you have here).

var expected = XContentHelper.stripWhitespace(""" { "secret_parameters": { "test_key": "test_value" } } """); assertThat(xContentResult, is(expected));

dan-rubinstein · 2025-04-30T19:14:26Z

...src/main/java/org/elasticsearch/xpack/inference/services/huggingface/HuggingFaceService.java


+/**
+ * This class is responsible for managing the Hugging Face inference service.
+ * It handles the creation of models, chunked inference, and unified completion inference.


nit: The class also handles non-chunked inference which should be included in the javadoc.

I rephrased it so it is more specific. Thanks.

dan-rubinstein · 2025-04-30T19:24:21Z

...src/main/java/org/elasticsearch/xpack/inference/services/huggingface/HuggingFaceService.java

@@ -167,13 +220,15 @@ public static InferenceServiceConfiguration get() {
            return configuration.getOrCompute();
        }

+        private Configuration() {}


Why is this line needed?

In short - to protect this class from being instantiated.

Since there are only static members in this class - there is no reason for having an option of instantiating it. To protect this class from being instantiated we can hide default constructor that every Object has by declaring private one.
It is optional, si if you want - I can remove this.

dan-rubinstein · 2025-04-30T19:55:29Z

...pack/inference/services/huggingface/completion/HuggingFaceChatCompletionServiceSettings.java

+    public static final String NAME = "hugging_face_completion_service_settings";
+    // At the time of writing HuggingFace hasn't posted the default rate limit for inference endpoints so the value his is only a guess
+    // 3000 requests per minute
+    private static final RateLimitSettings DEFAULT_RATE_LIMIT_SETTINGS = new RateLimitSettings(3000);


How did you arrive at the 3000 default? Is there somewhere that confirms that 3000 is a viable number (even if it is different than a recommended default from Hugging Face)?

To be honest this is taken as is from HuggingFace Service Settings for other tasks (sparse and text embeddings). I haven't seen any publications from HuggingFace claiming there to be a different rate limit, so decided to go with what we have for other operations.
Original is committed by @jonathan-buttner so please do let me know if this is something to reconsider.

Yeah I haven't seen a rate limit in hugging face's docs and it'll probably depend on how the model is deployed (via serverless or a dedicated endpoint). I'm ok with 3000, we can document that users should use an appropriate value for their environment.

dan-rubinstein · 2025-04-30T19:59:15Z

...pack/inference/services/huggingface/completion/HuggingFaceChatCompletionServiceSettings.java

+    public static HuggingFaceChatCompletionServiceSettings fromMap(Map<String, Object> map, ConfigurationParseContext context) {
+        ValidationException validationException = new ValidationException();
+
+        String modelId = extractOptionalString(map, MODEL_ID, ModelConfigurations.SERVICE_SETTINGS, validationException);


What is model ID used for? My understanding for other task types for hugging face is that we take in the URI instead of model ID. What does it mean if both URI and model ID are set?

When we're talking about Text Generation task on Hugging Face side - Model ID sent to HuggingFace API as part of OpenAI like request schema. While being seemingly optional for dedicated endpoints, it is mandatory for serverless HF Inference Endpoints. So in order to receive successful response from those - we need to provide it as part of the request.

For other tasks we have integration with - it is not being sent because it is not defined in request schema on HF side.
URI is mandatory always and provided by customer because HF doesn't have default URL.

What is model ID used for?

It is defined by HF as field of request being sent to HF as part of OpenAI like request. We're sending it as part of payload.

What does it mean if both URI and model ID are set?

When both model ID and URI are set that means that we know the URI to call and are able to send request with field that HF expects to be there if it is serverless endpoint.

dan-rubinstein · 2025-04-30T20:10:03Z

...ch/xpack/inference/services/huggingface/request/embeddings/HuggingFaceEmbeddingsRequest.java

@@ -24,25 +24,35 @@

 import static org.elasticsearch.xpack.inference.external.request.RequestUtils.createAuthBearerHeader;

-public class HuggingFaceInferenceRequest implements Request {
+/**
+ * This class is responsible for creating a request to the Hugging Face API for embeddings.


nit: The class doesn't really create the request so much as it is the request. Maybe you can say it represents the request?

Sadly, I tend to disagree. While being named HuggingFaceInferenceRequest and implementing Request interface from looking at this class's behavior we can see that it literally creates request in createHttpRequest method. Object itself hardly can be called a request because it is not being sent to Hugging Face, it just creates an actual request that goes out to HF. Though I agree that it represents request in some way.
@jonathan-buttner What do you think?

Yeah our naming convention here is confusing.

How about we change the class doc to:

This class is responsible for creating an Hugging Face embeddings HTTP request.

Did the change.

dan-rubinstein · 2025-04-30T20:20:22Z

...ference/services/huggingface/request/completion/HuggingFaceUnifiedChatCompletionRequest.java

+import static org.elasticsearch.xpack.inference.external.request.RequestUtils.createAuthBearerHeader;
+
+/**
+ * This class is responsible for creating a request to the Hugging Face API for chat completions.


Same comment as the other request class. I think this more represents a request than creates it.

Replied above. Would like to hear Jonathan's take on it.

Change is done.

dan-rubinstein · 2025-04-30T20:26:29Z

...e/services/huggingface/request/completion/HuggingFaceUnifiedChatCompletionRequestEntity.java

+        builder.startObject();
+        unifiedRequestEntity.toXContent(builder, params);
+
+        builder.field(MODEL_FIELD, model.getServiceSettings().modelId());


This is related to the question from the Model class but if the model ID is optional , would it be better to use the URI here instead? Or maybe we want to identify it with a combination of both?

URI is mandatory
Model ID is optional depending on the endpoint type (dedicated/serverless).
Identification can be performed by URI only in case of dedicated endpoint or combination of both in case of serverless endpoint usage.
URI only would cause error response if used for serverless endpoint.

Got it, thanks for clarifying. It seems like in the updated code we include the model ID if it is provided here which makes sense but we do not include the URI as part of the toXContent. Is there a reason we are not including the URI if it is mandatory in all cases?

Spoke with Jonathan and he explained that this is serializing the request body. Makes sense why we aren't serializing the URI here then. The most recent updates seem good to me then.

dan-rubinstein · 2025-04-30T20:36:17Z

...elasticsearch/xpack/inference/services/huggingface/action/HuggingFaceActionCreatorTests.java

            var initialInputs = initialRequestAsMap.get("inputs");
            assertThat(initialInputs, is(List.of("123")));

        }
    }
+
+    public void testExecute_ReturnsSuccessfulResponse_ForChatCompletionAction() throws IOException {


Seems like this test and the next one have very similar processes. Can we reduce code duplication by moving the shared code into a shared function that takes in the values that are generated differently?

Done. They are quite different so I extracted most of the repeated code into methods.

dan-rubinstein · 2025-04-30T20:43:59Z

...ack/inference/services/huggingface/request/HuggingFaceUnifiedChatCompletionRequestTests.java

+public class HuggingFaceUnifiedChatCompletionRequestTests extends ESTestCase {
+
+    public void testCreateRequest_WithStreaming() throws IOException {
+        var request = createRequest("url", "secret", "abcd", "model", true);


For random inputs you can use randomAlphaOfLength(x);

…chat-completion-integration # Conflicts: # x-pack/plugin/inference/qa/inference-service-tests/src/javaRestTest/java/org/elasticsearch/xpack/inference/InferenceGetServicesIT.java

…it tests

…ed formatting in InferenceGetServicesIT

…ngFaceService

jonathan-buttner · 2025-05-09T17:27:52Z

@Jan-Kazlouski-elastic I was doing some testing with the Qwen model and noticed that the function name field was coming back as null and cause our parsing logic to fail. I put up a PR to fix that here: #127976

Would you mind cherrypicking that commit into this PR?

jonathan-buttner

Changes look great! thanks for addressing my concerns. I left one note about the function name change, you're welcome to pull that in if you like or I can merge PR later. Can you merge in the latest main to resolve the transport version issue? After that we're good to merge 👍

…chat-completion-integration # Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java

Jan-Kazlouski-elastic · 2025-05-11T16:00:59Z

@jonathan-buttner
I've

cherry-picked your commit for null function name
resolved conflict for TransportVersion
merged fresh changes from main branch
executed "gradle check" task

…icsearch into feature/hugging-face-chat-completion-integration # Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java

…icsearch into feature/hugging-face-chat-completion-integration

jonathan-buttner

Great work!

…tion

…icsearch into feature/hugging-face-chat-completion-integration # Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java

elasticsearchmachine · 2025-05-19T16:39:20Z

💔 Backport failed

Status	Branch	Result
❌	8.19	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 127254

jonathan-buttner · 2025-05-19T17:31:53Z

💚 All backports created successfully

Status	Branch	Result
✅	8.19

Questions ?

Please refer to the Backport tool documentation

…#127254) * Add Hugging Face Chat Completion support to Inference Plugin * Add support for streaming chat completion task for HuggingFace * [CI] Auto commit changes from spotless * Add support for non-streaming completion task for HuggingFace * Remove RequestManager for HF Chat Completion Task * Refactored Hugging Face Completion Service Settings, removed Request Manager, added Unit Tests * Refactored Hugging Face Action Creator, added Unit Tests * Add Hugging Face Server Test * [CI] Auto commit changes from spotless * Removed parameters from media type for Chat Completion Request and unit tests * Removed OpenAI default URL in HuggingFaceService's configuration, fixed formatting in InferenceGetServicesIT * Refactor error message handling in HuggingFaceActionCreator and HuggingFaceService * Update minimal supported version and add Hugging Face transport version constants * Made modelId field optional in HuggingFaceChatCompletionModel, updated unit tests * Removed max input tokens field from HuggingFaceChatCompletionServiceSettings, fixed unit tests * Removed if statement checking TransportVersion for HuggingFaceChatCompletionServiceSettings constructor with StreamInput param * Removed getFirst() method calls for backport compatibility * Made HuggingFaceChatCompletionServiceSettingsTests extend AbstractBWCWireSerializationTestCase for future serialization testing * Refactored tests to use stripWhitespace method for readability * Refactored javadoc for HuggingFaceService * Renamed HF chat completion TransportVersion constant names * Added random string generation in unit test * Refactored javadocs for HuggingFace requests * Refactored tests to reduce duplication * Added changelog file * Add HuggingFaceChatCompletionResponseHandler and associated tests * Refactor error handling in HuggingFaceServiceTests to standardize error response codes and types * Refactor HuggingFace error handling to improve response structure and add streaming support * Allowing null function name for hugging face models --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co> Co-authored-by: Jonathan Buttner <jonathan.buttner@elastic.co> (cherry picked from commit d1ad917) # Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java

jonathan-buttner · 2025-05-19T17:33:18Z

Backport is here: #128152

#128152) * Add Hugging Face Chat Completion support to Inference Plugin * Add support for streaming chat completion task for HuggingFace * [CI] Auto commit changes from spotless * Add support for non-streaming completion task for HuggingFace * Remove RequestManager for HF Chat Completion Task * Refactored Hugging Face Completion Service Settings, removed Request Manager, added Unit Tests * Refactored Hugging Face Action Creator, added Unit Tests * Add Hugging Face Server Test * [CI] Auto commit changes from spotless * Removed parameters from media type for Chat Completion Request and unit tests * Removed OpenAI default URL in HuggingFaceService's configuration, fixed formatting in InferenceGetServicesIT * Refactor error message handling in HuggingFaceActionCreator and HuggingFaceService * Update minimal supported version and add Hugging Face transport version constants * Made modelId field optional in HuggingFaceChatCompletionModel, updated unit tests * Removed max input tokens field from HuggingFaceChatCompletionServiceSettings, fixed unit tests * Removed if statement checking TransportVersion for HuggingFaceChatCompletionServiceSettings constructor with StreamInput param * Removed getFirst() method calls for backport compatibility * Made HuggingFaceChatCompletionServiceSettingsTests extend AbstractBWCWireSerializationTestCase for future serialization testing * Refactored tests to use stripWhitespace method for readability * Refactored javadoc for HuggingFaceService * Renamed HF chat completion TransportVersion constant names * Added random string generation in unit test * Refactored javadocs for HuggingFace requests * Refactored tests to reduce duplication * Added changelog file * Add HuggingFaceChatCompletionResponseHandler and associated tests * Refactor error handling in HuggingFaceServiceTests to standardize error response codes and types * Refactor HuggingFace error handling to improve response structure and add streaming support * Allowing null function name for hugging face models --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co> Co-authored-by: Jonathan Buttner <jonathan.buttner@elastic.co> (cherry picked from commit d1ad917) # Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java Co-authored-by: Jan-Kazlouski-elastic <jan.kazlouski@elastic.co>

Add Hugging Face Chat Completion support to Inference Plugin

63f21de

elasticsearchmachine added external-contributor Pull request authored by a developer outside the Elasticsearch team v9.1.0 labels Apr 23, 2025

jonathan-buttner added :ml Machine learning Team:ML Meta label for the ML team >enhancement v8.19.0 labels Apr 23, 2025

jonathan-buttner reviewed Apr 24, 2025

View reviewed changes

Jan-Kazlouski-elastic and others added 11 commits April 25, 2025 13:25

Merge remote-tracking branch 'refs/remotes/origin/main' into feature/…

6b7dd2e

…hugging-face-chat-completion-integration

Add support for streaming chat completion task for HuggingFace

65e4060

[CI] Auto commit changes from spotless

404f640

Add support for non-streaming completion task for HuggingFace

ceebb9a

Remove RequestManager for HF Chat Completion Task

acaa35b

Merge remote-tracking branch 'refs/remotes/origin/main' into feature/…

91fa92e

…hugging-face-chat-completion-integration

Refactored Hugging Face Completion Service Settings, removed Request …

ff3ef50

…Manager, added Unit Tests

Refactored Hugging Face Action Creator, added Unit Tests

965093b

Add Hugging Face Server Test

6757b07

Merge remote-tracking branch 'origin/main' into feature/hugging-face-…

58ea9fd

…chat-completion-integration

[CI] Auto commit changes from spotless

df845eb

Jan-Kazlouski-elastic marked this pull request as ready for review April 29, 2025 20:15

Jan-Kazlouski-elastic requested a review from jonathan-buttner April 29, 2025 20:20

jonathan-buttner added the auto-backport Automatically create backport pull requests when merged label Apr 30, 2025

jonathan-buttner requested changes Apr 30, 2025

View reviewed changes

dan-rubinstein reviewed Apr 30, 2025

View reviewed changes

Jan-Kazlouski-elastic added 4 commits May 2, 2025 14:17

Merge remote-tracking branch 'origin/main' into feature/hugging-face-…

cc24e68

…chat-completion-integration # Conflicts: # x-pack/plugin/inference/qa/inference-service-tests/src/javaRestTest/java/org/elasticsearch/xpack/inference/InferenceGetServicesIT.java

Removed parameters from media type for Chat Completion Request and un…

5bbe3b7

…it tests

Removed OpenAI default URL in HuggingFaceService's configuration, fix…

3684816

…ed formatting in InferenceGetServicesIT

Refactor error message handling in HuggingFaceActionCreator and Huggi…

7670d2c

…ngFaceService

Jan-Kazlouski-elastic mentioned this pull request May 9, 2025

Add Hugging Face Rerank support #127966

Open

jonathan-buttner mentioned this pull request May 9, 2025

[ML] Allowing null function name for chat completion schema #127976

Closed

jonathan-buttner reviewed May 9, 2025

View reviewed changes

dan-rubinstein approved these changes May 9, 2025

View reviewed changes

Jan-Kazlouski-elastic and others added 2 commits May 11, 2025 16:15

Merge remote-tracking branch 'origin/main' into feature/hugging-face-…

9370b57

…chat-completion-integration # Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java

Allowing null function name for hugging face models

9044bee

Jan-Kazlouski-elastic requested a review from jonathan-buttner May 11, 2025 16:00

Jan-Kazlouski-elastic added 4 commits May 12, 2025 16:04

Merge branch 'main' of https://github.com/Jan-Kazlouski-elastic/elast…

e72a312

…icsearch into feature/hugging-face-chat-completion-integration # Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java

Merge branch 'main' of https://github.com/Jan-Kazlouski-elastic/elast…

e2cb334

…icsearch into feature/hugging-face-chat-completion-integration

Merge branch 'main' of https://github.com/Jan-Kazlouski-elastic/elast…

a4b5d2c

…icsearch into feature/hugging-face-chat-completion-integration

Merge branch 'main' of https://github.com/Jan-Kazlouski-elastic/elast…

c5988ed

…icsearch into feature/hugging-face-chat-completion-integration

Jan-Kazlouski-elastic mentioned this pull request May 14, 2025

Update Inference specification for Hugging Face's completion and chat completion tasks elastic/elasticsearch-specification#4370

Closed

7 tasks

Merge branch 'main' of https://github.com/Jan-Kazlouski-elastic/elast…

1547559

…icsearch into feature/hugging-face-chat-completion-integration

jonathan-buttner approved these changes May 19, 2025

View reviewed changes

Merge branch 'main' into feature/hugging-face-chat-completion-integra…

71c6057

…tion

jonathan-buttner enabled auto-merge (squash) May 19, 2025 13:58

Merge branch 'main' of https://github.com/Jan-Kazlouski-elastic/elast…

228fffa

…icsearch into feature/hugging-face-chat-completion-integration # Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java

auto-merge was automatically disabled May 19, 2025 14:30
Head branch was pushed to by a user without write access

jonathan-buttner enabled auto-merge (squash) May 19, 2025 15:05

jonathan-buttner merged commit d1ad917 into elastic:main May 19, 2025
19 checks passed

elasticsearchmachine added the backport pending label May 19, 2025

jonathan-buttner mentioned this pull request May 19, 2025

[8.19] Add Hugging Face Chat Completion support to Inference Plugin (#127254) #128152

Merged

Jan-Kazlouski-elastic mentioned this pull request May 19, 2025

Update Inference specification for Hugging Face's completion and chat completion tasks elastic/elasticsearch-specification#4383

Open

7 tasks


		String modelId = extractOptionalString(map, MODEL_ID, ModelConfigurations.SERVICE_SETTINGS, validationException);

		var uri = extractUri(map, URL, validationException);

Add Hugging Face Chat Completion support to Inference Plugin #127254

Add Hugging Face Chat Completion support to Inference Plugin #127254

Conversation

Jan-Kazlouski-elastic commented Apr 23, 2025 • edited by jonathan-buttner Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elasticsearchmachine commented Apr 29, 2025

jonathan-buttner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jan-Kazlouski-elastic May 2, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jan-Kazlouski-elastic May 2, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jan-Kazlouski-elastic May 2, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jan-Kazlouski-elastic May 2, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jan-Kazlouski-elastic May 2, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonathan-buttner commented May 9, 2025 • edited Loading

jonathan-buttner left a comment

Choose a reason for hiding this comment

Jan-Kazlouski-elastic commented May 11, 2025

jonathan-buttner left a comment

Choose a reason for hiding this comment

elasticsearchmachine commented May 19, 2025

💔 Backport failed

jonathan-buttner commented May 19, 2025

💚 All backports created successfully

Questions ?

jonathan-buttner commented May 19, 2025

Jan-Kazlouski-elastic commented Apr 23, 2025 •

edited by jonathan-buttner

Loading

Jan-Kazlouski-elastic May 2, 2025 •

edited

Loading

Jan-Kazlouski-elastic May 2, 2025 •

edited

Loading

Jan-Kazlouski-elastic May 2, 2025 •

edited

Loading

Jan-Kazlouski-elastic May 2, 2025 •

edited

Loading

Jan-Kazlouski-elastic May 2, 2025 •

edited

Loading

jonathan-buttner commented May 9, 2025 •

edited

Loading