Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

free provision for embedding providers #2026

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

zhengwj533
Copy link
Contributor

@zhengwj533 zhengwj533 commented Mar 6, 2025

Question:

Unable to provide a new embedding provider. To adapt to the new embedding provider, multiple source code files need to be modified.

Hope:

Users can directly provide an embedding provider and specify in the configuration file to use the newly provided provider

resolvent:

  1. Modify the check logic of the embedding config and check if the configuration file is working properly based on the user-defined provider’s supported_providers() function.
  2. Modify the definition of R2RPProviders, relax the type restrictions of embedding and LLM, that is, use parent class types to avoid synchronously modifying R2RPProviders every time a new provider is added.
  3. Modifying the factory also relaxes the type restrictions.
  4. A sample provider will be like this.
import logging
import os
from typing import Any

from openai import AsyncOpenAI, AuthenticationError, OpenAI

from core.base import (
    ChunkSearchResult,
    EmbeddingConfig,
    EmbeddingProvider,
)

logger = logging.getLogger()


class DashscopeEmbeddingProvider(EmbeddingProvider):

    def __init__(self, config: EmbeddingConfig):
        ...

    @classmethod
    def supported_providers(self) -> list[str]:
        return ["dashscope"]

    def _get_embedding_kwargs(self, **kwargs):
        embedding_kwargs = {
            "model": self.base_model,
            "dimensions": self.base_dimension,
        }
        embedding_kwargs.update(kwargs)
        return embedding_kwargs

    async def _execute_task(self, task: dict[str, Any]) -> list[list[float]]:
        # fix: request one 'text' in base provider
        texts = task.get("text", task.get("texts"))
        one_text = isinstance(texts, str)
        if one_text:
            texts = [texts]

        kwargs = self._get_embedding_kwargs(**task.get("kwargs", {}))

        try:
            response = await self.aclient.embeddings.create(
                input=texts,
                **kwargs,
            )
            embeddings = [data.embedding for data in response.data]
            return embeddings[0] if one_text else embeddings
...

Important

Enable user-defined embedding providers by dynamically discovering subclasses and relaxing type restrictions in embedding.py, abstractions.py, and factory.py.

  • Behavior:
    • supported_providers() in embedding.py now dynamically discovers subclasses of EmbeddingProvider to include user-defined providers.
    • R2RProviders in abstractions.py and create_providers() in factory.py now use base class types for embedding and llm to allow custom providers.
  • Code Structure:
    • Added get_all_subclasses() in embedding.py to recursively find all subclasses of a given class.
    • Removed specific provider types in R2RProviders and create_providers() in factory.py, using EmbeddingProvider and CompletionProvider instead.
  • Misc:
    • Added a placeholder supported_providers() method in EmbeddingProvider class.

This description was created by Ellipsis for befec6d. It will automatically update as commits are pushed.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested. Reviewed everything up to befec6d in 2 minutes and 26 seconds

More details
  • Looked at 129 lines of code in 3 files
  • Skipped 0 files when reviewing.
  • Skipped posting 6 drafted comments based on config settings.
1. py/core/base/providers/embedding.py:90
  • Draft comment:
    Use 'cls' instead of 'self' for the class method parameter in supported_providers to follow Python convention.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 0% vs. threshold = 50%
    The comment appears to be incorrect. The method is already a @classmethod and doesn't have any parameters to change. The comment suggests fixing something that isn't actually an issue. This seems like the automated tool misread the code or made incorrect assumptions.
    Could the tool be suggesting to add a cls parameter since this is a classmethod and it's a common pattern to have cls as the first parameter?
    While classmethods often have a cls parameter, it's not required if the method doesn't need to use it. The current implementation is valid Python and there's no clear need to add an unused parameter.
    The comment should be deleted as it suggests fixing a non-existent issue - there is no self parameter to replace with cls.
2. py/core/base/providers/embedding.py:52
  • Draft comment:
    Ensure get_all_subclasses does not introduce duplicates or unintended subclasses. Consider checking if the provider.supported_providers() is callable.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50%
    The comment has two parts: 1) Checking for duplicates - while possible in theory with diamond inheritance, this is very unlikely to be an issue in practice with embedding providers. 2) Checking if supported_providers() is callable - this is already handled by hasattr() check. The comment is speculative and asks for verification rather than pointing out a clear issue.
    The comment could be pointing out a real edge case with multiple inheritance that could cause issues. The hasattr check doesn't guarantee the attribute is callable.
    While technically true, these are highly speculative edge cases. The code follows Python conventions and the hasattr pattern is widely used. No strong evidence these issues would occur in practice.
    The comment should be deleted as it is speculative and asks for verification rather than pointing out clear issues. The code as written follows standard Python patterns.
3. py/core/main/assembly/factory.py:196
  • Draft comment:
    The create_embedding_provider function only accounts for 'openai', 'litellm', and 'ollama'. This may not support new user-defined providers unless an override is provided. Consider updating this logic to dynamically load custom providers.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
4. py/core/base/providers/embedding.py:90
  • Draft comment:
    Use 'cls' (not 'self') in this @classmethod for consistency with Python conventions.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50%
    The comment is technically correct - Python convention is to use cls for classmethod parameters. However, this is a very minor style issue that doesn't affect functionality. The method doesn't even use the parameter, as it just returns an empty list. Plus there's a TODO indicating this will be made abstract anyway.
    The comment is technically accurate about Python conventions. And consistency in code style can be important for maintainability.
    While technically correct, this is an extremely minor style issue on a method that's marked as temporary (with a TODO). The effort of changing it versus the benefit gained is not worth it.
    This comment should be deleted as it points out a very minor style issue that doesn't meaningfully impact code quality, especially given the method is temporary.
5. py/core/main/assembly/factory.py:196
  • Draft comment:
    The return type of create_embedding_provider is still specific to known providers. Consider updating the annotation to 'EmbeddingProvider' to support new/free providers added later.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
6. py/core/main/assembly/factory.py:318
  • Draft comment:
    Both embedding and completion_embedding providers use the same override (embedding_provider_override). This prevents independent customization. Consider introducing separate override parameters if different providers are desired.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.

Workflow ID: wflow_GWE732HveFEJdEga


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

logger.debug(f"Supported embedding providers: {providers}")
for provider in providers:
if hasattr(provider, "supported_providers"):
provider_names.extend(provider.supported_providers())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider deduplicating provider names when extending the list to avoid duplicates from multiple subclass providers.

Suggested change
provider_names.extend(provider.supported_providers())
provider_names = list(set(provider_names).union(provider.supported_providers()))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good

@@ -67,6 +86,11 @@
self.semaphore = asyncio.Semaphore(config.concurrent_request_limit)
self.current_requests = 0

@classmethod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this method is meant to be abstract, consider applying the @abstractmethod decorator (in combination with @classmethod) to enforce implementation in subclasses.

Suggested change
@classmethod
@abstractmethod\n @classmethod

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good, but it will force existing providers to provide the implementation of this method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants