Skip to content

Integration of Atla LLM Judge with Comet's Opik #114

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

sitamgithub-MSIT
Copy link

@sitamgithub-MSIT sitamgithub-MSIT commented Apr 19, 2025

This Python script evaluates AI-generated responses using a scoring rubric for humor and wit, utilizing an LLM Judge powered by Atla's open-source judge LLMs. It also logs traces to Comet's observability platform, Opik.

Summary by CodeRabbit

  • New Features
    • Introduced an LLM-based evaluation tool for scoring AI-generated responses using a humor-related rubric, providing detailed reasoning and numeric scores.
    • Added example usage demonstrating response evaluation.
  • Chores
    • Added environment variable templates and project configuration files for dependency management.
    • Created an initial README with the project title.

Copy link

coderabbitai bot commented Apr 19, 2025

Walkthrough

This update introduces a new Python project called "llm-as-judge," which implements an LLM-based metric for evaluating AI-generated responses. The changes include the creation of the main evaluation module, a sample README, a project configuration file specifying dependencies, and an example environment variable file. The core functionality is encapsulated in the LLMJudgeMetric class, which uses an OpenAI-compatible API with a custom model and endpoint to assess responses based on a humor scoring rubric. The project is integrated with the Opik framework for metric tracking and telemetry.

Changes

File(s) Change Summary
llm-as-judge/.env.example Added a sample environment variable file with an ATLA_API_KEY placeholder.
llm-as-judge/README.md Added a new README file with the title "LLM as a Judge".
llm-as-judge/main.py Introduced the main evaluation module with LLMJudgeMetric and LLMJudgeResult classes, using OpenAI and Opik for LLM-based scoring with a custom API endpoint and model.
llm-as-judge/pyproject.toml Added project configuration and dependency declarations for the Python package.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant LLMJudgeMetric
    participant OpenAI_API

    User->>LLMJudgeMetric: Call score(input, output)
    LLMJudgeMetric->>OpenAI_API: Send formatted prompt (input, output)
    OpenAI_API-->>LLMJudgeMetric: Return JSON response (score, reason)
    LLMJudgeMetric-->>User: Return ScoreResult (score, reason)
Loading

Poem

🐰
A metric was born, so clever and new,
To judge with a rubric, both honest and true.
It scores your best jokes with a digital mind,
And tells you the reason, so you’re never behind.
With Opik and Atla, the framework’s in place—
Let’s see which AI brings a smile to your face!

Tip

⚡💬 Agentic Chat (Pro Plan, General Availability)
  • We're introducing multi-step agentic chat in review comments and issue comments, within and outside of PR's. This feature enhances review and issue discussions with the CodeRabbit agentic chat by enabling advanced interactions, including the ability to create pull requests directly from comments and add commits to existing pull requests.

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 11b106d and d7b2980.

📒 Files selected for processing (1)
  • llm-as-judge/main.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • llm-as-judge/main.py

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (5)
llm-as-judge/main.py (5)

14-15: Make Opik configuration more flexible

The Opik configuration is hardcoded with use_local=False, which limits flexibility. Consider making this configurable via an environment variable.

 # Configure Comet's Opik
 import opik
-opik.configure(use_local=False)
+opik.configure(use_local=os.environ.get("OPIK_USE_LOCAL", "False").lower() == "true")

90-96: Improve response content extraction robustness

The response parsing logic attempts to handle different response structures but could be more robust. It's difficult to determine the expected structure from this code alone.

         # Parse the response from the LLM
-        response_content = (
-            response.choices[0].message.content
-            if hasattr(response, "choices")
-            else response.message.content
-        )
+        try:
+            if hasattr(response, "choices") and response.choices:
+                response_content = response.choices[0].message.content
+            elif hasattr(response, "message"):
+                response_content = response.message.content
+            else:
+                raise ValueError("Unexpected response structure from LLM API")
+        except (AttributeError, IndexError) as e:
+            raise ValueError(f"Failed to extract content from LLM response: {str(e)}") from e

54-76: Consider extracting prompt template to a separate constant or file

The prompt template is quite long and embedded in the code. For better maintainability, consider extracting it to a separate constant or file, especially if you plan to add more templates in the future.

+# Define the prompt templates
+HUMOR_SCORING_PROMPT = """
+You are an expert, impartial judge tasked with evaluating an AI-generated response based on a given instruction and scoring rubric.
+Provide comprehensive feedback on the response, strictly adhering to the scoring rubric. Follow this with a score between 1 and 5.
+
+The format of the your response should be a json with no backticks that returns:
+{
+    "score": <score between 1 and 5>,
+    "reason": "<reason for the score>"
+}
+
+Scoring Rubric:
+Does the response effectively use humor or wit to enhance the conversation?
+Score 1: The response is devoid of any humor or wit.
+Score 2: The response attempts humor, but it falls flat or is inappropriate.
+Score 3: The response includes humor or wit, but it could be more effectively integrated.
+Score 4: The response uses humor or wit effectively in most instances, enhancing the conversation.
+Score 5: The response perfectly integrates humor or wit, greatly enhancing the enjoyment of the conversation.
+
+Here is the data to evaluate:
+Instruction: {input}
+Response: {output}
+"""
+
 class LLMJudgeMetric(base_metric.BaseMetric):
     """
     A metric that uses an LLM to judge the quality of AI-generated responses.
...
         )
 
         # Define the prompt template for the LLM
-        self.prompt_template = """
-        You are an expert, impartial judge tasked with evaluating an AI-generated response based on a given instruction and scoring rubric.
-        Provide comprehensive feedback on the response, strictly adhering to the scoring rubric. Follow this with a score between 1 and 5.
-
-        The format of the your response should be a json with no backticks that returns:
-        {{
-            "score": <score between 1 and 5>,
-            "reason": "<reason for the score>"
-        }}
-
-        Scoring Rubric:
-        Does the response effectively use humor or wit to enhance the conversation?
-        Score 1: The response is devoid of any humor or wit.
-        Score 2: The response attempts humor, but it falls flat or is inappropriate.
-        Score 3: The response includes humor or wit, but it could be more effectively integrated.
-        Score 4: The response uses humor or wit effectively in most instances, enhancing the conversation.
-        Score 5: The response perfectly integrates humor or wit, greatly enhancing the enjoyment of the conversation.
-
-        Here is the data to evaluate:
-        Instruction: {input}
-        Response: {output}
-        """
+        self.prompt_template = HUMOR_SCORING_PROMPT

25-41: Consider adding support for different evaluation criteria

The metric is currently hardcoded to evaluate humor/wit. To make this class more versatile, consider adding support for different evaluation criteria through configurable prompts.

 class LLMJudgeMetric(base_metric.BaseMetric):
     """
     A metric that uses an LLM to judge the quality of AI-generated responses.
 
     Attributes:
         - name (str): The name of the metric.
         - model_name (str): The name of the LLM model to use for evaluation.
+        - criterion (str): The evaluation criterion to use (e.g., "humor", "helpfulness").
         - llm_client (OpenAI): The client for communicating with the LLM.
         - prompt_template (str): The template for constructing prompts to send to the LLM.
 
     Methods:
         score(input, output, **ignored_kwargs): Evaluates the AI-generated response against the
             provided input using the LLM as a judge.
 
     Returns:
         ScoreResult: Contains the numerical score (1-5) and reasoning for the evaluation.
     """
 
-    def __init__(self, name: str = "LLM judge metric", model_name: str = "atla-selene"):
+    def __init__(
+        self, 
+        name: str = "LLM judge metric", 
+        model_name: str = "atla-selene",
+        criterion: str = "humor"
+    ):

Then you would need to implement a method to select the appropriate prompt template based on the criterion.


78-101: Add caching mechanism for performance optimization

LLM API calls are expensive. Consider implementing a caching mechanism to avoid redundant evaluations of the same input-output pairs.

+    # Simple in-memory cache for LLM evaluations
+    _cache = {}
+
     @opik.track
     def score(self, input: str, output: str, **ignored_kwargs: Any):
         """Method to evaluate the AI-generated response using the LLM judge."""
+        # Check cache for existing evaluation
+        cache_key = f"{input}:{output}"
+        if cache_key in self._cache:
+            cached_result = self._cache[cache_key]
+            return score_result.ScoreResult(
+                name=self.name, 
+                value=cached_result["score"], 
+                reason=cached_result["reason"]
+            )
+
         # Apply prompt template and prepare the messages for the LLM
         prompt = self.prompt_template.format(input=input, output=output)
         messages = [{"role": "user", "content": prompt}]

...

         # Return the result as a ScoreResult object with the score and reason
+        # Cache the result for future use
+        self._cache[cache_key] = result_json
+
         return score_result.ScoreResult(
             name=self.name, value=result_json["score"], reason=result_json["reason"]
         )

Note: For production use, consider using a more robust caching solution with TTL (time-to-live) and size limits.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 07f2d33 and 11b106d.

⛔ Files ignored due to path filters (1)
  • llm-as-judge/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (4)
  • llm-as-judge/.env.example (1 hunks)
  • llm-as-judge/README.md (1 hunks)
  • llm-as-judge/main.py (1 hunks)
  • llm-as-judge/pyproject.toml (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
llm-as-judge/main.py (1)
autogen-stock-analyst/custom_autogen_model.py (1)
  • create (48-78)
🔇 Additional comments (2)
llm-as-judge/.env.example (1)

1-1:

✅ Verification successful

Document all required environment variables.

Currently only ATLA_API_KEY is listed. To ensure users can configure both the Atla judge and Comet Opik telemetry, please surface any additional variables that main.py depends on (for example, COMET_API_KEY, COMET_PROJECT, COMET_EXPERIMENT_NAME, etc.), and add inline comments explaining each entry in this example file.

Run this script to discover all env‑var references in your code:


🏁 Script executed:

#!/bin/bash
# List all os.getenv usages along with context
rg -n "os.getenv" -A2 llm-as-judge

Length of output: 34


I’d like to double‑check if the code actually references any Comet variables or environment loading patterns before updating the example. Let’s search for COMET_* usages and any getenv/environ calls:


🏁 Script executed:

#!/bin/bash
set -e

echo "🔍 Searching for COMET_ environment variables…"
rg -n "COMET_" -A2 llm-as-judge || true

echo "🔍 Searching for any getenv calls repository‑wide…"
rg -n "getenv" -A2 . || true

echo "🔍 Searching for os.environ usage…"
rg -n "os.environ" -A2 llm-as-judge || true

echo "🔍 Searching for dotenv or load_dotenv…"
rg -n "load_dotenv" -A2 llm-as-judge || true
rg -n "dotenv" -A2 llm-as-judge || true

Length of output: 10149


Let’s pull in the Comet section of main.py to confirm whether any COMET_ keys are actually used:


🏁 Script executed:

#!/bin/bash
set -e

echo "🔍 Searching for any Comet references in main.py…"
rg -i "comet" -n -A5 llm-as-judge/main.py || true

echo "🔍 Showing the first 200 lines of main.py for full context…"
sed -n '1,200p' llm-as-judge/main.py

Length of output: 4502


Only ATLA_API_KEY Is Required

I reviewed llm-as-judge/main.py and found that the only explicit environment variable it reads is ATLA_API_KEY. The Comet Opik client is initialized via opik.configure(use_local=False) but does not reference any COMET_* variables in this code. No further os.getenv or os.environ calls are present.

• .env.example already lists:

ATLA_API_KEY=your_atla_api_key  # API key for Atla LLM judge

No additional variables need to be documented in this example file.

llm-as-judge/pyproject.toml (1)

8-10:

✅ Verification successful

Verify dependency names and minimum versions.

Double-check that the packages you’re specifying (openai, opik, python-dotenv) are the correct PyPI names used by your imports in main.py and that the version bounds cover the features you need.

You can quickly inspect available versions with:


🏁 Script executed:

#!/bin/bash
# Example for opik
pip index versions opik

Length of output: 1004


🏁 Script executed:

#!/bin/bash
# Check available PyPI versions
pip index versions openai
pip index versions python-dotenv

# Verify import statements in the codebase
rg -n "import .*openai" -A2
rg -n "from openai" -A2
rg -n "import .*opik" -A2
rg -n "from opik" -A2
rg -n "dotenv" -A2

Length of output: 38318


All dependency names and minimum versions are correct.

I’ve cross‑checked the imports in llm-as-judge/main.py (openai, opik, dotenv) against PyPI and the specified bounds:

  • openai>=1.75.0 (latest 1.75.0)
  • opik>=1.7.7 (latest 1.7.9)
  • python‑dotenv>=1.1.0 (latest 1.1.0)

No changes required.

@@ -0,0 +1 @@
# LLM as a Judge
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Expand the README with project details and instructions

The README currently only contains a title. To onboard users effectively, please add essential sections such as:

  • A brief Description of what the package does and its use case
  • Installation instructions (via pip/poetry)
  • Environment Variables setup (e.g., .env parameters)
  • Usage examples showing how to invoke the metric
  • Contribution guidelines and License information

Example snippet:

 # LLM as a Judge
+
+## Description
+This package provides an `LLMJudgeMetric` that leverages Atla’s open-source LLM to evaluate AI-generated responses on humor and wit, integrated with Comet’s Opik for telemetry.
+
+## Installation
+```bash
+pip install llm-as-judge
+```
+
+## Environment Variables
+Create a `.env` file with:
+```dotenv
+ATLA_API_KEY=your_api_key_here
+OPIK_API_KEY=your_opik_key_here
+```
+
+## Usage
+```python
+from llm_as_judge.main import LLMJudgeMetric
+
+metric = LLMJudgeMetric()
+score, detail = metric.evaluate(response="Your AI response", prompt="Your prompt")
+print(score, detail)
+```
+
+## Contributing
+1. Fork the repo  
+2. Create a feature branch  
+3. Submit a pull request  
+
+## License
+MIT © Your Name

Comment on lines +96 to +97
result_json = json.loads(response_content)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling for JSON parsing

The code assumes the response will always be valid JSON, but this may not be the case. Add error handling to gracefully manage parsing errors.

-        result_json = json.loads(response_content)
+        try:
+            result_json = json.loads(response_content)
+        except json.JSONDecodeError as e:
+            raise ValueError(f"Failed to parse LLM response as JSON: {response_content}") from e
+
+        # Validate response structure
+        if "score" not in result_json or "reason" not in result_json:
+            raise ValueError(f"LLM response missing required fields: {result_json}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
result_json = json.loads(response_content)
try:
result_json = json.loads(response_content)
except json.JSONDecodeError as e:
raise ValueError(f"Failed to parse LLM response as JSON: {response_content}") from e
# Validate response structure
if "score" not in result_json or "reason" not in result_json:
raise ValueError(f"LLM response missing required fields: {result_json}")

Comment on lines 48 to 52
# Initialize the OpenAI client with the API key and base URL for Atla
self.llm_client = OpenAI(
api_key=os.environ.get("ATLA_API_KEY"),
base_url="https://api.atla-ai.com/v1",
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling for missing API key

The code doesn't handle the case when the ATLA_API_KEY environment variable is missing, which could lead to confusing errors at runtime.

+        # Check if API key is available
+        api_key = os.environ.get("ATLA_API_KEY")
+        if not api_key:
+            raise ValueError(
+                "ATLA_API_KEY environment variable not found. Please check your .env file."
+            )
+
         # Initialize the OpenAI client with the API key and base URL for Atla
         self.llm_client = OpenAI(
-            api_key=os.environ.get("ATLA_API_KEY"),
+            api_key=api_key,
             base_url="https://api.atla-ai.com/v1",
         )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Initialize the OpenAI client with the API key and base URL for Atla
self.llm_client = OpenAI(
api_key=os.environ.get("ATLA_API_KEY"),
base_url="https://api.atla-ai.com/v1",
)
# Check if API key is available
api_key = os.environ.get("ATLA_API_KEY")
if not api_key:
raise ValueError(
"ATLA_API_KEY environment variable not found. Please check your .env file."
)
# Initialize the OpenAI client with the API key and base URL for Atla
self.llm_client = OpenAI(
api_key=api_key,
base_url="https://api.atla-ai.com/v1",
)

Comment on lines 104 to 109
# Example usage
metric = LLMJudgeMetric()
metric.score(
input="Tell me a joke.",
output="Why did the chicken cross the road? To get to the other side",
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Move example usage to if __name__ == "__main__" block

The example usage at the bottom of the file will execute whenever the module is imported, which is likely not intended. Move it to a conditional block that only executes when the file is run directly.

-# Example usage
-metric = LLMJudgeMetric()
-metric.score(
-    input="Tell me a joke.",
-    output="Why did the chicken cross the road? To get to the other side",
-)
+# Example usage
+if __name__ == "__main__":
+    metric = LLMJudgeMetric()
+    result = metric.score(
+        input="Tell me a joke.",
+        output="Why did the chicken cross the road? To get to the other side",
+    )
+    print(f"Score: {result.value}")
+    print(f"Reason: {result.reason}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Example usage
metric = LLMJudgeMetric()
metric.score(
input="Tell me a joke.",
output="Why did the chicken cross the road? To get to the other side",
)
# Example usage
if __name__ == "__main__":
metric = LLMJudgeMetric()
result = metric.score(
input="Tell me a joke.",
output="Why did the chicken cross the road? To get to the other side",
)
print(f"Score: {result.value}")
print(f"Reason: {result.reason}")

Comment on lines 85 to 89
# Call the LLM with the prepared messages
response = self.llm_client.chat.completions.create(
model=self.model_name, messages=messages
)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling for API calls

The LLM API call lacks error handling. OpenAI API calls can fail for various reasons (rate limits, authentication issues, network problems), and these should be gracefully handled.

         # Call the LLM with the prepared messages
-        response = self.llm_client.chat.completions.create(
-            model=self.model_name, messages=messages
-        )
+        try:
+            response = self.llm_client.chat.completions.create(
+                model=self.model_name, messages=messages
+            )
+        except Exception as e:
+            raise RuntimeError(f"Error calling LLM API: {str(e)}") from e
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Call the LLM with the prepared messages
response = self.llm_client.chat.completions.create(
model=self.model_name, messages=messages
)
# Call the LLM with the prepared messages
try:
response = self.llm_client.chat.completions.create(
model=self.model_name, messages=messages
)
except Exception as e:
raise RuntimeError(f"Error calling LLM API: {str(e)}") from e

Comment on lines +1 to +11
[project]
name = "llm-as-judge"
version = "0.1.0"
description = "Integration of Atla LLM Judge with Comet's Opik."
readme = "README.md"
requires-python = ">=3.12"
dependencies = [
"openai>=1.75.0",
"opik>=1.7.7",
"python-dotenv>=1.1.0",
]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add PEP 621 build-system metadata.

The pyproject.toml is missing the required [build-system] table, so build tools won’t know which backend to invoke. Without this section, installation and packaging will fail. Please append something like:

[build-system]
requires = ["setuptools>=61.0"]
build-backend = "setuptools.build_meta"

to enable proper builds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant