-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Integration of Atla LLM Judge with Comet's Opik #114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughThis update introduces a new Python project called "llm-as-judge," which implements an LLM-based metric for evaluating AI-generated responses. The changes include the creation of the main evaluation module, a sample README, a project configuration file specifying dependencies, and an example environment variable file. The core functionality is encapsulated in the Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant LLMJudgeMetric
participant OpenAI_API
User->>LLMJudgeMetric: Call score(input, output)
LLMJudgeMetric->>OpenAI_API: Send formatted prompt (input, output)
OpenAI_API-->>LLMJudgeMetric: Return JSON response (score, reason)
LLMJudgeMetric-->>User: Return ScoreResult (score, reason)
Poem
Tip ⚡💬 Agentic Chat (Pro Plan, General Availability)
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Nitpick comments (5)
llm-as-judge/main.py (5)
14-15
: Make Opik configuration more flexibleThe Opik configuration is hardcoded with
use_local=False
, which limits flexibility. Consider making this configurable via an environment variable.# Configure Comet's Opik import opik -opik.configure(use_local=False) +opik.configure(use_local=os.environ.get("OPIK_USE_LOCAL", "False").lower() == "true")
90-96
: Improve response content extraction robustnessThe response parsing logic attempts to handle different response structures but could be more robust. It's difficult to determine the expected structure from this code alone.
# Parse the response from the LLM - response_content = ( - response.choices[0].message.content - if hasattr(response, "choices") - else response.message.content - ) + try: + if hasattr(response, "choices") and response.choices: + response_content = response.choices[0].message.content + elif hasattr(response, "message"): + response_content = response.message.content + else: + raise ValueError("Unexpected response structure from LLM API") + except (AttributeError, IndexError) as e: + raise ValueError(f"Failed to extract content from LLM response: {str(e)}") from e
54-76
: Consider extracting prompt template to a separate constant or fileThe prompt template is quite long and embedded in the code. For better maintainability, consider extracting it to a separate constant or file, especially if you plan to add more templates in the future.
+# Define the prompt templates +HUMOR_SCORING_PROMPT = """ +You are an expert, impartial judge tasked with evaluating an AI-generated response based on a given instruction and scoring rubric. +Provide comprehensive feedback on the response, strictly adhering to the scoring rubric. Follow this with a score between 1 and 5. + +The format of the your response should be a json with no backticks that returns: +{ + "score": <score between 1 and 5>, + "reason": "<reason for the score>" +} + +Scoring Rubric: +Does the response effectively use humor or wit to enhance the conversation? +Score 1: The response is devoid of any humor or wit. +Score 2: The response attempts humor, but it falls flat or is inappropriate. +Score 3: The response includes humor or wit, but it could be more effectively integrated. +Score 4: The response uses humor or wit effectively in most instances, enhancing the conversation. +Score 5: The response perfectly integrates humor or wit, greatly enhancing the enjoyment of the conversation. + +Here is the data to evaluate: +Instruction: {input} +Response: {output} +""" + class LLMJudgeMetric(base_metric.BaseMetric): """ A metric that uses an LLM to judge the quality of AI-generated responses. ... ) # Define the prompt template for the LLM - self.prompt_template = """ - You are an expert, impartial judge tasked with evaluating an AI-generated response based on a given instruction and scoring rubric. - Provide comprehensive feedback on the response, strictly adhering to the scoring rubric. Follow this with a score between 1 and 5. - - The format of the your response should be a json with no backticks that returns: - {{ - "score": <score between 1 and 5>, - "reason": "<reason for the score>" - }} - - Scoring Rubric: - Does the response effectively use humor or wit to enhance the conversation? - Score 1: The response is devoid of any humor or wit. - Score 2: The response attempts humor, but it falls flat or is inappropriate. - Score 3: The response includes humor or wit, but it could be more effectively integrated. - Score 4: The response uses humor or wit effectively in most instances, enhancing the conversation. - Score 5: The response perfectly integrates humor or wit, greatly enhancing the enjoyment of the conversation. - - Here is the data to evaluate: - Instruction: {input} - Response: {output} - """ + self.prompt_template = HUMOR_SCORING_PROMPT
25-41
: Consider adding support for different evaluation criteriaThe metric is currently hardcoded to evaluate humor/wit. To make this class more versatile, consider adding support for different evaluation criteria through configurable prompts.
class LLMJudgeMetric(base_metric.BaseMetric): """ A metric that uses an LLM to judge the quality of AI-generated responses. Attributes: - name (str): The name of the metric. - model_name (str): The name of the LLM model to use for evaluation. + - criterion (str): The evaluation criterion to use (e.g., "humor", "helpfulness"). - llm_client (OpenAI): The client for communicating with the LLM. - prompt_template (str): The template for constructing prompts to send to the LLM. Methods: score(input, output, **ignored_kwargs): Evaluates the AI-generated response against the provided input using the LLM as a judge. Returns: ScoreResult: Contains the numerical score (1-5) and reasoning for the evaluation. """ - def __init__(self, name: str = "LLM judge metric", model_name: str = "atla-selene"): + def __init__( + self, + name: str = "LLM judge metric", + model_name: str = "atla-selene", + criterion: str = "humor" + ):Then you would need to implement a method to select the appropriate prompt template based on the criterion.
78-101
: Add caching mechanism for performance optimizationLLM API calls are expensive. Consider implementing a caching mechanism to avoid redundant evaluations of the same input-output pairs.
+ # Simple in-memory cache for LLM evaluations + _cache = {} + @opik.track def score(self, input: str, output: str, **ignored_kwargs: Any): """Method to evaluate the AI-generated response using the LLM judge.""" + # Check cache for existing evaluation + cache_key = f"{input}:{output}" + if cache_key in self._cache: + cached_result = self._cache[cache_key] + return score_result.ScoreResult( + name=self.name, + value=cached_result["score"], + reason=cached_result["reason"] + ) + # Apply prompt template and prepare the messages for the LLM prompt = self.prompt_template.format(input=input, output=output) messages = [{"role": "user", "content": prompt}] ... # Return the result as a ScoreResult object with the score and reason + # Cache the result for future use + self._cache[cache_key] = result_json + return score_result.ScoreResult( name=self.name, value=result_json["score"], reason=result_json["reason"] )Note: For production use, consider using a more robust caching solution with TTL (time-to-live) and size limits.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
llm-as-judge/uv.lock
is excluded by!**/*.lock
📒 Files selected for processing (4)
llm-as-judge/.env.example
(1 hunks)llm-as-judge/README.md
(1 hunks)llm-as-judge/main.py
(1 hunks)llm-as-judge/pyproject.toml
(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
llm-as-judge/main.py (1)
autogen-stock-analyst/custom_autogen_model.py (1)
create
(48-78)
🔇 Additional comments (2)
llm-as-judge/.env.example (1)
1-1
:✅ Verification successful
Document all required environment variables.
Currently only
ATLA_API_KEY
is listed. To ensure users can configure both the Atla judge and Comet Opik telemetry, please surface any additional variables thatmain.py
depends on (for example,COMET_API_KEY
,COMET_PROJECT
,COMET_EXPERIMENT_NAME
, etc.), and add inline comments explaining each entry in this example file.Run this script to discover all env‑var references in your code:
🏁 Script executed:
#!/bin/bash # List all os.getenv usages along with context rg -n "os.getenv" -A2 llm-as-judgeLength of output: 34
I’d like to double‑check if the code actually references any Comet variables or environment loading patterns before updating the example. Let’s search for COMET_* usages and any getenv/environ calls:
🏁 Script executed:
#!/bin/bash set -e echo "🔍 Searching for COMET_ environment variables…" rg -n "COMET_" -A2 llm-as-judge || true echo "🔍 Searching for any getenv calls repository‑wide…" rg -n "getenv" -A2 . || true echo "🔍 Searching for os.environ usage…" rg -n "os.environ" -A2 llm-as-judge || true echo "🔍 Searching for dotenv or load_dotenv…" rg -n "load_dotenv" -A2 llm-as-judge || true rg -n "dotenv" -A2 llm-as-judge || trueLength of output: 10149
Let’s pull in the Comet section of
main.py
to confirm whether anyCOMET_
keys are actually used:
🏁 Script executed:
#!/bin/bash set -e echo "🔍 Searching for any Comet references in main.py…" rg -i "comet" -n -A5 llm-as-judge/main.py || true echo "🔍 Showing the first 200 lines of main.py for full context…" sed -n '1,200p' llm-as-judge/main.pyLength of output: 4502
Only ATLA_API_KEY Is Required
I reviewed llm-as-judge/main.py and found that the only explicit environment variable it reads is
ATLA_API_KEY
. The Comet Opik client is initialized viaopik.configure(use_local=False)
but does not reference anyCOMET_*
variables in this code. No furtheros.getenv
oros.environ
calls are present.• .env.example already lists:
ATLA_API_KEY=your_atla_api_key # API key for Atla LLM judge
No additional variables need to be documented in this example file.
llm-as-judge/pyproject.toml (1)
8-10
:✅ Verification successful
Verify dependency names and minimum versions.
Double-check that the packages you’re specifying (
openai
,opik
,python-dotenv
) are the correct PyPI names used by your imports inmain.py
and that the version bounds cover the features you need.You can quickly inspect available versions with:
🏁 Script executed:
#!/bin/bash # Example for opik pip index versions opikLength of output: 1004
🏁 Script executed:
#!/bin/bash # Check available PyPI versions pip index versions openai pip index versions python-dotenv # Verify import statements in the codebase rg -n "import .*openai" -A2 rg -n "from openai" -A2 rg -n "import .*opik" -A2 rg -n "from opik" -A2 rg -n "dotenv" -A2Length of output: 38318
All dependency names and minimum versions are correct.
I’ve cross‑checked the imports in
llm-as-judge/main.py
(openai
,opik
,dotenv
) against PyPI and the specified bounds:
- openai>=1.75.0 (latest 1.75.0)
- opik>=1.7.7 (latest 1.7.9)
- python‑dotenv>=1.1.0 (latest 1.1.0)
No changes required.
@@ -0,0 +1 @@ | |||
# LLM as a Judge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Expand the README with project details and instructions
The README currently only contains a title. To onboard users effectively, please add essential sections such as:
- A brief Description of what the package does and its use case
- Installation instructions (via pip/poetry)
- Environment Variables setup (e.g.,
.env
parameters) - Usage examples showing how to invoke the metric
- Contribution guidelines and License information
Example snippet:
# LLM as a Judge
+
+## Description
+This package provides an `LLMJudgeMetric` that leverages Atla’s open-source LLM to evaluate AI-generated responses on humor and wit, integrated with Comet’s Opik for telemetry.
+
+## Installation
+```bash
+pip install llm-as-judge
+```
+
+## Environment Variables
+Create a `.env` file with:
+```dotenv
+ATLA_API_KEY=your_api_key_here
+OPIK_API_KEY=your_opik_key_here
+```
+
+## Usage
+```python
+from llm_as_judge.main import LLMJudgeMetric
+
+metric = LLMJudgeMetric()
+score, detail = metric.evaluate(response="Your AI response", prompt="Your prompt")
+print(score, detail)
+```
+
+## Contributing
+1. Fork the repo
+2. Create a feature branch
+3. Submit a pull request
+
+## License
+MIT © Your Name
result_json = json.loads(response_content) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error handling for JSON parsing
The code assumes the response will always be valid JSON, but this may not be the case. Add error handling to gracefully manage parsing errors.
- result_json = json.loads(response_content)
+ try:
+ result_json = json.loads(response_content)
+ except json.JSONDecodeError as e:
+ raise ValueError(f"Failed to parse LLM response as JSON: {response_content}") from e
+
+ # Validate response structure
+ if "score" not in result_json or "reason" not in result_json:
+ raise ValueError(f"LLM response missing required fields: {result_json}")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
result_json = json.loads(response_content) | |
try: | |
result_json = json.loads(response_content) | |
except json.JSONDecodeError as e: | |
raise ValueError(f"Failed to parse LLM response as JSON: {response_content}") from e | |
# Validate response structure | |
if "score" not in result_json or "reason" not in result_json: | |
raise ValueError(f"LLM response missing required fields: {result_json}") |
llm-as-judge/main.py
Outdated
# Initialize the OpenAI client with the API key and base URL for Atla | ||
self.llm_client = OpenAI( | ||
api_key=os.environ.get("ATLA_API_KEY"), | ||
base_url="https://api.atla-ai.com/v1", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error handling for missing API key
The code doesn't handle the case when the ATLA_API_KEY environment variable is missing, which could lead to confusing errors at runtime.
+ # Check if API key is available
+ api_key = os.environ.get("ATLA_API_KEY")
+ if not api_key:
+ raise ValueError(
+ "ATLA_API_KEY environment variable not found. Please check your .env file."
+ )
+
# Initialize the OpenAI client with the API key and base URL for Atla
self.llm_client = OpenAI(
- api_key=os.environ.get("ATLA_API_KEY"),
+ api_key=api_key,
base_url="https://api.atla-ai.com/v1",
)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
# Initialize the OpenAI client with the API key and base URL for Atla | |
self.llm_client = OpenAI( | |
api_key=os.environ.get("ATLA_API_KEY"), | |
base_url="https://api.atla-ai.com/v1", | |
) | |
# Check if API key is available | |
api_key = os.environ.get("ATLA_API_KEY") | |
if not api_key: | |
raise ValueError( | |
"ATLA_API_KEY environment variable not found. Please check your .env file." | |
) | |
# Initialize the OpenAI client with the API key and base URL for Atla | |
self.llm_client = OpenAI( | |
api_key=api_key, | |
base_url="https://api.atla-ai.com/v1", | |
) |
llm-as-judge/main.py
Outdated
# Example usage | ||
metric = LLMJudgeMetric() | ||
metric.score( | ||
input="Tell me a joke.", | ||
output="Why did the chicken cross the road? To get to the other side", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Move example usage to if __name__ == "__main__"
block
The example usage at the bottom of the file will execute whenever the module is imported, which is likely not intended. Move it to a conditional block that only executes when the file is run directly.
-# Example usage
-metric = LLMJudgeMetric()
-metric.score(
- input="Tell me a joke.",
- output="Why did the chicken cross the road? To get to the other side",
-)
+# Example usage
+if __name__ == "__main__":
+ metric = LLMJudgeMetric()
+ result = metric.score(
+ input="Tell me a joke.",
+ output="Why did the chicken cross the road? To get to the other side",
+ )
+ print(f"Score: {result.value}")
+ print(f"Reason: {result.reason}")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
# Example usage | |
metric = LLMJudgeMetric() | |
metric.score( | |
input="Tell me a joke.", | |
output="Why did the chicken cross the road? To get to the other side", | |
) | |
# Example usage | |
if __name__ == "__main__": | |
metric = LLMJudgeMetric() | |
result = metric.score( | |
input="Tell me a joke.", | |
output="Why did the chicken cross the road? To get to the other side", | |
) | |
print(f"Score: {result.value}") | |
print(f"Reason: {result.reason}") |
llm-as-judge/main.py
Outdated
# Call the LLM with the prepared messages | ||
response = self.llm_client.chat.completions.create( | ||
model=self.model_name, messages=messages | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error handling for API calls
The LLM API call lacks error handling. OpenAI API calls can fail for various reasons (rate limits, authentication issues, network problems), and these should be gracefully handled.
# Call the LLM with the prepared messages
- response = self.llm_client.chat.completions.create(
- model=self.model_name, messages=messages
- )
+ try:
+ response = self.llm_client.chat.completions.create(
+ model=self.model_name, messages=messages
+ )
+ except Exception as e:
+ raise RuntimeError(f"Error calling LLM API: {str(e)}") from e
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
# Call the LLM with the prepared messages | |
response = self.llm_client.chat.completions.create( | |
model=self.model_name, messages=messages | |
) | |
# Call the LLM with the prepared messages | |
try: | |
response = self.llm_client.chat.completions.create( | |
model=self.model_name, messages=messages | |
) | |
except Exception as e: | |
raise RuntimeError(f"Error calling LLM API: {str(e)}") from e |
[project] | ||
name = "llm-as-judge" | ||
version = "0.1.0" | ||
description = "Integration of Atla LLM Judge with Comet's Opik." | ||
readme = "README.md" | ||
requires-python = ">=3.12" | ||
dependencies = [ | ||
"openai>=1.75.0", | ||
"opik>=1.7.7", | ||
"python-dotenv>=1.1.0", | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add PEP 621 build-system metadata.
The pyproject.toml
is missing the required [build-system]
table, so build tools won’t know which backend to invoke. Without this section, installation and packaging will fail. Please append something like:
[build-system]
requires = ["setuptools>=61.0"]
build-backend = "setuptools.build_meta"
to enable proper builds.
This Python script evaluates AI-generated responses using a scoring rubric for humor and wit, utilizing an LLM Judge powered by Atla's open-source judge LLMs. It also logs traces to Comet's observability platform, Opik.
Summary by CodeRabbit