Add payer metrics #622

mkysel · 2025-03-12T14:41:37Z

Far from complete, but a step in the right direction.

Summary by CodeRabbit

New Features
- Introduced enhanced performance tracking for key operations, enabling more accurate measurement of processing times, retry counts, and activity volumes.
- Improved observability across publishing workflows and nonce management for deeper insights into system performance.
Chores
- Expanded metrics collection to support proactive diagnostics and overall system reliability.

coderabbitai · 2025-03-12T14:41:47Z

Walkthrough

The changes add new metric tracking across several modules. In the API payer package and blockchain nonce manager, calls to emit performance and operation metrics have been inserted into key methods. These modifications record durations, retry counts, message counts, and nonce values. Additionally, new Prometheus metrics collectors and emission functions are defined in the metrics package, extending observability without altering the core control flows.

Changes

File(s)	Summary of Changes
`pkg/api/payer/nodeCursorTracker.go`, `pkg/api/payer/service.go`, `pkg/blockchain/nonceManager.go`	Added new package imports and inserted calls to metrics functions within key methods to record performance durations, retry counts, message origins, and current nonce values.
`pkg/metrics/metrics.go`, `pkg/metrics/payer.go`	Introduced new Prometheus metrics collectors and corresponding emission functions for node publish duration, cursor block time, current nonce, banlist retries, and messages originated.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant NodeCursorTracker
    participant Metrics
    Client->>NodeCursorTracker: BlockUntilDesiredCursorReached(desiredCursor)
    NodeCursorTracker->>NodeCursorTracker: startTime = time.Now()
    NodeCursorTracker-->>NodeCursorTracker: Wait until desired cursor is reached
    NodeCursorTracker->>Metrics: EmitBlockUntilDesiredCursorReached(originatorId, duration)
    NodeCursorTracker-->>Client: Return result

sequenceDiagram
    participant Client
    participant Service
    participant Metrics
    Client->>Service: publishToNodeWithRetry(...)
    Service-->>Service: Attempt node publish + retry loop
    alt Retry occurred
        Service->>Metrics: EmitBanlistRetries(originatorId, retries)
    end
    Client->>Service: publishToNodes(...)
    Service->>Metrics: EmitNodePublishDuration(originatorId, duration)
    Service->>Metrics: EmitMessageOriginated(originatorId, messageCount)
    Client->>Service: publishToBlockchain(...)
    Service->>Metrics: EmitNodePublishDuration(originatorId, duration)
    Service->>Metrics: EmitMessageOriginated(originatorId, messageCount)

sequenceDiagram
    participant Caller
    participant NonceManager
    participant Metrics
    Caller->>NonceManager: GetNonce()
    NonceManager->>Metrics: EmitCurrentNonce(nonce)
    NonceManager-->>Caller: Return NonceContext

Possibly related PRs

Guarantee your own commits via a blocking payer #441: The changes in the main PR enhance the BlockUntilDesiredCursorReached method in the NodeCursorTracker struct, while the retrieved PR introduces a new implementation of the NodeCursorTracker struct and its method, indicating a direct modification of the same method. Therefore, the changes are related.

Suggested reviewers

fbac

Warning

There were issues while running some tools. Please review the errors and either fix the tool’s configuration or disable the tool if it’s a critical failure.

🔧 golangci-lint (1.62.2)

Error: can't load config: the Go language version (go1.23) used to build golangci-lint is lower than the targeted Go version (1.24)
Failed executing command with error: can't load config: the Go language version (go1.23) used to build golangci-lint is lower than the targeted Go version (1.24)

Tip

⚡🧪 Multi-step agentic review comment chat (experimental)

We're introducing multi-step agentic chat in review comments. This experimental feature enhances review discussions with the CodeRabbit agentic chat by enabling advanced interactions, including the ability to create pull requests directly from comments.
- To enable this feature, set early_access to true under in the settings.

✨ Finishing Touches

📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

pkg/api/payer/service.go (2)
329-331: Consider using a more specific identifier for blockchain metrics

While the metrics implementation is good, consider using a more descriptive name for the blockchain originator ID to distinguish these metrics from node-related metrics.
-metrics.EmitNodePublishDuration(desiredOriginatorId, time.Since(start).Seconds())
-metrics.EmitMessageOriginated(desiredOriginatorId, 1)
+metrics.EmitBlockchainPublishDuration(desiredOriginatorId, time.Since(start).Seconds())
+metrics.EmitBlockchainMessageOriginated(desiredOriginatorId, 1)
This would make it clearer in your metrics dashboard which publishes are going to the blockchain versus nodes.

233-234: Consider implementing a helper function to reduce duplication

The metrics emission code is repeated in both publishToNodes and publishToBlockchain methods. This could be extracted to a helper function to improve maintainability.
+func (s *Service) emitPublishMetrics(originatorID uint32, start time.Time, messageCount int) {
+  metrics.EmitNodePublishDuration(originatorID, time.Since(start).Seconds())
+  metrics.EmitMessageOriginated(originatorID, messageCount)
+}

 func (s *Service) publishToNodes(
   // existing code...
-  metrics.EmitNodePublishDuration(originatorID, time.Since(start).Seconds())
-  metrics.EmitMessageOriginated(originatorID, len(payerEnvelopes))
+  s.emitPublishMetrics(originatorID, start, len(payerEnvelopes))
   return resp.OriginatorEnvelopes, nil
 }

 func (s *Service) publishToBlockchain(
   // existing code...
-  metrics.EmitNodePublishDuration(desiredOriginatorId, time.Since(start).Seconds())
-  metrics.EmitMessageOriginated(desiredOriginatorId, 1)
+  s.emitPublishMetrics(desiredOriginatorId, start, 1)
Also applies to: 329-331

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3cb07a8 and 8970bfe.

📒 Files selected for processing (5)

pkg/api/payer/nodeCursorTracker.go (3 hunks)
pkg/api/payer/service.go (5 hunks)
pkg/blockchain/nonceManager.go (2 hunks)
pkg/metrics/metrics.go (1 hunks)
pkg/metrics/payer.go (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (3)

GitHub Check: Push Docker Images to GitHub Packages (xmtpd)
GitHub Check: Build pre-baked anvil-xmtpd
GitHub Check: Test (Node)

🔇 Additional comments (16)

pkg/metrics/metrics.go (1)

75-79: Good enhancement of metrics collection

The addition of these new metrics enhances observability for payer operations, which should help with monitoring and debugging. This approach of adding metrics to the existing collection mechanism is clean and follows the established pattern.

pkg/metrics/payer.go (4)

8-19: LGTM: Good implementation of the node publish duration metric

The histogram is properly defined with appropriate namespace, help text, and label for originator_id.

21-32: LGTM: Well-structured cursor block time metric

The metric is properly defined with a descriptive name and help text. The implementation follows best practices by using a histogram with the originator_id label.

34-44: LGTM: Appropriate use of gauge for nonce tracking

Good choice using a gauge for the nonce value as it represents a single changing value rather than a distribution or cumulative count.

60-71: LGTM: Well-implemented message counter

The counter for originated messages is properly implemented with appropriate labels and emission function.

pkg/blockchain/nonceManager.go (2)

3-10: LGTM: Clean import addition

The metrics package is properly imported.

54-55: LGTM: Good instrumentation of nonce generation

The metrics emission is placed in the right location - after the nonce is generated and logged, but before it's returned.

pkg/api/payer/nodeCursorTracker.go (3)

3-12: LGTM: Clean import additions

The necessary imports for metrics and time are added appropriately.

50-51: LGTM: Simple and effective timing setup

Using time.Now() at the start of the method is the right approach for measuring duration.

105-108: LGTM: Well-positioned metrics emission

The metrics emission is correctly placed where the desired sequence ID is reached, and it properly uses time.Since(start).Seconds() to calculate the duration.
pkg/api/payer/service.go (6)

6-6: Good addition of metrics package import

Adding metrics capabilities to this service is a valuable enhancement for observability.

184-186: Appropriate conditional metrics emission for banlist retries

The conditional check ensures the metric is only emitted when retries actually occur, avoiding unnecessary metric noise for successful first attempts.

225-225: Useful timing capture for node publish operations

Correctly positioned timestamp capture to measure the duration of the node publishing operation.

233-234: Good metrics instrumentation for node publishing

The metrics track both the duration of the publish operation and the number of messages originated, providing valuable performance and throughput insights.

268-269: Appropriate timing capture for blockchain publish operations

Correctly positioned timestamp capture at the beginning of the blockchain publishing flow.

184-186:
✅ Verification successful

Consider error handling for metrics emission

The metrics emission functions don't appear to have any error handling. Verify that failures in metrics recording won't impact the service's core functionality.

Let's check how the metrics package handles errors in the implementation of these functions:

Also applies to: 233-234, 329-331

🏁 Script executed:
#!/bin/bash
# Check how metrics functions are implemented and if they handle errors
grep -r "func Emit" pkg/metrics/ --include="*.go" -A 10
Length of output: 3958
Metrics Emission: No additional error handling required

After verifying the implementation in the pkg/metrics package, the metrics emission functions (including EmitBanlistRetries) simply invoke Prometheus instrumentation methods (e.g., .With(), .Observe(), .Add()), which do not return errors. Their design is fire‑and‑forget and does not impact the service’s core functionality if a metrics recording failure were to occur. This pattern is consistent across similar metrics functions in the package.

pkg/metrics/payer.go

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

pkg/metrics/payer.go (3)
10-10: Inconsistent naming convention in metric name

The metric name xmtp_payer__node_publish_duration_seconds uses a double underscore between "payer" and "node", while other metrics in this file use a single underscore. Consider standardizing the naming convention across all metrics.
-		Name: "xmtp_payer__node_publish_duration_seconds",
+		Name: "xmtp_payer_node_publish_duration_seconds",
21-32: Function name doesn't match metric name convention

The function name EmitBlockUntilDesiredCursorReached doesn't reflect the new metric name xmtp_payer_read_own_commit_in_time_seconds that was agreed upon in the discussion. Consider renaming the function to match the metric's purpose for better code clarity.
-func EmitBlockUntilDesiredCursorReached(originatorId uint32, duration float64) {
+func EmitReadOwnCommitDuration(originatorId uint32, duration float64) {
	cursorBlockTime.With(prometheus.Labels{"originator_id": strconv.Itoa(int(originatorId))}).
		Observe(duration)
}
48-49: Consider shortening the metric name

The metric name xmtp_payer_failed_attempts_to_publish_to_node_via_banlist is quite verbose. Consider a more concise name that still captures the essence of what's being measured.
-		Name:    "xmtp_payer_failed_attempts_to_publish_to_node_via_banlist",
+		Name:    "xmtp_payer_banlist_publish_retries",

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8970bfe and ddf910b.

📒 Files selected for processing (5)

pkg/api/payer/nodeCursorTracker.go (3 hunks)
pkg/api/payer/service.go (5 hunks)
pkg/blockchain/nonceManager.go (2 hunks)
pkg/metrics/metrics.go (1 hunks)
pkg/metrics/payer.go (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (4)

pkg/metrics/metrics.go
pkg/blockchain/nonceManager.go
pkg/api/payer/nodeCursorTracker.go
pkg/api/payer/service.go

⏰ Context from checks skipped due to timeout of 90000ms (4)

GitHub Check: Push Docker Images to GitHub Packages (xmtpd-cli)
GitHub Check: Push Docker Images to GitHub Packages (xmtpd)
GitHub Check: Build pre-baked anvil-xmtpd
GitHub Check: Test (Node)

🔇 Additional comments (3)

pkg/metrics/payer.go (3)

34-44: LGTM: Well-implemented gauge metric

The currentNonce gauge implementation is clean and includes a helpful comment about thread safety. The naming and help text are clear and informative.

60-71: LGTM: Well-implemented counter metric

The messagesOriginated counter implementation is straightforward and correctly uses the Add method for incrementing. The naming and help text are clear.

1-71:
✅ Verification successful

Missing metric registration with Prometheus registry

None of the defined metrics are registered with the Prometheus registry in this file. Ensure that these metrics are registered somewhere in the codebase, typically during application initialization.

🏁 Script executed:
#!/bin/bash
# Verify if metrics are registered somewhere in the codebase

# Look for metrics registration patterns in the codebase
echo "Searching for prometheus.Register or prometheus.MustRegister patterns..."
rg "prometheus\.(Must)?Register" -t go
Length of output: 271
Metrics registration verified in the codebase

The metrics defined in pkg/metrics/payer.go are indeed registered through the registerCollectors function in pkg/metrics/metrics.go. No changes are required.

mkysel requested a review from a team as a code owner March 12, 2025 14:41

coderabbitai bot reviewed Mar 12, 2025

View reviewed changes

pkg/metrics/payer.go Outdated Show resolved Hide resolved

neekolas reviewed Mar 13, 2025

View reviewed changes

pkg/metrics/payer.go Outdated Show resolved Hide resolved

mkysel added 2 commits March 13, 2025 11:32

Add payer metrics

43f15ce

better names

ddf910b

mkysel force-pushed the mkysel/payer-metrics branch from 8970bfe to ddf910b Compare March 13, 2025 15:35

coderabbitai bot reviewed Mar 13, 2025

View reviewed changes

mkysel requested review from fbac and neekolas March 13, 2025 18:06

neekolas approved these changes Mar 13, 2025

View reviewed changes

mkysel merged commit 03fd0d8 into main Mar 14, 2025
7 checks passed

mkysel deleted the mkysel/payer-metrics branch March 14, 2025 14:56

coderabbitai bot mentioned this pull request Mar 14, 2025

More sync metrics #634

Merged

mkysel linked an issue Mar 14, 2025 that may be closed by this pull request

Payer metrics #623

Closed

coderabbitai bot mentioned this pull request Mar 14, 2025

Test utility to start an anvil instance #629

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add payer metrics #622

Add payer metrics #622

mkysel commented Mar 12, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 12, 2025 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot left a comment

Add payer metrics #622

Add payer metrics #622

Conversation

mkysel commented Mar 12, 2025 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Mar 12, 2025 • edited Loading

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

mkysel commented Mar 12, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 12, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)