Skip to content

More sync metrics #634

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 14, 2025
Merged

More sync metrics #634

merged 2 commits into from
Mar 14, 2025

Conversation

mkysel
Copy link
Collaborator

@mkysel mkysel commented Mar 14, 2025

.. also rename all the metrics with a prefix for easier grepping

Summary by CodeRabbit

  • New Features

    • Introduced enhanced tracking metrics for synchronization, enabling improved monitoring of connection performance and message flows.
  • Refactor

    • Standardized metrics naming across operations to offer clearer insights on system performance.
    • Refined error handling in synchronization processes for more reliable operations.

@mkysel mkysel requested a review from a team as a code owner March 14, 2025 15:02
Copy link
Contributor

coderabbitai bot commented Mar 14, 2025

Walkthrough

This change refactors several metric emission functions and variable names across multiple packages by renaming generic metrics to more context-specific ones (e.g., "payer" and "indexer"). In addition, new synchronization metrics and a connection status counter have been introduced, along with improved error propagation in envelope processing. The underlying control flow remains largely unchanged aside from these naming updates and minor adjustments in error handling.

Changes

File(s) Change Summary
pkg/api/payer/nodeCursorTracker.go, pkg/api/payer/service.go, pkg/blockchain/nonceManager.go, pkg/metrics/payer.go Updated metric emission calls and variable/function names to use the "Payer" prefix in methods such as BlockUntilDesiredCursorReached, publishToNodeWithRetry, and nonce management.
pkg/blockchain/rpcLogStreamer.go, pkg/metrics/indexer.go, pkg/metrics/metrics.go Renamed metrics and associated functions/variables to use the "Indexer" prefix; removed outdated metrics and registered a new set reflecting indexer context.
pkg/metrics/sync.go, pkg/sync/syncWorker.go Introduced new synchronization metrics, including gauges and counters, and added a SyncConnectionsStatusCounter type to track connection statuses. Also improved error handling in envelope validation and insertion.

Sequence Diagram(s)

sequenceDiagram
    participant SW as syncWorker
    participant MSC as SyncConnectionsStatusCounter
    participant VE as validateAndInsertEnvelope
    participant IE as insertEnvelope

    SW->>MSC: NewSyncConnectionsStatusCounter(nodeID)
    SW->>VE: validateAndInsertEnvelope(envelope)
    alt Validation Fails
        SW->>MSC: MarkFailure()
    else Validation Succeeds
        SW->>IE: insertEnvelope(envelope)
        alt Insertion Fails
            SW->>MSC: MarkFailure()
        else Insertion Succeeds
            SW->>MSC: MarkSuccess()
        end
    end
    SW->>MSC: Close()
Loading

Possibly related PRs

  • Track unsettled usage #549: Modifies metric emission functions to reflect payer-specific contexts, closely aligning with the renaming changes in this PR.
  • Add payer metrics #622: Involves updates to metrics within NodeCursorTracker and Service structs, transitioning to payer-specific metrics.
  • Add tests and fix bugs for payer API #248: Focuses on changes to the BlockUntilDesiredCursorReached method and related Service methods, similar to parts of this diff.

Suggested reviewers

  • fbac
  • neekolas

Warning

There were issues while running some tools. Please review the errors and either fix the tool’s configuration or disable the tool if it’s a critical failure.

🔧 golangci-lint (1.62.2)

Error: can't load config: the Go language version (go1.23) used to build golangci-lint is lower than the targeted Go version (1.24)
Failed executing command with error: can't load config: the Go language version (go1.23) used to build golangci-lint is lower than the targeted Go version (1.24)

Tip

⚡🧪 Multi-step agentic review comment chat (experimental)
  • We're introducing multi-step agentic chat in review comments. This experimental feature enhances review discussions with the CodeRabbit agentic chat by enabling advanced interactions, including the ability to create pull requests directly from comments.
    - To enable this feature, set early_access to true under in the settings.
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
pkg/metrics/sync.go (2)

8-15: Consider retaining the explanatory comment or providing doc comments for clarity.

Line 8 points out that this metric "should be a counter, but it does not support set and we can not rely on memory state." To avoid confusion for fellow developers and maintainers, consider either elaborating on this in the function-level doc comments or linking to relevant Prometheus documentation. This helps justify why a gauge is used here instead of a counter.


48-54: Unify connection-related metrics for better maintainability.

You are currently managing multiple related metrics:
syncOutgoingSyncConnections (gauge)
syncFailedOutgoingSyncConnections (gauge)
syncFailedOutgoingSyncConnectionCounter (counter)

Consider combining them into a single GaugeVec or CounterVec, potentially with a status label (e.g., "failed", "open") to minimize duplication and reduce the surface area for mistakes. This also improves maintainability and consistency.

Also applies to: 55-60, 62-68

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 03fd0d8 and 83dcc2e.

📒 Files selected for processing (9)
  • pkg/api/payer/nodeCursorTracker.go (1 hunks)
  • pkg/api/payer/service.go (3 hunks)
  • pkg/blockchain/nonceManager.go (1 hunks)
  • pkg/blockchain/rpcLogStreamer.go (2 hunks)
  • pkg/metrics/indexer.go (1 hunks)
  • pkg/metrics/metrics.go (1 hunks)
  • pkg/metrics/payer.go (2 hunks)
  • pkg/metrics/sync.go (1 hunks)
  • pkg/sync/syncWorker.go (10 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: Push Docker Images to GitHub Packages (xmtpd-cli)
  • GitHub Check: Test (Node)
  • GitHub Check: Build pre-baked anvil-xmtpd
  • GitHub Check: Push Docker Images to GitHub Packages (xmtpd)
🔇 Additional comments (33)
pkg/blockchain/nonceManager.go (1)

88-88: LGTM: Renamed metric with contextual prefix

The metric has been appropriately renamed from EmitCurrentNonce to EmitPayerCurrentNonce, which aligns with the PR objective of improving metric naming for better searchability.

pkg/api/payer/nodeCursorTracker.go (1)

105-108: LGTM: Renamed metric with payer context prefix

The metric has been successfully renamed from EmitBlockUntilDesiredCursorReached to EmitPayerBlockUntilDesiredCursorReached while maintaining the same parameters, improving metric organization.

pkg/blockchain/rpcLogStreamer.go (2)

198-198: LGTM: Renamed metric with indexer context prefix

The metric has been appropriately renamed from EmitCurrentBlock to EmitIndexerCurrentBlock, providing better context and searchability.


229-229: LGTM: Renamed metric with indexer context prefix

The metric has been successfully renamed from EmitNumLogsFound to EmitIndexerNumLogsFound, maintaining consistency with the indexer context.

pkg/metrics/metrics.go (1)

71-84: LGTM: Improved metric organization with contextual prefixes

The metrics registration has been updated to use consistently prefixed metric names:

  • "indexer" prefix for blockchain indexing metrics
  • "payer" prefix for payment-related metrics
  • "sync" prefix for synchronization metrics

The changes align perfectly with the PR objectives to enhance sync metrics and rename metrics for better searchability. The new sync metrics will provide valuable insights into synchronization operations.

pkg/api/payer/service.go (3)

185-185: Updated metric prefix for consistency

The banlist retries metric has been renamed to use the Payer prefix, aligning with the PR's objective to facilitate better metric filtering.


233-234: Renamed metrics with payer prefix

The node publish duration and message originated metrics have been updated with the Payer prefix, improving metric categorization.


329-330: Standardized metrics in blockchain publishing

Consistent with other changes, these metrics have been updated to use the Payer prefix, ensuring uniformity throughout the codebase.

pkg/sync/syncWorker.go (11)

170-171: Added connection status tracking metrics

New connection status counter was added to track the health of node connections for synchronization.


181-182: Added metrics for failed node connections

This change tracks failure cases when a node isn't found in the registry, providing visibility into sync issues.


187-188: Added metrics for invalid node configurations

New metric tracking for nodes with invalid configuration, improving observability.


212-213: Added metrics for connection failures during retries

Connection failures during retry attempts are now recorded in metrics, enhancing monitoring capabilities.


233-234: Added success metrics for node connections

New metric tracking for successful connections, completing the connection lifecycle monitoring.


393-395: Updated error handling approach

The code now explicitly ignores errors from validateAndInsertEnvelope to continue processing subsequent envelopes even when one fails.


416-422: Improved error propagation with metrics

The function now returns errors and adds metrics tracking for originator error messages, enhancing observability.


439-441: Added new sync metrics

New metrics for tracking last seen sequence IDs and received message counts have been added, improving sync monitoring.


463-466: Improved fee calculation clarity

Fee calculation has been restructured for better readability, separating base and congestion fees.


482-488: Enhanced error handling in envelope insertion

The function now returns errors instead of just logging them, allowing for better error propagation.


521-523: Changed log level for duplicate envelopes

Changed log level from warning to info for already inserted envelopes, as this is an expected case rather than a warning condition.

pkg/metrics/indexer.go (8)

10-16: Renamed log streamer metrics with indexer prefix

Updated the logs counter variable and its metric name to include the indexer prefix, improving metric categorization and searchability.


18-24: Renamed current block metric with indexer prefix

Block tracking gauge has been updated with an indexer prefix for consistency and better filtering.


26-33: Renamed duration metrics with indexer prefix

Log retrieval duration histogram now includes an indexer prefix for improved metric organization.


35-41: Renamed request counter with indexer prefix

Request counter now has a consistent indexer prefix, aligning with other metrics in this module.


43-46: Updated emission function to match renamed metric

Function name updated to reflect the new indexer prefix convention.


48-51: Updated block tracking function to match renamed metric

Consistent renaming pattern applied to the block tracking function.


53-56: Updated duration emission function to match renamed metric

Duration tracking function updated with the indexer prefix for consistency.


62-65: Updated function call to use new metrics name

Updated the measurement function to use the newly renamed metrics function.

pkg/metrics/payer.go (5)

8-19: Renamed node publish metrics with payer prefix

The node publish duration histogram and its associated emission function have been updated with a payer prefix for better categorization.


21-32: Renamed cursor block time metrics with payer prefix

The cursor block time histogram and its associated emission function have been updated with a payer prefix for consistent naming.


34-44: Renamed nonce metrics with payer prefix

The current nonce gauge and its associated emission function have been updated with a payer prefix to align with other metrics.


55-58: Renamed banlist retries function with payer prefix

Updated the banlist retries function to use the payer prefix, consistent with other metric changes.


60-71: Renamed messages originated metrics with payer prefix

The messages originated counter and its associated emission function have been updated with a payer prefix for better filtering.

pkg/metrics/sync.go (1)

70-117:

✅ Verification successful

Address concurrent access to struct fields if used across goroutines.

The SyncConnectionsStatusCounter struct maintains boolean fields (hasFailed and hasSucceeded) that get updated in MarkFailure(), MarkSuccess(), and Close(). If the same instance is accessed concurrently by multiple goroutines, these fields and their corresponding gauge increments/decrements can lead to data races or inconsistent gauge states. Ensure that each instance of SyncConnectionsStatusCounter is used by only one goroutine at a time, or introduce a synchronization mechanism (e.g. a mutex) around these field updates.

Below is a script that you can run to locate all uses of MarkFailure, MarkSuccess, and Close across the codebase, so you can confirm whether multiple goroutines may invoke them simultaneously:


🏁 Script executed:

#!/bin/bash
# Search the repository for potential concurrent usages of SyncConnectionsStatusCounter methods.
rg -A 5 -B 3 'MarkFailure|MarkSuccess|Close'

Length of output: 65960


Re: Concurrency Safety of SyncConnectionsStatusCounter

After investigating repository usage, it appears that each instance of SyncConnectionsStatusCounter is created and used locally within a single goroutine (for example, in pkg/sync/syncWorker.go). The methods MarkFailure(), MarkSuccess(), and Close() are invoked sequentially within their respective execution paths, so no concurrent access of its internal boolean fields is observed at this time.

If a future change requires sharing an instance across multiple goroutines, please ensure to introduce a synchronization mechanism (e.g., a mutex) around these field updates and the corresponding gauge operations.

Comment on lines +108 to +117
func (fc *SyncConnectionsStatusCounter) Close() {
if fc.hasFailed {
fc.hasFailed = false
syncFailedOutgoingSyncConnections.Dec()
}
if fc.hasSucceeded {
fc.hasSucceeded = false
syncOutgoingSyncConnections.Dec()
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider marking sync connections as closed once.

Repeated calls to Close() could cause gauge inconsistencies if invoked more than once on the same instance. Because there is no "closed" field or check, a second call to Close() might wrongly decrement gauges again if this code changes in the future. Consider adding a dedicated "closed" state to guard against double-closing or re-opening actions.

@mkysel mkysel merged commit da8c2b9 into main Mar 14, 2025
7 checks passed
@mkysel mkysel deleted the mkysel/sync-metrics branch March 14, 2025 20:12
mkysel added a commit that referenced this pull request Mar 20, 2025
.. also rename all the metrics with a prefix for easier grepping

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

- **New Features**
- Introduced enhanced tracking metrics for synchronization, enabling
improved monitoring of connection performance and message flows.

- **Refactor**
- Standardized metrics naming across operations to offer clearer
insights on system performance.
- Refined error handling in synchronization processes for more reliable
operations.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants