Skip to content

Conversation

@alpe
Copy link
Contributor

@alpe alpe commented Jan 16, 2026

Replaces #2954

alpe added 30 commits November 12, 2025 15:16
* main:
  fix: remove duplicate error logging in light node shutdown (#2841)
  chore: fix incorrect function name in comment (#2840)
  chore: remove sequencer go.mod (#2837)
* main:
  build(deps): Bump the go_modules group across 2 directories with 3 updates (#2846)
  build(deps): Bump github.com/dvsekhvalnov/jose2go from 1.7.0 to 1.8.0 in /test/e2e (#2851)
  build(deps): Bump github.com/consensys/gnark-crypto from 0.18.0 to 0.18.1 in /test/e2e (#2844)
  build(deps): Bump github.com/cometbft/cometbft from 0.38.17 to 0.38.19 in /test/e2e (#2843)
  build(deps): Bump github.com/dvsekhvalnov/jose2go from 1.6.0 to 1.7.0 in /test/e2e (#2845)
(cherry picked from commit c44cd77e665f6d5d463295c6ed61c59a56d88db3)
* main:
  chore: reduce log noise (#2864)
  fix: sync service for non zero height starts with empty store (#2834)
  build(deps): Bump golang.org/x/crypto from 0.43.0 to 0.45.0 in /execution/evm (#2861)
  chore: minor improvement for docs (#2862)
* main:
  chore: bump da (#2866)
  chore: bump  core (#2865)
* main:
  chore: fix some comments (#2874)
  chore: bump node in evm-single (#2875)
  refactor(syncer,cache): use compare and swap loop and add comments (#2873)
  refactor: use state da height as well (#2872)
  refactor: retrieve highest da height in cache (#2870)
  chore: change from event count to start and end height (#2871)
* main:
  chore: remove extra github action yml file (#2882)
  fix(execution/evm): verify payload status (#2863)
  feat: fetch included da height from store (#2880)
  chore: better output on errors (#2879)
  refactor!: create da client and split cache interface (#2878)
  chore!: rename `evm-single` and `grpc-single` (#2839)
  build(deps): Bump golang.org/x/crypto from 0.42.0 to 0.45.0 in /tools/da-debug in the go_modules group across 1 directory (#2876)
  chore: parallel cache de/serialization (#2868)
  chore: bump blob size (#2877)
* main:
  build(deps): Bump mdast-util-to-hast from 13.2.0 to 13.2.1 in /docs in the npm_and_yarn group across 1 directory (#2900)
  refactor(block): centralize timeout in client (#2903)
  build(deps): Bump the all-go group across 2 directories with 3 updates (#2898)
  chore: bump default timeout (#2902)
  fix: revert default db (#2897)
  refactor: remove obsolete // +build tag (#2899)
  fix:da visualiser namespace  (#2895)
  refactor: omit unnecessary reassignment (#2892)
  build(deps): Bump the all-go group across 5 directories with 6 updates (#2881)
  chore: fix inconsistent method name in retryWithBackoffOnPayloadStatus comment (#2889)
  fix: ensure consistent network ID usage in P2P subscriber (#2884)
  build(deps): Bump golangci/golangci-lint-action from 9.0.0 to 9.1.0 (#2885)
  build(deps): Bump actions/checkout from 5 to 6 (#2886)
* main: (34 commits)
  feat: make reaper poll duration configurable (#2951)
  chore!: move sequencers to pkg (#2931)
  feat: Ensure Header integrity on DA (#2948)
  feat(testda): add header support with GetHeaderByHeight method (#2946)
  chore: improve code comments clarity (#2947)
  chore(sequencers): optimize store check (#2945)
  fix: make evm_execution more robust (#2942)
  fix(sequencers/single): deterministic queue (#2938)
  fix(block): fix init logic sequencer for da epoch fetching (#2926)
  feat: use DA timestamp (#2939)
  chore: improve code comments clarity (#2943)
  build(deps): bump libp2p (#2937)
  build(deps): Bump actions/cache from 4 to 5 (#2934)
  build(deps): Bump actions/download-artifact from 6 to 7 (#2933)
  build(deps): Bump actions/upload-artifact from 5 to 6 (#2932)
  feat: DA Client remove interface part 3, replace types with new code (#2910)
  DA Client remove interface: Part 2.5, create e2e test to validate that a blob is posted in DA layer. (#2920)
  fix(syncing): skip forced txs checks for p2p blocks (#2922)
  build(deps): Bump the all-go group across 5 directories with 5 updates (#2919)
  chore: loosen syncer state check (#2927)
  ...
alpe added 8 commits January 12, 2026 15:58
* main:
  fix: Fail fast when executor ahead (#2966)
  feat(block): async epoch fetching (#2952)
  perf: tune badger defaults and add db bench (#2950)
* main:
  chore: adding syncing tracing (#2981)
  feat(tracing): adding block production tracing (#2980)
  feat(tracing): Add Store, P2P and Config tracing (#2972)
  chore: fix upgrade test (#2979)
  build(deps): Bump github.com/ethereum/go-ethereum from 1.16.7 to 1.16.8 in /execution/evm/test in the go_modules group across 1 directory (#2974)
  feat(tracing): adding tracing to DA client (#2968)
  chore: create onboarding skill  (#2971)
  test: add e2e tests for force inclusion (part 2) (#2970)
  feat(tracing): adding eth client tracing (#2960)
  test: add e2e tests for force inclusion (#2964)
  build(deps): Bump the all-go group across 4 directories with 10 updates (#2969)
* main:
  fix: inconsistent state detection and rollback (#2983)
  chore: improve graceful shutdown restarts (#2985)
  feat(submitting): add posting strategies (#2973)
@claude
Copy link
Contributor

claude bot commented Jan 16, 2026

Claude finished @alpe's task in 3m 8s —— View job


Code Review: RAFT High Availability Implementation

I've completed a comprehensive review of this RAFT high availability implementation. This is a significant and well-architected feature that introduces critical resilience capabilities to the sequencer. Below are my findings organized by category.


Summary

This PR successfully implements a production-ready RAFT consensus layer for high availability, enabling dynamic leader election and automatic failover between aggregator (leader) and sync (follower) modes. The implementation shows strong attention to safety, particularly around preventing double-signing and state divergence. The e2e tests demonstrate thorough validation of failover scenarios.

Key Strengths:

  • Strong safety guarantees (FSM synchronization before leadership, quorum checks before block production)
  • Comprehensive state validation and recovery mechanisms
  • Well-structured failover architecture with clean separation of concerns
  • Excellent test coverage for critical failover scenarios

Areas for Improvement:

  • Test coverage gaps (34% patch coverage)
  • Missing error context in some critical paths
  • Configuration validation could be more user-friendly
  • Some concurrency patterns need hardening

Critical Issues

1. Race Condition in Dynamic Leader Election (pkg/raft/election.go)

Location: pkg/raft/election.go:86-90,158-178

The ticker channel is set to nil after firing, but there's a window where both the leaderCh and ticker.C could fire simultaneously, potentially causing race conditions in worker initialization.

case <-ticker.C:
    // ...
    ticker.Stop()
    ticker.C = nil  // Setting to nil, but select could already be evaluating

Recommendation: Use atomic operations or a mutex to protect the ticker state transitions.

Severity: Medium - Could cause initialization issues during cluster startup


2. Insufficient Error Context in State Validation (block/internal/executing/executor.go, block/internal/syncing/syncer.go)

Locations:

  • executor.go:241-261
  • syncer.go:829-853

Critical state validation errors lack sufficient context for debugging production issues. When hash mismatches occur, operators need to know:

  • Which node detected the mismatch
  • The full state at both heights
  • Whether this is a cluster-wide issue or isolated divergence

Current:

return fmt.Errorf("invalid state: block hash mismatch at height %d: raft=%x local=%x", 
    state.LastBlockHeight, raftState.Hash, header.Hash())

Recommendation:

return fmt.Errorf("state divergence detected: block hash mismatch at height %d: "+
    "raft_hash=%x local_hash=%x node_id=%s cluster_state=height:%d raft_index=%d "+
    "This indicates a critical consistency violation requiring manual intervention",
    state.LastBlockHeight, raftState.Hash, header.Hash(), n.NodeID(), 
    raftState.Height, raftState.Index)

Severity: High - Impacts production debugging and incident response


3. Potential Deadlock in Failover State (node/failover.go)

Location: node/failover.go:180-235

The Run method uses errgroup.WithContext but the shutdown sequence in the deferred function could deadlock if bc.Stop() blocks while the context is already cancelled.

defer func() {
    if err := f.bc.Stop(); err != nil && !errors.Is(err, context.Canceled) {
        multiErr = errors.Join(multiErr, fmt.Errorf("stopping block components: %w", err))
    }
}()

Recommendation: Use a separate context with timeout for shutdown operations:

defer func() {
    shutdownCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
    defer cancel()
    if err := f.bc.Stop(); err != nil && !errors.Is(err, context.Canceled) {
        multiErr = errors.Join(multiErr, fmt.Errorf("stopping block components: %w", err))
    }
}()

Severity: Medium - Could cause graceful shutdown failures


High Priority Issues

4. Missing Nil Check for Raft Node (pkg/raft/election.go)

Location: pkg/raft/election.go:98-99, 109

The code checks leaderID() != NodeID() but leaderID() could return empty string if raft is not ready. This should be explicitly handled.

if d.node.leaderID() != d.node.NodeID() {
    d.logger.Info().Msg("lost leadership during sync wait")
    continue
}

Recommendation: Add explicit check:

leaderID := d.node.leaderID()
if leaderID == "" {
    d.logger.Warn().Msg("raft cluster not ready, no leader elected yet")
    continue
}
if leaderID != d.node.NodeID() {
    d.logger.Info().Msg("lost leadership during sync wait")
    continue
}

Severity: Medium - Could cause confusing behavior during cluster initialization


5. Configuration Error Messages Not User-Friendly (node/full.go, pkg/config/config.go)

Location: node/full.go:124

The error message "raft config must be used in sequencer setup only" is unclear. It doesn't explain that raft requires the aggregator flag.

Current:

return nil, fmt.Errorf("raft config must be used in sequencer setup only")

Recommendation:

return nil, fmt.Errorf("raft can only be enabled for aggregator nodes: "+
    "set --node.aggregator=true when using --raft.enable=true. "+
    "Non-aggregator nodes (full nodes/light clients) cannot participate in raft consensus")

Severity: Medium - Impacts developer experience and deployment troubleshooting

Note: This was also flagged by Gemini's review.


6. Duplicate Peer Handling Too Strict (pkg/raft/node.go)

Location: pkg/raft/node.go:139-141

The code errors on duplicate peers instead of deduplicating and warning. This makes configuration more brittle than necessary.

Current:

if svrs := deduplicateServers(cfg.Servers); len(svrs) != len(cfg.Servers) {
    return fmt.Errorf("duplicate peers found in config: %v", cfg.Servers)
}

Recommendation: Log warning and proceed:

if svrs := deduplicateServers(cfg.Servers); len(svrs) != len(cfg.Servers) {
    n.logger.Warn().Msgf("duplicate peers found in config, using deduplicated list: %v -> %v", 
        cfg.Servers, svrs)
    cfg.Servers = svrs
}

Severity: Low - Quality of life improvement

Note: This was also flagged by Gemini's review.


Medium Priority Issues

7. Unclear Recovery Logic (block/internal/executing/executor.go)

Location: executor.go:282-322

The recovery logic in initializeState performs multiple state checks but the flow is hard to follow. The nested if statements checking raft state could benefit from early returns.

Recommendation: Refactor for clarity:

if e.raftNode == nil {
    // No raft, proceed with standard initialization
    e.setLastState(state)
    e.sequencer.SetDAHeight(state.DAHeight)
    return e.syncExecutionLayer(state)
}

// Raft-enabled path
raftState := e.raftNode.GetState()
if err := e.validateStateAgainstRaft(state, raftState); err != nil {
    return err
}
syncTarget := max(state.LastBlockHeight, raftState.Height)
return e.syncToHeight(syncTarget)

Severity: Medium - Code maintainability


8. Missing Logging for Critical State Transitions (pkg/raft/election.go)

Location: election.go:93-139

When transitioning from follower to leader, there's extensive checking but missing logs for key decision points (e.g., when FSM lag causes leadership abdication).

Recommendation: Add structured logging:

if err := d.node.waitForMsgsLanded(d.node.Config().SendTimeout); err != nil {
    d.logger.Error().
        Err(err).
        Uint64("applied_index", d.node.raft.AppliedIndex()).
        Uint64("last_index", d.node.raft.LastIndex()).
        Msg("failed to wait for messages to land - FSM lagging, abdicating")
    // ...
}

Severity: Medium - Impacts production observability


9. Hardcoded Timeout Values (node/failover.go, pkg/raft/election.go)

Locations:

  • failover.go:183, 224 - 3-second timeouts
  • election.go:86 - 300ms ticker

These should be configurable for different deployment scenarios (local dev vs production).

Recommendation: Move to configuration:

type FailoverConfig struct {
    ShutdownTimeout time.Duration  // default: 3s
    SyncCheckInterval time.Duration // default: 300ms
}

Severity: Low - Deployment flexibility


Security Considerations

10. Raft Transport Security

Location: pkg/raft/node.go:88

The RAFT transport uses TCP without TLS. For production deployments, this means raft communication (including block data) is unencrypted.

Current:

transport, err := raft.NewTCPTransport(cfg.RaftAddr, addr, 3, 10*time.Second, os.Stderr)

Recommendation:

  • Document that raft should run on a trusted private network
  • Consider adding TLS support for the raft transport layer
  • Add configuration validation to warn if raft is exposed on public interfaces

Severity: Medium - Security posture for production deployments


11. No Authentication for Raft RPC Endpoint

Location: pkg/rpc/server/http.go (new /raft/node endpoint)

The new RPC endpoint exposing raft status has no authentication. While it only reveals leadership status, this could leak cluster topology information.

Recommendation:

  • Add optional authentication middleware for the raft endpoint
  • Consider rate limiting to prevent information disclosure attacks
  • Document security implications in deployment guides

Severity: Low - Information disclosure


Performance & Resource Efficiency

12. Unbounded Channel in FSM (pkg/raft/node.go)

Location: node.go:304-306, 327-333

The applyCh channel is set by the caller, and if full, updates are dropped with only a warning. This could cause follower nodes to miss state updates.

if f.applyCh != nil {
    select {
    case f.applyCh <- RaftApplyMsg{Index: log.Index, State: &state}:
    default:
        f.logger.Warn().Msg("apply channel full, dropping message")
    }
}

Recommendation:

  • Document the required buffer size for the apply channel (currently 100 in raft_retriever.go:71)
  • Add metrics to track dropped messages
  • Consider applying backpressure instead of silently dropping

Severity: Medium - Could cause state sync issues under load


13. Inefficient State Cloning (pkg/raft/node.go)

Location: node.go:255-257

Every GetState() call performs a full protobuf clone, which could be expensive if called frequently.

func (n *Node) GetState() *RaftBlockState {
    return proto.Clone(n.fsm.state.Load()).(*RaftBlockState)
}

Recommendation:

  • Profile to determine if this is a bottleneck
  • If so, implement copy-on-write or return immutable views
  • Add comment documenting why cloning is necessary (prevents callers from mutating shared state)

Severity: Low - Potential optimization opportunity


14. Goroutine Leak Potential (pkg/raft/election.go)

Location: election.go:67-84

The startWorker function spawns goroutines but relies on wg.Wait() in the defer. If the function returns early (e.g., due to panic), goroutines could leak.

Recommendation:

  • Add recovery handling in the worker goroutine
  • Ensure context cancellation propagates correctly
  • Add goroutine leak detection in tests

Severity: Low - Defensive programming


Testing & Reliability

15. Low Test Coverage (34.04%)

Key gaps:

  • pkg/raft/node.go: 11.05% (193 lines missing)
  • block/internal/executing/executor.go: Many raft-related paths untested
  • block/internal/syncing/syncer.go: Raft recovery paths at 0%

Recommendation:

  • Add unit tests for raft state machine transitions
  • Test all error paths in executor/syncer raft integration
  • Add tests for edge cases (e.g., raft ahead by >1 block, hash mismatches)
  • Consider using hashicorp/raft test helpers for cluster simulation

Severity: High - Critical for production confidence


16. Missing Tests for Concurrent State Access (block/internal/executing/executor.go, block/internal/syncing/syncer.go)

Both executor and syncer use atomic pointers for lastState, but there are no tests verifying thread-safety of state updates during raft transitions.

Recommendation:

  • Add race detector tests for state transitions
  • Test concurrent reads during leader elections
  • Verify no torn reads during failover

Severity: Medium - Concurrency correctness


17. E2E Tests Could Be More Comprehensive

Location: test/e2e/failover_e2e_test.go

The e2e tests cover basic failover but could test:

  • Multiple sequential failovers
  • Split-brain scenarios (network partitions)
  • Follower crashes and recovery
  • Rolling upgrades
  • Byzantine scenarios (malicious raft messages)

Recommendation: Expand test matrix to cover failure modes listed above.

Severity: Medium - Production resilience validation


Architecture & Code Quality

18. Raft State Validation Logic Duplicated

Locations:

  • block/internal/executing/executor.go:236-261, 305-322
  • block/internal/syncing/syncer.go:1252-1277

Both executor and syncer have nearly identical raft state validation logic. This should be extracted to a shared utility.

Recommendation:

// block/internal/common/raft_validator.go
func ValidateStateAgainstRaft(
    localHeight uint64,
    localHash []byte,
    raftState *raft.RaftBlockState,
    store store.Store,
) error {
    // Shared validation logic
}

Severity: Low - Code maintainability and DRY principle


19. Interface Segregation (block/internal/common/raft.go)

Location: block/internal/common/raft.go:9-18

The RaftNode interface is minimal but could be split further:

  • Leader-only operations (Broadcast, HasQuorum)
  • Observer operations (IsLeader, GetState)
  • Follower operations (SetApplyCallback)

Recommendation: Consider splitting if future features require more granular permissions.

Severity: Low - Future extensibility


20. Magic Numbers in Raft Configuration

Locations:

  • pkg/raft/node.go:88 - maxPool=3, timeout=10*time.Second
  • pkg/raft/node.go:78 - snapCount passed but not validated

Recommendation:

  • Extract as named constants with documentation
  • Add validation for sensible ranges (e.g., snapCount > 0)

Severity: Low - Code clarity


Documentation & Developer Experience

21. Missing Documentation for Recovery Scenarios

The recovery logic in IsSynced and Recover methods is complex but lacks documentation on:

  • When recovery is triggered
  • What recovery does (rollback vs replay)
  • Expected behavior for different height deltas

Recommendation: Add comprehensive godoc:

// IsSyncedWithRaft checks if local state matches raft consensus state.
// Returns: 
//   - 0 if in sync (heights and hashes match)
//   - negative if local is behind raft (safe, will catch up via normal sync)
//   - positive if local is ahead of raft (CRITICAL: requires rollback)
// 
// Hash mismatches at the same height indicate state divergence and return an error.
func (s *Syncer) IsSyncedWithRaft(raftState *raft.RaftBlockState) (int, error)

Severity: Medium - Critical for maintainers


22. Configuration Documentation Gaps

The new raft configuration flags lack comprehensive documentation:

  • What happens if --raft.bootstrap is set on all nodes?
  • How to add/remove nodes from an existing cluster?
  • Recovery procedures for lost quorum

Recommendation:

  • Add detailed comments to pkg/config/config.go raft section
  • Create docs/raft-operations.md with runbooks for common scenarios
  • Add validation to prevent misconfiguration (e.g., bootstrap without peers)

Severity: Medium - Operational readiness


23. Unclear TODO Comment (block/internal/syncing/syncer.go)

Location: syncer.go:654

if event.Source == common.SourceDA { // TODO(@julienrbrt): To be reverted once DA Hints are merged

This TODO references PR #2891 but doesn't explain:

  • Why this check exists
  • What DA Hints will change
  • Timeline for the revert

Recommendation: Expand comment with context or link to ADR/design doc.

Severity: Low - Code maintainability


Positive Observations

✅ Excellent Safety Mechanisms

  • FSM synchronization before leadership transition (election.go:96-105)
  • Quorum checks before block production (executor.go:411-413, 480-483)
  • Hash validation at every state transition
  • Grace period handling for forced inclusion transactions

✅ Clean Separation of Concerns

  • failoverState encapsulates mode-switching components cleanly
  • Dynamic leader election abstracted behind interface
  • Raft integration via minimal interface reduces coupling

✅ Comprehensive E2E Testing

  • Leader failover scenarios thoroughly tested
  • Rolling restart validation
  • DA gap verification
  • Double-signing prevention checks

✅ Strong Error Handling

  • Most error paths have context wrapping
  • Critical errors sent to dedicated channel
  • Proper cleanup in defer blocks

Recommendations Summary

Immediate (Before Merge)

  1. Fix race condition in ticker handling (Issue ABCI interface #1)
  2. Improve error context for state validation (Issue Peer discovery #2)
  3. Add nil checks for raft leaderID (Issue Fraud proofs gossiping #4)
  4. Improve configuration error messages (Issue Mempool #5)

Short Term (Next Sprint)

  1. Increase test coverage to >60% (Issue Consider preparing native optimint arguments in cosmos-sdk #15)
  2. Add concurrency tests (Issue Node address format #16)
  3. Document recovery scenarios (Issue Describe peer discovery in lazy-adr #21)
  4. Add raft operations guide (Issue Update lazy-adr 001 #22)

Medium Term (Next Release)

  1. Consider TLS for raft transport (Issue Define Fraud Proof data type #10)
  2. Add authentication for raft RPC endpoint (Issue Fold the validator roots into the state tree #11)
  3. Make timeouts configurable (Issue Add design doc to readme #9)
  4. Expand e2e test scenarios (Issue Add peer discovery #17)

Conclusion

This is a solid implementation of a critical feature. The core raft integration is well-designed with strong safety guarantees. The main concerns are around test coverage, error context for production debugging, and documentation for operators. With the recommended fixes, this will be production-ready.

The failover logic is particularly well-thought-out, with careful handling of edge cases like FSM lag and state divergence. The e2e tests demonstrate that the happy path works reliably.

Recommendation: Approve with requested changes. Address critical issues #1, #2, #4, and #5 before merge. Track medium/low priority issues for follow-up work.


@github-actions
Copy link
Contributor

github-actions bot commented Jan 16, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedJan 26, 2026, 5:24 PM

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @alpe, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request fundamentally changes how Rollkit nodes handle high availability by integrating the RAFT consensus algorithm. The primary goal is to ensure continuous operation and data consistency across a cluster of sequencer nodes, even in the event of node failures. The changes involve a significant architectural shift to support dynamic role management (leader/follower) and robust state synchronization, allowing nodes to automatically recover and maintain a consistent view of the blockchain state. This feature is crucial for building resilient and fault-tolerant Rollkit deployments.

Highlights

  • RAFT Consensus Integration: Introduces the Raft consensus algorithm for high availability, enabling dynamic leader election and state replication among sequencer nodes using the hashicorp/raft library.
  • Dynamic Leader Election & Failover: Implements a DynamicLeaderElection mechanism that allows nodes to seamlessly transition between leader and follower roles. This includes logic for handling leader loss, ensuring FSM synchronization before leadership, and attempting recovery if a new leader is not synced.
  • Enhanced State Synchronization & Recovery: The Executor and Syncer components are updated to verify local state against the Raft replicated state. This includes automatic rollbacks if the execution layer is ahead, checks for block existence in the store before syncing, and robust recovery mechanisms from Raft state during failover scenarios.
  • Refactored Node Architecture: The node's P2P client creation and management are refactored, and a new failoverState structure is introduced to encapsulate components that need to be reset during mode switches (aggregator/sync). This simplifies the node's lifecycle management in a HA setup.
  • New RPC Endpoint for RAFT Status: Adds a /raft/node HTTP endpoint to query the Raft node's leadership status and ID, providing visibility into the consensus state.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and well-executed feature: high availability for the sequencer using the Raft consensus algorithm. The changes are extensive, refactoring core components to support dynamic leader election and failover between aggregator (leader) and sync (follower) roles. The new architecture appears robust, with strong consistency checks to prevent issues like state divergence or stale leaders. The addition of comprehensive end-to-end tests for failover scenarios is particularly commendable. My review includes a few suggestions to improve configuration handling and error messages, aiming to enhance the operator experience and debuggability of this critical new functionality.

Comment on lines +163 to 166
prevState, err = s.store.GetStateAtHeight(ctx, height-1)
if err != nil {
return fmt.Errorf("failed to get previous state: %w", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better debuggability, it would be helpful to include the specific height that was being queried in the error message when getting the previous state fails.

Suggested change
prevState, err = s.store.GetStateAtHeight(ctx, height-1)
if err != nil {
return fmt.Errorf("failed to get previous state: %w", err)
}
prevState, err = s.store.GetStateAtHeight(ctx, height-1)
if err != nil {
return fmt.Errorf("failed to get previous state at height %d: %w", height-1, err)
}

Comment on lines +121 to 122
return nil, fmt.Errorf("raft config must be used in sequencer setup only")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The error message "raft config must be used in sequencer setup only" could be more explicit. It seems the design requires a node to be configured as a potential aggregator to participate in a Raft cluster. A clearer message would improve user experience when configuring nodes.

Suggested change
return nil, fmt.Errorf("raft config must be used in sequencer setup only")
}
return nil, fmt.Errorf("raft can only be enabled for aggregator nodes (aggregator flag must be true)")

Comment on lines +139 to +141
if svrs := deduplicateServers(cfg.Servers); len(svrs) != len(cfg.Servers) {
return fmt.Errorf("duplicate peers found in config: %v", cfg.Servers)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Instead of returning an error for duplicate peers in the configuration, it would be more user-friendly to log a warning and proceed with the deduplicated list. The deduplicateServers function already handles the removal of duplicates, so failing fast might be overly strict for what is likely a minor configuration mistake.

Suggested change
if svrs := deduplicateServers(cfg.Servers); len(svrs) != len(cfg.Servers) {
return fmt.Errorf("duplicate peers found in config: %v", cfg.Servers)
}
if svrs := deduplicateServers(cfg.Servers); len(svrs) != len(cfg.Servers) {
n.logger.Warn().Msgf("duplicate peers found in config, using deduplicated list: %v", svrs)
cfg.Servers = svrs
}

@alpe alpe marked this pull request as draft January 18, 2026 10:38
@codecov
Copy link

codecov bot commented Jan 19, 2026

Codecov Report

❌ Patch coverage is 34.01060% with 747 lines in your changes missing coverage. Please review.
✅ Project coverage is 55.41%. Comparing base (2325ace) to head (5fc0b6b).

Files with missing lines Patch % Lines
pkg/raft/node.go 11.05% 193 Missing ⚠️
pkg/raft/node_mock.go 45.40% 74 Missing and 21 partials ⚠️
block/internal/executing/executor.go 2.32% 77 Missing and 7 partials ⚠️
block/internal/syncing/syncer.go 0.00% 83 Missing and 1 partial ⚠️
block/internal/syncing/raft_retriever.go 0.00% 75 Missing ⚠️
node/full.go 32.30% 37 Missing and 7 partials ⚠️
pkg/raft/election.go 67.20% 29 Missing and 12 partials ⚠️
node/failover.go 73.38% 24 Missing and 13 partials ⚠️
tools/local-da/rpc.go 0.00% 21 Missing ⚠️
block/internal/common/replay.go 50.00% 14 Missing and 6 partials ⚠️
... and 14 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2987      +/-   ##
==========================================
- Coverage   57.98%   55.41%   -2.57%     
==========================================
  Files         110      116       +6     
  Lines       10522    11505     +983     
==========================================
+ Hits         6101     6376     +275     
- Misses       3770     4429     +659     
- Partials      651      700      +49     
Flag Coverage Δ
combined 55.41% <34.01%> (-2.57%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@alpe alpe marked this pull request as ready for review January 19, 2026 08:42
* main:
  feat(tracing): tracing part 9 sequencer (#2990)
  build(deps): use mainline go-header (#2988)
migrationmngrtypes "github.com/evstack/ev-abci/modules/migrationmngr/types"
// ...
)
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this need to be inverted to keep the count correct

Copy link
Contributor

@tac0turtle tac0turtle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some comments, nice job.

We should get some docs, its fine in a follow up.

It would be good to do some testing after the merge on leadership rotation latency and what would be the expected time that a block may not be produced while a new leader is being found.

come of the comments from claude look realistic, should have tests for them at least

alpe added 3 commits January 26, 2026 17:15
* main:
  fix(docs): remove blog link from sidebar to fix 404 (#3014)
  build(deps): Bump github.com/cometbft/cometbft from 0.38.20 to 0.38.21 in /execution/evm/test in the go_modules group across 1 directory (#3011)
  refactor: use slices.Contains to simplify code (#3010)
  chore: Bump mermaid version and dependencies (#3009)
  chore: Bump github.com/consensys/gnark-crypto only (#3008)
  test: evm contract interaction (#3006)
  chore: remove redundant log (#3007)
  fix: return values correctly not nil (#3004)
  feat: expose execution client params to ev-node (#2982)
  feat(tracing): HTTP propagation (#3000)
  fix: deploy docs token (#3003)
  feat(tracing): add store tracing (#3001)
  feat: p2p exchange wrapper  (#2855)
  build(deps): Bump the all-go group across 5 directories with 5 updates (#2999)
  feat(tracing): adding forced inclusion tracing (#2997)
  chore: update calculator for strategies  (#2995)
  chore: adding tracing for da submitter (#2993)
  feat(tracing): part 10 da retriever tracing (#2991)
  chore: add da posting strategy to docs (#2992)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants