Skip to content

Conversation

@sidkhillon
Copy link
Contributor

@sidkhillon sidkhillon commented Dec 26, 2025

Currently, sleepForRetries (the sleep time between retry attempts during replication) is only configurable globally via the replication.source.sleepforretries configuration property. This makes it impossible to tune behavior for individual replication peers that may have different requirements.

This change would add support for configuring sleepForRetries on a per-peer basis, with fallback to the global configuration when not set. It also adds UI support for displaying the field and shell support for editing the value.

This is related to #7578 because this will not cleanly merge into branch-2

skhillon added 5 commits December 26, 2025 08:14
This squashed commit combines 8 commits:
- Allow peers to override sleep config
- Dynamic config update
- Always get value
- Use protobuf instead of string
- Add to test
- Add shell command
- Use builder instead
- Update UI to include sleep
The previous commit incorrectly added methods (getStartPosition,
getRecoveredQueueStartPos, terminate) that don't exist in upstream master.
These were from the old branch base and should not be included.
@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@ndimiduk ndimiduk requested review from Apache9 and taklwu December 26, 2025 19:04
@Apache-HBase

This comment has been minimized.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 27s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+0 🆗 buf 0m 0s buf was not available.
+0 🆗 buf 0m 0s buf was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
_ master Compile Tests _
+0 🆗 mvndep 0m 20s Maven dependency ordering for branch
+1 💚 mvninstall 3m 3s master passed
+1 💚 compile 4m 57s master passed
+1 💚 checkstyle 1m 30s master passed
+1 💚 spotbugs 4m 29s master passed
+1 💚 spotless 0m 42s branch has no errors when running spotless:check.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 11s Maven dependency ordering for patch
+1 💚 mvninstall 2m 52s the patch passed
+1 💚 compile 4m 58s the patch passed
+1 💚 cc 4m 58s the patch passed
+1 💚 javac 4m 58s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 1m 30s the patch passed
-0 ⚠️ rubocop 0m 10s /results-rubocop.txt The patch generated 12 new + 502 unchanged - 2 fixed = 514 total (was 504)
+1 💚 spotbugs 4m 50s the patch passed
+1 💚 hadoopcheck 11m 8s Patch does not cause any errors with Hadoop 3.3.6 3.4.1.
+1 💚 hbaseprotoc 1m 46s the patch passed
+1 💚 spotless 0m 42s patch has no errors when running spotless:check.
_ Other Tests _
+1 💚 asflicense 0m 35s The patch does not generate ASF License warnings.
51m 54s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7577/3/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #7577
Optional Tests dupname asflicense javac spotbugs checkstyle codespell detsecrets compile hadoopcheck hbaseanti spotless cc buflint bufcompat hbaseprotoc rubocop
uname Linux 4b57a8ef0ca5 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / cae2efb
Default Java Eclipse Adoptium-17.0.11+9
Max. process+thread count 85 (vs. ulimit of 30000)
modules C: hbase-protocol-shaded hbase-client hbase-server hbase-shell U: .
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7577/3/console
versions git=2.34.1 maven=3.9.8 spotbugs=4.7.3 rubocop=1.37.1
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@Apache9
Copy link
Contributor

Apache9 commented Dec 27, 2025

Since we have a Configuration Object in ReplicationPeerConfig, what about just create a combined Configuration instance and use it when creating ReplicationSource? In this way you can directly change the sleepForRetry through the configuration.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 29s Docker mode activated.
-0 ⚠️ yetus 0m 2s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --author-ignore-list --blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck
_ Prechecks _
_ master Compile Tests _
+0 🆗 mvndep 0m 21s Maven dependency ordering for branch
+1 💚 mvninstall 3m 23s master passed
+1 💚 compile 2m 14s master passed
+1 💚 javadoc 1m 5s master passed
+1 💚 shadedjars 6m 15s branch has no errors when building our shaded downstream artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 13s Maven dependency ordering for patch
+1 💚 mvninstall 3m 10s the patch passed
+1 💚 compile 2m 14s the patch passed
+1 💚 javac 2m 14s the patch passed
+1 💚 javadoc 1m 5s the patch passed
+1 💚 shadedjars 6m 18s patch has no errors when building our shaded downstream artifacts.
_ Other Tests _
+1 💚 unit 0m 34s hbase-protocol-shaded in the patch passed.
+1 💚 unit 1m 32s hbase-client in the patch passed.
+1 💚 unit 227m 33s hbase-server in the patch passed.
+1 💚 unit 7m 19s hbase-shell in the patch passed.
268m 59s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7577/3/artifact/yetus-jdk17-hadoop3-check/output/Dockerfile
GITHUB PR #7577
Optional Tests javac javadoc unit compile shadedjars
uname Linux 6f712e9d0480 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / cae2efb
Default Java Eclipse Adoptium-17.0.11+9
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7577/3/testReport/
Max. process+thread count 3729 (vs. ulimit of 30000)
modules C: hbase-protocol-shaded hbase-client hbase-server hbase-shell U: .
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7577/3/console
versions git=2.34.1 maven=3.9.8
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@sidkhillon
Copy link
Contributor Author

sidkhillon commented Dec 27, 2025

Since we have a Configuration Object in ReplicationPeerConfig, what about just create a combined Configuration instance and use it when creating ReplicationSource? In this way you can directly change the sleepForRetry through the configuration.

Just to confirm, you're suggesting I use the existing configuration map in ReplicationPeerConfig and store the value as "replication.source.sleepforretries". Then, we can override the config by doing something like:

Configuration combinedConf = new Configuration(globalConf);
// Override any values in globalConf with the existing value in peerConfig
peerConfig.getConfiguration().forEach(combinedConf::set);
// set this.conf = combinedConf in the ReplicationSource

Set peer-specific override via update_peer_config '1', CONFIG => {"replication.source.sleepforretries" => 2000}

Is that the approach you are looking for?

@Apache9
Copy link
Contributor

Apache9 commented Dec 28, 2025

Since we have a Configuration Object in ReplicationPeerConfig, what about just create a combined Configuration instance and use it when creating ReplicationSource? In this way you can directly change the sleepForRetry through the configuration.

Just to confirm, you're suggesting I use the existing configuration map in ReplicationPeerConfig and store the value as "replication.source.sleepforretries". Then, we can override the config by doing something like:

Configuration combinedConf = new Configuration(globalConf);
// Override any values in globalConf with the existing value in peerConfig
peerConfig.getConfiguration().forEach(combinedConf::set);
// set this.conf = combinedConf in the ReplicationSource

Set peer-specific override via update_peer_config '1', CONFIG => {"replication.source.sleepforretries" => 2000}

Is that the approach you are looking for?

Yes.

More specific, we can change the code in ReplicationSourceManager.createSource.

We have a CompoundConfiguration in hbase, where we can merge multiple Configurations together. You can check ReplicationPeerConfigUtil.getPeerClusterConfiguration method to find the usage.

Thanks.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds support for configuring the replication source sleepForRetries parameter on a per-peer basis, with fallback to the global configuration when not set. Previously, this parameter was only configurable globally via the replication.source.sleepforretries property.

  • Adds sleepForRetries field to ReplicationPeerConfig with builder support and protobuf serialization
  • Implements getSleepForRetries() method in ReplicationSource with fallback logic to global config when peer value is 0
  • Adds shell command set_peer_sleep_for_retries for managing the per-peer configuration

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
hbase-protocol-shaded/src/main/protobuf/server/master/Replication.proto Adds sleep_for_retries field to ReplicationPeer protobuf message
hbase-client/src/main/java/org/apache/hadoop/hbase/replication/ReplicationPeerConfig.java Adds sleepForRetries field to data model with getter/setter and includes in toString()
hbase-client/src/main/java/org/apache/hadoop/hbase/replication/ReplicationPeerConfigBuilder.java Adds setSleepForRetries() method to builder interface
hbase-client/src/main/java/org/apache/hadoop/hbase/client/replication/ReplicationPeerConfigUtil.java Adds conversion logic for sleepForRetries between protobuf and Java objects
hbase-client/src/test/java/org/apache/hadoop/hbase/replication/ReplicationPeerConfigTestUtil.java Updates test utilities to include sleepForRetries in config generation and assertions
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java Implements getSleepForRetries() with fallback logic and updates all usage of sleepForRetries to call this method
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceInterface.java Adds getSleepForRetries() method to interface
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceWALReader.java Updates to use source.getSleepForRetries() instead of local field
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java Updates to use source.getSleepForRetries() instead of local field
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceManager.java Updates to use source.getSleepForRetries() instead of local field
hbase-server/src/test/java/org/apache/hadoop/hbase/replication/ReplicationSourceDummy.java Implements getSleepForRetries() in test dummy class
hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestReplicationSource.java Updates mocks to return 0L for getSleepForRetries()
hbase-shell/src/main/ruby/shell/commands/set_peer_sleep_for_retries.rb Adds new shell command for setting per-peer sleep for retries
hbase-shell/src/main/ruby/hbase/replication_admin.rb Adds set_peer_sleep_for_retries() method implementation
hbase-shell/src/main/ruby/shell.rb Registers new set_peer_sleep_for_retries command
hbase-shell/src/test/ruby/hbase/replication_admin_test.rb Adds test for set_peer_sleep_for_retries command

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

public boolean isSerial() {
return serial;
}

Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The public getter method getSleepForRetries() is missing Javadoc documentation. For consistency with other public methods in this class and to help users understand the purpose and behavior of this configuration property, add Javadoc that explains:

  • What this value represents (sleep time between retries in milliseconds)
  • The default value (0, which means use global configuration)
  • When it's used (during replication retries)
Suggested change
/**
* Returns the per-peer sleep time between replication retries, in milliseconds.
* <p>
* A value of {@code 0} means that this peer does not define its own retry sleep and the
* global replication configuration should be used instead.
* This value is consulted when performing replication retries for this peer.
*
* @return sleep time between replication retries in milliseconds, or {@code 0} to use the
* global configuration
*/

Copilot uses AI. Check for mistakes.
Comment on lines 530 to 536
public long getSleepForRetries() {
long peerSleepForRetries = replicationPeer.getPeerConfig().getSleepForRetries();
if (peerSleepForRetries > 0) {
return peerSleepForRetries;
}
return this.conf.getLong("replication.source.sleepforretries", 1000);
}
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback logic is inconsistent with the existing getCurrentBandwidth() pattern. The bandwidth implementation (lines 519-523) treats 0 as "use default", but this implementation treats values > 0 as valid peer settings and 0 as "use default". This creates an inconsistency where setting sleepForRetries to 0 means "use global default" but for bandwidth, 0 means "use default". Consider using the same pattern as bandwidth (checking != 0 instead of > 0) for consistency, or add explicit documentation about this behavior difference.

Copilot uses AI. Check for mistakes.
Comment on lines 27 to 31
Examples:
# set sleep time to 2 seconds (2000ms) between retries for a peer
hbase> set_peer_sleep_for_retries '1', 2000
# unset sleep time for a peer to use the global default configured in server-side
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states "unset sleep time for a peer to use the global default configured in server-side" when passing 0, but this behavior differs from the bandwidth configuration pattern shown in the same file. In the bandwidth test (lines 609-622), 0 is also used as the initial/default value, but the bandwidth implementation treats 0 as "use default bandwidth". For consistency and clarity, consider documenting this explicitly in the help text that 0 has special meaning as a sentinel value for "use global default", and ensure users understand that setting it to 0 is not the same as setting it to the actual configured global default value.

Suggested change
Examples:
# set sleep time to 2 seconds (2000ms) between retries for a peer
hbase> set_peer_sleep_for_retries '1', 2000
# unset sleep time for a peer to use the global default configured in server-side
A value of 0 is treated specially as a sentinel meaning "use the global default
sleep time configured on the server side" rather than an explicit sleep time.
Examples:
# set sleep time to 2 seconds (2000ms) between retries for a peer
hbase> set_peer_sleep_for_retries '1', 2000
# unset the peer-specific sleep time so that this peer uses the global default
# configured on the server side (0 is a sentinel and does not change that default)

Copilot uses AI. Check for mistakes.
Comment on lines 530 to 536
public long getSleepForRetries() {
long peerSleepForRetries = replicationPeer.getPeerConfig().getSleepForRetries();
if (peerSleepForRetries > 0) {
return peerSleepForRetries;
}
return this.conf.getLong("replication.source.sleepforretries", 1000);
}
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test coverage for the new getSleepForRetries() method and its fallback logic. While the existing tests mock getSleepForRetries() to return 0L, there's no test that verifies:

  1. When peer config has a positive value (e.g., 2000), it returns that value
  2. When peer config is 0, it falls back to the global configuration value
  3. The integration between peer-level and global configuration

Consider adding a dedicated unit test in TestReplicationSource.java that verifies this fallback behavior, similar to how bandwidth is tested in other parts of the codebase.

Copilot uses AI. Check for mistakes.
* Get the sleep time for retries. Check peer config first, if set use it, otherwise fall back to
* global configuration.
* @return sleep time in milliseconds
*/
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method overrides ReplicationSourceInterface.getSleepForRetries; it is advisable to add an Override annotation.

Suggested change
*/
*/
@Override

Copilot uses AI. Check for mistakes.
this.conf.getInt(WAIT_ON_ENDPOINT_SECONDS, DEFAULT_WAIT_ON_ENDPOINT_SECONDS);
decorateConf();
// 1 second
this.sleepForRetries = this.conf.getLong("replication.source.sleepforretries", 1000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do you think this sleepForRetries would be changing dynamically after the initialization ? IIRC it would be only reloading via refreshSources (e.g. via updateReplicationPeerConfig ) such that this value would only created once per peer configuration or refresh.

so, can we keep this variable and just call getSleepForRetries() once within ReplicationSource.java?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants