Skip to content

Conversation

@Nasf-Fan
Copy link
Contributor

@Nasf-Fan Nasf-Fan commented Jan 4, 2026

Currently, the initial timeout for CRT_OPC_PROTO_QUERY RPC is only 3 seconds, it will help to get going more quickly when some rank(s) is down. But that increases the risk of query failure with timeout if there are only a few targets in the system and they may be busy or not ready in time when being queried.

The patch adds another one CRT_OPC_PROTO_QUERY RPC retry against the rank that has ever reported RPC timeout. Such retry will use default RPC timeout configuration instead of initial small value.

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@github-actions
Copy link

github-actions bot commented Jan 4, 2026

Ticket title is 'daos_rpc_proto_query() crt_proto_query()failed: DER_TIMEDOUT(-1011): 'Time out''
Status is 'In Progress'
https://daosio.atlassian.net/browse/DAOS-18388

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17335/2/execution/node/451/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17335/2/execution/node/466/log

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18388 branch from b408c13 to 0863dd6 Compare January 5, 2026 06:39
@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17335/5/execution/node/1176/log

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18388 branch from 0863dd6 to bd43c12 Compare January 6, 2026 05:16
Currently, the initial timeout for CRT_OPC_PROTO_QUERY RPC is only
3 seconds, it will help to get going more quickly when some rank(s)
is down. But that increases the risk of query failure with timeout
if there are only a few targets in the system and they may be busy
or not ready in time when being queried.

The patch adds another one CRT_OPC_PROTO_QUERY RPC retry against
the rank that has ever reported RPC timeout. Such retry will use
default RPC timeout configuration instead of initial small value.

Signed-off-by: Fan Yong <fan.yong@hpe.com>
@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18388 branch from bd43c12 to bc0c86f Compare January 6, 2026 05:25
@Nasf-Fan Nasf-Fan marked this pull request as ready for review January 7, 2026 03:30
@Nasf-Fan Nasf-Fan requested review from a team as code owners January 7, 2026 03:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants