Output gene name for mapped targets #67

bencap · 2025-12-12T22:29:08Z

This pull request introduces a new mechanism for inferring and attaching gene symbol information to target annotations, with provenance, in the mapping pipeline. It adds the compute_target_gene_info function, which determines a single gene symbol per target using a prioritized approach, and integrates this logic into the mapping API. Additionally, it restructures the reference_sequences data structure to use a new TargetAnnotation class, and makes several supporting improvements and bug fixes.

Gene symbol inference and annotation improvements:

Added the compute_target_gene_info async function in annotate.py, which determines a single gene symbol per target using a prioritized strategy (selected transcript, alignment overlap, variant spans, or fallback to metadata), and returns a GeneInfo object with provenance. Supporting helper functions for overlap-based inference and interval merging were also added.
Integrated gene symbol inference into the map_scoreset API route: for each target, the computed gene info is attached to its TargetAnnotation in the response. [1] [2]

Data structure and schema changes:

Updated the reference_sequences structure in map_scoreset to use TargetAnnotation objects, which now include a layers attribute and a new gene_info field. Adjusted all code paths to reference the new structure. [1] [2] [3]
Added TargetAnnotation and GeneInfo imports to relevant modules, ensuring the new schema types are properly used throughout. [1] [2]

Supporting infrastructure and environment:

Added ENSEMBL_API_URL to the .env.dev settings file to support Ensembl API queries for gene overlap.
Imported request_with_backoff and ENSEMBL_API_URL in lookup.py to enable robust gene feature queries.
Minor: Added Any import in lookup.py for type hinting.

Bug fixes and code improvements:

Fixed score parsing logic to correctly handle zero-valued scores (now checks for is not None instead of truthiness) in multiple locations (annotate.py). [1] [2] [3]

These changes collectively improve the accuracy and transparency of gene symbol assignment in the API, and lay the groundwork for robust, provenance-aware gene annotation in downstream analyses.

…e mapping

…rence selection to set it

This function queries the Ensembl API with exponential backoff as needed, returning a list of features which overlap the passed region.

Computes a new `gene_info` property for all mapped targets. This property is defined by an `hgnc_symbol` and a `selection_method`. The hgnc symbol is the HGNC symbol of the gene to which this target relates. The selection method is the method by which this symbol was selected and may be: - `tx_selection`: via the selected transcript - `alignment_max_covered_bases`: based on the gene 'feature' (via Ensembl) which covered the most bases of the aligned target - `variants_max_covered_bases`: same as `alignment_max_covered_bases`, but based on variant bases rather than aligned bases - `target_metadata`: based on parsing the target metadata the user supplied - `target_category`: no gene info was selected because the target was not protein coding (see #66) Various helpers were added to `dcd_mapping.annotate` which support this calculation. Gene info selection should not cause job failures, and will simply fail to select gene info on failure.

bencap · 2025-12-12T22:29:42Z

See #66 for information on computing this information for regulatory targets, which will be included in a future release.

bencap added 4 commits December 12, 2025 11:24

fix: Update score assignment to handle None values correctly in allel…

8c77d7d

…e mapping

feat: Add hgnc_symbol field to TxSelectResult and update protein refe…

fd2b69d

…rence selection to set it

feat: Implement get_overlapping_features_for_region function

6598890

This function queries the Ensembl API with exponential backoff as needed, returning a list of features which overlap the passed region.

bencap requested a review from sallybg December 12, 2025 22:29

bencap linked an issue Dec 12, 2025 that may be closed by this pull request

Determine Mapped Target HGNC Name During Mapping #55

Open

This was referenced Dec 13, 2025

Use new layers/gene info format in mapped target metadata VariantEffect/mavedb-api#611

Open

Use mapped HGNC name for assay facts gene text when available VariantEffect/mavedb-ui#597

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Output gene name for mapped targets #67

Output gene name for mapped targets #67

Uh oh!

bencap commented Dec 12, 2025

Uh oh!

bencap commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Output gene name for mapped targets #67

Are you sure you want to change the base?

Output gene name for mapped targets #67

Uh oh!

Conversation

bencap commented Dec 12, 2025

Uh oh!

bencap commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants