Skip to content

Conversation

@bencap
Copy link
Collaborator

@bencap bencap commented Dec 12, 2025

This pull request introduces a new mechanism for inferring and attaching gene symbol information to target annotations, with provenance, in the mapping pipeline. It adds the compute_target_gene_info function, which determines a single gene symbol per target using a prioritized approach, and integrates this logic into the mapping API. Additionally, it restructures the reference_sequences data structure to use a new TargetAnnotation class, and makes several supporting improvements and bug fixes.

Gene symbol inference and annotation improvements:

  • Added the compute_target_gene_info async function in annotate.py, which determines a single gene symbol per target using a prioritized strategy (selected transcript, alignment overlap, variant spans, or fallback to metadata), and returns a GeneInfo object with provenance. Supporting helper functions for overlap-based inference and interval merging were also added.
  • Integrated gene symbol inference into the map_scoreset API route: for each target, the computed gene info is attached to its TargetAnnotation in the response. [1] [2]

Data structure and schema changes:

  • Updated the reference_sequences structure in map_scoreset to use TargetAnnotation objects, which now include a layers attribute and a new gene_info field. Adjusted all code paths to reference the new structure. [1] [2] [3]
  • Added TargetAnnotation and GeneInfo imports to relevant modules, ensuring the new schema types are properly used throughout. [1] [2]

Supporting infrastructure and environment:

  • Added ENSEMBL_API_URL to the .env.dev settings file to support Ensembl API queries for gene overlap.
  • Imported request_with_backoff and ENSEMBL_API_URL in lookup.py to enable robust gene feature queries.
  • Minor: Added Any import in lookup.py for type hinting.

Bug fixes and code improvements:

  • Fixed score parsing logic to correctly handle zero-valued scores (now checks for is not None instead of truthiness) in multiple locations (annotate.py). [1] [2] [3]

These changes collectively improve the accuracy and transparency of gene symbol assignment in the API, and lay the groundwork for robust, provenance-aware gene annotation in downstream analyses.

This function queries the Ensembl API with exponential backoff as needed, returning a list of features which overlap the passed region.
Computes a new `gene_info` property for all mapped targets. This property is defined by an `hgnc_symbol` and a `selection_method`.

The hgnc symbol is the HGNC symbol of the gene to which this target relates. The selection method is the method by which this symbol was selected and may be:
- `tx_selection`: via the selected transcript
- `alignment_max_covered_bases`: based on the gene 'feature' (via Ensembl) which covered the most bases of the aligned target
- `variants_max_covered_bases`: same as `alignment_max_covered_bases`, but based on variant bases rather than aligned bases
- `target_metadata`: based on parsing the target metadata the user supplied
- `target_category`: no gene info was selected because the target was not protein coding (see #66)

Various helpers were added to `dcd_mapping.annotate` which support this calculation. Gene info selection should not cause job failures, and will simply fail to select gene info on failure.
@bencap
Copy link
Collaborator Author

bencap commented Dec 12, 2025

See #66 for information on computing this information for regulatory targets, which will be included in a future release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Determine Mapped Target HGNC Name During Mapping

2 participants