-
Notifications
You must be signed in to change notification settings - Fork 9
perf: Optimize clustering and I/O (4.4x faster segmented clustering) #579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Optimize several functions by using ds.variables instead of iterating over data_vars.items() or accessing ds[name], which triggers slow _construct_dataarray calls. Changes: - io.py: save_dataset_to_netcdf, load_dataset_from_netcdf, _reduce_constant_arrays - structure.py: from_dataset (use coord_cache pattern) - core.py: drop_constant_arrays (use numpy operations) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the 📝 WalkthroughWalkthroughAdds multidimensional clustering (Clustering / ClusteringResults), VariableCategory-based variable classification, segmented-timesteps support (RangeIndex), a centralized FlowSystem ↔ xarray I/O layer (FlowSystemDatasetIO), updated TransformAccessor clustering API (ClusterConfig/ExtremeConfig/SegmentConfig), and an IO benchmark script. Changes
Sequence Diagram(s)sequenceDiagram
participant User as User
participant FS as FlowSystem
participant TA as TransformAccessor
participant CL as Clustering
participant IO as FlowSystemDatasetIO
participant XR as xarray.Dataset
Note over User,FS: create / configure system
User->>FS: create(timesteps, periods, components)
FS->>FS: init (DatetimeIndex | RangeIndex), record variable_categories
Note over User,TA: clustering workflow
User->>TA: cluster(n_clusters, data_vars, extremes=ExtremeConfig(...))
TA->>TA: clustering_data() → prepare inputs
TA->>CL: call tsam.aggregate & build ClusteringResults
CL->>CL: build cluster_assignments, timesteps_per_cluster, metrics
TA->>FS: produce reduced FlowSystem with Clustering attached
Note over User,FS: solve & attach solution
User->>FS: solve()
FS->>FS: store solution, propagate variable_categories
Note over User,FS: expand & IO
User->>TA: expand(reduced_solution)
TA->>CL: expand_data(aggregated, segmentation-aware)
CL->>FS: produce expanded solution
User->>FS: to_dataset(include_solution=True, include_original_data=True)
FS->>IO: flow_system_to_dataset(...)
IO->>XR: construct xr.Dataset (solution, clustering, categories...)
User->>XR: save/load dataset
XR->>IO: restore_flow_system_from_dataset(ds)
IO->>CL: reconstruct Clustering from serialized data
IO->>FS: return FlowSystem with clustering and variable_categories
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Use ds.variables for faster access in clustering/base.py: - _create_reference_structure: original_data and metrics iteration - compare plot: duration_curve generation with direct numpy indexing Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
🤖 Fix all issues with AI agents
In `@benchmarks/benchmark_io_performance.py`:
- Around line 138-146: The benchmark currently calls
fs_clustered.optimize(fx.solvers.GurobiSolver()) which will fail at solve time
if Gurobi is not available; change this to detect/handle solver availability by
wrapping the optimize call in a try/except that catches solver runtime errors
and falls back to fx.solvers.HighsSolver(), or programmatically choose the
solver before calling fs_clustered.optimize by attempting to use
fx.solvers.GurobiSolver() and on failure selecting fx.solvers.HighsSolver();
update the call sites referenced (create_large_flow_system,
fs.transform.cluster, fs_clustered.optimize, fx.solvers.GurobiSolver) so the
benchmark either falls back to HighsSolver or logs a clear error and exits.
In `@docs/notebooks/08c2-clustering-storage-modes.ipynb`:
- Around line 174-175: Remove the invalid import "from tsam.config import
ExtremeConfig" and replace any usage of ExtremeConfig (e.g.,
extremes=ExtremeConfig(...)) with the TimeSeriesAggregation API by setting the
extremePeriodMethod parameter (one of "None", "append", "new_cluster_center",
"replace_cluster_center") and the corresponding flags like addPeakMin,
addPeakMax, addMeanMin, addMeanMax on the TimeSeriesAggregation instance; ensure
any code that passed extremes=ExtremeConfig(...) is updated to configure extreme
handling via TimeSeriesAggregation.extremePeriodMethod and the addPeak*/addMean*
arguments instead.
In `@docs/notebooks/08d-clustering-multiperiod.ipynb`:
- Around line 283-285: The printed header uses hardcoded "period=2024,
scenario=High" which can diverge from the indices used; retrieve the actual
labels from the dataset/coords used to produce cluster_assignments (e.g., the
period and scenario coordinate arrays referenced when you computed
cluster_assignments) and use those values in the header instead of fixed text,
then keep the existing loop over cluster_assignments and day_names to print each
day's cluster; reference the variables/objects cluster_assignments and day_names
and the same period/scenario coord variables you used earlier to select the
slice so the header is always derived from the data.
In `@flixopt/components.py`:
- Around line 1507-1512: The decay calculation uses mean timestep duration which
underestimates total hours per cluster; update the reduction on
self._model.timestep_duration (used when computing decay_n) to compute the total
hours for the cluster (e.g., use
_scalar_safe_reduce(self._model.timestep_duration, 'time', 'sum') or equivalent)
so decay_n = (1 - self.element.relative_loss_per_hour) **
total_hours_per_cluster correctly reflects summed duration while preserving
DataArray semantics and usage of _scalar_safe_reduce and relative_loss_per_hour.
In `@flixopt/flow_system.py`:
- Around line 210-218: The code currently allows self.timestep_duration to be
None when timesteps is a RangeIndex (segmented system), which breaks downstream
logic (e.g., weights['time'], temporal_weight); add an early validation in the
constructor or initializer where timestep_duration and
computed_timestep_duration are handled: detect the segmented case (timesteps is
a RangeIndex or computed_timestep_duration is None) and raise a clear ValueError
if timestep_duration is not provided, mentioning that RangeIndex requires an
explicit timestep_duration; keep existing logic for assigning
self.timestep_duration (including using fit_to_model_coords('timestep_duration',
computed_timestep_duration) when available) but enforce this guard before
downstream computation.
In `@flixopt/io.py`:
- Around line 1587-1603: _fast_get_dataarray currently only includes coordinates
whose names match variable.dims, dropping auxiliary coordinates whose dims are a
subset of the variable's dims; update the coords build in _fast_get_dataarray to
include any coord_cache entry where coord.dims is a subset of variable.dims (not
just keys equal to dims), so auxiliary coords are preserved when constructing
the xr.DataArray (use variable.dims and each coord.dims to decide inclusion and
pass the resulting coords dict to xr.DataArray).
In `@flixopt/transform_accessor.py`:
- Around line 1379-1385: Add a guard to detect when drop_constant_arrays removes
all variables before calling to_dataframe(): after computing
temporaly_changing_ds_for_clustering =
drop_constant_arrays(ds_slice_for_clustering, dim='time') check whether
temporaly_changing_ds_for_clustering is empty (e.g., has no data_vars or no
variables) and raise a clear ValueError mentioning the input filtering
(data_vars/selector) and that there is no varying data to cluster, instead of
calling temporaly_changing_ds_for_clustering.to_dataframe() and allowing a
cryptic indexing error; update the code paths that assign df_for_clustering
accordingly so downstream logic only runs when df_for_clustering is valid.
- Around line 1902-1914: The current indexing misuses timestep_mapping modulo
n_segments for time positions when clustering.is_segmented, causing wrong
time_indices and bad SOC interpolation; change the time index calculation so
time_indices = timestep_mapping % clustering.timesteps_per_cluster (always use
timesteps_per_cluster for the time position), keep cluster_indices =
timestep_mapping // time_dim_size (where time_dim_size remains
clustering.n_segments when segmented for clustering into clusters), and then use
those cluster_indices/time_indices to index segment_assignments,
position_within_segment and segment_durations (variables: timestep_mapping,
clustering.is_segmented, clustering.n_segments,
clustering.timesteps_per_cluster, cluster_indices, time_indices,
segment_assignments, position_within_segment, segment_durations, seg_indices,
positions, durations).
In `@pyproject.toml`:
- Around line 67-68: The pyproject.toml currently lists a git URL dependency
"tsam @ git+https://github.com/FBumann/tsam.git@v3-rebased" in the [full] and
[dev] extras which will break PyPI publication; remove that git-based
requirement from both the [full] and [dev] extras and either (a) replace it with
a proper PyPI version specifier (e.g., "tsam >= 3.0.0, < 4") once tsam is
published, (b) omit tsam from those extras and add a note in your README/INSTALL
docs instructing users to install tsam from source, or (c) vendor/include tsam
in the package; finally, if this is intended only for development, update the
release workflow (the verify-pypi job) to skip PyPI verification or otherwise
allow dev-only dependencies so the release will pass.
🧹 Nitpick comments (2)
flixopt/structure.py (1)
901-908: Avoid eager.valueson potentially lazy arrays.
array.valueswill fully materialize dask-backed data. If datasets can be opened lazily, this can blow memory and negate the IO perf gains. Consider guarding for lazy arrays or using a reduction that doesn’t materialize the full array.🔧 Lazy-safe null check (example)
- has_nulls = (np.issubdtype(array.dtype, np.floating) and np.any(np.isnan(array.values))) or ( - array.dtype == object and pd.isna(array.values).any() - ) + data = array.data + if hasattr(data, "compute"): + has_nulls = bool(array.isnull().any().compute()) + else: + has_nulls = (np.issubdtype(array.dtype, np.floating) and np.any(np.isnan(array.values))) or ( + array.dtype == object and pd.isna(array.values).any() + )To confirm whether lazy loading is possible in your IO path, you can locate the NetCDF loader implementation:
#!/bin/bash rg -n "def load_dataset_from_netcdf" -g '*.py' rg -n "open_dataset|load_dataset" -g '*.py'flixopt/clustering/base.py (1)
528-534: Preserve dimension order in_get_dim_valuesto keep.isel()predictable.
sorted(set(...))reorders period/scenario values and can make.isel()indices differ from the original order (e.g., FlowSystem period/scenario lists). Consider preserving first-seen order from the results keys.♻️ Proposed refactor
- return sorted(set(k[idx] for k in self._results.keys())) + values: list = [] + seen = set() + for key in self._results.keys(): + val = key[idx] + if val not in seen: + values.append(val) + seen.add(val) + return values
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (39)
benchmarks/benchmark_io_performance.pydocs/notebooks/01-quickstart.ipynbdocs/notebooks/02-heat-system.ipynbdocs/notebooks/03-investment-optimization.ipynbdocs/notebooks/04-operational-constraints.ipynbdocs/notebooks/05-multi-carrier-system.ipynbdocs/notebooks/06a-time-varying-parameters.ipynbdocs/notebooks/06b-piecewise-conversion.ipynbdocs/notebooks/06c-piecewise-effects.ipynbdocs/notebooks/08a-aggregation.ipynbdocs/notebooks/08c-clustering.ipynbdocs/notebooks/08c2-clustering-storage-modes.ipynbdocs/notebooks/08d-clustering-multiperiod.ipynbdocs/notebooks/08e-clustering-internals.ipynbdocs/notebooks/08f-clustering-segmentation.ipynbdocs/notebooks/09-plotting-and-data-access.ipynbdocs/notebooks/10-transmission.ipynbdocs/user-guide/optimization/clustering.mddocs/user-guide/optimization/index.mdflixopt/clustering/__init__.pyflixopt/clustering/base.pyflixopt/clustering/intercluster_helpers.pyflixopt/components.pyflixopt/core.pyflixopt/effects.pyflixopt/elements.pyflixopt/features.pyflixopt/flow_system.pyflixopt/io.pyflixopt/modeling.pyflixopt/statistics_accessor.pyflixopt/structure.pyflixopt/transform_accessor.pypyproject.tomltests/deprecated/examples/03_Optimization_modes/example_optimization_modes.pytests/test_cluster_reduce_expand.pytests/test_clustering/test_base.pytests/test_clustering/test_integration.pytests/test_clustering_io.py
🧰 Additional context used
🧬 Code graph analysis (12)
tests/deprecated/examples/03_Optimization_modes/example_optimization_modes.py (2)
flixopt/flow_system.py (2)
copy(964-989)transform(1681-1699)flixopt/transform_accessor.py (1)
cluster(1194-1478)
benchmarks/benchmark_io_performance.py (2)
flixopt/flow_system.py (6)
FlowSystem(61-2604)transform(1681-1699)to_dataset(686-724)from_dataset(727-749)solution(1461-1478)solution(1481-1484)flixopt/io.py (2)
to_dataset(1785-1828)from_dataset(1528-1563)
flixopt/modeling.py (1)
flixopt/structure.py (3)
VariableCategory(82-129)add_variables(204-220)add_variables(1740-1768)
flixopt/clustering/__init__.py (1)
flixopt/clustering/base.py (2)
Clustering(636-1346)ClusteringResults(161-633)
flixopt/structure.py (1)
flixopt/flow_system.py (1)
variable_categories(1487-1493)
flixopt/statistics_accessor.py (2)
flixopt/structure.py (3)
VariableCategory(82-129)solution(290-338)solution(1288-1302)flixopt/flow_system.py (3)
get_variables_by_category(1495-1547)solution(1461-1478)solution(1481-1484)
tests/test_cluster_reduce_expand.py (1)
flixopt/clustering/base.py (14)
n_clusters(316-318)n_clusters(672-674)cluster_assignments(338-349)cluster_assignments(778-784)timesteps_per_cluster(321-323)timesteps_per_cluster(677-679)sel(235-254)sel(714-761)clusters(1603-1706)dims(217-219)dims(697-699)timestep_mapping(814-823)coords(227-233)coords(702-712)
flixopt/features.py (3)
flixopt/structure.py (9)
FlowSystemModel(188-470)VariableCategory(82-129)add_variables(204-220)add_variables(1740-1768)get_coords(415-450)get_coords(1818-1823)dims(350-352)label_full(1284-1285)label_full(1849-1850)flixopt/clustering/base.py (4)
coords(227-233)coords(702-712)dims(217-219)dims(697-699)flixopt/elements.py (1)
label_full(638-639)
flixopt/flow_system.py (3)
flixopt/structure.py (9)
CompositeContainerMixin(1537-1713)Element(1241-1341)FlowSystemModel(188-470)Interface(473-1238)VariableCategory(82-129)dims(350-352)to_dataset(1038-1061)to_netcdf(1063-1090)copy(1218-1230)flixopt/clustering/base.py (5)
coords(227-233)coords(702-712)dims(217-219)dims(697-699)is_segmented(764-770)flixopt/io.py (4)
to_dataset(1785-1828)flow_system_to_dataset(1945-1968)restore_flow_system_from_dataset(1927-1942)save_dataset_to_netcdf(532-586)
flixopt/transform_accessor.py (2)
flixopt/structure.py (3)
VariableCategory(82-129)values(1663-1668)values(1969-1970)flixopt/clustering/base.py (13)
Clustering(636-1346)n_clusters(316-318)n_clusters(672-674)is_segmented(764-770)values(304-306)values(1321-1328)n_segments(331-333)n_segments(773-775)timesteps_per_cluster(321-323)timesteps_per_cluster(677-679)sel(235-254)sel(714-761)results(1236-1247)
flixopt/components.py (3)
flixopt/structure.py (4)
FlowSystemModel(188-470)VariableCategory(82-129)add_variables(204-220)add_variables(1740-1768)flixopt/clustering/base.py (7)
n_clusters(316-318)n_clusters(672-674)timesteps_per_cluster(321-323)timesteps_per_cluster(677-679)n_original_clusters(687-689)cluster_assignments(338-349)cluster_assignments(778-784)flixopt/modeling.py (1)
_scalar_safe_reduce(57-76)
flixopt/effects.py (1)
flixopt/structure.py (3)
Submodel(1716-1914)VariableCategory(82-129)register_class_for_io(148-157)
🔇 Additional comments (91)
docs/notebooks/06b-piecewise-conversion.ipynb (1)
208-217: Notebook metadata standardization looks good.The added metadata fields (
codemirror_mode,file_extension,mimetype,nbconvert_exporter,pygments_lexer) are standard Jupyter notebook fields that improve editor compatibility and export behavior.Note: The Python version was changed from
3.12.7to3.11.11. If this is intentional (e.g., to match your CI/dev environment), this is fine. Just verify it aligns with project requirements.docs/notebooks/06c-piecewise-effects.ipynb (2)
315-323: Standard Jupyter metadata additions look good.The added metadata fields (
codemirror_mode,file_extension,mimetype,nbconvert_exporter,pygments_lexer) are standard Jupyter notebook metadata that enable proper syntax highlighting and notebook export functionality.
324-324: Verify the Python version downgrade.The Python version was changed from
"3.12.7"to"3.11.11". Please confirm whether this downgrade is intentional (e.g., for CI/environment standardization) or an artifact of re-executing the notebook in a different environment.If this notebook was simply re-run in your current environment, consider whether committing the version change is necessary, especially if the project supports multiple Python versions.
docs/notebooks/04-operational-constraints.ipynb (1)
475-487: LGTM!Standard Jupyter notebook metadata enrichment adding
language_infofields for editor integration and export functionality. This aligns with the metadata updates across other notebooks in the repository.docs/notebooks/05-multi-carrier-system.ipynb (1)
544-554: LGTM!Consistent metadata enrichment matching the pattern applied to other notebooks. No functional changes.
docs/notebooks/10-transmission.ipynb (1)
636-646: LGTM!Standard notebook metadata update consistent with the other notebooks in this PR.
docs/notebooks/09-plotting-and-data-access.ipynb (1)
834-844: LGTM!Consistent notebook metadata update aligning with repository standards.
pyproject.toml (1)
86-86: Consistent with[full]extras dependency.The
[dev]dependency mirrors the git reference used in[full], maintaining consistency across installation profiles.docs/notebooks/06a-time-varying-parameters.ipynb (1)
311-324: LGTM!The metadata enrichment with
language_info(Python 3.11.11) aligns with the consistent notebook metadata pattern across the repository.docs/notebooks/08f-clustering-segmentation.ipynb (5)
152-166: LGTM!Clear demonstration of clustering with segmentation. The parameters (
n_clusters,cluster_duration,segments,extremes) are well-documented and the output logging provides useful feedback.
185-193: LGTM!Good exploration of the clustering object properties after segmentation. The
is_segmented,n_segments,n_clusters, andtimesteps_per_clusterproperties provide comprehensive insight into the segmentation structure.
317-325: LGTM!The expansion workflow is clearly demonstrated with timing metrics and verification of objective preservation.
625-643: LGTM!Complete notebook metadata with kernelspec and language_info consistent with other notebooks in the repository.
147-147: Import path is correct and consistent with existing codebase usage.The import
from tsam.config import ExtremeConfig, SegmentConfigfollows the same pattern used throughout the repository inflixopt/transform_accessor.pyand test files, confirming the tsam API is properly integrated.docs/notebooks/08c-clustering.ipynb (8)
135-147: LGTM!Clean migration to the new
ExtremeConfig-based peak forcing API. The import and usage pattern are consistent with the updated clustering interface.
184-186: LGTM!Updated access path to
fs_clustered.clustering.resultsaligns with the newClusteringResultscontainer architecture.
226-258: LGTM!Excellent addition documenting the
clustering_data()method for inspecting time-varying data before clustering. This helps users understand which variables influence the clustering algorithm.
260-303: LGTM!Good documentation of the
data_varsparameter for selective clustering. The explanation of "cluster based on subset, apply to all data" is clear and the use case examples are practical.
337-347: LGTM!Clear example of using
ClusterConfigto specify alternative clustering algorithms (kmeans vs hierarchical).
381-400: LGTM!The
apply_clustering()documentation provides a useful pattern for design variant comparison with consistent cluster structures.
623-645: LGTM!Comprehensive API reference update with renamed properties (
cluster_assignments,cluster_occurrences) and the newClusteringResultsxarray-like interface documentation.
730-748: LGTM!Updated notebook metadata consistent with repository-wide Python 3.11.11 standardization.
flixopt/effects.py (3)
20-28: LGTM!Clean import grouping with
VariableCategoryadded from.structurealongside the existing structure imports.
383-389: LGTM!Correctly annotates the
totalvariable withVariableCategory.TOTAL, enabling proper classification for segmentation and IO handling. This follows the established pattern inSubmodel.add_variableswhich stores the category in the model'svariable_categoriesmapping.
401-407: LGTM!Correctly annotates the
total_over_periodsvariable withVariableCategory.TOTAL_OVER_PERIODS. Consistent with the categorization pattern applied tototalabove.docs/notebooks/02-heat-system.ipynb (1)
383-395: LGTM!Metadata enrichment with
language_info(Python 3.11.11) consistent with the repository-wide notebook standardization.docs/notebooks/08a-aggregation.ipynb (1)
391-403: LGTM!The
language_infometadata enrichment is consistent with standard Jupyter notebook conventions and aligns with the broader metadata standardization across the notebook collection.flixopt/clustering/intercluster_helpers.py (3)
14-16: LGTM!The docstring correctly reflects the updated API using
cluster_assignments[d]instead ofcluster_order[d], aligning with the broader clustering API refactor described in the PR.
74-154: LGTM!The function correctly handles all capacity parameter types (InvestParameters, numeric, None) with appropriate fallback behavior and helpful warning messages when using default unbounded capacity.
157-201: LGTM!The function correctly constructs boundary coordinates with proper handling of optional period and scenario dimensions from the FlowSystem.
docs/notebooks/01-quickstart.ipynb (1)
285-295: LGTM!The metadata enrichment is consistent with the standardization effort across notebooks, adding proper Python runtime details and version information.
docs/notebooks/03-investment-optimization.ipynb (1)
432-444: LGTM!The metadata addition follows the same standardization pattern applied across the notebook collection.
docs/notebooks/08c2-clustering-storage-modes.ipynb (1)
195-199: LGTM!The migration from
time_series_for_high_peaksto theExtremeConfig-based API provides more explicit control over extreme period handling. Themethod='new_cluster'withmax_value=PEAK_SERIEScorrectly specifies that peak demand periods should be captured as separate clusters.tests/deprecated/examples/03_Optimization_modes/example_optimization_modes.py (1)
193-211: LGTM - Clean migration to ExtremeConfig API.The migration from
time_series_for_high_peaks/time_series_for_low_peaksto the newExtremeConfig-based approach follows the tsam 3.0 API correctly. The variable names reference the correct component labels defined earlier in the file (Wärmelastat line 123,Stromlastat line 128).docs/user-guide/optimization/index.md (1)
59-66: LGTM - Documentation correctly demonstrates the new ExtremeConfig API.The example clearly shows how to use the new
extremesparameter withExtremeConfig, aligning with the updatedtransform.cluster()signature.flixopt/core.py (2)
643-650: Potential edge case with all-NaN slices.When
datacontains slices that are entirely NaN along the specified axis,np.nanmax - np.nanminwill returnNaN, and the comparisonptp < atolwill beFalse. This means all-NaN variables won't be dropped as "constant."If all-NaN variables should be treated as constant (and dropped), consider:
- if np.all(ptp < atol): + if np.all((ptp < atol) | np.isnan(ptp)):However, if preserving all-NaN variables is intentional (e.g., for debugging), the current behavior is acceptable.
617-619: LGTM - Performance optimization with new tolerance parameter.The addition of
atolparameter with a sensible default (1e-10) allows for tolerance-based constancy detection, which is more robust for floating-point data than exact equality checks. The use ofds.variablesfor direct Variable access is a good performance optimization.flixopt/elements.py (8)
19-25: LGTM - Clean import of VariableCategory.The import is properly organized with related structure module imports.
671-677: LGTM - Proper categorization of flow_rate variable.The
category=VariableCategory.FLOW_RATEcorrectly identifies the main flow rate variable for downstream processing and filtering.
682-693: LGTM - Appropriate category for total_flow_hours tracking variable.Using
VariableCategory.TOTALcorrectly identifies this as an aggregate/total variable.
709-724: LGTM - Correct categorization for flow_hours_over_periods.The
VariableCategory.TOTAL_OVER_PERIODSappropriately distinguishes this cross-period aggregate from per-period totals.
732-749: LGTM - Status variable correctly categorized.The binary status variable is properly marked with
VariableCategory.STATUS.
751-761: LGTM - Investment model receives size category.Passing
size_category=VariableCategory.FLOW_SIZEtoInvestmentModelallows proper identification of flow sizing variables.
969-981: LGTM - Virtual flow variables properly categorized.Both
virtual_supplyandvirtual_demandare correctly marked withVariableCategory.VIRTUAL_FLOW, distinguishing them from physical flow variables.
1047-1052: LGTM - Component-level status variable categorized consistently.The component's status variable uses the same
VariableCategory.STATUSas flow-level status variables, maintaining consistency.tests/test_clustering_io.py (8)
6-6: LGTM - Required import for new tests.The
xarrayimport is needed forxr.DataArrayin the fixture andxr.testing.assert_equalin assertions.
82-93: LGTM - Stronger type validation for restored clustering.The explicit
isinstance(fs_restored.clustering, Clustering)check ensures the deserialization produces the correct type, not just a truthy value.
124-125: Good defensive testing for serialization edge case.The
check_names=Falseparameter appropriately handles the potential loss of index name during NetCDF serialization, which is a known xarray/NetCDF behavior.
543-586: Well-designed fixture for multi-dimensional testing.The
system_with_periods_and_scenariosfixture creates intentionally different demand patterns for each period/scenario combination to ensure cluster assignments differ. This is a thoughtful approach to testing multi-dimensional clustering behavior.
588-612: Good coverage of cluster assignment dimensionality.These tests verify the critical requirement that cluster assignments have the correct dimensions and produce different results for different period/scenario combinations.
614-669: Comprehensive roundtrip and derived property tests.The tests properly validate that cluster_assignments are exactly preserved after NetCDF roundtrip, and that derived properties like
n_clusters,timesteps_per_cluster, andcluster_occurrenceswork correctly after deserialization.
671-703: Valuable test for apply_clustering with loaded clustering.This test validates an important use case: loading a saved clustering and applying it to a fresh FlowSystem. The assertions verify both structural equivalence and assignment matching.
705-724: LGTM - End-to-end test for expand after load and optimize.This test covers the complete workflow: cluster → optimize → save → load → expand, verifying that the solution is properly expanded to original timesteps.
flixopt/modeling.py (1)
9-9: Category propagation looks consistent and useful.
Tagging tracker and duration variables withVariableCategoryaligns withSubmodel.add_variablesand will help downstream filtering.Also applies to: 266-308, 371-378
flixopt/statistics_accessor.py (1)
34-34: Category-based variable selection is a solid improvement.
This aligns the statistics accessor with the new categorization API and simplifies label handling.Also applies to: 527-533, 571-573, 580-582, 596-598
docs/notebooks/08d-clustering-multiperiod.ipynb (1)
176-188: Notebook updates align well with the new clustering API.
The ExtremeConfig-based examples and results access paths read clearly.Also applies to: 241-249, 252-259, 262-281, 288-291, 296-317, 322-364, 368-418, 422-451, 455-475, 480-486, 490-520, 524-582, 593-602
docs/user-guide/optimization/clustering.md (1)
26-37: Docs are consistent and clearer with the new clustering options.
Examples and parameter tables read cleanly.Also applies to: 54-58, 59-88, 89-105, 107-124, 125-134
tests/test_cluster_reduce_expand.py (5)
65-65: Updated tests correctly target the new clustering public API.
The move ton_clusters,cluster_assignments, andtimesteps_per_clusterreads cleanly.Also applies to: 123-127, 135-135, 294-295, 338-347, 352-353, 452-452, 536-538, 545-545
767-840: Extremes coverage looks solid.
These tests should guard the newExtremeConfigintegration well.
845-957:data_varsparameter tests are thorough and focused.
Nice coverage for validation, preservation, and optimization.
996-1377: Segmentation and IO round-trip coverage is excellent.
The assertions around mapping, duration, and expansion behavior are especially helpful.Also applies to: 1379-1441
1443-1523:combine_slicestests are clear and useful.
They exercise shape, coords, attrs, and datetime handling well.tests/test_clustering/test_integration.py (2)
125-213: Good coverage forclustering_datashape and period handling.
These tests exercise the expected Dataset return type and period selection behavior well.
288-335: ClusterConfig/preserve_column_means test updates look solid.
The assertions align with the new API surface and deterministic expectations.tests/test_clustering/test_base.py (5)
12-62: Helper-function tests look solid.
They validate both occurrences and mapping logic clearly.
63-205: ClusteringResults coverage is thorough.
Good coverage of dims/coords/selectors and DataArray accessors.
207-307: Clustering wrapper tests are consistent with the new API.
Nice coverage of derived properties and repr.
309-390: Multi-dimensional selection tests are well scoped.
392-483: Plot accessor tests are clear and focused.flixopt/clustering/__init__.py (1)
1-51: Public exports and docstring example match the new API.benchmarks/benchmark_io_performance.py (2)
19-75: Benchmark fixture setup is clear and representative.
78-106: Benchmark harness is clean and reusable.flixopt/structure.py (4)
82-143: VariableCategory and expansion groupings are well organized.
Clear taxonomy for segment handling.
197-203: Nice: model-level registry for variable categories.
This cleanly centralizes category metadata for downstream expansion logic.
1118-1129:ds.variables+coord_cacheconstruction looks good.
Nice performance-oriented replacement for repeated DataArray construction.
1740-1767: Category registration on variable creation is clean.
This keeps categorization close to variable definition and avoids later bookkeeping.flixopt/io.py (7)
564-575: Direct attr serialization loop looks solid.Single-pass handling for data vars and coords keeps the save path consistent and efficient.
609-641: Constant-dimension reduction looks consistent.The reduction loop maintains dimensional bookkeeping and preserves attrs on the reconstructed variables.
665-706: Stacking path is clear and preserves grouping semantics.Grouping by dims and stacking into a dedicated dimension reads correctly.
721-748: Unstacking restores variables efficiently.The slice-based reconstruction logic is straightforward and aligns with the stack layout.
774-785: Load-time attr restoration mirrors the save path.Good symmetry between serialization and deserialization of variable attrs.
1527-1563: Restore pipeline is cohesive with coord caching.The sequence from separation → fast array build → restore hooks is clear and consistent.
1831-1861: Solution namespacing and has_solution flagging are consistent.The solution prefixing and presence flagging are clear and aligned with the restore path.
flixopt/components.py (2)
942-953: Storage variable categorization is a good addition.Charge state/netto discharge categories and storage-size categorization align well with the new variable taxonomy.
Also applies to: 979-985
1316-1323: Intercluster categorization is consistent.Size and SOC boundary variables are now correctly tagged for downstream handling.
Also applies to: 1373-1380
flixopt/features.py (4)
40-84: Size category plumbing is clean.The new
size_categoryparameter is applied consistently tosizeandinvestedvariables.
200-258: Status variables are now consistently categorized.Inactive, startup/shutdown, total, and startup_count tags fit the taxonomy well.
408-428: Piecewise variable categories are wired through cleanly.Categorizing piece and zero_point variables improves downstream handling without changing behavior.
Also applies to: 519-524
642-706: Share allocation categories correctly distinguish temporal vs. scalar shares.The
SHARE,TOTAL, andPER_TIMESTEPtags align with how these variables are used.docs/notebooks/08e-clustering-internals.ipynb (2)
14-239: Notebook updates align with the new Clustering API.The expanded explanations and examples match the consolidated
Clustering/ClusteringResultsworkflow.
341-509: IO workflow and file‑size guidance are clear and helpful.The preservation/loss matrix and size‑reduction tips should prevent user confusion.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
| # Create and prepare FlowSystem | ||
| print('\n1. Creating FlowSystem...') | ||
| fs = create_large_flow_system(n_timesteps, n_periods, n_components) | ||
| print(f' Components: {len(fs.components)}') | ||
|
|
||
| print('\n2. Clustering and solving...') | ||
| fs_clustered = fs.transform.cluster(n_clusters=n_clusters, cluster_duration='1D') | ||
| fs_clustered.optimize(fx.solvers.GurobiSolver()) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
cd /tmp && find . -name "benchmark_io_performance.py" 2>/dev/null | head -5Repository: flixOpt/flixopt
Length of output: 41
🏁 Script executed:
git ls-files | grep -E "(benchmark|solver)" | head -20Repository: flixOpt/flixopt
Length of output: 116
🏁 Script executed:
rg "GurobiSolver" --type py -B 2 -A 2 | head -50Repository: flixOpt/flixopt
Length of output: 2612
🏁 Script executed:
cat -n benchmarks/benchmark_io_performance.py | sed -n '130,155p'Repository: flixOpt/flixopt
Length of output: 1248
🏁 Script executed:
cat -n benchmarks/benchmark_io_performance.py | head -50Repository: flixOpt/flixopt
Length of output: 1739
🏁 Script executed:
cat -n flixopt/solvers.py | head -100Repository: flixOpt/flixopt
Length of output: 3688
🏁 Script executed:
rg "class.*Solver" flixopt/solvers.py -A 5Repository: flixOpt/flixopt
Length of output: 596
🏁 Script executed:
rg "import.*Gurobi\|from.*gurobi" benchmarks/benchmark_io_performance.pyRepository: flixOpt/flixopt
Length of output: 41
🏁 Script executed:
rg "def optimize" flixopt/flow_system.py -A 15Repository: flixOpt/flixopt
Length of output: 589
🏁 Script executed:
rg "GurobiSolver\|gurobipy" flixopt/ --type py -lRepository: flixOpt/flixopt
Length of output: 41
🏁 Script executed:
cat -n flixopt/flow_system.py | grep -A 20 "def optimize"Repository: flixOpt/flixopt
Length of output: 926
🏁 Script executed:
rg "class OptimizeAccessor" flixopt/ --type py -A 30Repository: flixOpt/flixopt
Length of output: 1948
🏁 Script executed:
rg "def __call__" flixopt/ --type py -B 5 -A 20 | grep -A 20 "OptimizeAccessor"Repository: flixOpt/flixopt
Length of output: 41
🏁 Script executed:
rg "gurobipy\|import Gurobi" flixopt/ --type pyRepository: flixOpt/flixopt
Length of output: 41
🏁 Script executed:
python3 << 'EOF'
# Check what happens when GurobiSolver() is instantiated
# by looking at the solver class definition
import sys
sys.path.insert(0, '/tmp/repo')
try:
from flixopt.solvers import GurobiSolver
# Try to create an instance (this will fail if gurobipy is not installed)
solver = GurobiSolver()
print("SUCCESS: GurobiSolver instantiated without error")
print(f"Solver: {solver}")
except ImportError as e:
print(f"ImportError when creating GurobiSolver: {e}")
except Exception as e:
print(f"Exception when creating GurobiSolver ({type(e).__name__}): {e}")
EOFRepository: flixOpt/flixopt
Length of output: 130
🏁 Script executed:
rg "gurobipy\|import.*Gurobi" . --type pyRepository: flixOpt/flixopt
Length of output: 41
🏁 Script executed:
cat -n flixopt/optimize_accessor.py | head -100Repository: flixOpt/flixopt
Length of output: 3855
🏁 Script executed:
rg "def __call__" flixopt/optimize_accessor.py -A 50 | head -80Repository: flixOpt/flixopt
Length of output: 1869
🏁 Script executed:
rg "def solve" flixopt/flow_system.py -A 40 | head -80Repository: flixOpt/flixopt
Length of output: 1504
🏁 Script executed:
rg "def solve" flixopt/ --type py -lRepository: flixOpt/flixopt
Length of output: 105
🏁 Script executed:
cat -n flixopt/flow_system.py | grep -B 5 -A 40 "def solve"Repository: flixOpt/flixopt
Length of output: 1882
🏁 Script executed:
rg "def do_modeling\|model.solve" flixopt/ --type py -B 2 -A 15 | head -120Repository: flixOpt/flixopt
Length of output: 41
🏁 Script executed:
rg "gurobipy\|import.*gurobi" . -i --type pyRepository: flixOpt/flixopt
Length of output: 41
🏁 Script executed:
cat -n flixopt/optimization.py | head -200Repository: flixOpt/flixopt
Length of output: 8086
🏁 Script executed:
rg "solver_name.*gurobi\|if.*solver_name" flixopt/ --type py -B 3 -A 5Repository: flixOpt/flixopt
Length of output: 41
🏁 Script executed:
find . -name "structure.py" -o -name "*model*.py" | grep flixopt | head -5Repository: flixOpt/flixopt
Length of output: 103
🏁 Script executed:
cat -n flixopt/structure.py | grep -A 50 "def solve\|solver_name" | head -100Repository: flixOpt/flixopt
Length of output: 41
🏁 Script executed:
wc -l flixopt/structure.pyRepository: flixOpt/flixopt
Length of output: 84
🏁 Script executed:
rg "def solve" flixopt/structure.py -A 50Repository: flixOpt/flixopt
Length of output: 41
🏁 Script executed:
rg "GurobiSolver\|gurobi\|solver_name" flixopt/structure.py -i -B 2 -A 5Repository: flixOpt/flixopt
Length of output: 41
🏁 Script executed:
rg "def solve" flixopt/ --type py -lRepository: flixOpt/flixopt
Length of output: 105
🏁 Script executed:
rg "solver_name\|GurobiSolver" flixopt/ --type py -B 2 -A 10Repository: flixOpt/flixopt
Length of output: 41
🏁 Script executed:
ls -la flixopt/ | grep -E "\.py$"Repository: flixOpt/flixopt
Length of output: 1912
🏁 Script executed:
grep -n "def solve" flixopt/structure.pyRepository: flixOpt/flixopt
Length of output: 41
🏁 Script executed:
cat -n flixopt/structure.py | sed -n '1,100p'Repository: flixOpt/flixopt
Length of output: 3920
🏁 Script executed:
rg "class FlowSystemModel" flixopt/ --type py -A 100 | head -150Repository: flixOpt/flixopt
Length of output: 6912
🏁 Script executed:
rg "\.solve\(" flixopt/structure.py -B 3 -A 10Repository: flixOpt/flixopt
Length of output: 41
🏁 Script executed:
python3 << 'EOF'
# Check linopy's solve method signature
import inspect
try:
import linopy
print("Linopy version:", linopy.__version__)
print("\nLinopy Model.solve signature:")
print(inspect.signature(linopy.Model.solve))
except Exception as e:
print(f"Error: {e}")
EOFRepository: flixOpt/flixopt
Length of output: 90
🏁 Script executed:
cat benchmarks/benchmark_io_performance.py | head -20Repository: flixOpt/flixopt
Length of output: 488
🏁 Script executed:
cat -n benchmarks/benchmark_io_performance.py | sed -n '1,20p'Repository: flixOpt/flixopt
Length of output: 628
🏁 Script executed:
python3 << 'EOF'
# Check if GurobiSolver() instantiation would fail without gurobipy
import sys
sys.path.insert(0, '/tmp/repo')
# The GurobiSolver class in solvers.py is just a dataclass
# Let's verify it doesn't import gurobipy at class definition or instantiation
from flixopt.solvers import GurobiSolver
try:
solver = GurobiSolver()
print("SUCCESS: GurobiSolver instantiated without importing gurobipy")
print(f"Solver name: {solver.name}")
print(f"Solver options: {solver.options}")
except Exception as e:
print(f"FAILED: {type(e).__name__}: {e}")
EOFRepository: flixOpt/flixopt
Length of output: 334
🏁 Script executed:
rg "def __call__" flixopt/optimize_accessor.py -A 60 | tail -50Repository: flixOpt/flixopt
Length of output: 1799
🏁 Script executed:
cat -n flixopt/solvers.pyRepository: flixOpt/flixopt
Length of output: 3688
🏁 Script executed:
grep -r "gurobipy\|gurobi" benchmarks/ --include="*.py"Repository: flixOpt/flixopt
Length of output: 41
🏁 Script executed:
rg "class GurobiSolver" flixopt/solvers.py -A 25Repository: flixOpt/flixopt
Length of output: 788
🏁 Script executed:
rg "linopy.Model.solve\|xarray" flixopt/structure.py | head -20Repository: flixOpt/flixopt
Length of output: 41
🏁 Script executed:
cat -n flixopt/flow_system.py | sed -n '1402,1433p'Repository: flixOpt/flixopt
Length of output: 1417
🏁 Script executed:
grep -n "model.solve" flixopt/flow_system.pyRepository: flixOpt/flixopt
Length of output: 89
🏁 Script executed:
cat -n flixopt/flow_system.py | sed -n '1429,1470p'Repository: flixOpt/flixopt
Length of output: 1999
🏁 Script executed:
rg "solver_name.*gurobi\|HighsSolver\|HiGHS" flixopt/flow_system.py -i -B 2 -A 5Repository: flixOpt/flixopt
Length of output: 41
Benchmark requires Gurobi; consider guarding the solve call or using a fallback solver.
The benchmark will fail without Gurobi, but not at GurobiSolver() instantiation—the error occurs later in model.solve() when linopy tries to invoke the actual solver. Wrapping the instantiation won't prevent the failure. Instead, either:
- Guard the
fs_clustered.optimize()call with exception handling for the actual solve step, or - Use
HighsSolver()as a fallback when Gurobi is unavailable, or - Document Gurobi as a hard requirement.
🤖 Prompt for AI Agents
In `@benchmarks/benchmark_io_performance.py` around lines 138 - 146, The benchmark
currently calls fs_clustered.optimize(fx.solvers.GurobiSolver()) which will fail
at solve time if Gurobi is not available; change this to detect/handle solver
availability by wrapping the optimize call in a try/except that catches solver
runtime errors and falls back to fx.solvers.HighsSolver(), or programmatically
choose the solver before calling fs_clustered.optimize by attempting to use
fx.solvers.GurobiSolver() and on failure selecting fx.solvers.HighsSolver();
update the call sites referenced (create_large_flow_system,
fs.transform.cluster, fs_clustered.optimize, fx.solvers.GurobiSolver) so the
benchmark either falls back to HighsSolver or logs a clear error and exits.
| "from tsam.config import ExtremeConfig\n", | ||
| "\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🌐 Web query:
tsam python package ExtremeConfig class configuration module
💡 Result:
There is no class named ExtremeConfig in the tsam package. Extreme-period handling is part of the TimeSeriesAggregation module — configure it with the extremePeriodMethod parameter (options: "None", "append", "new_cluster_center", "replace_cluster_center") and the addPeakMin / addPeakMax / addMeanMin / addMeanMax arguments to specify which series produce extreme periods. See the TimeSeriesAggregation docs and source for the implementation. [1][2]
Sources:
- TimeSeriesAggregation docs (parameters & extremePeriodMethod). [1]
- timeseriesaggregation source code ( _addExtremePeriods implementation ). [2]
Remove the ExtremeConfig import; it does not exist in the tsam package.
The class ExtremeConfig is not available in tsam.config. Extreme-period handling in tsam is configured via the TimeSeriesAggregation module using the extremePeriodMethod parameter (with options: "None", "append", "new_cluster_center", "replace_cluster_center") and arguments like addPeakMin, addPeakMax, addMeanMin, addMeanMax. The current import and API usage (extremes=ExtremeConfig(...)) will fail at runtime.
🤖 Prompt for AI Agents
In `@docs/notebooks/08c2-clustering-storage-modes.ipynb` around lines 174 - 175,
Remove the invalid import "from tsam.config import ExtremeConfig" and replace
any usage of ExtremeConfig (e.g., extremes=ExtremeConfig(...)) with the
TimeSeriesAggregation API by setting the extremePeriodMethod parameter (one of
"None", "append", "new_cluster_center", "replace_cluster_center") and the
corresponding flags like addPeakMin, addPeakMax, addMeanMin, addMeanMax on the
TimeSeriesAggregation instance; ensure any code that passed
extremes=ExtremeConfig(...) is updated to configure extreme handling via
TimeSeriesAggregation.extremePeriodMethod and the addPeak*/addMean* arguments
instead.
| "print('\\nCluster assignments per day (period=2024, scenario=High):')\n", | ||
| "for i, cluster_id in enumerate(cluster_assignments):\n", | ||
| " print(f' {day_names[i]}: Cluster {cluster_id}')\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid hardcoded period/scenario labels in output.
You index by period=0, scenario=0 but print “period=2024, scenario=High”. Consider deriving labels from the coords to avoid mismatches when inputs change.
🤖 Prompt for AI Agents
In `@docs/notebooks/08d-clustering-multiperiod.ipynb` around lines 283 - 285, The
printed header uses hardcoded "period=2024, scenario=High" which can diverge
from the indices used; retrieve the actual labels from the dataset/coords used
to produce cluster_assignments (e.g., the period and scenario coordinate arrays
referenced when you computed cluster_assignments) and use those values in the
header instead of fixed text, then keep the existing loop over
cluster_assignments and day_names to print each day's cluster; reference the
variables/objects cluster_assignments and day_names and the same period/scenario
coord variables you used earlier to select the slice so the header is always
derived from the data.
flixopt/components.py
Outdated
| # relative_loss_per_hour is per-hour, so we need total hours per cluster | ||
| # Use sum over time to handle both regular and segmented systems | ||
| # Keep as DataArray to respect per-period/scenario values | ||
| rel_loss = _scalar_safe_reduce(self.element.relative_loss_per_hour, 'time', 'mean') | ||
| hours_per_cluster = timesteps_per_cluster * _scalar_safe_reduce(self._model.timestep_duration, 'time', 'mean') | ||
| hours_per_cluster = _scalar_safe_reduce(self._model.timestep_duration, 'time', 'mean') | ||
| decay_n = (1 - rel_loss) ** hours_per_cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix decay factor to use total hours per cluster.
The decay currently uses the mean timestep duration, which underestimates self‑discharge across a full cluster period. This skews SOC boundary linking. Use a summed duration (or otherwise compute total hours).
🔧 Proposed fix
- hours_per_cluster = _scalar_safe_reduce(self._model.timestep_duration, 'time', 'mean')
+ hours_per_cluster = _scalar_safe_reduce(self._model.timestep_duration, 'time', 'sum')🤖 Prompt for AI Agents
In `@flixopt/components.py` around lines 1507 - 1512, The decay calculation uses
mean timestep duration which underestimates total hours per cluster; update the
reduction on self._model.timestep_duration (used when computing decay_n) to
compute the total hours for the cluster (e.g., use
_scalar_safe_reduce(self._model.timestep_duration, 'time', 'sum') or equivalent)
so decay_n = (1 - self.element.relative_loss_per_hour) **
total_hours_per_cluster correctly reflects summed duration while preserving
DataArray semantics and usage of _scalar_safe_reduce and relative_loss_per_hour.
| # Use provided timestep_duration if given (for segmented systems), otherwise use computed value | ||
| # For RangeIndex (segmented systems), computed_timestep_duration is None | ||
| if timestep_duration is not None: | ||
| self.timestep_duration = timestep_duration | ||
| elif computed_timestep_duration is not None: | ||
| self.timestep_duration = self.fit_to_model_coords('timestep_duration', computed_timestep_duration) | ||
| else: | ||
| self.timestep_duration = None | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's locate and read the flow_system.py file to see the actual code
fd flow_system.pyRepository: flixOpt/flixopt
Length of output: 81
🏁 Script executed:
# Read the flow_system.py file around lines 210-218
wc -l flixopt/flow_system.pyRepository: flixOpt/flixopt
Length of output: 86
🏁 Script executed:
# Read lines around 210-218 to see the actual code
sed -n '200,225p' flixopt/flow_system.pyRepository: flixOpt/flixopt
Length of output: 1464
🏁 Script executed:
# Look for the _compute_time_metadata method to understand when computed_timestep_duration is None
sed -n '1,2604p' flixopt/flow_system.py | grep -n "_compute_time_metadata" | head -5Repository: flixOpt/flixopt
Length of output: 327
🏁 Script executed:
# Read the _compute_time_metadata method
sed -n '433,500p' flixopt/flow_system.pyRepository: flixOpt/flixopt
Length of output: 3265
🏁 Script executed:
# Search for weights and temporal_weight properties/methods to see how they use timestep_duration
rg -n "temporal_weight|def weights" flixopt/flow_system.py | head -20Repository: flixOpt/flixopt
Length of output: 239
🏁 Script executed:
# Read the temporal_weight property
sed -n '2075,2110p' flixopt/flow_system.pyRepository: flixOpt/flixopt
Length of output: 1451
🏁 Script executed:
# Read the weights property
sed -n '2209,2250p' flixopt/flow_system.pyRepository: flixOpt/flixopt
Length of output: 1931
🏁 Script executed:
# Check if timestep_duration being None causes any issues in weights/temporal_weight
# Let's look at the calculate_timestep_duration method
rg -n "calculate_timestep_duration" flixopt/flow_system.pyRepository: flixOpt/flixopt
Length of output: 180
🏁 Script executed:
# Read the calculate_timestep_duration method
sed -n '361,400p' flixopt/flow_system.pyRepository: flixOpt/flixopt
Length of output: 1832
🏁 Script executed:
# Let's check what happens when timestep_duration is None in the weights dict
# Search for any tests or usage of RangeIndex systems
rg -n "RangeIndex" flixopt/flow_system.py | head -15Repository: flixOpt/flixopt
Length of output: 1116
🏁 Script executed:
# Check for test files that use RangeIndex to understand expected behavior
fd "test.*flow_system" -o "*test*flow*"Repository: flixOpt/flixopt
Length of output: 200
🏁 Script executed:
# Search for test files related to flow_system
fd "test.*flow_system" tests/ 2>/dev/null || fd "flow_system" tests/ 2>/dev/null | head -10Repository: flixOpt/flixopt
Length of output: 173
🏁 Script executed:
# Search for RangeIndex usage in test files
rg -n "RangeIndex" tests/ | head -20Repository: flixOpt/flixopt
Length of output: 362
🏁 Script executed:
# Read the test file to understand RangeIndex usage
sed -n '1010,1100p' tests/test_cluster_reduce_expand.pyRepository: flixOpt/flixopt
Length of output: 3745
🏁 Script executed:
# Let's understand better - search for where timestep_duration might be set for RangeIndex systems
# Check if there are methods that set it after initialization
rg -n "self.timestep_duration =" flixopt/flow_system.py | head -20Repository: flixOpt/flixopt
Length of output: 278
Add validation for RangeIndex systems requiring explicit timestep_duration.
When timesteps is a RangeIndex (segmented system) and timestep_duration is not provided, self.timestep_duration becomes None. This breaks downstream uses like weights['time'] and temporal_weight. Add early validation with a clear error message.
🐛 Proposed fix
- # Use provided timestep_duration if given (for segmented systems), otherwise use computed value
+ # Segmented systems (RangeIndex) must provide timestep_duration explicitly
+ if isinstance(self.timesteps, pd.RangeIndex) and timestep_duration is None:
+ raise ValueError(
+ 'timestep_duration must be provided when timesteps is a RangeIndex (segmented system).'
+ )
+ # Use provided timestep_duration if given (for segmented systems), otherwise use computed value
# For RangeIndex (segmented systems), computed_timestep_duration is None
if timestep_duration is not None:
self.timestep_duration = timestep_duration📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Use provided timestep_duration if given (for segmented systems), otherwise use computed value | |
| # For RangeIndex (segmented systems), computed_timestep_duration is None | |
| if timestep_duration is not None: | |
| self.timestep_duration = timestep_duration | |
| elif computed_timestep_duration is not None: | |
| self.timestep_duration = self.fit_to_model_coords('timestep_duration', computed_timestep_duration) | |
| else: | |
| self.timestep_duration = None | |
| # Segmented systems (RangeIndex) must provide timestep_duration explicitly | |
| if isinstance(self.timesteps, pd.RangeIndex) and timestep_duration is None: | |
| raise ValueError( | |
| 'timestep_duration must be provided when timesteps is a RangeIndex (segmented system).' | |
| ) | |
| # Use provided timestep_duration if given (for segmented systems), otherwise use computed value | |
| # For RangeIndex (segmented systems), computed_timestep_duration is None | |
| if timestep_duration is not None: | |
| self.timestep_duration = timestep_duration | |
| elif computed_timestep_duration is not None: | |
| self.timestep_duration = self.fit_to_model_coords('timestep_duration', computed_timestep_duration) | |
| else: | |
| self.timestep_duration = None |
🤖 Prompt for AI Agents
In `@flixopt/flow_system.py` around lines 210 - 218, The code currently allows
self.timestep_duration to be None when timesteps is a RangeIndex (segmented
system), which breaks downstream logic (e.g., weights['time'], temporal_weight);
add an early validation in the constructor or initializer where
timestep_duration and computed_timestep_duration are handled: detect the
segmented case (timesteps is a RangeIndex or computed_timestep_duration is None)
and raise a clear ValueError if timestep_duration is not provided, mentioning
that RangeIndex requires an explicit timestep_duration; keep existing logic for
assigning self.timestep_duration (including using
fit_to_model_coords('timestep_duration', computed_timestep_duration) when
available) but enforce this guard before downstream computation.
| def _fast_get_dataarray(ds: xr.Dataset, name: str, coord_cache: dict[str, xr.DataArray]) -> xr.DataArray: | ||
| """Construct DataArray from Variable without slow coordinate inference. | ||
| This bypasses the slow _construct_dataarray method (~1.5ms -> ~0.1ms per var). | ||
| Args: | ||
| ds: Source dataset | ||
| name: Variable name | ||
| coord_cache: Pre-cached coordinate DataArrays | ||
| Returns: | ||
| Constructed DataArray | ||
| """ | ||
| variable = ds.variables[name] | ||
| coords = {k: coord_cache[k] for k in variable.dims if k in coord_cache} | ||
| return xr.DataArray(variable, coords=coords, name=name) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's look at the file structure and context around the function
cd flixopt && wc -l io.py && head -50 io.pyRepository: flixOpt/flixopt
Length of output: 1413
🏁 Script executed:
# Find the _fast_get_dataarray function and surrounding context
rg -n "_fast_get_dataarray" flixopt/Repository: flixOpt/flixopt
Length of output: 312
🏁 Script executed:
# Also search for _construct_dataarray to understand the slow path
rg -n "_construct_dataarray" flixopt/Repository: flixOpt/flixopt
Length of output: 821
🏁 Script executed:
# Search for test files related to IO operations
fd -e "test_*.py" -o -e "*_test.py" | grep -i ioRepository: flixOpt/flixopt
Length of output: 175
🏁 Script executed:
# Let's read the actual function and surrounding context
sed -n '1530,1620p' flixopt/io.pyRepository: flixOpt/flixopt
Length of output: 3853
🏁 Script executed:
# Search for test files
find . -type f -name "test*.py" | head -20Repository: flixOpt/flixopt
Length of output: 835
🏁 Script executed:
# Look for where coord_cache is created
rg -n "coord_cache" flixopt/io.py -B 2 -A 2Repository: flixOpt/flixopt
Length of output: 1043
🏁 Script executed:
# Search for uses of auxiliary coordinates or non-dimension coordinates
rg -i "auxiliary|non.*dimension|coords\[" flixopt/io.py | head -20Repository: flixOpt/flixopt
Length of output: 439
🏁 Script executed:
# Look for test files that test round-trip behavior
find . -name "*.py" -path "*/test*" -type f | xargs grep -l "from_dataset\|to_dataset" | head -5Repository: flixOpt/flixopt
Length of output: 229
🏁 Script executed:
# Check if there are any references to xarray's behavior with auxiliary coords
rg "DataArray.*coords" flixopt/io.py | head -10Repository: flixOpt/flixopt
Length of output: 122
🏁 Script executed:
# Search for how variables are written to understand what coords are preserved
sed -n '1800,1900p' flixopt/io.pyRepository: flixOpt/flixopt
Length of output: 3645
🏁 Script executed:
# Look at the main IO test file to understand what's tested
head -100 tests/test_io.pyRepository: flixOpt/flixopt
Length of output: 2842
🏁 Script executed:
# Search for any auxiliary coordinates being set in the codebase
rg "assign_coords|coords\[" --type py | grep -v test | head -20Repository: flixOpt/flixopt
Length of output: 2031
🏁 Script executed:
# Check if there's any xarray usage pattern that might add auxiliary coords
rg "xr\.DataArray\|xr\.Dataset" flixopt/io.py -A 1 | head -30Repository: flixOpt/flixopt
Length of output: 41
🏁 Script executed:
# Look for where datasets are created initially (to_dataset method)
sed -n '1680,1750p' flixopt/io.pyRepository: flixOpt/flixopt
Length of output: 3000
🏁 Script executed:
# Let's examine the test more carefully to see what's being tested
cat tests/test_io.py | head -60Repository: flixOpt/flixopt
Length of output: 1440
🏁 Script executed:
# Check what flow_system_base fixture includes and if it creates auxiliary coords
grep -n "flow_system_base\|@pytest.fixture" tests/conftest.py | head -30Repository: flixOpt/flixopt
Length of output: 667
🏁 Script executed:
# Search for where DataArrays are created with auxiliary coordinates in to_dataset
sed -n '1400,1550p' flixopt/io.py | grep -A 5 -B 5 "DataArray\|xr\."Repository: flixOpt/flixopt
Length of output: 906
🏁 Script executed:
# Check if xarray's construction of DataArray preserves all coords or just dimension coords
python3 << 'EOF'
import xarray as xr
import numpy as np
# Test how xarray handles auxiliary coordinates
ds = xr.Dataset({
'data': (['x', 'y'], np.random.rand(3, 4))
})
# Add a dimension coordinate
ds = ds.assign_coords({'x': np.arange(3), 'y': np.arange(4)})
# Add an auxiliary coordinate with dims as a subset
ds = ds.assign_coords({'aux': (('x',), np.arange(3))})
print("Original dataset coords:", list(ds.coords))
print("Original dataset coord names and dims:")
for name, coord in ds.coords.items():
print(f" {name}: dims={coord.dims}")
# Now simulate what _fast_get_dataarray does
var = ds.variables['data']
print("\nVariable 'data' dims:", var.dims)
# Method 1: Only dimension coords (current implementation)
coords1 = {k: ds.coords[k] for k in var.dims if k in ds.coords}
print("\nCoords selected (current method):", list(coords1.keys()))
# Method 2: All coords (proposed fix)
coords2 = {k: c for k, c in ds.coords.items() if set(c.dims).issubset(var.dims)}
print("Coords selected (proposed method):", list(coords2.keys()))
# Check which coords are lost
lost_coords = set(ds.coords.keys()) - set(coords1.keys())
print("\nLost coords with current method:", lost_coords)
EOFRepository: flixOpt/flixopt
Length of output: 177
🏁 Script executed:
# Install xarray and test coordinate handling
pip install xarray numpy pandas -q && python3 << 'EOF'
import xarray as xr
import numpy as np
# Test how xarray handles auxiliary coordinates
ds = xr.Dataset({
'data': (['x', 'y'], np.random.rand(3, 4))
})
# Add dimension coordinates
ds = ds.assign_coords({'x': np.arange(3), 'y': np.arange(4)})
# Add an auxiliary coordinate with dims as a subset
ds = ds.assign_coords({'aux': (('x',), np.arange(3))})
print("Original dataset coords:", list(ds.coords))
print("Original dataset coord names and dims:")
for name, coord in ds.coords.items():
print(f" {name}: dims={coord.dims}")
# Now simulate what _fast_get_dataarray does
var = ds.variables['data']
print("\nVariable 'data' dims:", var.dims)
# Method 1: Only dimension coords (current implementation)
coords1 = {k: ds.coords[k] for k in var.dims if k in ds.coords}
print("\nCoords selected (current method):", list(coords1.keys()))
# Method 2: All coords (proposed fix)
coords2 = {k: c for k, c in ds.coords.items() if set(c.dims).issubset(var.dims)}
print("Coords selected (proposed method):", list(coords2.keys()))
# Check which coords are lost
lost_coords = set(ds.coords.keys()) - set(coords1.keys())
print("\nLost coords with current method:", lost_coords)
EOFRepository: flixOpt/flixopt
Length of output: 613
🏁 Script executed:
# Now let's check if flixopt actually creates auxiliary coordinates anywhere
rg "assign_coords\(" flixopt/ --type py | grep -v test | grep -v "\.pyc"Repository: flixOpt/flixopt
Length of output: 2851
🏁 Script executed:
# Check the structure.py file where auxiliary coords might be used
rg -n "assign_coords" flixopt/structure.py -B 2 -A 2Repository: flixOpt/flixopt
Length of output: 41
🏁 Script executed:
# Check if the test actually verifies coordinate preservation
rg -n "test_flow_system_io" tests/test_io.py -A 10Repository: flixOpt/flixopt
Length of output: 354
🏁 Script executed:
# Look at the FlowSystem equality comparison to understand what's checked
rg -n "__eq__" flixopt/flow_system.py -A 15 | head -50Repository: flixOpt/flixopt
Length of output: 692
🏁 Script executed:
# Search for where auxiliary coordinates are actually added to datasets in serialization
rg -n "to_dataset" flixopt/flow_system.py -B 2 -A 20 | head -100Repository: flixOpt/flixopt
Length of output: 4885
🏁 Script executed:
# Check if any of the datasets created by to_dataset() actually have auxiliary coords
sed -n '1400,1650p' flixopt/io.pyRepository: flixOpt/flixopt
Length of output: 9490
🏁 Script executed:
# Let's check if the to_dataset method actually creates auxiliary coordinates
rg -n "assign_coords" flixopt/flow_system.py -B 3 -A 3Repository: flixOpt/flixopt
Length of output: 41
🏁 Script executed:
# Check what Structure.to_dataset does (parent class)
rg -n "class FlowSystem" flixopt/flow_system.py | head -3Repository: flixOpt/flixopt
Length of output: 238
🏁 Script executed:
# Find the parent class to_dataset method
rg -n "def to_dataset" flixopt/ --type py | grep -v test | grep -v ".pyc"Repository: flixOpt/flixopt
Length of output: 297
🏁 Script executed:
# Let's run the actual test to see if it passes
cd /repo && python -m pytest tests/test_io.py::test_flow_system_io -v 2>&1 | head -100Repository: flixOpt/flixopt
Length of output: 114
🏁 Script executed:
# Check Structure.to_dataset (parent class)
sed -n '1038,1120p' flixopt/structure.pyRepository: flixOpt/flixopt
Length of output: 3615
🏁 Script executed:
# Let's verify if datasets created by to_dataset() have auxiliary coords
# by running Python code to test this
python3 << 'EOF'
import sys
sys.path.insert(0, '/repo')
# Try to import and test
try:
import flixopt as fx
# Create a simple flow system
fs = fx.FlowSystem(
timesteps=fx.utils.get_times('2020', '2020', freq='d')[0:5],
)
# Convert to dataset
ds = fs.to_dataset()
print("Dataset coordinates:")
for name, coord in ds.coords.items():
print(f" {name}: dims={coord.dims}, shape={coord.shape}")
print("\nDataset variables:")
for name, var in ds.data_vars.items():
print(f" {name}: dims={var.dims}, shape={var.shape}")
# Check for auxiliary coords
print("\nAuxiliary coordinates (non-dimension coords):")
aux_coords = [name for name, coord in ds.coords.items()
if name not in ds.dims]
print(f" {aux_coords if aux_coords else 'None found'}")
except Exception as e:
print(f"Error: {e}")
import traceback
traceback.print_exc()
EOFRepository: flixOpt/flixopt
Length of output: 390
🏁 Script executed:
# Let's also check the structure.py to_dataset implementation
sed -n '1038,1150p' flixopt/structure.pyRepository: flixOpt/flixopt
Length of output: 4616
🏁 Script executed:
# Let's check where auxiliary coords are actually added and if they're part of the serialization flow
rg -n "assign_coords" flixopt/statistics_accessor.py -B 5 -A 5 | head -40Repository: flixOpt/flixopt
Length of output: 1032
🏁 Script executed:
# Check if these auxiliary coords (timeframe, timestep) are ever serialized to dataset
rg -n "to_dataset\|to_netcdf" flixopt/statistics_accessor.pyRepository: flixOpt/flixopt
Length of output: 41
🏁 Script executed:
# Look at the actual line in Structure.from_dataset to see the pattern
sed -n '1080,1100p' flixopt/structure.pyRepository: flixOpt/flixopt
Length of output: 810
🏁 Script executed:
# Critical: Check if there's a difference between how to_dataset and from_dataset handle coordinates
# Let's compare them side by side
echo "=== TO_DATASET ===" && sed -n '1038,1055p' flixopt/structure.py && echo -e "\n=== FROM_DATASET ===" && sed -n '1090,1110p' flixopt/structure.pyRepository: flixOpt/flixopt
Length of output: 1767
🏁 Script executed:
# Let's directly check if datasets created by to_dataset actually have auxiliary coords
# by looking at what _create_reference_structure does
rg -n "_create_reference_structure" flixopt/structure.py -A 30 | head -80Repository: flixOpt/flixopt
Length of output: 4114
🏁 Script executed:
# More importantly, let's see if the datasets that go through from_dataset/to_dataset
# ever have auxiliary coordinates to begin with
rg -n "coord_cache\|assign_coords" flixopt/structure.pyRepository: flixOpt/flixopt
Length of output: 41
🏁 Script executed:
# Critical: Check if from_dataset in Structure uses the same pattern as _fast_get_dataarray
sed -n '1090,1115p' flixopt/structure.pyRepository: flixOpt/flixopt
Length of output: 975
🏁 Script executed:
# Key insight: Let's verify that xr.Dataset(dict_of_dataarrays) loses auxiliary coords
python3 << 'EOF'
import xarray as xr
import numpy as np
# Create DataArrays with auxiliary coordinates
data_da = xr.DataArray(
np.random.rand(3, 4),
dims=['x', 'y'],
coords={'x': np.arange(3), 'y': np.arange(4)},
name='data'
)
# Add auxiliary coordinate to the DataArray
data_da = data_da.assign_coords({'aux': (('x',), np.arange(3))})
print("DataArray coords before Dataset creation:")
print(f" {list(data_da.coords.keys())}")
# Now create a Dataset from this DataArray (similar to to_dataset)
ds = xr.Dataset({'data': data_da})
print("\nDataset coords after Dataset creation:")
print(f" {list(ds.coords.keys())}")
print("\nConclusion: Auxiliary coords from DataArray are lost when creating Dataset")
EOFRepository: flixOpt/flixopt
Length of output: 256
🏁 Script executed:
# Let's verify the actual flow: to_dataset -> from_dataset
# Check if xr.Dataset constructor drops auxiliary coords
python3 << 'EOF'
import xarray as xr
import numpy as np
# Simulate what to_dataset does
da1 = xr.DataArray(np.random.rand(3, 4), dims=['x', 'y'], name='var1')
da1 = da1.assign_coords({'x': np.arange(3), 'y': np.arange(4)})
da1 = da1.assign_coords({'aux': (('x',), ['a', 'b', 'c'])})
da2 = xr.DataArray(np.random.rand(3), dims=['x'], name='var2')
da2 = da2.assign_coords({'x': np.arange(3)})
# What to_dataset does: creates Dataset from dict
extracted = {'var1': da1, 'var2': da2}
ds = xr.Dataset(extracted)
print("DataArray var1 coords:", list(da1.coords.keys()))
print("DataArray var2 coords:", list(da2.coords.keys()))
print("Dataset coords:", list(ds.coords.keys()))
print("\nAuxiliary 'aux' coord preserved in Dataset?", 'aux' in ds.coords)
EOFRepository: flixOpt/flixopt
Length of output: 212
🏁 Script executed:
# Let's check if the datasets actually get auxiliary coords added anywhere
# by searching for patterns that would add non-dimension coords to the dataset
rg "assign_coords\|Dataset.*coords" flixopt/io.py | grep -A 2 -B 2 "1[45][0-9]{2}"Repository: flixOpt/flixopt
Length of output: 41
🏁 Script executed:
# Most importantly, verify that the _restored_ dataset from from_dataset doesn't have auxiliary coords
# Let's check the exact flow in from_dataset
sed -n '1548,1610p' flixopt/io.pyRepository: flixOpt/flixopt
Length of output: 2606
🏁 Script executed:
# Final verification: test the actual serialization round-trip to see if auxiliary coords would be lost
python3 << 'EOF'
import xarray as xr
import numpy as np
# Simulate the to_dataset flow - create a dataset with an auxiliary coordinate
print("=== SIMULATION OF TO_DATASET ===")
# Create a DataArray with both dimension and auxiliary coordinates
da = xr.DataArray(
np.random.rand(3, 4),
dims=['x', 'y'],
coords={'x': np.arange(3), 'y': np.arange(4)},
name='data'
)
da = da.assign_coords({'aux': (('x',), ['a', 'b', 'c'])})
print(f"Original DataArray coords: {list(da.coords.keys())}")
# Create dataset (like to_dataset does)
ds = xr.Dataset({'data': da})
print(f"Dataset after creation: {list(ds.coords.keys())}")
print("\n=== SIMULATION OF _FAST_GET_DATAARRAY ===")
# Now simulate _fast_get_dataarray
variable = ds.variables['data']
coord_cache = {k: ds.coords[k] for k in ds.coords}
print(f"Variable dims: {variable.dims}")
print(f"Available coords in cache: {list(coord_cache.keys())}")
# Current implementation
coords_current = {k: coord_cache[k] for k in variable.dims if k in coord_cache}
print(f"Coords selected (current method): {list(coords_current.keys())}")
# Proposed implementation
coords_proposed = {k: c for k, c in coord_cache.items() if set(c.dims).issubset(variable.dims)}
print(f"Coords selected (proposed method): {list(coords_proposed.keys())}")
print(f"\nAuxiliary coord 'aux' lost with current method? {('aux' in ds.coords) and ('aux' not in coords_current)}")
EOFRepository: flixOpt/flixopt
Length of output: 444
Preserve auxiliary coordinates in fast DataArray construction.
_fast_get_dataarray currently only attaches dimension coordinates (those with names matching variable dims). Auxiliary coordinates—those with dimensions that are a subset of the variable's dimensions—are discarded during reconstruction. This will cause data loss if such coordinates are present in serialized datasets.
The proposed fix correctly handles both dimension and auxiliary coordinates:
Recommended fix
- coords = {k: coord_cache[k] for k in variable.dims if k in coord_cache}
+ coords = {
+ k: c for k, c in coord_cache.items()
+ if set(c.dims).issubset(variable.dims)
+ }While the round-trip tests currently pass (suggesting auxiliary coordinates aren't presently used in the codebase), this change makes the implementation robust for future use cases and aligns with xarray's coordinate handling semantics.
🤖 Prompt for AI Agents
In `@flixopt/io.py` around lines 1587 - 1603, _fast_get_dataarray currently only
includes coordinates whose names match variable.dims, dropping auxiliary
coordinates whose dims are a subset of the variable's dims; update the coords
build in _fast_get_dataarray to include any coord_cache entry where coord.dims
is a subset of variable.dims (not just keys equal to dims), so auxiliary coords
are preserved when constructing the xr.DataArray (use variable.dims and each
coord.dims to decide inclusion and pass the resulting coords dict to
xr.DataArray).
| # Select data for clustering (may be subset if data_vars specified) | ||
| ds_slice_for_clustering = ( | ||
| ds_for_clustering.sel(**selector, drop=True) if selector else ds_for_clustering | ||
| ) | ||
| temporaly_changing_ds_for_clustering = drop_constant_arrays(ds_slice_for_clustering, dim='time') | ||
| df_for_clustering = temporaly_changing_ds_for_clustering.to_dataframe() | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
find flixopt -name "transform_accessor.py" -type fRepository: flixOpt/flixopt
Length of output: 88
🏁 Script executed:
cat -n flixopt/transform_accessor.py | sed -n '1370,1400p'Repository: flixOpt/flixopt
Length of output: 2058
🏁 Script executed:
grep -n "def drop_constant_arrays" flixopt/transform_accessor.pyRepository: flixOpt/flixopt
Length of output: 41
🏁 Script executed:
rg "def drop_constant_arrays" --type pyRepository: flixOpt/flixopt
Length of output: 100
🏁 Script executed:
rg "def drop_constant_arrays" -A 15 flixopt/core.pyRepository: flixOpt/flixopt
Length of output: 736
🏁 Script executed:
rg "def drop_constant_arrays" -A 30 flixopt/core.pyRepository: flixOpt/flixopt
Length of output: 1269
🏁 Script executed:
rg "def drop_constant_arrays" -A 45 flixopt/core.pyRepository: flixOpt/flixopt
Length of output: 1660
🏁 Script executed:
cat -n flixopt/transform_accessor.py | sed -n '1400,1415p'Repository: flixOpt/flixopt
Length of output: 968
🏁 Script executed:
rg "tsam.aggregate" -B 5 -A 5 flixopt/transform_accessor.py | head -40Repository: flixOpt/flixopt
Length of output: 1388
🏁 Script executed:
cat -n flixopt/transform_accessor.py | sed -n '1378,1398p'Repository: flixOpt/flixopt
Length of output: 1517
🏁 Script executed:
cat -n flixopt/transform_accessor.py | sed -n '1374,1425p'Repository: flixOpt/flixopt
Length of output: 3386
🏁 Script executed:
cat -n flixopt/transform_accessor.py | sed -n '1350,1375p'Repository: flixOpt/flixopt
Length of output: 1326
🏁 Script executed:
rg "ds_for_clustering\s*=" flixopt/transform_accessor.py | head -20Repository: flixOpt/flixopt
Length of output: 258
🏁 Script executed:
cat -n flixopt/transform_accessor.py | sed -n '1320,1365p'Repository: flixOpt/flixopt
Length of output: 2334
🏁 Script executed:
python3 << 'EOF'
import xarray as xr
import numpy as np
# Test if empty dataset's data_vars attribute can be checked with 'if not'
ds_empty = xr.Dataset()
ds_with_vars = xr.Dataset({'var1': (['time'], [1, 2, 3])})
print(f"Empty dataset data_vars: {ds_empty.data_vars}")
print(f"Empty dataset bool test: {bool(ds_empty.data_vars)}")
print(f"not empty dataset.data_vars: {not ds_empty.data_vars}")
print(f"\nDataset with vars data_vars: {ds_with_vars.data_vars}")
print(f"Dataset with vars bool test: {bool(ds_with_vars.data_vars)}")
print(f"not dataset_with_vars.data_vars: {not ds_with_vars.data_vars}")
# Test to_dataframe on empty dataset
try:
df_empty = ds_empty.to_dataframe()
print(f"\nEmpty dataset to_dataframe() result:\n{df_empty}")
print(f"Shape: {df_empty.shape}")
except Exception as e:
print(f"Error: {e}")
EOFRepository: flixOpt/flixopt
Length of output: 380
🏁 Script executed:
python3 << 'EOF'
import pandas as pd
import numpy as np
# Test what happens with tsam.aggregate on empty DataFrame
# First check if tsam is available
try:
import tsam
print(f"tsam version: {tsam.__version__}")
# Create an empty dataframe
df_empty = pd.DataFrame()
print(f"Empty DataFrame shape: {df_empty.shape}")
print(f"Empty DataFrame:\n{df_empty}")
# Try to call aggregate (this may fail)
try:
result = tsam.aggregate(df_empty, n_clusters=3)
except Exception as e:
print(f"\ntsam.aggregate() error with empty DataFrame:")
print(f"Error type: {type(e).__name__}")
print(f"Error message: {e}")
except ImportError:
print("tsam not available in sandbox environment")
EOFRepository: flixOpt/flixopt
Length of output: 100
Add a guard when clustering input becomes empty.
When drop_constant_arrays() removes all selected variables (especially with data_vars filtering), calling to_dataframe() on an empty Dataset fails with a cryptic indexing error. Add an explicit check with a clear error message.
🐛 Proposed fix
temporaly_changing_ds_for_clustering = drop_constant_arrays(ds_slice_for_clustering, dim='time')
+ if not temporaly_changing_ds_for_clustering.data_vars:
+ raise ValueError(
+ 'No time-varying data left for clustering after drop_constant_arrays(). '
+ 'Check data_vars or inspect transform.clustering_data().'
+ )
df_for_clustering = temporaly_changing_ds_for_clustering.to_dataframe()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Select data for clustering (may be subset if data_vars specified) | |
| ds_slice_for_clustering = ( | |
| ds_for_clustering.sel(**selector, drop=True) if selector else ds_for_clustering | |
| ) | |
| temporaly_changing_ds_for_clustering = drop_constant_arrays(ds_slice_for_clustering, dim='time') | |
| df_for_clustering = temporaly_changing_ds_for_clustering.to_dataframe() | |
| # Select data for clustering (may be subset if data_vars specified) | |
| ds_slice_for_clustering = ( | |
| ds_for_clustering.sel(**selector, drop=True) if selector else ds_for_clustering | |
| ) | |
| temporaly_changing_ds_for_clustering = drop_constant_arrays(ds_slice_for_clustering, dim='time') | |
| if not temporaly_changing_ds_for_clustering.data_vars: | |
| raise ValueError( | |
| 'No time-varying data left for clustering after drop_constant_arrays(). ' | |
| 'Check data_vars or inspect transform.clustering_data().' | |
| ) | |
| df_for_clustering = temporaly_changing_ds_for_clustering.to_dataframe() |
🤖 Prompt for AI Agents
In `@flixopt/transform_accessor.py` around lines 1379 - 1385, Add a guard to
detect when drop_constant_arrays removes all variables before calling
to_dataframe(): after computing temporaly_changing_ds_for_clustering =
drop_constant_arrays(ds_slice_for_clustering, dim='time') check whether
temporaly_changing_ds_for_clustering is empty (e.g., has no data_vars or no
variables) and raise a clear ValueError mentioning the input filtering
(data_vars/selector) and that there is no varying data to cluster, instead of
calling temporaly_changing_ds_for_clustering.to_dataframe() and allowing a
cryptic indexing error; update the code paths that assign df_for_clustering
accordingly so downstream logic only runs when df_for_clustering is valid.
| "tsam @ git+https://github.com/FBumann/tsam.git@v3-rebased", # Time series aggregation (unreleased) | ||
| "scipy >= 1.15.1, < 2", # Used by tsam. Prior versions have conflict with highspy. See https://github.com/scipy/scipy/issues/22257 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if v3-rebased branch exists and get its current HEAD commit
curl -s "https://api.github.com/repos/FBumann/tsam/branches/v3-rebased" | jq '{branch: .name, sha: .commit.sha}' 2>/dev/null || echo "Branch check failed or jq not available"
# Also test basic GitHub API access
curl -s -I "https://api.github.com/repos/FBumann/tsam" | head -5Repository: flixOpt/flixopt
Length of output: 266
🌐 Web query:
Can PyPI packages depend on git URLs in pyproject.toml
💡 Result:
Short answer: No — a package published on PyPI must not rely on git/VCS URLs in its declared install-time dependencies because pip will refuse to install non-PyPI URL dependencies transitively when installing a package from PyPI. Build-system (build-time) requirements in pyproject.toml can be URL/VCS specifiers, but runtime dependencies that point at git URLs are not supported for packages distributed via PyPI.
Relevant sources:
- pip docs: pyproject.toml build-system.requires may contain URL specifiers, but runtime dependency resolution for packages from PyPI blocks arbitrary URL dependencies. [1]
- Historical/issue discussion explaining pip blocks PEP 508 URL requirements for packages installed from PyPI. [2]
- Ongoing PEP work (PEP 725 / DepURLs) proposes better ways to reference external dependencies in pyproject.toml, but it is a separate proposal and not a current guarantee that PyPI/pip will accept git URL runtime dependencies. [3]
- Community Q&A and examples showing the same pip behavior and common workarounds (vendor, publish to PyPI, or install from a VCS directly). [4]
If you need a workable solution: publish the dependency to PyPI, vendor or bundle it, instruct users to install from a VCS editable/direct URL themselves (i.e., not via PyPI), or follow future PEPs (like PEP 725) if/when tooling implements them.
🌐 Web query:
pip install git URL dependency from pyproject.toml production
💡 Result:
Yes — pyproject.toml (PEP 621) supports VCS/direct-URL dependencies and pip can install them. Short recipe and production recommendations:
-
Add a VCS direct reference in pyproject.toml (PEP 621 / setuptools syntax):
- Example (pin to tag/commit):
[project]
dependencies = [
"somepkg @ git+https://github.com/owner/somepkg.git@v1.2.3",
"otherpkg @ git+ssh://[email protected]:owner/otherpkg.git@abcdef123456",
]
(pip/PEP 508 direct URL form). (setuptools.pypa.io)
- Example (pin to tag/commit):
-
Install with pip
- To install the project and have pip install those dependencies, run:
pip install . - Or install the project directly from its repo (pip will install its pyproject dependencies):
pip install "git+https://github.com/you/your-project.git@main" - You can also install a single VCS URL directly:
pip install "git+https://github.com/owner/somepkg.git@v1.2.3" . (pythontutorials.net)
- To install the project and have pip install those dependencies, run:
-
Production best practices
- Pin VCS deps to an exact commit or annotated tag (avoid floating branch names) to ensure reproducible installs. (setuptools.pypa.io)
- For private repos, use deploy-friendly auth (CI/CD deploy tokens, SSH keys, or tokenized HTTPS URLs) rather than interactive credentials. Test auth in your CI environment. (pythontutorials.net)
- Consider converting pyproject deps into a requirements.txt for pip-based deployment (tools like pyp2req or generating a pip constraints file), so your deployment step can be pip install -r requirements.txt --no-deps / --constraint ... if you need deterministic installs. (pypi.org)
If you want, tell me:
- whether you use setuptools/poetry/pdm, and
- whether the VCS repo is public or private — I’ll give the exact pyproject snippet and CI-ready install command.
Citations:
- 1: https://setuptools.pypa.io/en/stable/userguide/dependency_management.html?utm_source=openai
- 2: https://www.pythontutorials.net/blog/download-dependencies-declared-in-pyproject-toml-using-pip/?utm_source=openai
- 3: https://setuptools.pypa.io/en/stable/userguide/dependency_management.html?utm_source=openai
- 4: https://www.pythontutorials.net/blog/how-to-pip-install-a-package-that-has-git-dependencies/?utm_source=openai
- 5: https://pypi.org/project/pyp2req/?utm_source=openai
🏁 Script executed:
#!/bin/bash
# Check if this project is configured for PyPI release
cat pyproject.toml | head -30
# Check for PyPI upload workflows
find . -name "*.yml" -o -name "*.yaml" | xargs grep -l "publish\|pypi\|release" 2>/dev/null | head -5Repository: flixOpt/flixopt
Length of output: 1378
🏁 Script executed:
#!/bin/bash
# Check git log to see context around the tsam dependency change
git log --oneline -20 -- pyproject.toml | head -10
# Check if v3-rebased ref is mentioned elsewhere
rg "v3-rebased" --type toml --type py --type txtRepository: flixOpt/flixopt
Length of output: 329
🏁 Script executed:
#!/bin/bash
# Check the release workflow to see if it uploads to PyPI
cat .github/workflows/release.yaml | head -60
# Also check for any PyPI token secrets
grep -r "pypi" .github/workflows/ | head -10Repository: flixOpt/flixopt
Length of output: 2194
🏁 Script executed:
#!/bin/bash
# Check pyproject.toml for extras documentation and other dependencies
sed -n '50,150p' pyproject.tomlRepository: flixOpt/flixopt
Length of output: 3450
This configuration will fail PyPI publication.
Using a git URL for tsam (line 67 in [full] extras and line 86 in [dev] extras) is incompatible with PyPI distribution. pip explicitly blocks git URL dependencies for packages installed from PyPI — the verify-pypi job in your release workflow will fail when attempting to install from https://pypi.org/simple/.
Pinning to a commit hash does not resolve this. Before release, you must either:
- Wait for tsam v3 to be published to PyPI and use a version specifier (
tsam >= 3.0.0, < 4) - Remove tsam from the
[full]and[dev]extras and document that users requiring tsam must install from source - Vendor or bundle tsam as part of flixopt
Verify the release workflow will pass with the current dependencies, or update the workflow to skip PyPI publication if this is intentional for development-only use.
🤖 Prompt for AI Agents
In `@pyproject.toml` around lines 67 - 68, The pyproject.toml currently lists a
git URL dependency "tsam @ git+https://github.com/FBumann/tsam.git@v3-rebased"
in the [full] and [dev] extras which will break PyPI publication; remove that
git-based requirement from both the [full] and [dev] extras and either (a)
replace it with a proper PyPI version specifier (e.g., "tsam >= 3.0.0, < 4")
once tsam is published, (b) omit tsam from those extras and add a note in your
README/INSTALL docs instructing users to install tsam from source, or (c)
vendor/include tsam in the package; finally, if this is intended only for
development, update the release workflow (the verify-pypi job) to skip PyPI
verification or otherwise allow dev-only dependencies so the release will pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@flixopt/clustering/base.py`:
- Around line 619-633: The zip call that builds the selector silently truncates
mismatches (zip(self._dim_names, key, strict=False)); change it to use
strict=True so any length mismatch between self._dim_names and key raises
immediately, e.g., replace the zip invocation used when constructing selector in
the loop over self._results.items() (where selector = dict(zip(self._dim_names,
key, ...))) with zip(..., strict=True) to fail fast on dimension mismatches and
ensure selectors are built only when dimensions align.
🧹 Nitpick comments (5)
flixopt/clustering/base.py (5)
578-591: Type conversion in deserialization only handles int.The
_str_to_keymethod only attempts to convert parts toint. If period values are floats (e.g.,2024.5) or other numeric types, they would be stored as strings after round-trip serialization.Consider extending the conversion logic:
🔧 Suggested enhancement
`@staticmethod` def _str_to_key(key_str: str, dim_names: list[str]) -> tuple: """Convert string back to key tuple.""" if key_str == '__single__': return () parts = key_str.split('|') - # Try to convert to int if possible (for period years) + # Try to convert to numeric types if possible (for period years/floats) result = [] for part in parts: try: result.append(int(part)) except ValueError: - result.append(part) + try: + result.append(float(part)) + except ValueError: + result.append(part) return tuple(result)
1206-1206: Default empty DatetimeIndex may cause issues downstream.When
original_timestepsisNone, it defaults to an emptypd.DatetimeIndex([]). Methods likeexpand_dataandbuild_expansion_divisordefault to usingself.original_timesteps, which could lead to confusing behavior with zero-length expansions.Consider validating that
original_timestepsis not empty when it's actually needed, or document this edge case more explicitly.
1242-1254: Property raises ValueError without checking all paths.The
resultsproperty raisesValueErrorif neither_results_cachenor_aggregation_resultsis available. However, the__init__method allows creating aClusteringwithresults=Noneand_aggregation_results=None, which would always fail when accessingresults.Consider adding validation in
__init__to ensure at least one of these is provided, or document this as an invalid state.
1425-1434: Variable shadowing:variablesparameter is overwritten.The local assignment
variables = ds.variablesat line 1426 shadows the function parametervariablesfrom line 1369. While this doesn't cause a bug (the parameter was already consumed at line 1410), it's confusing and could lead to issues if the code is refactored.🔧 Suggested fix: rename local variable
if kind == 'duration_curve': sorted_vars = {} - # Use variables for faster access (avoids _construct_dataarray overhead) - variables = ds.variables + # Use ds_variables for faster access (avoids _construct_dataarray overhead) + ds_variables = ds.variables rep_values = ds.coords['representation'].values rep_idx = {rep: i for i, rep in enumerate(rep_values)} for var in ds.data_vars: - data = variables[var].values + data = ds_variables[var].values for rep in rep_values: # Direct numpy indexing instead of .sel() values = np.sort(data[rep_idx[rep]].flatten())[::-1]
1665-1669: Consider direct numpy indexing for cluster labels.The list comprehension at line 1668 calls
.sel(cluster=c).valuesfor each cluster, creating intermediate DataArrays. For consistency with the PR's optimization pattern, consider using direct indexing:🔧 Suggested optimization
if occ_extra_dims: cluster_labels = [f'Cluster {c}' for c in range(n_clusters)] else: + occ_values = cluster_occurrences.values cluster_labels = [ - f'Cluster {c} (×{int(cluster_occurrences.sel(cluster=c).values)})' for c in range(n_clusters) + f'Cluster {c} (×{int(occ_values[c])})' for c in range(n_clusters) ]
| for key, cr in self._results.items(): | ||
| # Build selector for this key | ||
| selector = dict(zip(self._dim_names, key, strict=False)) | ||
|
|
||
| # Select the slice for this (period, scenario) | ||
| data_slice = data.sel(**selector, drop=True) if selector else data | ||
|
|
||
| # Drop constant arrays and convert to DataFrame | ||
| time_varying = drop_constant_arrays(data_slice, dim='time') | ||
| df = time_varying.to_dataframe() | ||
|
|
||
| # Apply clustering | ||
| results[key] = cr.apply(df) | ||
|
|
||
| return Clustering._from_aggregation_results(results, self._dim_names) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's look at the file structure and find the relevant class/method
cd flixopt/clustering && wc -l base.pyRepository: flixOpt/flixopt
Length of output: 71
🏁 Script executed:
# Read the context around line 619-633 to understand the method
sed -n '600,650p' flixopt/clustering/base.pyRepository: flixOpt/flixopt
Length of output: 2146
🏁 Script executed:
# Find where self._dim_names is defined/set
rg -n "_dim_names" flixopt/clustering/base.pyRepository: flixOpt/flixopt
Length of output: 948
🏁 Script executed:
# Find where self._results is populated
rg -n "_results" flixopt/clustering/base.py -A 2Repository: flixOpt/flixopt
Length of output: 5616
🏁 Script executed:
# Search for zip() usage with strict parameter across the codebase
rg -n "zip.*strict" flixopt/Repository: flixOpt/flixopt
Length of output: 2078
🏁 Script executed:
# Find where AggregationResults is instantiated (look for __init__ and calls to it)
sed -n '200,230p' flixopt/clustering/base.pyRepository: flixOpt/flixopt
Length of output: 1155
🏁 Script executed:
# Find where AggregationResults is created/instantiated from
rg -n "AggregationResults(" flixopt/ -A 3 | head -80Repository: flixOpt/flixopt
Length of output: 147
🏁 Script executed:
# Look at the _from_aggregation_results classmethod that's used at the end of apply()
sed -n '1257,1290p' flixopt/clustering/base.pyRepository: flixOpt/flixopt
Length of output: 1474
🏁 Script executed:
# Check how keys are structured in _results - look for patterns where results dict is built
rg -n "self._results\[" flixopt/clustering/base.py | head -20Repository: flixOpt/flixopt
Length of output: 314
🏁 Script executed:
# Look at how _results dict is populated - find code that builds it
rg -n "results\[.*\].*=" flixopt/clustering/base.py | head -20Repository: flixOpt/flixopt
Length of output: 285
🏁 Script executed:
# Search for places where AggregationResults is instantiated to see the pattern of key/dim_names
rg -n "ClusteringResults\(" flixopt/ -B 3 -A 3 | head -100Repository: flixOpt/flixopt
Length of output: 2835
🏁 Script executed:
# Check the overall structure by looking at how keys are constructed
sed -n '535,575p' flixopt/clustering/base.pyRepository: flixOpt/flixopt
Length of output: 1831
🏁 Script executed:
# Look at where the apply method is called from to understand what _results contains
rg -n "\.apply\(" flixopt/clustering/base.py -B 2 -A 2Repository: flixOpt/flixopt
Length of output: 874
🏁 Script executed:
# Look at git history to understand when strict=False was introduced
git log --oneline -p flixopt/clustering/base.py | grep -A 10 -B 5 "strict=False" | head -50Repository: flixOpt/flixopt
Length of output: 690
🏁 Script executed:
# Check if there are any tests or validation that verify key/dim_names matching
rg -n "dim_names" flixopt/clustering/base.py | grep -i "test\|assert\|valid\|check"Repository: flixOpt/flixopt
Length of output: 41
🏁 Script executed:
# Look more carefully at how keys are created in _results dict
# Find where clustering is applied with period/scenario dimensions
sed -n '600,640p' flixopt/clustering/base.pyRepository: flixOpt/flixopt
Length of output: 1655
🏁 Script executed:
# Check if there's any code that constructs mismatched keys
rg -n "self._results\[" flixopt/clustering/base.py -B 2 -A 2Repository: flixOpt/flixopt
Length of output: 1023
Use strict=True in zip() to catch dimension mismatches.
The zip(self._dim_names, key, strict=False) at line 621 will silently truncate if key and dim_names have mismatched lengths. Since key tuples should always match the structure of dim_names (both sourced from the same ClusteringResults initialization), using strict=True would catch any unexpected inconsistencies and fail fast rather than silently producing incomplete selectors. The empty case (key=(), dim_names=[]) works fine with strict=True since both are empty.
🤖 Prompt for AI Agents
In `@flixopt/clustering/base.py` around lines 619 - 633, The zip call that builds
the selector silently truncates mismatches (zip(self._dim_names, key,
strict=False)); change it to use strict=True so any length mismatch between
self._dim_names and key raises immediately, e.g., replace the zip invocation
used when constructing selector in the loop over self._results.items() (where
selector = dict(zip(self._dim_names, key, ...))) with zip(..., strict=True) to
fail fast on dimension mismatches and ensure selectors are built only when
dimensions align.
_add_clustering_to_dataset was slow due to 210 individual ds[name] = arr assignments. Each triggers xarray's expensive dataset_update_method. Changed to batch assignment with ds.assign(dict): - Before: ~2600ms for to_dataset with clustering - After: ~109ms for to_dataset with clustering Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Avoided _construct_dataarray overhead by: - Using ds.variables instead of ds.data_vars.items() - Using numpy slicing instead of .isel() - Passing attrs dict directly instead of DataArray cluster() benchmark: - Before: ~10.1s - After: ~8.9s Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@flixopt/transform_accessor.py`:
- Around line 293-313: The current shortcut returns a 1-D xr.Dataset when there
are multiple (period, scenario) slices but only one non-empty metrics DataFrame,
dropping period/scenario dims; change the condition so the single-slice fast
path is used only when there is truly one overall slice (i.e.,
len(clustering_metrics_all) == 1 and len(non_empty_metrics) == 1). If there are
multiple (period, scenario) keys but only one non-empty metrics_df, do not
return the 1-D Dataset from the block using first_key/metrics_df; instead let
execution fall through to the multi-dim assembly code (which should construct
the full Dataset with NaN-filled slices) so dimensions referenced by
Clustering._dim_names (periods/scenarios) are preserved.
♻️ Duplicate comments (2)
flixopt/transform_accessor.py (2)
1392-1397: Guard against empty clustering input after drop_constant_arrays().Same issue as previously noted: if the selected variables become empty after
drop_constant_arrays,to_dataframe()will fail with a cryptic error. Please add an explicit guard and a clear error message before converting to DataFrame.
1915-1923: Segmented SOC interpolation still uses wrong time index.
timestep_mapping % n_segmentscollapses to segment indices, not timestep positions within a cluster, which yields incorrect interpolation. Use modulotimesteps_per_clusterfortime_indiceswhile keeping cluster decoding as‑is.🛠️ Proposed fix
- if clustering.is_segmented and clustering.n_segments is not None: - time_dim_size = clustering.n_segments - else: - time_dim_size = clustering.timesteps_per_cluster - cluster_indices = timestep_mapping // time_dim_size - time_indices = timestep_mapping % time_dim_size + if clustering.is_segmented and clustering.n_segments is not None: + time_dim_size = clustering.n_segments + else: + time_dim_size = clustering.timesteps_per_cluster + cluster_indices = timestep_mapping // time_dim_size + # Use original timestep position within each cluster for segment lookup + timesteps_per_cluster = clustering.timesteps_per_cluster + time_positions = xr.DataArray( + np.arange(timestep_mapping.sizes['original_time']), + dims=['original_time'], + coords={'original_time': timestep_mapping.coords['original_time']}, + ).broadcast_like(timestep_mapping) + time_indices = time_positions % timesteps_per_cluster
| non_empty_metrics = {k: v for k, v in clustering_metrics_all.items() if not v.empty} | ||
|
|
||
| if not non_empty_metrics: | ||
| return xr.Dataset() | ||
|
|
||
| first_key = (periods[0], scenarios[0]) | ||
|
|
||
| if len(non_empty_metrics) == 1 or len(clustering_metrics_all) == 1: | ||
| metrics_df = non_empty_metrics.get(first_key) | ||
| if metrics_df is None: | ||
| metrics_df = next(iter(non_empty_metrics.values())) | ||
| return xr.Dataset( | ||
| { | ||
| col: xr.DataArray( | ||
| metrics_df[col].values, | ||
| dims=['time_series'], | ||
| coords={'time_series': metrics_df.index}, | ||
| ) | ||
| for col in metrics_df.columns | ||
| } | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preserve period/scenario dims when only one metrics slice is non‑empty.
If there are multiple (period, scenario) slices but only one non‑empty metrics DataFrame, the shortcut returns a 1‑D Dataset and drops period/scenario dimensions, which can misalign metrics with Clustering._dim_names. Consider only using the shortcut when there is truly a single slice, and otherwise fall through to the multi‑dim path with NaN‑filled slices.
🛠️ Proposed fix
- if len(non_empty_metrics) == 1 or len(clustering_metrics_all) == 1:
+ if len(clustering_metrics_all) == 1:
metrics_df = non_empty_metrics.get(first_key)
if metrics_df is None:
metrics_df = next(iter(non_empty_metrics.values()))
return xr.Dataset(🤖 Prompt for AI Agents
In `@flixopt/transform_accessor.py` around lines 293 - 313, The current shortcut
returns a 1-D xr.Dataset when there are multiple (period, scenario) slices but
only one non-empty metrics DataFrame, dropping period/scenario dims; change the
condition so the single-slice fast path is used only when there is truly one
overall slice (i.e., len(clustering_metrics_all) == 1 and len(non_empty_metrics)
== 1). If there are multiple (period, scenario) keys but only one non-empty
metrics_df, do not return the 1-D Dataset from the block using
first_key/metrics_df; instead let execution fall through to the multi-dim
assembly code (which should construct the full Dataset with NaN-filled slices)
so dimensions referenced by Clustering._dim_names (periods/scenarios) are
preserved.
Eliminated 451,856 slow pandas .loc calls by using numpy reshape for segmented clustering data instead of iterating per-cluster. cluster() with segments benchmark (50 clusters, 4 segments): - Before: ~93.7s - After: ~21.1s - Speedup: 4.4x Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
914580f to
0754f06
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@flixopt/transform_accessor.py`:
- Around line 533-565: The current branch treats any variable with
partially-missing typical slices as fully constant; change the logic in the loop
(around variables, ds_new_vars, typical_das, all_keys) so that only when name
not in typical_das you take the constant/no-time path; if name exists in
typical_das but its keys != all_keys, build the time-dependent DataArray by
iterating all_keys and for each missing (period,scenario) key fill that slice
with the appropriate constant aggregate (e.g., broadcast the available constant
array or a selected default from typical_das[name]), then concatenate/stack
these filled slices into the full (cluster, time, ...) shape and assign into
ds_new_vars with dims ['cluster','time',...] and coords new_coords; ensure you
use the same coord_cache, cluster_coords and time_coords when constructing the
filled/combined DataArray.
♻️ Duplicate comments (3)
flixopt/transform_accessor.py (3)
298-318: Preserve period/scenario dims in the metrics shortcut.
Thelen(non_empty_metrics) == 1 or len(clustering_metrics_all) == 1fast path can dropperiod/scenariodims when multiple slices exist but only one is non-empty (Line 305). This produces a 1‑D Dataset and misaligns downstream dimension expectations. Consider requiring a true single-slice case only.♻️ Proposed fix
- if len(non_empty_metrics) == 1 or len(clustering_metrics_all) == 1: + if len(clustering_metrics_all) == 1 and len(non_empty_metrics) == 1: metrics_df = non_empty_metrics.get(first_key) if metrics_df is None: metrics_df = next(iter(non_empty_metrics.values())) return xr.Dataset(
1397-1403: Guard against empty clustering input beforeto_dataframe().
Afterdrop_constant_arrays, the dataset can be empty when all selected variables are constant (Line 1401–1402).to_dataframe()then fails with a cryptic indexing error. Add an explicit guard with a clear message.🐛 Proposed fix
ds_slice_for_clustering = ( ds_for_clustering.sel(**selector, drop=True) if selector else ds_for_clustering ) temporaly_changing_ds_for_clustering = drop_constant_arrays(ds_slice_for_clustering, dim='time') + if not temporaly_changing_ds_for_clustering.data_vars: + raise ValueError( + 'No time-varying data left for clustering after drop_constant_arrays(). ' + 'Check data_vars or inspect transform.clustering_data().' + ) df_for_clustering = temporaly_changing_ds_for_clustering.to_dataframe()
1920-1932: Segmented charge_state interpolation uses the wrong time index basis.
For segmented clustering,timestep_mapping % n_segmentsgives the segment index, butsegment_assignments/position_within_segmentare keyed by timestep position within the cluster (Line 1926–1932). This mis-indexes segment assignments and produces incorrect interpolation. Use timestep positions modtimesteps_per_clusterfor the time index while keeping cluster indices derived fromtimestep_mapping.🐛 Proposed fix
if clustering.is_segmented and clustering.n_segments is not None: time_dim_size = clustering.n_segments else: time_dim_size = clustering.timesteps_per_cluster cluster_indices = timestep_mapping // time_dim_size - time_indices = timestep_mapping % time_dim_size + # Use original timestep position within each cluster for segment lookup + timesteps_per_cluster = clustering.timesteps_per_cluster + time_positions = xr.DataArray( + np.arange(timestep_mapping.sizes['original_time']), + dims=['original_time'], + coords={'original_time': timestep_mapping.coords['original_time']}, + ).broadcast_like(timestep_mapping) + time_indices = time_positions % timesteps_per_cluster
| # Use ds.variables to avoid _construct_dataarray overhead | ||
| variables = ds.variables | ||
| coord_cache = {k: ds.coords[k].values for k in ds.coords} | ||
|
|
||
| for name in ds.data_vars: | ||
| var = variables[name] | ||
| if 'time' not in var.dims: | ||
| # No time dimension - wrap Variable in DataArray | ||
| coords = {d: coord_cache[d] for d in var.dims if d in coord_cache} | ||
| ds_new_vars[name] = xr.DataArray(var.values, dims=var.dims, coords=coords, attrs=var.attrs, name=name) | ||
| elif name not in typical_das or set(typical_das[name].keys()) != all_keys: | ||
| # Time-dependent but constant: reshape to (cluster, time, ...) | ||
| # Use numpy slicing instead of .isel() | ||
| time_idx = var.dims.index('time') | ||
| slices = [slice(None)] * len(var.dims) | ||
| slices[time_idx] = slice(0, n_reduced_timesteps) | ||
| sliced_values = var.values[tuple(slices)] | ||
|
|
||
| other_dims = [d for d in var.dims if d != 'time'] | ||
| other_shape = [var.sizes[d] for d in other_dims] | ||
| new_shape = [actual_n_clusters, n_time_points] + other_shape | ||
| reshaped = sliced_values.reshape(new_shape) | ||
| new_coords = {'cluster': cluster_coords, 'time': time_coords} | ||
| for dim in other_dims: | ||
| if dim in coord_cache: | ||
| new_coords[dim] = coord_cache[dim] | ||
| ds_new_vars[name] = xr.DataArray( | ||
| reshaped, | ||
| dims=['cluster', 'time'] + other_dims, | ||
| coords=new_coords, | ||
| attrs=var.attrs, | ||
| ) | ||
| else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle partially-missing typical slices without discarding valid aggregates.
The condition name not in typical_das or set(typical_das[name].keys()) != all_keys treats any variable missing one (period, scenario) slice as fully constant (Line 543). That discards valid aggregated data for slices where the variable does vary and reshapes raw data instead. This can corrupt reduced outputs for mixed-variability cases. Only use the constant path when the variable is absent altogether, and fill missing slices with constant values before combining.
🛠️ Suggested approach (fill missing slices)
- elif name not in typical_das or set(typical_das[name].keys()) != all_keys:
+ elif name not in typical_das:
# Time-dependent but constant: reshape to (cluster, time, ...)
# Use numpy slicing instead of .isel()
time_idx = var.dims.index('time')
slices = [slice(None)] * len(var.dims)
slices[time_idx] = slice(0, n_reduced_timesteps)
sliced_values = var.values[tuple(slices)]
...
ds_new_vars[name] = xr.DataArray(
reshaped,
dims=['cluster', 'time'] + other_dims,
coords=new_coords,
attrs=var.attrs,
)
else:
+ # Fill missing (period, scenario) slices with constant values
+ slices = dict(typical_das[name])
+ missing = all_keys - set(slices.keys())
+ if missing:
+ # TODO: extract constant-slice construction into a helper reused above
+ for key in missing:
+ slices[key] = _build_constant_slice_for_key(...)
+ da = self._combine_slices_to_dataarray_2d(
+ slices=slices,
+ attrs=var.attrs,
+ periods=periods,
+ scenarios=scenarios,
+ )🤖 Prompt for AI Agents
In `@flixopt/transform_accessor.py` around lines 533 - 565, The current branch
treats any variable with partially-missing typical slices as fully constant;
change the logic in the loop (around variables, ds_new_vars, typical_das,
all_keys) so that only when name not in typical_das you take the
constant/no-time path; if name exists in typical_das but its keys != all_keys,
build the time-dependent DataArray by iterating all_keys and for each missing
(period,scenario) key fill that slice with the appropriate constant aggregate
(e.g., broadcast the available constant array or a selected default from
typical_das[name]), then concatenate/stack these filled slices into the full
(cluster, time, ...) shape and assign into ds_new_vars with dims
['cluster','time',...] and coords new_coords; ensure you use the same
coord_cache, cluster_coords and time_coords when constructing the
filled/combined DataArray.
- benchmark_io_performance.py: Add Gurobi → HiGHS solver fallback - components.py: Fix storage decay to use sum (not mean) for hours per cluster - flow_system.py: Add RangeIndex validation requiring explicit timestep_duration - io.py: Include auxiliary coordinates in _fast_get_dataarray - transform_accessor.py: Add empty dataset guard after drop_constant_arrays - transform_accessor.py: Fix timestep_mapping indexing for segmented clustering Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace data_vars.items() iteration with ds.variables pattern to avoid slow _construct_dataarray calls (5502 calls × ~1.5ms each). Before: 3.73s After: 1.72s Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Summary
Performance optimizations using faster data access patterns throughout the codebase.
Changes
transform_accessor.py_build_typical_das.loctransform_accessor.py_build_reduced_datasetds.variables+ numpy slicingio.py_add_clustering_to_datasetds.assign()io.pyds.variablespatternstructure.pyfrom_datasetcore.pydrop_constant_arraysds.variables+ numpyclustering/base.pyds.variablespatternKey Optimization: Numpy reshape for segmented clustering
Benchmarks
cluster()with segments (50 clusters, 4 segments)cluster()without segments (12 clusters)to_dataset()with clusteringTesting