From 7cd01f1defe2413504188802bdf2c72a2523361a Mon Sep 17 00:00:00 2001 From: Simon Billinge Date: Sat, 17 Jan 2026 13:50:38 -0500 Subject: [PATCH 1/5] deprecated: loadData to load_data, and move it --- docs/source/examples/parsers_example.rst | 24 +-- docs/source/examples/resample_example.rst | 6 +- docs/source/utilities/parsers_utility.rst | 4 +- news/depr-tests.rst | 23 +++ src/diffpy/utils/_deprecator.py | 12 +- src/diffpy/utils/parsers/loaddata.py | 15 +- src/diffpy/utils/tools.py | 180 ++++++++++++++++++++++ tests/test_loaddata.py | 75 +++++++++ tests/test_serialization.py | 20 +-- 9 files changed, 327 insertions(+), 32 deletions(-) create mode 100644 news/depr-tests.rst diff --git a/docs/source/examples/parsers_example.rst b/docs/source/examples/parsers_example.rst index db97e0ed..19e607eb 100644 --- a/docs/source/examples/parsers_example.rst +++ b/docs/source/examples/parsers_example.rst @@ -13,13 +13,13 @@ Using the parsers module, we can load file data into simple and easy-to-work-wit Our goal will be to extract the data, and the parameters listed in the header, from this file and load it into our program. -2) To get the data table, we will use the ``loadData`` function. The default behavior of this +2) To get the data table, we will use the ``load_data`` function. The default behavior of this function is to find and extract a data table from a file. .. code-block:: python - from diffpy.utils.parsers.loaddata import loadData - data_table = loadData('') + from diffpy.utils.tools import load_data + data_table = load_data('') While this will work with most datasets, on our ``data.txt`` file, we got a ``ValueError``. The reason for this is due to the comments ``$ Phase Transition Near This Temperature Range`` and ``--> Note Significant Jump in Rw <--`` @@ -27,9 +27,9 @@ embedded within the dataset. To fix this, try using the ``comments`` parameter. .. code-block:: python - data_table = loadData('', comments=['$', '-->']) + data_table = load_data('', comments=['$', '-->']) -This parameter tells ``loadData`` that any lines beginning with ``$`` and ``-->`` are just comments and +This parameter tells ``load_data`` that any lines beginning with ``$`` and ``-->`` are just comments and more entries in our data table may follow. Here are a few other parameters to test out: @@ -39,7 +39,7 @@ Here are a few other parameters to test out: .. code-block:: python - loadData('', comments=['$', '-->'], delimiter=',') + load_data('', comments=['$', '-->'], delimiter=',') returns an empty list. * ``minrows=50``: Only look for data tables with at least 50 rows. Since our data table has much less than that many @@ -47,7 +47,7 @@ returns an empty list. .. code-block:: python - loadData('', comments=['$', '-->'], minrows=50) + load_data('', comments=['$', '-->'], minrows=50) returns an empty list. * ``usecols=[0, 3]``: Only return the 0th and 3rd columns (zero-indexed) of the data table. For ``data.txt``, this @@ -55,14 +55,14 @@ returns an empty list. .. code-block:: python - loadData('', comments=['$', '-->'], usecols=[0, 3]) + load_data('', comments=['$', '-->'], usecols=[0, 3]) -3) Next, to get the header information, we can again use ``loadData``, +3) Next, to get the header information, we can again use ``load_data``, but this time with the ``headers`` parameter enabled. .. code-block:: python - hdata = loadData('', comments=['$', '-->'], headers=True) + hdata = load_data('', comments=['$', '-->'], headers=True) 4) Rather than working with separate ``data_table`` and ``hdata`` objects, it may be easier to combine them into a single dictionary. We can do so using the ``serialize_data`` function. @@ -116,8 +116,8 @@ The returned value, ``parsed_file_data``, is the dictionary we just added to ``s .. code-block:: python - data_table = loadData('') - hdata = loadData('', headers=True) + data_table = load_data('') + hdata = load_data('', headers=True) serialize_data('', hdata, data_table, serial_file='') The serial file ``serialfile.json`` should now contain two entries: ``data.txt`` and ``moredata.txt``. diff --git a/docs/source/examples/resample_example.rst b/docs/source/examples/resample_example.rst index ba28390b..5af42e73 100644 --- a/docs/source/examples/resample_example.rst +++ b/docs/source/examples/resample_example.rst @@ -16,9 +16,9 @@ given enough datapoints. .. code-block:: python - from diffpy.utils.parsers.loaddata import loadData - nickel_datatable = loadData('') - nitarget_datatable = loadData('') + from diffpy.utils.tools import load_data + nickel_datatable = load_data('') + nitarget_datatable = load_data('') Each data table has two columns: first is the grid and second is the function value. To extract the columns, we can utilize the serialize function ... diff --git a/docs/source/utilities/parsers_utility.rst b/docs/source/utilities/parsers_utility.rst index ffaf768e..954405b5 100644 --- a/docs/source/utilities/parsers_utility.rst +++ b/docs/source/utilities/parsers_utility.rst @@ -5,7 +5,7 @@ Parsers Utility The ``diffpy.utils.parsers`` module allows users to easily and robustly load file data into a Python project. -- ``loaddata.loadData()``: Find and load a data table/block from a text file. This seems to work for most datafiles +- ``loaddata.load_data()``: Find and load a data table/block from a text file. This seems to work for most datafiles including those generated by diffpy programs. Running only ``numpy.loadtxt`` will result in errors for most these files as there is often excess data or parameters stored above the data block. Users can instead choose to load all the parameters of the form `` = `` into a dictionary @@ -17,7 +17,7 @@ The ``diffpy.utils.parsers`` module allows users to easily and robustly load fil - ``serialization.deserialize_data()``: Load data from a serial file format into a Python dictionary. Currently, the only supported serial format is ``.json``. -- ``serialization.serialize_data()``: Serialize the data generated by ``loadData()`` into a serial file format. Currently, the only +- ``serialization.serialize_data()``: Serialize the data generated by ``load_data()`` into a serial file format. Currently, the only supported serial format is ``.json``. For a more in-depth tutorial for how to use these parser utilities, click :ref:`here `. diff --git a/news/depr-tests.rst b/news/depr-tests.rst new file mode 100644 index 00000000..74f7c173 --- /dev/null +++ b/news/depr-tests.rst @@ -0,0 +1,23 @@ +**Added:** + +* + +**Changed:** + +* load_data now takes a Path or a string for the file-path + +**Deprecated:** + +* diffpy.utils.parsers.loaddata.loadData replaced by diffpy.utils.tools.load_data + +**Removed:** + +* + +**Fixed:** + +* + +**Security:** + +* diff --git a/src/diffpy/utils/_deprecator.py b/src/diffpy/utils/_deprecator.py index 72172cae..c504604a 100644 --- a/src/diffpy/utils/_deprecator.py +++ b/src/diffpy/utils/_deprecator.py @@ -20,7 +20,7 @@ def deprecated(message, *, category=DeprecationWarning, stacklevel=1): .. code-block:: python - from diffpy._deprecations import deprecated + from diffpy.utils._deprecator import deprecated import warnings @deprecated("old_function is deprecated; use new_function instead") @@ -39,7 +39,6 @@ def new_function(x, y): .. code-block:: python from diffpy._deprecations import deprecated - import warnings warnings.simplefilter("always", DeprecationWarning) @@ -83,7 +82,9 @@ def wrapper(*args, **kwargs): return decorator -def deprecation_message(base, old_name, new_name, removal_version): +def deprecation_message( + base, old_name, new_name, removal_version, new_base=None +): """Generate a deprecation message. Parameters @@ -102,7 +103,10 @@ def deprecation_message(base, old_name, new_name, removal_version): str A formatted deprecation message. """ + if new_base is None: + new_base = base return ( f"'{base}.{old_name}' is deprecated and will be removed in " - f"version {removal_version}. Please use '{base}.{new_name}' instead." + f"version {removal_version}. Please use '{new_base}.{new_name}' " + f"instead." ) diff --git a/src/diffpy/utils/parsers/loaddata.py b/src/diffpy/utils/parsers/loaddata.py index 05d37497..7de4204e 100644 --- a/src/diffpy/utils/parsers/loaddata.py +++ b/src/diffpy/utils/parsers/loaddata.py @@ -18,8 +18,21 @@ import numpy from diffpy.utils import validators +from diffpy.utils._deprecator import deprecated, deprecation_message +base = "diffpy.utils.parsers.loaddata" +removal_version = "4.0.0" +loaddata_deprecation_msg = deprecation_message( + base, + "loadData", + "load_data", + removal_version, + new_base="diffpy.utils.tools", +) + + +@deprecated(loaddata_deprecation_msg) def loadData( filename, minrows=10, headers=False, hdel="=", hignore=None, **kwargs ): @@ -254,7 +267,7 @@ def readfp(self, fp, append=False): File details include: * File name. - * All data blocks findable by loadData. + * All data blocks findable by load_data. * Headers (if present) for each data block. (Generally the headers contain column name information). """ diff --git a/src/diffpy/utils/tools.py b/src/diffpy/utils/tools.py index 19f8e03b..4d4e19fb 100644 --- a/src/diffpy/utils/tools.py +++ b/src/diffpy/utils/tools.py @@ -8,6 +8,7 @@ from scipy.signal import convolve from xraydb import material_mu +from diffpy.utils import validators from diffpy.utils.parsers.loaddata import loadData @@ -396,3 +397,182 @@ def compute_mud(filepath): key=lambda pair: pair[1], ) return best_mud + + +def load_data( + filename, minrows=10, headers=False, hdel="=", hignore=None, **kwargs +): + """Find and load data from a text file. + + The data block is identified as the first matrix block of at least + minrows rows and constant number of columns. This seems to work for most + of the datafiles including those generated by diffpy programs. + + Parameters + ---------- + filename: Path or string + Name of the file we want to load data from. + minrows: int + Minimum number of rows in the first data block. All rows must have + the same number of floating point values. + headers: bool + when False (default), the function returns a numpy array of the data + in the data block. When True, the function instead returns a + dictionary of parameters and their corresponding values parsed from + header (information prior the data block). See hdel and hignore for + options to help with parsing header information. + hdel: str + (Only used when headers enabled.) Delimiter for parsing header + information (default '='). e.g. using default hdel, the line ' + parameter = p_value' is put into the dictionary as + {parameter: p_value}. + hignore: list + (Only used when headers enabled.) Ignore header rows beginning with + any elements in hignore. e.g. hignore=['# ', '['] causes the + following lines to be skipped: '# qmax=10', '[defaults]'. + kwargs: + Keyword arguments that are passed to numpy.loadtxt including the + following arguments below. (See numpy.loadtxt for more details.) Only + pass kwargs used by numpy.loadtxt. + + Useful kwargs + ============= + comments: str, sequence of str + The characters or list of characters used to indicate the start of a + comment (default '#'). Comment lines are ignored. + delimiter: str + Delimiter for the data in the block (default use whitespace). For + comma-separated data blocks, set delimiter to ','. + unpack: bool + Return data as a sequence of columns that allows tuple unpacking such + as x, y = load_data(FILENAME, unpack=True). Note transposing the + loaded array as load_data(FILENAME).T has the same effect. + usecols: + Zero-based index of columns to be loaded, by default use all detected + columns. The reading skips data blocks that do not have the usecols- + specified columns. + + Returns + ------- + data_block: ndarray + A numpy array containing the found data block. (This is not returned + if headers is enabled.) + hdata: dict + If headers are enabled, return a dictionary of parameters read from + the header. + """ + from numpy import array, loadtxt + + # for storing header data + hdata = {} + # determine the arguments + delimiter = kwargs.get("delimiter") + usecols = kwargs.get("usecols") + # required at least one column of floating point values + mincv = (1, 1) + # but if usecols is specified, require sufficient number of columns + # where the used columns contain floats + if usecols is not None: + hiidx = max(-min(usecols), max(usecols) + 1) + mincv = (hiidx, len(set(usecols))) + + # Check if a line consists of floats only and return their count + # Return zero if some strings cannot be converted. + def countcolumnsvalues(line): + try: + words = line.split(delimiter) + # remove trailing blank columns + while words and not words[-1].strip(): + words.pop(-1) + nc = len(words) + if usecols is not None: + nv = len([float(words[i]) for i in usecols]) + else: + nv = len([float(w) for w in words]) + except (IndexError, ValueError): + nc = nv = 0 + return nc, nv + + # Check if file exists before trying to open + filename = Path(filename) + if not filename.is_file(): + raise IOError( + ( + f"File {str(filename)} cannot be found. " + "Please rerun the program specifying a valid filename." + ) + ) + + # make sure fid gets cleaned up + with open(filename, "rb") as fid: + # search for the start of datablock + start = ncvblock = None + fpos = (0, 0) + nrows = 0 + for line in fid: + # decode line + dline = line.decode() + # find header information if requested + if headers: + hpair = dline.split(hdel) + flag = True + # ensure number of non-blank arguments is two + if len(hpair) != 2: + flag = False + else: + # ignore if an argument is blank + hpair[0] = hpair[0].strip() # name of data entry + hpair[1] = hpair[1].strip() # value of entry + if not hpair[0] or not hpair[1]: + flag = False + else: + # check if row has an ignore tag + if hignore is not None: + for tag in hignore: + taglen = len(tag) + if ( + len(hpair[0]) >= taglen + and hpair[0][:taglen] == tag + ): + flag = False + # add header data + if flag: + name = hpair[0] + value = hpair[1] + # check if data value should be stored as float + if validators.is_number(hpair[1]): + value = float(hpair[1]) + hdata.update({name: value}) + # continue search for the start of datablock + fpos = (fpos[1], fpos[1] + len(line)) + line = dline + ncv = countcolumnsvalues(line) + if ncv < mincv: + start = None + continue + # ncv is acceptable here, require the same number of columns + # throughout the datablock + if start is None or ncv != ncvblock: + ncvblock = ncv + nrows = 0 + start = fpos[0] + nrows += 1 + # block was found here! + if nrows >= minrows: + break + + # Return header data if requested + if headers: + return hdata # Return, so do not proceed to reading datablock + + # Return an empty array when no data found. + # loadtxt would otherwise raise an exception on loading from EOF. + if start is None: + data_block = array([], dtype=float) + else: + fid.seek(start) + # always use usecols argument so that loadtxt does not crash + # in case of trailing delimiters. + kwargs.setdefault("usecols", list(range(ncvblock[0]))) + data_block = loadtxt(fid, **kwargs) + return data_block diff --git a/tests/test_loaddata.py b/tests/test_loaddata.py index 82d947ee..95ac009f 100644 --- a/tests/test_loaddata.py +++ b/tests/test_loaddata.py @@ -6,6 +6,7 @@ import pytest from diffpy.utils.parsers.loaddata import loadData +from diffpy.utils.tools import load_data def test_loadData_default(datafile): @@ -80,3 +81,77 @@ def test_loadData_headers(datafile): loaddatawithheaders, headers=True, hdel=delimiter, hignore=hignore ) assert hdata == expected + + +def test_load_data_default(datafile): + """Check load_data() with default options.""" + loaddata01 = datafile("loaddata01.txt") + d2c = np.array([[3, 31], [4, 32], [5, 33]]) + + with pytest.raises(IOError) as err: + load_data("doesnotexist.txt") + assert str(err.value) == ( + "File doesnotexist.txt cannot be found. " + "Please rerun the program specifying a valid filename." + ) + + # The default minrows=10 makes it read from the third line + d = load_data(loaddata01) + assert np.array_equal(d2c, d) + + # The usecols=(0, 1) would make it read from the third line + d = load_data(loaddata01, minrows=1, usecols=(0, 1)) + assert np.array_equal(d2c, d) + + # Check the effect of usecols effect + d = load_data(loaddata01, usecols=(0,)) + assert np.array_equal(d2c[:, 0], d) + + d = load_data(loaddata01, usecols=(1,)) + assert np.array_equal(d2c[:, 1], d) + + +def test_load_data_1column(datafile): + """Check loading of one-column data.""" + loaddata01 = datafile("loaddata01.txt") + d1c = np.arange(1, 6) + + # Assertions using pytest's assert + d = load_data(loaddata01, usecols=[0], minrows=1) + assert np.array_equal(d1c, d) + + d = load_data(loaddata01, usecols=[0], minrows=2) + assert np.array_equal(d1c, d) + + d = load_data(loaddata01, usecols=[0], minrows=3) + assert not np.array_equal(d1c, d) + + +def test_load_data_headers(datafile): + """Check loadData() with headers options enabled.""" + expected = { + "wavelength": 0.1, + "dataformat": "Qnm", + "inputfile": "darkSub_rh20_C_01.chi", + "mode": "xray", + "bgscale": 1.2998929285, + "composition": "0.800.20", + "outputtype": "gr", + "qmaxinst": 25.0, + "qmin": 0.1, + "qmax": 25.0, + "rmax": "100.0r", + "rmin": "0.0r", + "rstep": "0.01r", + "rpoly": "0.9r", + } + + loaddatawithheaders = datafile("loaddatawithheaders.txt") + hignore = ["# ", "// ", "["] # ignore lines beginning with these strings + delimiter = ": " # what our data should be separated by + + # Load data with headers + hdata = load_data( + loaddatawithheaders, headers=True, hdel=delimiter, hignore=hignore + ) + assert hdata == expected diff --git a/tests/test_serialization.py b/tests/test_serialization.py index 049d325c..33adb4ee 100644 --- a/tests/test_serialization.py +++ b/tests/test_serialization.py @@ -7,7 +7,7 @@ ImproperSizeError, UnsupportedTypeError, ) -from diffpy.utils.parsers.loaddata import loadData +from diffpy.utils.parsers.loaddata import load_data from diffpy.utils.parsers.serialization import deserialize_data, serialize_data @@ -21,9 +21,9 @@ def test_load_multiple(tmp_path, datafile): generated_data = None for headerfile in tlm_list: - # gather data using loadData - hdata = loadData(headerfile, headers=True) - data_table = loadData(headerfile) + # gather data using load_data + hdata = load_data(headerfile, headers=True) + data_table = load_data(headerfile) # check path extraction generated_data = serialize_data( @@ -60,8 +60,8 @@ def test_exceptions(datafile): loadfile = datafile("loadfile.txt") warningfile = datafile("generatewarnings.txt") nodt = datafile("loaddatawithheaders.txt") - hdata = loadData(loadfile, headers=True) - data_table = loadData(loadfile) + hdata = load_data(loadfile, headers=True) + data_table = load_data(loadfile) # improper file types with pytest.raises(UnsupportedTypeError): @@ -123,15 +123,15 @@ def test_exceptions(datafile): assert numpy.allclose(r_extract[data_name]["r"], r_list) assert numpy.allclose(gr_extract[data_name]["gr"], gr_list) # no datatable - nodt_hdata = loadData(nodt, headers=True) - nodt_dt = loadData(nodt) + nodt_hdata = load_data(nodt, headers=True) + nodt_dt = load_data(nodt) no_dt = serialize_data(nodt, nodt_hdata, nodt_dt, show_path=False) nodt_data_name = list(no_dt.keys())[0] assert numpy.allclose(no_dt[nodt_data_name]["data table"], nodt_dt) # ensure user is warned when columns are overwritten - hdata = loadData(warningfile, headers=True) - data_table = loadData(warningfile) + hdata = load_data(warningfile, headers=True) + data_table = load_data(warningfile) with pytest.warns(RuntimeWarning) as record: serialize_data( warningfile, From 0b46822b2c7b4fb721b9976c8c121cb11d1e3560 Mon Sep 17 00:00:00 2001 From: Simon Billinge Date: Sat, 17 Jan 2026 14:06:34 -0500 Subject: [PATCH 2/5] fix: bad imports so tests pass --- src/diffpy/utils/parsers/serialization.py | 2 +- src/diffpy/utils/tools.py | 3 +-- tests/test_serialization.py | 2 +- 3 files changed, 3 insertions(+), 4 deletions(-) diff --git a/src/diffpy/utils/parsers/serialization.py b/src/diffpy/utils/parsers/serialization.py index b8ed0c60..5acbbe41 100644 --- a/src/diffpy/utils/parsers/serialization.py +++ b/src/diffpy/utils/parsers/serialization.py @@ -37,7 +37,7 @@ def serialize_data( into a serial language file. Dictionary is formatted as {filename: data}. - Requires hdata and data_table (can be generated by loadData). + Requires hdata and data_table (can be generated by load_data). Parameters ---------- diff --git a/src/diffpy/utils/tools.py b/src/diffpy/utils/tools.py index 4d4e19fb..94611a96 100644 --- a/src/diffpy/utils/tools.py +++ b/src/diffpy/utils/tools.py @@ -9,7 +9,6 @@ from xraydb import material_mu from diffpy.utils import validators -from diffpy.utils.parsers.loaddata import loadData def _stringify(string_value): @@ -391,7 +390,7 @@ def compute_mud(filepath): mu*D : float The best-fit mu*D value. """ - z_data, I_data = loadData(filepath, unpack=True) + z_data, I_data = load_data(filepath, unpack=True) best_mud, _ = min( (_compute_single_mud(z_data, I_data) for _ in range(20)), key=lambda pair: pair[1], diff --git a/tests/test_serialization.py b/tests/test_serialization.py index 33adb4ee..e5bc8e1d 100644 --- a/tests/test_serialization.py +++ b/tests/test_serialization.py @@ -7,8 +7,8 @@ ImproperSizeError, UnsupportedTypeError, ) -from diffpy.utils.parsers.loaddata import load_data from diffpy.utils.parsers.serialization import deserialize_data, serialize_data +from diffpy.utils.tools import load_data def test_load_multiple(tmp_path, datafile): From 4a1ee184a07be3642d37b7655e7016114435fec2 Mon Sep 17 00:00:00 2001 From: Simon Billinge Date: Sun, 18 Jan 2026 08:25:25 -0500 Subject: [PATCH 3/5] change: change structure so load_data is back in parsers --- src/diffpy/utils/parsers/__init__.py | 10 +- src/diffpy/utils/parsers/loaddata.py | 351 ++++++++++++++------------- src/diffpy/utils/tools.py | 181 +------------- tests/test_loaddata.py | 2 +- tests/test_serialization.py | 2 +- 5 files changed, 191 insertions(+), 355 deletions(-) diff --git a/src/diffpy/utils/parsers/__init__.py b/src/diffpy/utils/parsers/__init__.py index a0278e27..956f2e6d 100644 --- a/src/diffpy/utils/parsers/__init__.py +++ b/src/diffpy/utils/parsers/__init__.py @@ -6,10 +6,18 @@ # (c) 2010 The Trustees of Columbia University # in the City of New York. All rights reserved. # -# File coded by: Chris Farrow +# File coded by: Simon Billinge # # See AUTHORS.txt for a list of people who contributed. # See LICENSE_DANSE.txt for license information. # ############################################################################## """Various utilities related to data parsing and manipulation.""" + +# this allows load_data to be imported from diffpy.utils.parsers +# it is needed during deprecation of the old loadData structure +# when we remove loadData we can move all the parser functionality +# a parsers.py module (like tools.py) and remove this if we want +from .loaddata import load_data + +__all__ = ["load_data"] diff --git a/src/diffpy/utils/parsers/loaddata.py b/src/diffpy/utils/parsers/loaddata.py index 7de4204e..e48090e2 100644 --- a/src/diffpy/utils/parsers/loaddata.py +++ b/src/diffpy/utils/parsers/loaddata.py @@ -13,7 +13,7 @@ # ############################################################################## -import os +from pathlib import Path import numpy @@ -35,6 +35,178 @@ @deprecated(loaddata_deprecation_msg) def loadData( filename, minrows=10, headers=False, hdel="=", hignore=None, **kwargs +): + return load_data(filename, minrows, headers, hdel, hignore, **kwargs) + + +class TextDataLoader(object): + """Smart loading of a text data with possibly multiple datasets. + + Parameters + ---------- + minrows: int + Minimum number of rows in the first data block. (Default 10.) + usecols: tuple + Which columns in our dataset to use. Ignores all other columns. If + None (default), use all columns. + skiprows + Rows in dataset to skip. (Currently not functional.) + """ + + def __init__(self, minrows=10, usecols=None, skiprows=None): + if minrows is not None: + self.minrows = minrows + if usecols is not None: + self.usecols = tuple(usecols) + # FIXME: implement usage in _findDataBlocks + if skiprows is not None: + self.skiprows = skiprows + # data items + self._reset() + return + + def _reset(self): + self.filename = "" + self.headers = [] + self.datasets = [] + self._resetvars() + return + + def _resetvars(self): + self._filename = "" + self._lines = None + self._splitlines = None + self._words = None + self._linerecs = None + self._wordrecs = None + return + + def read(self, filename): + """Open a file and run readfp. + + Use if file is not already open for read byte. + """ + with open(filename, "rb") as fp: + self.readfp(fp) + return + + def readfp(self, fp, append=False): + """Get file details. + + File details include: + * File name. + * All data blocks findable by load_data. + * Headers (if present) for each data block. (Generally the headers + contain column name information). + """ + self._reset() + # try to read lines from fp first + self._lines = fp.readlines() + # and if good, assign filename + self.filename = getattr(fp, "name", "") + self._words = "".join(self._lines).split() + self._splitlines = [line.split() for line in self._lines] + self._findDataBlocks() + return + + def _findDataBlocks(self): + mincols = 1 + if self.usecols is not None and len(self.usecols): + mincols = max(mincols, max(self.usecols) + 1) + mincols = max(mincols, abs(min(self.usecols))) + nlines = len(self._lines) + nwords = len(self._words) + # idx - line index, nw0, nw1 - index of the first and last word, + # nf - number of words, ok - has data + self._linerecs = numpy.recarray( + (nlines,), + dtype=[ + ("idx", int), + ("nw0", int), + ("nw1", int), + ("nf", int), + ("ok", bool), + ], + ) + lr = self._linerecs + lr.idx = numpy.arange(nlines) + lr.nf = [len(sl) for sl in self._splitlines] + lr.nw1 = lr.nf.cumsum() + lr.nw0 = lr.nw1 - lr.nf + lr.ok = True + # word records + lw = self._wordrecs = numpy.recarray( + (nwords,), + dtype=[ + ("idx", int), + ("line", int), + ("col", int), + ("ok", bool), + ("value", float), + ], + ) + lw.idx = numpy.arange(nwords) + n1 = numpy.zeros(nwords, dtype=bool) + n1[lr.nw1[:-1]] = True + lw.line = n1.cumsum() + lw.col = lw.idx - lr.nw0[lw.line] + lw.ok = True + values = nwords * [0.0] + for i, w in enumerate(self._words): + try: + values[i] = float(w) + except ValueError: + lw.ok[i] = False + # prune lines that have a non-float values: + lw.values = values + if self.usecols is None: + badlines = lw.line[~lw.ok] + lr.ok[badlines] = False + else: + for col in self.usecols: + badlines = lw.line[(lw.col == col) & ~lw.ok] + lr.ok[badlines] = False + lr1 = lr[lr.nf >= mincols] + okb = numpy.r_[lr1.ok[:1], lr1.ok[1:] & ~lr1.ok[:-1], False] + oke = numpy.r_[False, ~lr1.ok[1:] & lr1.ok[:-1], lr1.ok[-1:]] + blockb = numpy.r_[True, lr1.nf[1:] != lr1.nf[:-1], False] + blocke = numpy.r_[False, blockb[1:-1], True] + beg = numpy.nonzero(okb | blockb)[0] + end = numpy.nonzero(oke | blocke)[0] + rowcounts = end - beg + assert not numpy.any(rowcounts < 0) + goodrows = rowcounts >= self.minrows + begend = numpy.transpose([beg, end - 1])[goodrows] + hbeg = 0 + for dbeg, dend in begend: + bb1 = lr1[dbeg] + ee1 = lr1[dend] + hend = bb1.idx + header = "".join(self._lines[hbeg:hend]) + hbeg = ee1.idx + 1 + if self.usecols is None: + data = numpy.reshape(lw.value[bb1.nw0 : ee1.nw1], (-1, bb1.nf)) + else: + tdata = numpy.empty( + (len(self.usecols), dend - dbeg), dtype=float + ) + for j, trow in zip(self.usecols, tdata): + j %= bb1.nf + trow[:] = lw.value[bb1.nw0 + j : ee1.nw1 : bb1.nf] + data = tdata.transpose() + self.headers.append(header) + self.datasets.append(data) + # finish reading to a last header and empty dataset + if hbeg < len(self._lines): + header = "".join(self._lines[hbeg:]) + data = numpy.empty(0, dtype=float) + self.headers.append(header) + self.datasets.append(data) + return + + +def load_data( + filename, minrows=10, headers=False, hdel="=", hignore=None, **kwargs ): """Find and load data from a text file. @@ -44,7 +216,7 @@ def loadData( Parameters ---------- - filename + filename: Path or string Name of the file we want to load data from. minrows: int Minimum number of rows in the first data block. All rows must have @@ -79,8 +251,8 @@ def loadData( comma-separated data blocks, set delimiter to ','. unpack: bool Return data as a sequence of columns that allows tuple unpacking such - as x, y = loadData(FILENAME, unpack=True). Note transposing the - loaded array as loadData(FILENAME).T has the same effect. + as x, y = load_data(FILENAME, unpack=True). Note transposing the + loaded array as load_data(FILENAME).T has the same effect. usecols: Zero-based index of columns to be loaded, by default use all detected columns. The reading skips data blocks that do not have the usecols- @@ -128,10 +300,11 @@ def countcolumnsvalues(line): return nc, nv # Check if file exists before trying to open - if not os.path.exists(filename): + filename = Path(filename) + if not filename.is_file(): raise IOError( ( - f"File {filename} cannot be found. " + f"File {str(filename)} cannot be found. " "Please rerun the program specifying a valid filename." ) ) @@ -209,169 +382,3 @@ def countcolumnsvalues(line): kwargs.setdefault("usecols", list(range(ncvblock[0]))) data_block = loadtxt(fid, **kwargs) return data_block - - -class TextDataLoader(object): - """Smart loading of a text data with possibly multiple datasets. - - Parameters - ---------- - minrows: int - Minimum number of rows in the first data block. (Default 10.) - usecols: tuple - Which columns in our dataset to use. Ignores all other columns. If - None (default), use all columns. - skiprows - Rows in dataset to skip. (Currently not functional.) - """ - - def __init__(self, minrows=10, usecols=None, skiprows=None): - if minrows is not None: - self.minrows = minrows - if usecols is not None: - self.usecols = tuple(usecols) - # FIXME: implement usage in _findDataBlocks - if skiprows is not None: - self.skiprows = skiprows - # data items - self._reset() - return - - def _reset(self): - self.filename = "" - self.headers = [] - self.datasets = [] - self._resetvars() - return - - def _resetvars(self): - self._filename = "" - self._lines = None - self._splitlines = None - self._words = None - self._linerecs = None - self._wordrecs = None - return - - def read(self, filename): - """Open a file and run readfp. - - Use if file is not already open for read byte. - """ - with open(filename, "rb") as fp: - self.readfp(fp) - return - - def readfp(self, fp, append=False): - """Get file details. - - File details include: - * File name. - * All data blocks findable by load_data. - * Headers (if present) for each data block. (Generally the headers - contain column name information). - """ - self._reset() - # try to read lines from fp first - self._lines = fp.readlines() - # and if good, assign filename - self.filename = getattr(fp, "name", "") - self._words = "".join(self._lines).split() - self._splitlines = [line.split() for line in self._lines] - self._findDataBlocks() - return - - def _findDataBlocks(self): - mincols = 1 - if self.usecols is not None and len(self.usecols): - mincols = max(mincols, max(self.usecols) + 1) - mincols = max(mincols, abs(min(self.usecols))) - nlines = len(self._lines) - nwords = len(self._words) - # idx - line index, nw0, nw1 - index of the first and last word, - # nf - number of words, ok - has data - self._linerecs = numpy.recarray( - (nlines,), - dtype=[ - ("idx", int), - ("nw0", int), - ("nw1", int), - ("nf", int), - ("ok", bool), - ], - ) - lr = self._linerecs - lr.idx = numpy.arange(nlines) - lr.nf = [len(sl) for sl in self._splitlines] - lr.nw1 = lr.nf.cumsum() - lr.nw0 = lr.nw1 - lr.nf - lr.ok = True - # word records - lw = self._wordrecs = numpy.recarray( - (nwords,), - dtype=[ - ("idx", int), - ("line", int), - ("col", int), - ("ok", bool), - ("value", float), - ], - ) - lw.idx = numpy.arange(nwords) - n1 = numpy.zeros(nwords, dtype=bool) - n1[lr.nw1[:-1]] = True - lw.line = n1.cumsum() - lw.col = lw.idx - lr.nw0[lw.line] - lw.ok = True - values = nwords * [0.0] - for i, w in enumerate(self._words): - try: - values[i] = float(w) - except ValueError: - lw.ok[i] = False - # prune lines that have a non-float values: - lw.values = values - if self.usecols is None: - badlines = lw.line[~lw.ok] - lr.ok[badlines] = False - else: - for col in self.usecols: - badlines = lw.line[(lw.col == col) & ~lw.ok] - lr.ok[badlines] = False - lr1 = lr[lr.nf >= mincols] - okb = numpy.r_[lr1.ok[:1], lr1.ok[1:] & ~lr1.ok[:-1], False] - oke = numpy.r_[False, ~lr1.ok[1:] & lr1.ok[:-1], lr1.ok[-1:]] - blockb = numpy.r_[True, lr1.nf[1:] != lr1.nf[:-1], False] - blocke = numpy.r_[False, blockb[1:-1], True] - beg = numpy.nonzero(okb | blockb)[0] - end = numpy.nonzero(oke | blocke)[0] - rowcounts = end - beg - assert not numpy.any(rowcounts < 0) - goodrows = rowcounts >= self.minrows - begend = numpy.transpose([beg, end - 1])[goodrows] - hbeg = 0 - for dbeg, dend in begend: - bb1 = lr1[dbeg] - ee1 = lr1[dend] - hend = bb1.idx - header = "".join(self._lines[hbeg:hend]) - hbeg = ee1.idx + 1 - if self.usecols is None: - data = numpy.reshape(lw.value[bb1.nw0 : ee1.nw1], (-1, bb1.nf)) - else: - tdata = numpy.empty( - (len(self.usecols), dend - dbeg), dtype=float - ) - for j, trow in zip(self.usecols, tdata): - j %= bb1.nf - trow[:] = lw.value[bb1.nw0 + j : ee1.nw1 : bb1.nf] - data = tdata.transpose() - self.headers.append(header) - self.datasets.append(data) - # finish reading to a last header and empty dataset - if hbeg < len(self._lines): - header = "".join(self._lines[hbeg:]) - data = numpy.empty(0, dtype=float) - self.headers.append(header) - self.datasets.append(data) - return diff --git a/src/diffpy/utils/tools.py b/src/diffpy/utils/tools.py index 94611a96..42e43bc8 100644 --- a/src/diffpy/utils/tools.py +++ b/src/diffpy/utils/tools.py @@ -8,7 +8,7 @@ from scipy.signal import convolve from xraydb import material_mu -from diffpy.utils import validators +from diffpy.utils.parsers import load_data def _stringify(string_value): @@ -396,182 +396,3 @@ def compute_mud(filepath): key=lambda pair: pair[1], ) return best_mud - - -def load_data( - filename, minrows=10, headers=False, hdel="=", hignore=None, **kwargs -): - """Find and load data from a text file. - - The data block is identified as the first matrix block of at least - minrows rows and constant number of columns. This seems to work for most - of the datafiles including those generated by diffpy programs. - - Parameters - ---------- - filename: Path or string - Name of the file we want to load data from. - minrows: int - Minimum number of rows in the first data block. All rows must have - the same number of floating point values. - headers: bool - when False (default), the function returns a numpy array of the data - in the data block. When True, the function instead returns a - dictionary of parameters and their corresponding values parsed from - header (information prior the data block). See hdel and hignore for - options to help with parsing header information. - hdel: str - (Only used when headers enabled.) Delimiter for parsing header - information (default '='). e.g. using default hdel, the line ' - parameter = p_value' is put into the dictionary as - {parameter: p_value}. - hignore: list - (Only used when headers enabled.) Ignore header rows beginning with - any elements in hignore. e.g. hignore=['# ', '['] causes the - following lines to be skipped: '# qmax=10', '[defaults]'. - kwargs: - Keyword arguments that are passed to numpy.loadtxt including the - following arguments below. (See numpy.loadtxt for more details.) Only - pass kwargs used by numpy.loadtxt. - - Useful kwargs - ============= - comments: str, sequence of str - The characters or list of characters used to indicate the start of a - comment (default '#'). Comment lines are ignored. - delimiter: str - Delimiter for the data in the block (default use whitespace). For - comma-separated data blocks, set delimiter to ','. - unpack: bool - Return data as a sequence of columns that allows tuple unpacking such - as x, y = load_data(FILENAME, unpack=True). Note transposing the - loaded array as load_data(FILENAME).T has the same effect. - usecols: - Zero-based index of columns to be loaded, by default use all detected - columns. The reading skips data blocks that do not have the usecols- - specified columns. - - Returns - ------- - data_block: ndarray - A numpy array containing the found data block. (This is not returned - if headers is enabled.) - hdata: dict - If headers are enabled, return a dictionary of parameters read from - the header. - """ - from numpy import array, loadtxt - - # for storing header data - hdata = {} - # determine the arguments - delimiter = kwargs.get("delimiter") - usecols = kwargs.get("usecols") - # required at least one column of floating point values - mincv = (1, 1) - # but if usecols is specified, require sufficient number of columns - # where the used columns contain floats - if usecols is not None: - hiidx = max(-min(usecols), max(usecols) + 1) - mincv = (hiidx, len(set(usecols))) - - # Check if a line consists of floats only and return their count - # Return zero if some strings cannot be converted. - def countcolumnsvalues(line): - try: - words = line.split(delimiter) - # remove trailing blank columns - while words and not words[-1].strip(): - words.pop(-1) - nc = len(words) - if usecols is not None: - nv = len([float(words[i]) for i in usecols]) - else: - nv = len([float(w) for w in words]) - except (IndexError, ValueError): - nc = nv = 0 - return nc, nv - - # Check if file exists before trying to open - filename = Path(filename) - if not filename.is_file(): - raise IOError( - ( - f"File {str(filename)} cannot be found. " - "Please rerun the program specifying a valid filename." - ) - ) - - # make sure fid gets cleaned up - with open(filename, "rb") as fid: - # search for the start of datablock - start = ncvblock = None - fpos = (0, 0) - nrows = 0 - for line in fid: - # decode line - dline = line.decode() - # find header information if requested - if headers: - hpair = dline.split(hdel) - flag = True - # ensure number of non-blank arguments is two - if len(hpair) != 2: - flag = False - else: - # ignore if an argument is blank - hpair[0] = hpair[0].strip() # name of data entry - hpair[1] = hpair[1].strip() # value of entry - if not hpair[0] or not hpair[1]: - flag = False - else: - # check if row has an ignore tag - if hignore is not None: - for tag in hignore: - taglen = len(tag) - if ( - len(hpair[0]) >= taglen - and hpair[0][:taglen] == tag - ): - flag = False - # add header data - if flag: - name = hpair[0] - value = hpair[1] - # check if data value should be stored as float - if validators.is_number(hpair[1]): - value = float(hpair[1]) - hdata.update({name: value}) - # continue search for the start of datablock - fpos = (fpos[1], fpos[1] + len(line)) - line = dline - ncv = countcolumnsvalues(line) - if ncv < mincv: - start = None - continue - # ncv is acceptable here, require the same number of columns - # throughout the datablock - if start is None or ncv != ncvblock: - ncvblock = ncv - nrows = 0 - start = fpos[0] - nrows += 1 - # block was found here! - if nrows >= minrows: - break - - # Return header data if requested - if headers: - return hdata # Return, so do not proceed to reading datablock - - # Return an empty array when no data found. - # loadtxt would otherwise raise an exception on loading from EOF. - if start is None: - data_block = array([], dtype=float) - else: - fid.seek(start) - # always use usecols argument so that loadtxt does not crash - # in case of trailing delimiters. - kwargs.setdefault("usecols", list(range(ncvblock[0]))) - data_block = loadtxt(fid, **kwargs) - return data_block diff --git a/tests/test_loaddata.py b/tests/test_loaddata.py index 95ac009f..92c53571 100644 --- a/tests/test_loaddata.py +++ b/tests/test_loaddata.py @@ -5,8 +5,8 @@ import numpy as np import pytest +from diffpy.utils.parsers import load_data from diffpy.utils.parsers.loaddata import loadData -from diffpy.utils.tools import load_data def test_loadData_default(datafile): diff --git a/tests/test_serialization.py b/tests/test_serialization.py index e5bc8e1d..eeab5307 100644 --- a/tests/test_serialization.py +++ b/tests/test_serialization.py @@ -3,12 +3,12 @@ import numpy import pytest +from diffpy.utils.parsers import load_data from diffpy.utils.parsers.custom_exceptions import ( ImproperSizeError, UnsupportedTypeError, ) from diffpy.utils.parsers.serialization import deserialize_data, serialize_data -from diffpy.utils.tools import load_data def test_load_multiple(tmp_path, datafile): From 8d11075c07d46594572da1c7ed2d20918f5a12ea Mon Sep 17 00:00:00 2001 From: Simon Billinge Date: Sun, 18 Jan 2026 08:46:37 -0500 Subject: [PATCH 4/5] docs: change docstring so API docs are correctly updated --- src/diffpy/utils/_deprecator.py | 7 +++++++ src/diffpy/utils/parsers/loaddata.py | 5 +++++ 2 files changed, 12 insertions(+) diff --git a/src/diffpy/utils/_deprecator.py b/src/diffpy/utils/_deprecator.py index c504604a..8cc50cd7 100644 --- a/src/diffpy/utils/_deprecator.py +++ b/src/diffpy/utils/_deprecator.py @@ -110,3 +110,10 @@ def deprecation_message( f"version {removal_version}. Please use '{new_base}.{new_name}' " f"instead." ) + + +_DEPRECATION_DOCSTRING_TEMPLATE = ( + "This function has been deprecated and will be " + "removed in version {removal_version}. Please use" + "{new_base}.{new_name} instead." +) diff --git a/src/diffpy/utils/parsers/loaddata.py b/src/diffpy/utils/parsers/loaddata.py index e48090e2..da422058 100644 --- a/src/diffpy/utils/parsers/loaddata.py +++ b/src/diffpy/utils/parsers/loaddata.py @@ -36,6 +36,11 @@ def loadData( filename, minrows=10, headers=False, hdel="=", hignore=None, **kwargs ): + """This function has been deprecated and will be removed in version + 4.0.0. + + Please use diffpy.utils.parsers.load_data instead. + """ return load_data(filename, minrows, headers, hdel, hignore, **kwargs) From 34bfb90816314d9da2bd306427ddb13eff0be48a Mon Sep 17 00:00:00 2001 From: Simon Billinge Date: Sun, 18 Jan 2026 09:03:02 -0500 Subject: [PATCH 5/5] docs; fix typos in examples text that had load_data in tools --- docs/source/examples/parsers_example.rst | 2 +- docs/source/examples/resample_example.rst | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/examples/parsers_example.rst b/docs/source/examples/parsers_example.rst index 19e607eb..747d0c4f 100644 --- a/docs/source/examples/parsers_example.rst +++ b/docs/source/examples/parsers_example.rst @@ -18,7 +18,7 @@ Using the parsers module, we can load file data into simple and easy-to-work-wit .. code-block:: python - from diffpy.utils.tools import load_data + from diffpy.utils.parsers import load_data data_table = load_data('') While this will work with most datasets, on our ``data.txt`` file, we got a ``ValueError``. The reason for this is diff --git a/docs/source/examples/resample_example.rst b/docs/source/examples/resample_example.rst index 5af42e73..32e3e02a 100644 --- a/docs/source/examples/resample_example.rst +++ b/docs/source/examples/resample_example.rst @@ -16,7 +16,7 @@ given enough datapoints. .. code-block:: python - from diffpy.utils.tools import load_data + from diffpy.utils.parsers import load_data nickel_datatable = load_data('') nitarget_datatable = load_data('')