Utils Module

This module initializes and imports the essential utility functions for data conversion, statistical analysis, caching, and event detection for the MHKiT library.

`matlab_to_datetime`	Convert MATLAB datenum format to Python datetime
`excel_to_datetime`	Convert Excel datenum format to Python datetime
`index_to_datetime`	Convert DataFrame index from int/float to datetime, rounds datetime to the nearest millisecond
`get_statistics`	Calculate mean, max, min and stdev statistics of continuous data for a given statistical window.
`vector_statistics`	Function used to calculate statistics for vector/directional channels based on routine from Campbell data logger and Yamartino algorithm
`unwrap_vector`	Function used to unwrap vectors into 0-360 deg range
`magnitude_phase`	Retuns magnitude and phase in two or three dimensions.
`unorm`	Calculates the root mean squared value given three arrays.
`handle_caching`	Handles caching of data to avoid redundant network requests or computations.
`clear_cache`	Clears the cache.
`upcrossing`	Finds the zero upcrossing points.
`peaks`	Finds the peaks between zero crossings.
`troughs`	Finds the troughs between zero crossings.
`heights`	Calculates the height between zero crossings.
`periods`	Calculates the period between zero crossings.
`custom`	Applies a custom function to the timeseries data between upcrossing points.
`to_numeric_array`	Convert input data to a numeric array, ensuring all elements are numeric.
`convert_to_dataset`	Converts the given data to an xarray.Dataset.
`convert_to_dataarray`	Converts the given data to an xarray.DataArray.
`convert_nested_dict_and_pandas`	Recursively searches inside nested dictionaries for pandas DataFrames to convert to xarray Datasets.

mhkit.utils.matlab_to_datetime(matlab_datenum: ndarray | list | float | int) → DatetimeIndex[source]

Convert MATLAB datenum format to Python datetime

Parameters:: matlab_datenum (numpy array) – MATLAB datenum to be converted
Returns:: time (DateTimeIndex) – Python datetime values

mhkit.utils.excel_to_datetime(excel_num: ndarray | list | float | int) → DatetimeIndex[source]

Convert Excel datenum format to Python datetime

Parameters:: excel_num (numpy array) – Excel datenums to be converted
Returns:: time (DateTimeIndex) – Python datetime values

mhkit.utils.index_to_datetime(index, unit='s', origin='unix')[source]

Convert DataFrame index from int/float to datetime, rounds datetime to the nearest millisecond

Parameters:

index (pandas Index) – DataFrame index in int or float
unit (str, optional) – Units of the original index
origin (str) – Reference date used to define the starting time. If origin = ‘unix’, the start time is ‘1970-01-01 00:00:00’ The origin can also be defined using a datetime string in a similar format (i.e. ‘2019-05-17 16:05:45’)

Returns:

pandas Index – DataFrame index in datetime

mhkit.utils.get_statistics(data: DataFrame, freq: float | int, period: float | int = 600, vector_channels: str | List[str] | None = None) → Tuple[DataFrame, DataFrame, DataFrame, DataFrame][source]

Calculate mean, max, min and stdev statistics of continuous data for a given statistical window. Default length of statistical window (period) is based on IEC TS 62600-3:2020 ED1. Also allows calculation of statistics for multiple statistical windows of continuous data and accounts for vector/directional channels.

Parameters:

data (pandas DataFrame) – Data indexed by datetime with columns of data to be analyzed
freq (float/int) – Sample rate of data [Hz]
period (float/int) – Statistical window of interest [sec], default = 600
vector_channels (string or list (optional)) – List of vector/directional channel names formatted in deg (0-360)

Returns:

means,maxs,mins,stdevs (pandas DataFrame) – Calculated statistical values from the data, indexed by the first timestamp

mhkit.utils.vector_statistics(data: Series | ndarray | list) → Tuple[ndarray, ndarray][source]

Function used to calculate statistics for vector/directional channels based on routine from Campbell data logger and Yamartino algorithm

Parameters:

data (pandas Series, numpy array, list) – Vector channel to calculate statistics on [deg, 0-360]

Returns:

vector_avg (numpy array) – Vector mean statistic
vector_std (numpy array) – Vector standard deviation statistic

mhkit.utils.unwrap_vector(data: Series | ndarray | list) → ndarray[source]

Function used to unwrap vectors into 0-360 deg range

Parameters:: data (pandas Series, numpy array, list) – Data points to be unwrapped [deg]
Returns:: data (numpy array) – Data points unwrapped between 0-360 deg

Retuns magnitude and phase in two or three dimensions.

Parameters:

x (array_like) – x-component
y (array_like) – y-component
z (array_like) – z-component defined positive up. (Optional) Default None.

Returns:

mag (float or array) – magnitude of the vector
theta (float or array) – radians from the x-axis
phi (float or array) – radians from z-axis defined as positive up. Optional: only returned when z is passed.

Calculates the root mean squared value given three arrays.

Parameters:

x (array) – One input for the root mean squared calculation.(eq. x velocity)
y (array) – One input for the root mean squared calculation.(eq. y velocity)
z (array) – One input for the root mean squared calculation.(eq. z velocity)

Returns:

u_norm (array) – The root mean squared of x, y, and z.

Example

If the inputs are [1,2,3], [4,5,6], and [7,8,9] the code take the cordinationg value from each array and calculates the root mean squared. The resulting output is [ 8.1240384, 9.64365076, 11.22497216].

mhkit.utils.handle_caching(hash_params: str, cache_dir: str, cache_content: Dict[str, Any] | None = None, clear_cache_file: bool = False) → Tuple[DataFrame | None, Dict[str, Any] | None, str][source]

Handles caching of data to avoid redundant network requests or computations.

The function checks if a cache file exists for the given parameters. If it does, the function will load data from the cache file, unless the clear_cache_file parameter is set to True, in which case the cache file is cleared. If the cache file does not exist and the data parameter is not None, the function will store the provided data in a cache file.

Parameters:

hash_params (str) – Parameters to generate the cache file hash.
cache_dir (str) – Directory where cache files are stored.
cache_content (Optional[Dict[str, Any]], optional) – Content to be cached. Should contain ‘data’, ‘metadata’, and ‘write_json’.
clear_cache_file (bool) – Whether to clear the existing cache.

Returns:

Tuple[Optional[pd.DataFrame], Optional[Dict[str, Any]], str] – Cached data, metadata, and cache file path.

mhkit.utils.clear_cache(specific_dir: str | None = None) → None[source]

Clears the cache.

The function checks if a specific directory or the entire cache directory exists. If it does, the function will remove the directory and recreate it. If the directory does not exist, a message indicating is printed.

Parameters:: specific_dir (str or None, optional) – Specific sub-directory to clear. If None, the entire cache is cleared. Default is None.
Returns:: None

mhkit.utils.upcrossing(t: ndarray, data: ndarray) → ndarray[source]

Finds the zero upcrossing points.

Parameters:

t (np.array) – Time array.
data (np.array) – Signal time series.

Returns:

inds (np.array) – Zero crossing indices

mhkit.utils.peaks(t: ndarray, data: ndarray, inds: ndarray | None = None) → ndarray[source]

Finds the peaks between zero crossings.

Parameters:

t (np.array) – Time array.
data (np.array) – Signal time-series.
inds (np.ndarray, optional) – Optional indices for the upcrossing. Useful when using several of the upcrossing methods to avoid repeating the upcrossing analysis each time.

Returns:

peaks (np.array) – Peak values of the time-series

mhkit.utils.troughs(t: ndarray, data: ndarray, inds: ndarray | None = None) → ndarray[source]

Finds the troughs between zero crossings.

Parameters:

t (np.array) – Time array.
data (np.array) – Signal time-series.
inds (np.array, optional) – Optional indices for the upcrossing. Useful when using several of the upcrossing methods to avoid repeating the upcrossing analysis each time.

Returns:

troughs (np.array) – Trough values of the time-series

mhkit.utils.heights(t: ndarray, data: ndarray, inds: ndarray | None = None) → ndarray[source]

Calculates the height between zero crossings.

The height is defined as the max value - min value between the zero crossing points.

Parameters:

t (np.array) – Time array.
data (np.array) – Signal time-series.
inds (np.array, optional) – Optional indices for the upcrossing. Useful when using several of the upcrossing methods to avoid repeating the upcrossing analysis each time.

Returns:

heights (np.array) – Height values of the time-series

mhkit.utils.periods(t: ndarray, data: ndarray, inds: ndarray | None = None) → ndarray[source]

Calculates the period between zero crossings.

Parameters:

t (np.array) – Time array.
data (np.array) – Signal time-series.
inds (np.array, optional) – Optional indices for the upcrossing. Useful when using several of the upcrossing methods to avoid repeating the upcrossing analysis each time.

Returns:

periods (np.array) – Period values of the time-series

mhkit.utils.custom(t: ndarray, data: ndarray, func: Callable[[int, int], ndarray], inds: ndarray | None = None) → ndarray[source]

Applies a custom function to the timeseries data between upcrossing points.

Parameters:

t (np.array) – Time array.
data (np.array) – Signal time-series.
func (Callable[[int, int], np.ndarray]) – Function to apply between the zero crossing periods given t[ind1], t[ind2], where ind1 < ind2, correspond to the start and end of an upcrossing section.
inds (np.array, optional) – Optional indices for the upcrossing. Useful when using several of the upcrossing methods to avoid repeating the upcrossing analysis each time.

Returns:

values (np.array) – Custom values of the time-series

mhkit.utils.to_numeric_array(data: list | ndarray | Series | DataArray, name: str) → ndarray[source]: Convert input data to a numeric array, ensuring all elements are numeric.

mhkit.utils.convert_to_dataset(data: DataFrame | Series | DataArray | Dataset, name: str = 'data') → Dataset[source]

Converts the given data to an xarray.Dataset.

This function is designed to handle inputs that can be either a pandas DataFrame, a pandas Series, an xarray DataArray, or an xarray Dataset. It ensures that the output is consistently an xarray.Dataset.

Parameters:

data (pandas Series, pandas DataFrame, xarray DataArray, or xarray Dataset) – The data to be converted.
name (str (Optional)) – The name to assign to the data variable in case the input is an xarray DataArray without a name. Default value is ‘data’.

Returns:

xarray.Dataset – The input data converted to an xarray.Dataset. If the input is already an xarray.Dataset, it is returned as is.

Examples

>>> df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
>>> ds = convert_to_dataset(df)
>>> type(ds)
<class 'xarray.core.dataset.Dataset'>

>>> series = pd.Series([1, 2, 3], name='C')
>>> ds = convert_to_dataset(series)
>>> type(ds)
<class 'xarray.core.dataset.Dataset'>

>>> data_array = xr.DataArray([1, 2, 3])
>>> ds = convert_to_dataset(data_array, name='D')
>>> type(ds)
<class 'xarray.core.dataset.Dataset'>

mhkit.utils.convert_to_dataarray(data: ndarray | DataFrame | Series | DataArray | Dataset, name: str = 'data') → DataArray[source]

Converts the given data to an xarray.DataArray.

This function takes in a numpy ndarray, pandas Series, pandas Dataframe, or xarray Dataset and outputs an equivalent xarray DataArray. DataArrays can be passed through with no changes.

Xarray datasets can only be input when all variable have the same dimensions.

Multivariate pandas Dataframes become 2D DataArrays, which is especially useful when IO functions return Dataframes with an extremely large number of variable. Use the function convert_to_dataset to change a multivariate Dataframe into a multivariate Dataset.

Parameters:

data (numpy ndarray, pandas DataFrame, pandas Series, xarray)
DataArray – The data to be converted.
Dataset (or xarray) – The data to be converted.
name (str (Optional)) – The name to overwrite the name of the input data variable for pandas or xarray input. Default value is ‘data’.

Returns:

xarray.DataArray – The input data converted to an xarray.DataArray. If the input is already an xarray.DataArray, it is returned as is.

Examples

>>> df = pd.DataFrame({'A': [1, 2, 3]})
>>> da = convert_to_dataarray(df)
>>> type(da)
<class 'xarray.core.datarray.DataArray'>

>>> series = pd.Series([1, 2, 3], name='C')
>>> da = convert_to_dataarray(series)
>>> type(da)
<class 'xarray.core.datarray.DataArray'>

>>> data_array = xr.DataArray([1, 2, 3])
>>> da = convert_to_dataarray(data_array, name='D')
>>> type(da)
<class 'xarray.core.datarray.DataArray'>

mhkit.utils.convert_nested_dict_and_pandas(data: Dict[str, DataFrame | Dict[str, Any]]) → Dict[str, Dataset | Dict[str, Any]][source]

Recursively searches inside nested dictionaries for pandas DataFrames to convert to xarray Datasets. Typically called by wave.io functions that read SWAN, WEC-Sim, CDIP, NDBC data.

Parameters:: data (dictionary of dictionaries and pandas DataFrames)
Returns:: data (dictionary of dictionaries and xarray Datasets)