QC Module

The QC module includes quality control functions from Pecos, see https://pecos.readthedocs.io for more details.

Functions

Description

check_corrupt

Check for corrupt data

check_delta

Check for stagant data and/or abrupt changes in the data using the difference between max and min values within a rolling window

check_increment

Check data increments using the difference between values

check_missing

Check for missing data

check_outlier

Check for outliers using normalized data within a rolling window

check_range

Check for data outside the expected range

check_timestamp

Check time series for missing, non-monotonic, and duplicate timestamps

qc_data_to_dataframe

Convert qc data structure to pandas dataframe

Note

The names of the functions below are of the convention path.path.function. Only the function name is used when calling the function in MATLAB. For example, to call on mhkit.qc.check_timestamp simply use check_timestamp.

mhkit.qc.check_corrupt(data, vals, options)

Check for data that is outside expected range

Parameters:
  • data (pandas dataframe or qcdata structure) –

    Pandas dataframe indexed by datetime (use py.mhkit_python_utils.pandas_dataframe.timeseries_to_pandas(ts,time,x))

    OR

    qcdata structure of form:

    data.values: 2D array of doubles with arbitrary number of columns

    data.time: 1D array of datetimes or posix times

  • vals (cell array of floats) – Array of at least two corrupt data values Use cell array for one value, or pack array with NaN (A single value array becomes a non-iterable scalar in python and will cause an error.)

  • key (string (optional)) – Data column name or translation dictionary key. If not specified or set to py.None, all columns are used for test. to call: check_corrupt(data,vals,”key”,key)

  • min_failures (int (optional)) – Minimum number of consecutive failures required for reporting default = 1 to call: check_corrupt(data,vals,”min_failures”,min_failures)

Returns:

results (qcdata structure of form:) –

results.values: array of doubles

Same shape as input data.values Elements that failed QC test replaced with NaN

results.mask: array of int64

Same shape as input data.values Logical mask of QC results (1 = passed, 0 = failed QC test)

results.time: array of datetimes

Same as input times

mhkit.qc.check_delta(data, bound, window, options)

Check for stagnant data and/or abrupt changes in the data using difference between max and min values within a rolling window

Parameters:
  • data (pandas dataframe or qcdata structure) –

    Pandas dataframe indexed by datetime (use py.mhkit_python_utils.pandas_dataframe.timeseries_to_pandas(ts,time,x))

    OR

    qcdata structure of form:

    data.values: 2D array of doubles with arbitrary number of columns

    data.time: 1D array of datetimes or posix times

  • bound (cell array of floats) – [lower bound, upper bound] for min/max delta checking NaN or py.None can be used for either bound

  • window (int or double) – Size of the rolling window (in seconds) used to compute delta

  • key (string (optional)) – Data column name or translation dictionary key. If not specified or set to py.None, all columns are used for test. to call: check_delta(data,bound,”key”,key)

  • direction (string (optional)) –

    Options: ‘positive’, ‘negative’, or py.None (default)

    If direction is positive, then only identify positive deltas (the min occurs before the max) If direction is negative, then only identify negative deltas (the max occurs before the min) If direction is py.None, then identify both positive and negative deltas

    to call: check_delta(data,bound,”direction”,direction)

  • min_failures (int (optional)) – Minimum number of consecutive failures required for reporting, default = 1 to call: check_delta(data,bound,”min_failures”,min_failures)

Returns:

results (qcdata structure of form:) –

results.values: array of doubles

Same shape as input data.values Elements that failed QC test replaced with NaN

results.mask: array of int64

Same shape as input data.values Logical mask of QC results (1 = passed, 0 = failed QC test)

results.time: array of datetimes

Same as input times

mhkit.qc.check_increment(data, bound, options)

Check data increments using the difference between values

Parameters:
  • data (pandas dataframe or qcdata structure) –

    Pandas dataframe indexed by datetime (use py.mhkit_python_utils.pandas_dataframe.timeseries_to_pandas(ts,time,x))

    OR

    qcdata structure of form:

    data.values: 2D array of doubles with arbitrary number of columns

    data.time: 1D array of datetimes or posix times

  • bound (cell array of floats) – [lower bound, upper bound] for min/max difference NaN or py.None can be used for either bound

  • key (string (optional)) – Data column name or translation dictionary key. If not specified or set to py.None, all columns are used for test. to call: check_increment(data,bound,”key”,key)

  • increment (int (optional)) – Time step shift used to compute difference, default = 1 to call: check_increment(data,bound,”increment”,increment)

  • absolute_value (logical (optional)) – Use the absolute value of increment data, default = py.True to call: check_increment(data,bound,”absolute_value”,absolute_value)

  • min_failures (int (optional)) – Minimum number of consecutive failures required for reporting default = 1 to call: check_increment(data,bound,”min_failures”,min_failures)

Returns:

results (qcdata structure of form:) –

results.values: array of doubles

Same shape as input data.values Elements that failed QC test replaced with NaN

results.mask: array of int64

Same shape as input data.values Logical mask of QC results (1 = passed, 0 = failed QC test)

results.time: array of datetimes

Same as input times

mhkit.qc.check_missing(data, options)

Check for missing data

Parameters:
  • data (pandas dataframe or qcdata structure) –

    Pandas dataframe indexed by datetime (use py.mhkit_python_utils.pandas_dataframe.timeseries_to_pandas(ts,time,x))

    OR

    data structure of form:

    data.values: 2D array of doubles with arbitrary number of columns

    data.time: 1D array of datetimes or posix times

  • key (string (optional)) – Data column name or translation dictionary key. If not specified or set to py.None, all columns are used for test. to call: check_missing(data,”key”,key)

  • min_failures (int (optional)) – Minimum number of consecutive failures required for reporting, default = 1 to call: check_missing(data,”min_failures”,min_failures)

Returns:

results (qcdata structure of form:) –

results.values: array of doubles

Same shape as input data.values Elements that failed QC test replaced with NaN

results.mask: array of int64

Same shape as input data.values Logical mask of QC results (1 = passed, 0 = failed QC test)

results.time: array of datetimes

Same as input times

mhkit.qc.check_outlier(data, bound, options)

Check or outliers using normalized data within a rolling window Upper and lower bounds in standard deviations Data is normalized using (data-mean)/std

Parameters:
  • data (pandas dataframe or qcdata structure) –

    Pandas dataframe indexed by datetime (use py.mhkit_python_utils.pandas_dataframe.timeseries_to_pandas(ts,time,x))

    OR

    qcdata structure of form:

    data.values: 2D array of doubles with arbitrary number of columns

    data.time: 1D array of datetimes or posix times

  • bound (cell array of floats) – [lower bound, upper bound] of standard deviations from mean allowed NaN or py.None can be used for either bound

  • key (string (optional)) – Data column name or translation dictionary key. If not specified or set to py.None, all columns are used for test. to call: check_outlier(data,bound,”key”,key)

  • window (int (optional)) – Size of rolling window (in seconds) used to normalize data default = 3600. If window is set to py.None, data is normalized using mean and stddev of entire data set (column by column) to call: check_outlier(data,bound,”window”,window)

  • absolute_value (logical (optional)) – Use the absolute value of the normalized data, default = py.True to call: check_outlier(data,bound,”absolute_value”,absolute_value)

  • min_failures (int (optional)) – Minimum number of consecutive failures required for reporting, default = 1 to call: check_outlier(data,bound,”min_failures”,min_failures)

  • streaming (logical (optional)) – Indicates if streaming analysis should be used, default = py.False to call: check_outlier(data,bound,”streaming”,streaming)

Returns:

results (qcdata structure of form:) –

results.values: array of doubles

Same shape as input data.values Elements that failed QC test replaced with NaN

results.mask: array of int64

Same shape as input data.values Logical mask of QC results (1 = passed, 0 = failed QC test)

results.time: array of datetimes

Same as input times

mhkit.qc.check_range(data, bound, options)

Check for data that is outside expected range

Parameters:
  • data (pandas dataframe or qcdata structure) –

    Pandas dataframe indexed by datetime (use py.mhkit_python_utils.pandas_dataframe.timeseries_to_pandas(ts,time,x))

    OR

    qcdata structure of form:

    data.values: 2D array of doubles with arbitrary number of columns

    data.time: 1D array of datetimes or posix times

  • bound (cell array of floats) – [lower bound, upper bound] for range checking NaN or py.None can be used for either bound

  • key (string (optional)) – Data column name or translation dictionary key. If not specified or set to py.None, all columns are used for test. to call: check_range(data,bound,”key”,key)

  • min_failures (int (optional)) – Minimum number of consecutive failures required for reporting default = 1 to call: check_range(data,bound,”min_failures”,min_failures)

Returns:

results (qcdata structure of form:) –

results.values: array of doubles

Same shape as input data.values Elements that failed QC test replaced with NaN

results.mask: array of int64

Same shape as input data.values Logical mask of QC results (1 = passed, 0 = failed QC test)

results.time: array of datetimes

Same as input times

mhkit.qc.check_timestamp(data, freq, options)

Check time series for missing, non-monotonic, and duplicate timestamps

Parameters:
  • data (pandas dataframe or qcdata structure) –

    Pandas dataframe indexed by datetime (use py.mhkit_python_utils.pandas_dataframe.timeseries_to_pandas(ts,time,x))

    OR

    qcdata structure of form:

    data.values: 2D array of doubles with arbitrary number of columns

    data.time: 1D array of datetimes or posix times

  • freq (int) – Expected time series frequency, in seconds

  • expected_start_time (Timestamp (optional)) – Expected start time in datetime format. Default: None to call: check_timestamp(data,freq,”expected_start_time”,expected_start_time)

  • expected_end_time (Timestamp (optional)) – Expected end time in datetime format. Default: None to call: check_timestamp(data,freq,”expected_end_time”,expected_end_time)

  • min_failures (int (optional)) – Minimum number of consecutive failures required for reporting, default = 1 to call: check_timestamp(data,freq,”min_failures”,min_failures)

  • exact_times (logical (optional)) – If py.True, times are expected to occur at regular intervals (specified by freq) and data is reindexed to match expected frequency If py.False, times only need to occur once or more within each interval (specified by freq) and data is not reindexed to call: check_timestamp(data,freq,”exact_times”,exact_times)

Returns:

results (qcdata structure of form:) –

results.values: array of doubles

Same shape as input data.values Elements that failed QC test replaced with NaN

results.mask: array of int64

Same shape as input data.values Logical mask of QC results (1 = passed, 0 = failed QC test)

results.time: array of datetimes

Same as input times (possibly reindexed by exact_times)

mhkit.qc.qc_data_to_dataframe(data)

Convert qc data structure to pandas dataframe

Parameters:

data

Returns:

results (Pandas DataFrame)