QC Module
The QC module includes quality control functions from Pecos, see https://pecos.readthedocs.io for more details.
Functions |
Description |
---|---|
|
Check for corrupt data |
|
Check for stagant data and/or abrupt changes in the data using the difference between max and min values within a rolling window |
|
Check data increments using the difference between values |
|
Check for missing data |
|
Check for outliers using normalized data within a rolling window |
|
Check for data outside the expected range |
|
Check time series for missing, non-monotonic, and duplicate timestamps |
|
Convert qc data structure to pandas dataframe |
Note
The names of the functions below are of the convention path.path.function
. Only the function name is used when calling the function in MATLAB. For example, to call on mhkit.qc.check_timestamp
simply use check_timestamp
.
- mhkit.qc.check_corrupt(data, vals, options)
Check for data that is outside expected range
- Parameters:
data (
pandas dataframe or qcdata structure
) –Pandas dataframe indexed by datetime (use py.mhkit_python_utils.pandas_dataframe.timeseries_to_pandas(ts,time,x))
OR
qcdata structure of form:
data.values: 2D array of doubles with arbitrary number of columns
data.time: 1D array of datetimes or posix times
vals (
cell array of floats
) – Array of at least two corrupt data values Use cell array for one value, or pack array with NaN (A single value array becomes a non-iterable scalar in python and will cause an error.)key (
string (optional)
) – Data column name or translation dictionary key. If not specified or set to py.None, all columns are used for test. to call: check_corrupt(data,vals,”key”,key)min_failures (
int (optional)
) – Minimum number of consecutive failures required for reporting default = 1 to call: check_corrupt(data,vals,”min_failures”,min_failures)
- Returns:
results (qcdata structure of form:) –
- results.values: array of doubles
Same shape as input data.values Elements that failed QC test replaced with NaN
- results.mask: array of int64
Same shape as input data.values Logical mask of QC results (1 = passed, 0 = failed QC test)
- results.time: array of datetimes
Same as input times
- mhkit.qc.check_delta(data, bound, window, options)
Check for stagnant data and/or abrupt changes in the data using difference between max and min values within a rolling window
- Parameters:
data (
pandas dataframe or qcdata structure
) –Pandas dataframe indexed by datetime (use py.mhkit_python_utils.pandas_dataframe.timeseries_to_pandas(ts,time,x))
OR
qcdata structure of form:
data.values: 2D array of doubles with arbitrary number of columns
data.time: 1D array of datetimes or posix times
bound (
cell array of floats
) – [lower bound, upper bound] for min/max delta checking NaN or py.None can be used for either boundwindow (
int or double
) – Size of the rolling window (in seconds) used to compute deltakey (
string (optional)
) – Data column name or translation dictionary key. If not specified or set to py.None, all columns are used for test. to call: check_delta(data,bound,”key”,key)direction (
string (optional)
) –- Options: ‘positive’, ‘negative’, or py.None (default)
If direction is positive, then only identify positive deltas (the min occurs before the max) If direction is negative, then only identify negative deltas (the max occurs before the min) If direction is py.None, then identify both positive and negative deltas
to call: check_delta(data,bound,”direction”,direction)
min_failures (
int (optional)
) – Minimum number of consecutive failures required for reporting, default = 1 to call: check_delta(data,bound,”min_failures”,min_failures)
- Returns:
results (qcdata structure of form:) –
- results.values: array of doubles
Same shape as input data.values Elements that failed QC test replaced with NaN
- results.mask: array of int64
Same shape as input data.values Logical mask of QC results (1 = passed, 0 = failed QC test)
- results.time: array of datetimes
Same as input times
- mhkit.qc.check_increment(data, bound, options)
Check data increments using the difference between values
- Parameters:
data (
pandas dataframe or qcdata structure
) –Pandas dataframe indexed by datetime (use py.mhkit_python_utils.pandas_dataframe.timeseries_to_pandas(ts,time,x))
OR
qcdata structure of form:
data.values: 2D array of doubles with arbitrary number of columns
data.time: 1D array of datetimes or posix times
bound (
cell array of floats
) – [lower bound, upper bound] for min/max difference NaN or py.None can be used for either boundkey (
string (optional)
) – Data column name or translation dictionary key. If not specified or set to py.None, all columns are used for test. to call: check_increment(data,bound,”key”,key)increment (
int (optional)
) – Time step shift used to compute difference, default = 1 to call: check_increment(data,bound,”increment”,increment)absolute_value (
logical (optional)
) – Use the absolute value of increment data, default = py.True to call: check_increment(data,bound,”absolute_value”,absolute_value)min_failures (
int (optional)
) – Minimum number of consecutive failures required for reporting default = 1 to call: check_increment(data,bound,”min_failures”,min_failures)
- Returns:
results (qcdata structure of form:) –
- results.values: array of doubles
Same shape as input data.values Elements that failed QC test replaced with NaN
- results.mask: array of int64
Same shape as input data.values Logical mask of QC results (1 = passed, 0 = failed QC test)
- results.time: array of datetimes
Same as input times
- mhkit.qc.check_missing(data, options)
Check for missing data
- Parameters:
data (
pandas dataframe or qcdata structure
) –Pandas dataframe indexed by datetime (use py.mhkit_python_utils.pandas_dataframe.timeseries_to_pandas(ts,time,x))
OR
data structure of form:
data.values: 2D array of doubles with arbitrary number of columns
data.time: 1D array of datetimes or posix times
key (
string (optional)
) – Data column name or translation dictionary key. If not specified or set to py.None, all columns are used for test. to call: check_missing(data,”key”,key)min_failures (
int (optional)
) – Minimum number of consecutive failures required for reporting, default = 1 to call: check_missing(data,”min_failures”,min_failures)
- Returns:
results (qcdata structure of form:) –
- results.values: array of doubles
Same shape as input data.values Elements that failed QC test replaced with NaN
- results.mask: array of int64
Same shape as input data.values Logical mask of QC results (1 = passed, 0 = failed QC test)
- results.time: array of datetimes
Same as input times
- mhkit.qc.check_outlier(data, bound, options)
Check or outliers using normalized data within a rolling window Upper and lower bounds in standard deviations Data is normalized using (data-mean)/std
- Parameters:
data (
pandas dataframe or qcdata structure
) –Pandas dataframe indexed by datetime (use py.mhkit_python_utils.pandas_dataframe.timeseries_to_pandas(ts,time,x))
OR
qcdata structure of form:
data.values: 2D array of doubles with arbitrary number of columns
data.time: 1D array of datetimes or posix times
bound (
cell array of floats
) – [lower bound, upper bound] of standard deviations from mean allowed NaN or py.None can be used for either boundkey (
string (optional)
) – Data column name or translation dictionary key. If not specified or set to py.None, all columns are used for test. to call: check_outlier(data,bound,”key”,key)window (
int (optional)
) – Size of rolling window (in seconds) used to normalize data default = 3600. If window is set to py.None, data is normalized using mean and stddev of entire data set (column by column) to call: check_outlier(data,bound,”window”,window)absolute_value (
logical (optional)
) – Use the absolute value of the normalized data, default = py.True to call: check_outlier(data,bound,”absolute_value”,absolute_value)min_failures (
int (optional)
) – Minimum number of consecutive failures required for reporting, default = 1 to call: check_outlier(data,bound,”min_failures”,min_failures)streaming (
logical (optional)
) – Indicates if streaming analysis should be used, default = py.False to call: check_outlier(data,bound,”streaming”,streaming)
- Returns:
results (qcdata structure of form:) –
- results.values: array of doubles
Same shape as input data.values Elements that failed QC test replaced with NaN
- results.mask: array of int64
Same shape as input data.values Logical mask of QC results (1 = passed, 0 = failed QC test)
- results.time: array of datetimes
Same as input times
- mhkit.qc.check_range(data, bound, options)
Check for data that is outside expected range
- Parameters:
data (
pandas dataframe or qcdata structure
) –Pandas dataframe indexed by datetime (use py.mhkit_python_utils.pandas_dataframe.timeseries_to_pandas(ts,time,x))
OR
qcdata structure of form:
data.values: 2D array of doubles with arbitrary number of columns
data.time: 1D array of datetimes or posix times
bound (
cell array of floats
) – [lower bound, upper bound] for range checking NaN or py.None can be used for either boundkey (
string (optional)
) – Data column name or translation dictionary key. If not specified or set to py.None, all columns are used for test. to call: check_range(data,bound,”key”,key)min_failures (
int (optional)
) – Minimum number of consecutive failures required for reporting default = 1 to call: check_range(data,bound,”min_failures”,min_failures)
- Returns:
results (qcdata structure of form:) –
- results.values: array of doubles
Same shape as input data.values Elements that failed QC test replaced with NaN
- results.mask: array of int64
Same shape as input data.values Logical mask of QC results (1 = passed, 0 = failed QC test)
- results.time: array of datetimes
Same as input times
- mhkit.qc.check_timestamp(data, freq, options)
Check time series for missing, non-monotonic, and duplicate timestamps
- Parameters:
data (
pandas dataframe or qcdata structure
) –Pandas dataframe indexed by datetime (use py.mhkit_python_utils.pandas_dataframe.timeseries_to_pandas(ts,time,x))
OR
qcdata structure of form:
data.values: 2D array of doubles with arbitrary number of columns
data.time: 1D array of datetimes or posix times
freq (
int
) – Expected time series frequency, in secondsexpected_start_time (
Timestamp (optional)
) – Expected start time in datetime format. Default: None to call: check_timestamp(data,freq,”expected_start_time”,expected_start_time)expected_end_time (
Timestamp (optional)
) – Expected end time in datetime format. Default: None to call: check_timestamp(data,freq,”expected_end_time”,expected_end_time)min_failures (
int (optional)
) – Minimum number of consecutive failures required for reporting, default = 1 to call: check_timestamp(data,freq,”min_failures”,min_failures)exact_times (
logical (optional)
) – If py.True, times are expected to occur at regular intervals (specified by freq) and data is reindexed to match expected frequency If py.False, times only need to occur once or more within each interval (specified by freq) and data is not reindexed to call: check_timestamp(data,freq,”exact_times”,exact_times)
- Returns:
results (qcdata structure of form:) –
- results.values: array of doubles
Same shape as input data.values Elements that failed QC test replaced with NaN
- results.mask: array of int64
Same shape as input data.values Logical mask of QC results (1 = passed, 0 = failed QC test)
- results.time: array of datetimes
Same as input times (possibly reindexed by exact_times)
- mhkit.qc.qc_data_to_dataframe(data)
Convert qc data structure to pandas dataframe
- Parameters:
data
- Returns:
results (Pandas DataFrame)