QC Module

The QC module includes quality control functions from Pecos, see https://pecos.readthedocs.io for more details.

`check_timestamp`	Check time series for missing, non-monotonic and duplicate timestamps
`check_missing`	Check for missing data
`check_corrupt`	Check for corrupt data
`check_range`	Check for data that is outside expected range
`check_delta`	Check for stagnant data and/or abrupt changes in the data using the difference between max and min values (delta) within a rolling window
`check_outlier`	Check for outliers using normalized data within a rolling window

mhkit.qc.check_timestamp(data, frequency, expected_start_time=None, expected_end_time=None, min_failures=1, exact_times=True)[source]

Check time series for missing, non-monotonic and duplicate timestamps

Parameters:

data (pandas DataFrame) – Data used in the quality control test, indexed by datetime
frequency (int or float) – Expected time series frequency, in seconds
expected_start_time (Timestamp, optional) – Expected start time. If not specified, the minimum timestamp is used
expected_end_time (Timestamp, optional) – Expected end time. If not specified, the maximum timestamp is used
min_failures (int, optional) – Minimum number of consecutive failures required for reporting, default = 1
exact_times (bool, optional) – Controls how missing times are checked. If True, times are expected to occur at regular intervals (specified in frequency) and the DataFrame is reindexed to match the expected frequency. If False, times only need to occur once or more within each interval (specified in frequency) and the DataFrame is not reindexed.

Returns:

dictionary – Results include cleaned data, mask, and test results summary

mhkit.qc.check_missing(data, key=None, min_failures=1)[source]

Check for missing data

Parameters:

data (pandas DataFrame) – Data used in the quality control test, indexed by datetime
key (string, optional) – Data column name or translation dictionary key. If not specified, all columns are used in the test.
min_failures (int, optional) – Minimum number of consecutive failures required for reporting, default = 1

Returns:

dictionary – Results include cleaned data, mask, and test results summary

mhkit.qc.check_corrupt(data, corrupt_values, key=None, min_failures=1)[source]

Check for corrupt data

Parameters:

data (pandas DataFrame) – Data used in the quality control test, indexed by datetime
corrupt_values (list of int or floats) – List of corrupt data values
key (string, optional) – Data column name or translation dictionary key. If not specified, all columns are used in the test.
min_failures (int, optional) – Minimum number of consecutive failures required for reporting, default = 1

Returns:

dictionary – Results include cleaned data, mask, and test results summary

mhkit.qc.check_range(data, bound, key=None, min_failures=1)[source]

Check for data that is outside expected range

Parameters:

data (pandas DataFrame) – Data used in the quality control test, indexed by datetime
bound (list of floats) – [lower bound, upper bound], None can be used in place of a lower or upper bound
key (string, optional) – Data column name or translation dictionary key. If not specified, all columns are used in the test.
min_failures (int, optional) – Minimum number of consecutive failures required for reporting, default = 1

Returns:

dictionary – Results include cleaned data, mask, and test results summary

mhkit.qc.check_delta(data, bound, window, key=None, direction=None, min_failures=1)[source]

Check for stagnant data and/or abrupt changes in the data using the difference between max and min values (delta) within a rolling window

Parameters:

data (pandas DataFrame) – Data used in the quality control test, indexed by datetime
bound (list of floats) – [lower bound, upper bound], None can be used in place of a lower or upper bound
window (int or float) – Size of the rolling window (in seconds) used to compute delta
key (string, optional) – Data column name or translation dictionary key. If not specified, all columns are used in the test.
direction (str, optional) –
Options = ‘positive’, ‘negative’, or None
- If direction is positive, then only identify positive deltas (the min occurs before the max)
- If direction is negative, then only identify negative deltas (the max occurs before the min)
- If direction is None, then identify both positive and negative deltas
min_failures (int, optional) – Minimum number of consecutive failures required for reporting, default = 1

Returns:

dictionary – Results include cleaned data, mask, and test results summary

mhkit.qc.check_outlier(data, bound, window=None, key=None, absolute_value=False, streaming=False, min_failures=1)[source]

Check for outliers using normalized data within a rolling window

The upper and lower bounds are specified in standard deviations. Data normalized using (data-mean)/std.

Parameters:

data (pandas DataFrame) – Data used in the quality control test, indexed by datetime
bound (list of floats) – [lower bound, upper bound], None can be used in place of a lower or upper bound
window (int or float, optional) – Size of the rolling window (in seconds) used to normalize data, If window is set to None, data is normalized using the entire data sets mean and standard deviation (column by column). default = None.
key (string, optional) – Data column name or translation dictionary key. If not specified, all columns are used in the test.
absolute_value (boolean, optional) – Use the absolute value the normalized data, default = True
streaming (boolean, optional) – Indicates if streaming analysis should be used, default = False
min_failures (int, optional) – Minimum number of consecutive failures required for reporting, default = 1

Returns:

dictionary – Results include cleaned data, mask, and test results summary