QC Module

The QC module includes quality control functions from Pecos, see https://pecos.readthedocs.io for more details.

check_timestamp

Check time series for missing, non-monotonic and duplicate timestamps

check_missing

Check for missing data

check_corrupt

Check for corrupt data

check_range

Check for data that is outside expected range

check_delta

Check for stagnant data and/or abrupt changes in the data using the difference between max and min values (delta) within a rolling window

check_outlier

Check for outliers using normalized data within a rolling window

mhkit.qc.check_timestamp(data, frequency, expected_start_time=None, expected_end_time=None, min_failures=1, exact_times=True)[source]

Check time series for missing, non-monotonic and duplicate timestamps

Parameters
  • data (pandas DataFrame) – Data used in the quality control test, indexed by datetime

  • frequency (int or float) – Expected time series frequency, in seconds

  • expected_start_time (Timestamp, optional) – Expected start time. If not specified, the minimum timestamp is used

  • expected_end_time (Timestamp, optional) – Expected end time. If not specified, the maximum timestamp is used

  • min_failures (int, optional) – Minimum number of consecutive failures required for reporting, default = 1

  • exact_times (bool, optional) – Controls how missing times are checked. If True, times are expected to occur at regular intervals (specified in frequency) and the DataFrame is reindexed to match the expected frequency. If False, times only need to occur once or more within each interval (specified in frequency) and the DataFrame is not reindexed.

Returns

dictionary – Results include cleaned data, mask, and test results summary

mhkit.qc.check_missing(data, key=None, min_failures=1)[source]

Check for missing data

Parameters
  • data (pandas DataFrame) – Data used in the quality control test, indexed by datetime

  • key (string, optional) – Data column name or translation dictionary key. If not specified, all columns are used in the test.

  • min_failures (int, optional) – Minimum number of consecutive failures required for reporting, default = 1

Returns

dictionary – Results include cleaned data, mask, and test results summary

mhkit.qc.check_corrupt(data, corrupt_values, key=None, min_failures=1)[source]

Check for corrupt data

Parameters
  • data (pandas DataFrame) – Data used in the quality control test, indexed by datetime

  • corrupt_values (list of int or floats) – List of corrupt data values

  • key (string, optional) – Data column name or translation dictionary key. If not specified, all columns are used in the test.

  • min_failures (int, optional) – Minimum number of consecutive failures required for reporting, default = 1

Returns

dictionary – Results include cleaned data, mask, and test results summary

mhkit.qc.check_range(data, bound, key=None, min_failures=1)[source]

Check for data that is outside expected range

Parameters
  • data (pandas DataFrame) – Data used in the quality control test, indexed by datetime

  • bound (list of floats) – [lower bound, upper bound], None can be used in place of a lower or upper bound

  • key (string, optional) – Data column name or translation dictionary key. If not specified, all columns are used in the test.

  • min_failures (int, optional) – Minimum number of consecutive failures required for reporting, default = 1

Returns

dictionary – Results include cleaned data, mask, and test results summary

mhkit.qc.check_delta(data, bound, window, key=None, direction=None, min_failures=1)[source]

Check for stagnant data and/or abrupt changes in the data using the difference between max and min values (delta) within a rolling window

Parameters
  • data (pandas DataFrame) – Data used in the quality control test, indexed by datetime

  • bound (list of floats) – [lower bound, upper bound], None can be used in place of a lower or upper bound

  • window (int or float) – Size of the rolling window (in seconds) used to compute delta

  • key (string, optional) – Data column name or translation dictionary key. If not specified, all columns are used in the test.

  • direction (str, optional) –

    Options = ‘positive’, ‘negative’, or None

    • If direction is positive, then only identify positive deltas (the min occurs before the max)

    • If direction is negative, then only identify negative deltas (the max occurs before the min)

    • If direction is None, then identify both positive and negative deltas

  • min_failures (int, optional) – Minimum number of consecutive failures required for reporting, default = 1

Returns

dictionary – Results include cleaned data, mask, and test results summary

mhkit.qc.check_outlier(data, bound, window=None, key=None, absolute_value=False, streaming=False, min_failures=1)[source]

Check for outliers using normalized data within a rolling window

The upper and lower bounds are specified in standard deviations. Data normalized using (data-mean)/std.

Parameters
  • data (pandas DataFrame) – Data used in the quality control test, indexed by datetime

  • bound (list of floats) – [lower bound, upper bound], None can be used in place of a lower or upper bound

  • window (int or float, optional) – Size of the rolling window (in seconds) used to normalize data, If window is set to None, data is normalized using the entire data sets mean and standard deviation (column by column). default = None.

  • key (string, optional) – Data column name or translation dictionary key. If not specified, all columns are used in the test.

  • absolute_value (boolean, optional) – Use the absolute value the normalized data, default = True

  • streaming (boolean, optional) – Indicates if streaming analysis should be used, default = False

  • min_failures (int, optional) – Minimum number of consecutive failures required for reporting, default = 1

Returns

dictionary – Results include cleaned data, mask, and test results summary