MHKiT Quality Control Module

The following example runs a simple quality control analysis on wave elevation data using the MHKiT QC module. The data file used in this example is stored in the \\MHKiT\\examples\\data directory.

Start by importing the necessary Python packages and MHKiT modules.

[1]:

import pandas as pd
from mhkit import qc, utils

Load Data

The wave elevation data used in this example includes several issues, including timestamps that are out of order, corrupt data with values of -999, data outside the expected range, and stagnant data.

The data is loaded into a pandas DataFrame using the pandas method read_csv. The first 5 rows of data are shown below, along with a plot.

[2]:

# Load data from the csv file into a DataFrame
data = pd.read_csv("data/qc/wave_elevation_data.csv", index_col="Time")

# Plot the data
data.plot(figsize=(15, 5), ylim=(-60, 60))

# Print the first 5 rows of data
print(data.head())

        probe1  probe2  probe3
Time
10.000   24.48   28.27     1.3
10.002   34.48   40.27    -8.7
10.004   30.48   38.27   -13.7
10.006   12.48   24.27   -32.7
10.008   13.48   22.27   -21.7

The data is indexed by time in seconds. To use the quality control functions, the data must be indexed by datetime. The index can be converted to datetime using the following utility function.

[3]:

# Convert the index to datetime
data.index = utils.index_to_datetime(data.index, origin="2019-05-20")

# Print the first 5 rows of data
print(data.head())

                         probe1  probe2  probe3
Time
2019-05-20 00:00:10.000   24.48   28.27     1.3
2019-05-20 00:00:10.002   34.48   40.27    -8.7
2019-05-20 00:00:10.004   30.48   38.27   -13.7
2019-05-20 00:00:10.006   12.48   24.27   -32.7
2019-05-20 00:00:10.008   13.48   22.27   -21.7

Quality control tests

The following quality control tests are used to identify timestamp issues, corrupt data, data outside the expected range, and stagnant data.

Each quality control tests results in the following information:

Cleaned data, which is a DataFrame that has NaN in place of data that did not pass the quality control test
Boolean mask, which is a DataFrame with True/False that indicates if each data point passed the quality control test
Summary of the quality control test results, the summary includes the variable name (which is blank for timestamp issues), the start and end time of the test failure, and an error flag for each test failure

Check timestamp

Quality control analysis generally starts by checking the timestamp index of the data.

The following test checks to see if 1) the data contains duplicate timestamps, 2) timestamps are not monotonically increasing, and 3) timestamps occur at irregular intervals (an interval of 0.002s is expected for this data).

If duplicate timestamps are found, the resulting DataFrames (cleaned data and mask) keep the first occurrence. If timestamps are not monotonic, the timestamps in the resulting DataFrames are reordered.

[4]:

# Define expected frequency of the data, in seconds
frequency = 0.002

# Run the timestamp quality control test
results = qc.check_timestamp(data, frequency)

The cleaned data, boolean mask, and test results summary are shown below. The summary is transposed (using .T) so that it is easier to read.

[5]:

# Plot cleaned data
results["cleaned_data"].plot(figsize=(15, 5), ylim=(-60, 60))

# Print the first 5 rows of the cleaned data
print(results["cleaned_data"].head())

                         probe1  probe2  probe3
2019-05-20 00:00:10.000   24.48   28.27     1.3
2019-05-20 00:00:10.002   34.48   40.27    -8.7
2019-05-20 00:00:10.004   30.48   38.27   -13.7
2019-05-20 00:00:10.006   12.48   24.27   -32.7
2019-05-20 00:00:10.008   13.48   22.27   -21.7

[6]:

# Print the first 5 rows of the mask
print(results["mask"].head())

                         probe1  probe2  probe3
2019-05-20 00:00:10.000    True    True    True
2019-05-20 00:00:10.002    True    True    True
2019-05-20 00:00:10.004    True    True    True
2019-05-20 00:00:10.006    True    True    True
2019-05-20 00:00:10.008    True    True    True

[7]:

# Print the test results summary
# The summary is transposed (using .T) so that it is easier to read.
print(results["test_results"].T)

                                        0                           1  \
Variable Name
Start Time     2019-05-20 00:00:10.230000  2019-05-20 00:00:10.340000
End Time       2019-05-20 00:00:10.230000  2019-05-20 00:00:10.340000
Timesteps                               1                           1
Error Flag         Nonmonotonic timestamp         Duplicate timestamp

                                        2
Variable Name
Start Time     2019-05-20 00:00:10.042000
End Time       2019-05-20 00:00:10.044000
Timesteps                               2
Error Flag              Missing timestamp

Check for corrupt data

In the following quality control tests, the cleaned data from the previous test is used as input to the subsequent test. For each quality control test, a plot of the cleaned data is shown along with the test results summary.

Note, that if you want to run a series of quality control tests before extracting the cumulative cleaned data, boolean mask, and summary, we recommend using Pecos directly with the object-oriented approach, see https://pecos.readthedocs.io/ for more details.

The quality control test below checks for corrupt data, indicated by a value of -999.

[8]:

# Define corrupt values
corrupt_values = [-999]

# Run the corrupt data quality control test
results = qc.check_corrupt(results["cleaned_data"], corrupt_values)

# Plot cleaned data
results["cleaned_data"].plot(figsize=(15, 5), ylim=(-60, 60))

# Print test results summary
print(results["test_results"].T)

                                        0                           1
Variable Name                      probe1                      probe3
Start Time     2019-05-20 00:00:10.110000  2019-05-20 00:00:10.834000
End Time       2019-05-20 00:00:10.134000  2019-05-20 00:00:10.848000
Timesteps                              13                           8
Error Flag                   Corrupt data                Corrupt data

Check for data outside the expected range

The next quality control test checks for data that is greater than 50 or less than -50. Note that expected range tests can also be used to compare measured values to a model, or analyze the expected relationships between data columns.

[9]:

# Define expected lower and upper bound ([lower bound, upper bound])
expected_bounds = [-50, 50]

# Run expected range quality control test
results = qc.check_range(results["cleaned_data"], expected_bounds)

# Plot cleaned data
results["cleaned_data"].plot(figsize=(15, 5), ylim=(-60, 60))

# Print test results summary
print(results["test_results"].T)

                                        0                           1  \
Variable Name                      probe3                      probe3
Start Time     2019-05-20 00:00:10.240000  2019-05-20 00:00:10.468000
End Time       2019-05-20 00:00:10.240000  2019-05-20 00:00:10.468000
Timesteps                               1                           1
Error Flag        Data < lower bound, -50     Data < lower bound, -50

                                        2
Variable Name                      probe3
Start Time     2019-05-20 00:00:10.716000
End Time       2019-05-20 00:00:10.716000
Timesteps                               1
Error Flag         Data > upper bound, 50

Check for stagnant data

The final quality control test checks for stagnant data by looking for data that changes by less than 0.001 within a 0.02 second moving window.

[10]:

# Define expected lower bound (no upper bound is specified in this example)
expected_bound = [0.001, None]

# Define the moving window, in seconds
window = 0.02

# Run the delta quality control test
results = qc.check_delta(results["cleaned_data"], expected_bound, window)

# Plot cleaned data
results["cleaned_data"].plot(figsize=(15, 5), ylim=(-60, 60))

# Print test results summary
print(results["test_results"].T)

                                        0
Variable Name                      probe2
Start Time     2019-05-20 00:00:10.400000
End Time       2019-05-20 00:00:10.544000
Timesteps                              73
Error Flag     Delta < lower bound, 0.001

Cleaned Data

The cleaned data can be used directly in MHKiT analysis, or the missing values can be replaced using various methods before analysis is run. Data replacement strategies are generally defined on a case by case basis. Pandas includes methods to interpolate, replace, and fill missing values.

[11]:

# Extract final cleaned data for MHKiT analysis
cleaned_data = results["cleaned_data"]