Data Quality Check

Before beginning any modeling or analysis, it is essential to ensure the integrity and consistency of the dataset. The Quality Check section of the Data Management module provides automated tools to detect common issues such as missing values, inconsistent dosing records, and irregular time patterns. Addressing these potential problems early in the workflow is critical for ensuring reliable model performance and avoiding biased or misleading results.

Figure 1. Path to the Quality check section in the Data Management module.

Once your dataset has been properly initialized in the Data section and covariate columns have been appropriately defined, the Quality Check section becomes active and available for use.

The data checks are organized into three dedicated tabs, each focusing on a specific type of data:

  1. Covariates
  2. Dosing events
  3. Time series

1. Covariates

Once the dataset has been initialized and continuous, categorical, and (if present) time-varying covariate columns have been declared in the Data section, the Covariates tab becomes active. It summarises seven automated checks, each accompanied by contextual messages and—when appropriate—tables that highlight the issues detected.

  1. Declared continuous covariates

    • Lists the continuous covariates provided by the user.
    • If none were specified, the message “The user has not specified continuous covariates.” is shown.
  2. Declared categorical covariates

    • Analogous to point 1 but for categorical covariates.
    • If none were specified, a corresponding message is displayed.
  3. Missing-value scan

    Searches the declared covariate columns for empty cells.

    • If empties are found, the message "Empty cells were found in these columns and IDs:" appears, followed by a table listing the affected columns and IDs.
    • Otherwise, "User-specified covariates contain no empty cells."
  4. Time-varying columns change check

    Verifies that each user-defined time-varying covariate truly varies within an ID.

    • Outcomes:

      • "All user-defined time-varying columns change over time."
      • "User-defined time-varying columns that don’t change over time:" (followed by offending column names).
      • If no time-varying covariates were declared: "The user has not specified time-varying covariates."
  5. Time-invariant stability check

    Display a list of considered time invariant covariates and ensures that covariates expected to be constant within an ID do not drift over time.

    • Outcomes:
      • "All time invariant columns don't change over time."
      • "Time invariant columns that change over time: " (followed by offending column names).
  6. Invariant-covariate diagnostics Three complementary tests are run:

    6.1 Balance of categorical covariates

    Flags any categorical covariate where a level represents < 15 % of IDs.

    • Shows "Uneven distribution of covariate levels in columns:" plus a table of covariate/level pairs.
    • If no imbalance: "There is no significant imbalance detected in the distribution of levels among the categorical covariates."

    6.2 Outlier detection in continuous covariates

    Usual outliers: values outside \([Q1 - 1.5 × IQR\) ; \(Q3 + 1.5 × IQR]\) 1. Distant outliers: values outside \([Q1 - 2.5 × IQR\) ; \(Q3 + 2.5 × IQR]\) and beyond the 5th/95th percentiles.

    • If found, the message "Possible outliers in continuous data:" appears with a three-column table: covariate, IDs with usual outliers, and IDs with distant outliers.
    • Otherwise: "No possible outliers were detected."

    6.3 Potential imputations in baseline covariates

    For continuous covariates not marked as time-varying, the routine looks for values repeated in > 10 % of IDs—a sign of bulk imputation.

    • If detected, the message "Potential imputation for baseline covariate > 10 %:" appears with a table of covariates and affected IDs.
    • If none: "No imputations greater than 10 % were detected."
  7. LOCF detection in time-varying covariates

    Aims to identify possible Last Observation Carried Forward (LOCF) practices, where a value is repeated across successive time points.

    • If any time-varying covariate shows > 3 consecutive identical values within an ID, those IDs and covariates are reported.
    • Otherwise: "No LOCF were detected."
Figure 2. Example output from the Covariates tab of Quality Check section.

2. Dosing events

The Dosing events tab evaluates the internal consistency of all dose-related columns. For full functionality, the dataset should contain the following fields:

  • AMT – dose amount
  • EVID – event identifier (0 = observation, > 0 = dose or other event)
  • MDV – missing-DV flag (0 = DV present, 1 = DV missing)
  • CMT – compartment number receiving the dose or ADM – administration type (Simurg accepts either)
  • DUR – infusion duration

If one or more of these columns are absent, any check that relies on that column is skipped and the message "The column required for this check is missing." is shown.

  1. Presence of required columns

    Confirms that all five dose-related columns are in the dataset.

    • If any are missing: "The following required columns for Dosing-event checks are missing in the dataset:" followed by the list.
    • If none are missing: "All expected Dosing-event columns "AMT", "EVID", "MDV", "CMT", "ADM", "DUR" are present in the dataset."
  2. AMT vs EVID consistency

    Detects rows that combine an observation flag with a non-zero dose amount (EVID = 0 & AMT ≠ 0), i.e., dose information placed in observation rows.

    • Inconsistencies: "Inconsistencies have been found between the AMT and EVID columns. Rule: EVID = 0 & AMT ≠ 0" plus a table of ID, TIME, AMT, EVID.
    • None: "No inconsistencies have been found between the AMT and EVID columns."
  3. Zero-dose events

    Flags dosing rows that declare a dose amount of zero (AMT = 0 & EVID ≠ 0).

    • Issues found: "Dose amount 0 in dose event. Rule: AMT = 0 & EVID ≠ 0" plus a table of ID, TIME, AMT, EVID.
    • None: "No zero-dose amount for dose event."
  4. EVID vs MDV coherence

    Looks for dose events that also claim a non-missing dependent value in the same row (EVID ≠ 0 & MDV = 0).

    • If present: "Possible inconsistencies have been found between the EVID and MDV columns. Rule: EVID ≠ 0 & MDV = 0" plus a table of ID, TIME, MDV, EVID.
    • None: "No inconsistencies have been found between the EVID and MDV columns."
  5. AMT supplied without CMT/ADM

    Checks for non-zero doses that lack a target compartment or administration type (AMT ≠ 0 & (CMT = 0 | ADM = 0)).

    • Inconsistencies: "Dose amount without compartment (CMT) or administration type (ADM)" plus a table of ID, TIME, AMT, CMT/ADM.
    • None: "No inconsistencies have been found between columns AMT and CMT or AMT and ADM."
  6. Infusion-duration logic

    Verifies that dose rows belonging to a given CMT/ADM are either all bolus (DUR = 0) or all infusions (DUR > 0). Mixed usage triggers the warning AMT ≠ 0 & (all DUR = 0 | all DUR ≠ 0)

    • If violated: "Infusion duration time zero were detected" plus table of ID, TIME, AMT, CMT/ADM, DUR.
    • None: "No issues found with infusion duration time."
  7. Duplicate dose records

    Identifies duplicate dosing rows—same ID, same TIME, same CMT/ADM, and non-zero AMT*.

    • Duplicates: "Duplicates: same time, same CMT or ADM, and AMT ≠ 0 were detected." plus table of ID, TIME, AMT, CMT/ADM
    • None: "No duplicate time for same dose amount."
Figure 3. Example output from the Dosing events tab of Quality Check section.

The information returned by these checks helps correct dosing-record errors before advancing to modelling or simulation steps.

3. Time series

The Time Series tab contains 8 specific checks that assess the quality and consistency of longitudinal observations across time. These checks help ensure that observational data are well-structured, logically consistent, and reliable for analysis.

To enable all the checks in this tab, the dataset must include the following columns:

  • ID – subject identifier
  • TIME – time of observation or event
  • DV – dependent variable (measurement)
  • DVID/YTYPE – type of measurement (Simurg accepts either)
  • MDV – missing dependent variable flag (1 = missing, 0 = present)
  • EVID – event identifier (0 = observation, >0 = event such as dosing)

If any of these columns are missing, the checks relying on them will be skipped, and the message “The column required for this check is missing” will appear in place of the corresponding results.

  1. Presence of required columns

    Confirms that all six time series columns are in the dataset.

    • If any are missing: "The following required columns for time series checks are missing in the dataset:" followed by the list.
    • If all are present: "All required columns "ID", "TIME", "DV", "DVID/YTYPE", "MDV", and "EVID" for time series checks are present in the dataset."
  2. Missing data percentage (MDV == 1) per DVID

    This check calculates the proportion of measurements marked as missing (MDV = 1) for each DVID level.

    • Output: Table with columns DVID and MDV %
  3. Empty or zero DV when MDV ≠ 1 and DVID ≠ 0 This flags rows where the dependent variable is missing or zero despite being marked as valid observations. Rule: DV is empty or zero while MDV ≠ 1 and DVID ≠ 0.

    • If found: "IDs with empty or zero DV when MDV ≠ 1 and DVID ≠ 0:" plus table of ID
    • Otherwise: "No IDs found with empty DV value for observation event."
  4. Non-empty DV with EVID ≠ 0

    This check detects non-observation events that improperly contain DV values. Rule: EVID ≠ 0 & MDV = 0 with a non-empty or non-zero DV.

    • If found: "IDs with non-empty and non-zero DV while EVID ≠ 0 and MDV = 0:" plus ID list
    • Otherwise: "No IDs found with DV values for non observation event."
  5. General measurement statistics

    Presents summary statistics for each DVID, showing the number of measurements and percentage of missing values.

    • Output: "General statistics of the measurements:" plus table with columns DVID, Total measurements, Measurements with MDV = 1, Missing values %
  6. Check for positive/negative DV values

    Ensures that all measurement values make sense and are consistent in sign (e.g., no negative concentrations if not expected). Exclude rows with MDV = 1.

    • Output: "Check for positive/negative observation event:" plus table with columns DVID, Positive measurements, Negative measurements
  7. Duplicate observations

    Identifies duplicated time points per ID for the same DVID (excluding rows with MDV = 1). Rule: Duplicate (ID, TIME, DVID) combinations with AMT = 0.

    • If found: "IDs with duplicate measurements:" and table with ID, TIME, DVID, total measurements
    • Otherwise: "No IDs found with duplicate measurements."
  8. Repeated DV values across consecutive time points

    Flags sequences where the same DV value is repeated in 2 or more consecutive rows, which may suggest imputation or logging errors. Exclude MDV = 1; check is done per DVID.

    • If found: "Same value repeated in DV (for 2 or more sequential rows):" plus affected IDs
    • If none: "No repeated values in DV (for 2 or more sequential rows)"
Figure 4. Example output from the Time series tab of Quality Check section.

These checks serve as a vital step in confirming the consistency and integrity of observational data before modeling, simulation, or visual exploration.

1

Q1 = first quartile, Q3 = third quartile, IQR = inter-quartile range (Q3 – Q1).