How The Quiet Data Standards Revolution Is Impacting Drug Approval
By Varun Debbeti, principal statistical programmer

When a patient hears that a drug was just approved by the FDA, they picture scientists in labs, clinical trials, doctors reviewing data. Few picture the data itself — how it was structured, labeled, and formatted before a single reviewer at the agency could look at it. That infrastructure, invisible to patients and barely visible to clinicians, is where CDISC's Study Data Tabulation Model (SDTM) lives. And after two decades, it's getting its most significant overhaul yet.
SDTM has been the required standard for organizing clinical trial data submitted to the FDA since 2004. Every NDA, ANDA, and most BLA submissions depend on it. The standard tells sponsors how to structure data sets so FDA reviewers can actually navigate them: what a demographics domain looks like, how adverse events get coded, where device data sits. For more than 20 years, it has held largely constant while the trials themselves became exponentially more complex through decentralized studies, adaptive designs, cell and gene therapies, and real-world evidence. The model was built for a different era of drug development.
Where The Current Model Breaks Down
A Phase 3 oncology trial I worked on permitted rescreening and re-enrollment, which is a common protocol design in oncology but one that exposed fundamental limitations in the current SDTM framework. When a participant screen-failed and later returned under a new subject number, was successfully enrolled, and treated, the mapping challenges began immediately. The demographics (DM) domain enforces one record per participant, yet the sponsor must decide which participation (the failed screening or the successful enrollment) constitutes the "primary" record. The FDA Technical Conformance Guide recommends housing additional participations in a custom domain structured like DM, but no such domain exists in SDTMIG v3.4. Downstream, the adverse events (AE) domain compounds the problem: AEs collected across both participation periods collapse under a single subject identifier, with non-standard domains like SUPPQUAL and EPOCH serving as the only mechanisms to distinguish which events belong to which enrollment, which is a fragile convention-driven approach. SDTM v3.0 and SDTMIG v4.0 directly address this gap. The new demographics for multiple participations (DC) domain formalizes the capture of all participation instances, while subject identified (SUBJID) has been promoted to an identifier across all domain classes. AEs, labs, and other observations can now natively link to specific participations.
What SDTM v3.0 Actually Changes
SDTM v3.0, developed in parallel with SDTMIG v4.0, is the most substantive structural overhaul of CDISC tabulation standards in decades. Beyond clarifying edits, the model reorganizes domain specification tables with assumptions centralized and overloaded columns split, introduces machine-readable automation-ready metadata, and replaces the brittle vertical supplemental data set (SUPP) construct with a simpler non-standard variables (NS) framework that enables direct merges and typed variables. Protocol deviations also move toward greater consistency, with an explicit classification mechanism emerging to standardize how severity is represented across sponsors.
The replacement of supplemental qualifier (SUPPQUAL) data sets with NS domains is, in my view, the most underestimated change in SDTM v3.0 and SDTMIG v4.0. SUPPQUAL has been a persistent source of inefficiency for statistical programmers since its introduction. Every SUPPQUAL requires transposing collected data from its natural horizontal structure into a narrow vertical name-value pair format — QNAM, QLABEL, QVAL, QORIG — and then reversing that transpose during analysis. This roundtrip costs real programming hours across every study, for every domain that carries non-standard variables.
None of this is abstract. Variability or nonconformance in data standards can introduce avoidable friction during FDA review. When agencies receive nonconformant or inconsistently structured submissions, they issue information requests, ask for resubmissions, or put studies on clinical hold. Every week of delay in that process is a week patients with serious or rare conditions wait. In oncology, where I’ve spent more than eight years working across Phase 1 through Phase 4 studies, the gap between a data package that reviewers can immediately navigate and one that requires clarification can translate into months. Data standards quality sits on the critical path to approval.
Replacing A 35-Year-Old File Format
Running parallel to the standards update is a format change that has been building quietly for years. The SAS XPORT transport format (the file type FDA has required for data submissions since 1999, itself built on a format dating to 1989) is finally being challenged. In April 2025, the FDA issued a Federal Register notice requesting public comment on CDISC Dataset-JSON v1.1 as a long-term replacement for XPT. The XPT format imposes real constraints: eight-character variable name limits, no Unicode support, and poor storage efficiency. Dataset-JSON eliminates all of them. An SDTM LB data set I examined went from 2,699 KB in XPT to 640 KB in Dataset-JSON, compression that can eliminate data set splitting for large ADaM submissions approaching FDA’s 5 GB threshold. A 2023–2024 CDISC-PHUSE-FDA pilot demonstrated Dataset-JSON can serve as a direct substitute in regulatory submissions. The transition isn’t finalized, but the direction is clear. The real friction is institutional readiness; teams building working familiarity with JSON workflows now will be better positioned when regulatory timelines formalize.
Closing The Conformance Interpretation Gap
The other significant shift is CORE, which is the CDISC Open Rules Engine. Before CORE, conformance rules existed as written specifications that every organization interpreted independently.
The same rule could be operationalized differently by two sponsors, two CROs, or two validation tools, with none of them necessarily matching what FDA reviewers expected. CORE publishes executable conformance rules in machine-readable format, available as open-source software, that any organization can run directly against their data. The interpretation gap closes. Sponsors can check conformance continuously throughout a study, not only at the submission finish line, catching data integrity problems months earlier than current practice. For example, when reference range values are missing in the laboratory findings (LB) domain, FDA’s validation software flags a warning that different sponsors and CROs resolve inconsistently. One may document the gap in the reviewer's guide while another retrospectively sources values from site records, applying different logic to the same rule across the same domain. CDISC CORE eliminates this variability by replacing narrative guidance with executable conformance rules, ensuring a single implementation produces consistent outcomes regardless of therapeutic area, organizational convention, or validation tool.
What Gaps Still Need To Close
CORE's rule coverage is still expanding, and the Dataset-JSON transition timeline is uncertain. SDTM v3.0 itself has not been finalized but is still in development alongside a public review process that the community should be actively engaging. Real-world data and decentralized trial designs still lack the standardization depth traditional interventional trials have. The gap between what trials actually generate and what the submission framework cleanly accommodates hasn't fully closed.
For patients with rare diseases or aggressive cancers where no approved alternative exists, the distance between a two-year and a three-year regulatory timeline is not an abstraction. Consider a patient with recurrent osteosarcoma of the limb, where surgery was the only conventional option and failed due to tumor regrowth, leaving amputation as the remaining path forward. A targeted therapy in late-stage review could change that outcome permanently but only if it arrives in time. The structural improvements underway with SDTM v3.0, Dataset-JSON, and CDISC CORE do not make headlines, but they compress the friction between a completed trial and a regulatory decision. For the statistical programmer, that means cleaner submissions and fewer review cycles. For the patient waiting on the other side of that process, it can mean a limb saved, a life altered, and a future that looked very different just months before.
About The Author:
Varun Debbeti is a statistical programming professional with over a decade of experience supporting Phase 1 through Phase 4 oncology clinical trials within the pharmaceutical industry. He has contributed to clinical data programming and regulatory submission activities supporting approvals by the FDA and the EMA, including work associated with Ogsiveo for desmoid tumors and the BLA submission for YESCARTA, developed by Kite Pharma for the treatment of certain forms of non-Hodgkin lymphoma. In addition to his industry experience, he is the founder of ClinStandards.org, an educational non-profit platform where he writes technical articles on evolving CDISC data standards for the clinical research community.