Combining RWD And Machine Learning To Determine Meaningful Patient Populations
By Monica Nandagopal, senior analyst, Beroe Inc.

Clinical trials play a crucial role in driving medical innovation but frequently encounter challenges in enrolling diverse and representative patient populations. Combining RWD with advanced ML models offers a powerful and transformative solution to optimize patient recruitment. This article examines the essential factors that define meaningful patient cohorts, reviews important sources of RWD, and highlights machine learning techniques that improve patient identification and recruitment efficiency. Case studies from the industry showcase successful implementations of RWD and AI-powered tools that accelerate trial timelines and enhance inclusivity. The analysis emphasizes the importance of integrating multiple data sources alongside customized ML algorithms to overcome recruitment challenges, lower costs, and produce clinically generalizable results. This integrated approach is poised to become the future standard in clinical research, enabling faster and more equitable delivery of effective therapies to a wider patient population.1
The Need For Appropriate Patient Cohorts
Well-chosen cohorts improve trial validity, reduce timeline delays, and help meet regulatory expectations for real-world applicability. Data from articles published in 2019 showed that automated eligibility prescreening reduced patient screening time by 30%-35%, increased the number of candidates who matched relevant trial criteria by 15%, and increased those approached and consented into the trial by 10% compared to manual screening methods.2
A relevant patient population is necessary to generate statistically meaningful results that confirm treatment safety and efficacy in trial populations that reflect real-world diversity in age, gender, ethnicity, and clinical features to ensure generalizable outcomes. Meaningful cohorts include underrepresented groups (e.g., elderly, minorities, those with comorbidities) and are studied to address equity and improve treatment relevance. Properly defined cohorts also reduce screen failure rates and enable timely enrollment, accelerating study completion and market access.
Patient Cohort Selection:
Major factors that impact patient selection are clinical, geographical, patient awareness, and demographical parameters. Evaluating each parameter and giving them an appropriate weight can help researchers determine patients’ appropriateness for a trial. Pharmaceutical companies can apply these weightage parameters to identify the most important factors when selecting the right patients for a specific therapeutic area. For instance, during RWD selection for rare diseases, it is crucial that higher weight is assigned to clinical eligibility and access parameters, since these factors exert a greater influence on the success of clinical trials.3
Sources: Beroe Analysis
The above are indicative weights that can be adopted while evaluating patients for clinical trials and this can be modified based on the disease severity and patient access Combining multiple RWD sources can give a better understanding of the patient’s status and clinical morbidity.
RWD Sources
Traditional recruitment is often slow, costly, and inefficient, meaning trials often struggle to enroll enough eligible and diverse patients. RWD helps by providing a richer, broader view of patient populations that better reflect real-world demographics and clinical realities. Combining and harmonizing these sources allows for a comprehensive picture of disease prevalence, patient characteristics, and physician treatment patterns.3 RWD sources include:
Highlights Of RWD Adoption In Clinical Research
The analysis below describes the RWD landscape under various parameters, such as region and therapeutic area.
Source: Global Data
The above charts show that oncology is the leading therapeutic area using RWD and RWE, including 34% of studies that incorporate these elements. Trials in the central nervous system field represent 12% of studies using RWD/RWE, followed by cardiovascular indications, which account for 10%.4
China leads in the geographic distribution of RWE trials, accounting for 30% of the global total. It is followed by Italy at 10%, Germany at 10%, the United States at 9%, and Japan at 8%.4
The higher concentration of RWD/RWE trials in certain geographies is influenced by factors such as regulatory environments, data availability and quality, healthcare system maturity, and local adoption of innovative trial methodologies. Combining multiple RWD sources such as EHRs, claims, registries, pharmacy, wearables, and patient-generated data enhances patient identification precision and recruitment efficiency. This multi-source integration is particularly valuable in oncology and complex therapeutic areas where patient heterogeneity and dynamic disease status are common.5-8
Potential Solutions
RWD Source Combinations For Recruitment Efficiency
Useful combinations of data sources include:
- EHR and Claims Data: Offers a comprehensive patient snapshot, including both detailed clinical profiles and care patterns. Suitable for large-scale studies requiring detailed patient journeys.
- EHR and Patient Reported Data: Enables patient recruiters to consider both objective medical criteria and subjective patient experiences, broadening eligibility reach. Suitable for quality-of-life studies, disease symptom-based recruitment.
- Hospital Data and Data Collected from Wearables: Allows for timely identification of patients who develop or progress to desired clinical status. Suitable for dynamic conditions, digital health studies.
- EHR and Lab Data: Reduces prescreening failures and can rapidly exclude patients with contraindicated medications or insufficient lab values before outreach. Suitable for drug trials with specific biomarker requirements.7,8
Machine Learning Models For Patient Prescreening
Using appropriate ML models for identifying patterns can help determine which patients are most likely to benefit from or meet the criteria for specific trials. The integration of EHRs takes this a step further by allowing real-time access to comprehensive patient data, such as medical history, diagnoses, and treatments.9-11
Different model algorithms are used for different purposes. Below are a few examples:
Sources: Secondary articles and Beroe Analysis
Real-World Examples Of RWD And ML
Pharma companies have started to utilize the RWD and ML models to bring about the best of patient quality in clinical trials. Some real-world examples include:
Sources: Press releases
Recommendations For Adopting RWD And ML
Pharma companies are actively adopting RWD combined with advanced ML models to optimize patient recruitment in clinical trials. For those interested in adopting this approach:
- Use a combination of complementary RWD sources such as EHRs, claims data, patient-reported outcomes, hospital data, wearables, and laboratory data to gain a comprehensive and accurate understanding of patient profiles and disease status.
- Apply appropriate ML algorithms, including predictive modelling, natural language processing, ensemble methods, and deep learning to screen, identify, and prioritize patients who meet trial eligibility criteria, while also predicting patient retention and optimizing diversity.
- Incorporate a weighted evaluation of key factors (clinical eligibility, demographic diversity, geographic access, patient awareness, and data quality) to ensure cohorts represent real-world populations and address health disparities.
- Leverage AI-powered tools that integrate regulatory and compliance guidelines to produce high-quality recruitment materials and minimize risks of regulatory setbacks.
- Foster strategic partnerships between pharmaceutical companies and technology/data platform providers to accelerate implementation of AI-driven recruitment solutions and maximize patient inclusion and trial efficiency.
With these integrations, clinical trials can achieve faster, more cost-effective recruitment of relevant and diverse patient cohorts, ultimately reducing study timelines and improving the generalizability of trial outcomes. This approach supports the broader goal of bringing safe and effective therapies to patients more rapidly and equitably.15,16
References:
- M. Abdalah Ismail, P. Talha Al-Zoubi, P. Issam El Naqa and M. Hina Saeed, “The role of artificial intelligence in hastening time to recruitment in clinical trials,” Oxford Academic, p. 5, 2023.
- I. Spasic, D. Krzeminski, P. Corcoran and A. Balinsky, “Cohort Selection for Clinical Trials From Longitudinal Patient Records: Text Mining Approach,” National Center of Biotechnology Information, 2019.
- Biopharma Dive, “Biopharma Dive,” 17 April 2023. [Online]. Available: https://www.biopharmadive.com/spons/patient-centered-clinical-trials-improve-recruitment-and-retention/647481/
- Global Data, “Clinical Trials Arena,” November 2024. [Online]. Available: https://www.clinicaltrialsarena.com/analyst-comment/real-world-evidence-trials-increase-2024/?
- M. Grabner, C. Molife, L. Wang, K. Winfree, Z. Cui, G. Cuyun Carter and L. Hess, “Data Integration to Improve Real-world Health Outcomes Research for Non–Small Cell Lung Cancer in the United States: Descriptive and Qualitative Exploration,” JMIR Cancer, vol. 7, no. 2, 2021
- T. Beukelman, L. Chen, N. Annapureddy, J. Oates, M. E. B. Clowse, M. Long, M. D. Kappelman, R. L. Rhee, P. A. Merkel, W. B. Nowell, F. Xie, C. Clinton and J. R. Curtis, “Using pooled electronic health records data to conduct pharmacoepidemiology safety studies: Challenges and lessons learned,” National Center for Biotechnology Information, p. 32, 2023.
- M. S. Jansen, O. M. Dekkers, S. l. Cessie, L. Hooft, H. Gardarsdottir, A. d. Boer and R. H. H. Groenwold, “Real-World Evidence to Inform Regulatory Decision Making: A Scoping Review,” American Society for Clinical Pharmacology and Therapeutics, 2024.
- “IQVIA,” 04 July 2025. [Online]. Available: https://www.iqvia.com/library/publications/unlock-the-keys-to-effective-real-world-data-usage.
- X. Lu, C. Yang, L. Liang, G. Hu, Z. Zhong and Z. Jiang, “Artificial intelligence for optimizing recruitment and retention in clinical trials: a scoping review,” National Center for Biotechnology Information, p. 32, 2025.
- M. Samuel Kaskovich, M. Kirk D. Wyatt, P. Tomasz Oliwa, M. Luca Graglia, M. Brian Furner, P. Jooho Lee, P. Anoop Mayampurath and M. P. Samuel L. Volchenboum, “Automated Matching of Patients to Clinical Trials: A Patient-Centric Natural Language Processing Approach for Pediatric Leukemia,” American Society of Clinical Oncology Journal, vol. 7, 2023.
- A. Iyer and S. Narayanaswami, “A Novel Model Using ML Techniques for Clinical Trial Design and Expedited Patient Onboarding Process,” National Center for Biotechnology Information, 2025.
- K. Kantor and M. Morzy, “Machine learning and natural language processing in clinical trial eligibility criteria parsing: a scoping review,” National Center for Biotechnology Information, 2024
- D. Kitishian, “Pfizer’s AI Strategy: Analysis of Dominance in Pharma,” Klover.Ai, 2025.
- G. Macdonald, “BioXconomy,” 18 November 2024. [Online]. Available: https://www.bioxconomy.com/clinical-and-research/sanofi-to-use-ai-to-accelerate-recruitment-in-phase-iii-ms-program
- Drug Patent Watch, “Drug Patent Watch,” 25 July 2025. [Online]. Available: https://www.drugpatentwatch.com/blog/8-applications-machine-learning-pharmaceutical-industry/.
- K. Getz, “New Insights On the Impact of AI-Enabled Solutions,” Applied Clinical Trials, vol. 34, no. 3, 2025.
About The Author:
Monica Nandagopal is a category research analyst with over six years of experience in market research and consulting. Her insights have supported top pharma companies’ strategic decisions on supplier outsourcing, category management, and planning. In the past year, she engaged in more than 10 market sourcing studies, five supplier data visualizations, and multiple quick, reactive analyses across clientele for global and regional requirements.