Big data in clinical research

Sarah Yao

Wednesday, August 30th, 2017

Sarah Yao
MBBS III, Monash University

Sarah is a third-year medical student at Monash University. She has come to appreciate that she has various interests in medicine and surgery, but is particularly interested in paediatrics, medical education, and research.

Background: Medicine is an ever evolving field of knowledge, new practice, and research. There are various clinical research methodologies; the clinical researcher may actively collect patient information, or retrospectively obtain patient data from traditional datasets, such as hard-copy patient records. In more recent years, clinical research has seen the emergence of ‘big data’.


Big data are large electronic databases characterized by the four V’s-volume, variety, veracity and velocity. The rise of big data suggests that there are advantages to its use.  One advantage of big data is easy accessibility, which allows information to be obtained and analysed in a short period of time. However, there are shortcomings of using big data in clinical research, mainly with regards to sampling bias and validity. Nonetheless, big data are here to stay in today’s digitised age of medicine, and the researcher must consider the appropriate contexts for the use of big data in clinical research.


Aim: The aim of this paper was to define ‘big data’ in medicine and examine its use in clinical research.


Methods: A literature review was conducted on Ovid MEDLINE to identify relevant literature. The PRISMA statement was used to screen and select articles that would be reviewed for the paper.


Conclusion: The future of big data is promising, with the allure of low-cost, immediate, and comprehensive data, but it is important that clinical researchers understand how to utilise these well for research and knowledge translation.



The future of medical practice is shaped by the outcomes of today’s clinical research trials. Medicine is an increasingly data-intensive field reliant on clinical research [1]. For decades, the clinical research industry has conducted large amounts of research by either actively collecting patient data or retrieving it from hard-copy patient records [2]. However, recent years have seen the emergence of ‘big data’ as the key source of data for clinical trials and observational clinical research alike [3,4]. Big data algorithms in medicine broadly refer to the aggregation of individual medical datasets into large, electronic databases that are readily available for data analysis in clinical research [3,5]. The rise of big data in clinical research suggests that there are obvious advantages of its use. However, there are also challenges in optimising its use in clinical research due to the risk of bias [6]. The aim of this paper was to define ‘big data’ in medicine and examine its use in clinical research.



This literature review examined recent literature that focused on the use of big data in clinical research. This included researching its validity, reliability, advantages, and disadvantages. Recent literature was defined as articles published from 2005 until 2016.

A literature search was performed on Ovid MEDLINE using the following query:

(“validity” OR “reliability” OR “advantages” OR “disadvantages”)


(“administrative data” OR “database” OR “big data” OR “electronic health records” OR “electronic medical record” OR “clinical database”)


(“clinical research” OR “healthcare”)

The PRISMA statement was used when selecting articles for this review (Figure 1). A total of 164 publications were identified through the database searches. One additional record was identified through the bibliography of one of the articles retrieved from the database search. The abstracts of these articles were manually reviewed for relevance to the topic, excluding 119 articles. Full-text screening excluded a further 23 articles due to inappropriate topic focus and repetition. This paper subsequently focused on reviewing 23 articles. These articles were primarily original research articles and systematic reviews.

Figure 1. PRIMSA ow diagram demonstra ng search strategy.

What is big data?

  • Defining big data in a medical context

Big data in medicine is characterised by the four V’s – volume, variety, veracity, and velocity [7]:

  1. Volume refers to the large amounts of patient information being collected over time and stored, as suggested by the term ‘big data’. Various elements incorporated into big data include patient demographics, history, investigations, diagnoses, and length of stay [8]. Big data are available in the form of registries and patient databases. Registries gather disease- or population-specific information, while patient databases document patient information throughout the course of an illness [8].
  2. Variety of data can be broadly discussed as structured or unstructured data. Structured data is information that is easily stored, searched, retrieved, edited, and analysed digitally, as in keying in patient ID numbers into electronic medical records to access patient information [7]. In contrast, unstructured data include traditional print records, electronic free text, radiographic films, or survey data collected from patients [9].
  3. Veracity concerns the true representativeness of the data. It refers to the goal of achieving validity and credibility in the data set [7].
  4. Velocity represents the rate at which data is recorded and generated to allow timely retrieval for analysis and decision-making [7].


Advantages of using big data  

  • Accessibility and availability

Big data are readily available [10]. Patient records, such as admission history, investigations, diagnostic results, and medications, are all electronically documented on hospital databases. As these hospital databases are installed on staff computers in the hospital, health professionals working in these hospitals are able to easily access this data for review or, increasingly, for clinical research. The integration of multi-pathway patient records1 in big data provides a convenient, comprehensive pool of information available to researchers [11]. This integration facilitates retrospective cohort studies and therefore aids researchers to identify patterns in disease progression and compare the effectiveness of treatments [11].

  • Cost- and time-efficiency

Given the convenience of data collection using electronic patient registers, the process of obtaining information needed for clinical research is shortened, in comparison to the more time-consuming alternative of manually collecting patient data [12]. Big data are useful in minimising logistical impediments in prospective and retrospective, longitudinal, population-based studies [13,14]. Researchers who require large sample sizes can also easily extract information from the available pool of data in these databases, potentially increasing the study power of their research [13]. The added benefit of being able to use computerised techniques to analyse unstructured data within big data also means that finer data acquisition can be performed, compared to data acquired by laborious, manual extraction from traditional datasets [14].  In addition, this information is available at a low cost, if not free, to clinical research staff, bypassing potential additional costs that might be incurred through manual data collection [15].


Challenges of Big Data

Kaplan et al. [16] suggests that several biases can arise when analysing big data, including, but not limited to, sampling bias and lack of scope in the information recorded. Secondly, the validity of big data is highly dependent on the context in which it is being used [17]. Lastly, minor data security issues may arise from the utilisation of big data.

  • Sampling bias and lack of scope

Sampling bias of big data can be discussed in terms of its standardisation and completeness. Completeness of data encompasses both its comprehensiveness and whether it is a good representation of the population of interest [17]. Clinical research often requires data collection from a large sample size of patients. As every patient will have different investigations, diagnoses, and treatment plans, every patient will have varying types and amounts of clinical documentation and to differing degrees of detail. There will, therefore, be difficulty in standardising a method for data collection across an array of available patient information to ensure completeness of the data. It is crucial to ensure that the data is complete, otherwise the research results could be subject to information bias [8]. Typically, the ideal method to achieve this is to conduct prospective data collection, minimising omissions [8]. However, as big data is retrospective, it is often difficult to agree upon a decision regarding inclusion of the data or methods to retrieve missing data when medical records are not available [8].  In those situations, the clinical researcher will be required to design algorithms to clean and correct the available data, however it is difficult to design an objective method to validate certain choices made in this process of data collection.

In addition, the coding of information is very much skewed towards documenting and following up the primary diagnoses [17]. As such, secondary diagnoses are often missed or poorly recorded, resulting in a lack of well-documented secondary patient information, such as co-morbidities.

  • Validity of big data

Joppe defines validity in quantitative research as a criterion that determines whether a research truly measures what it was initially intended to measure [18]. The validity of big data varies between different clinical specialties and the circumstances in which the data is being used [17].

Occasionally, big data may contain incomplete data sets, or even incorrect data, due to errors in transcription or abstraction [8]. There have been instances when data is misclassified during the recording of data during the data coding process [17]. These may occur when a patient undergoes a procedure that treats more than one condition, or in recording a patient’s hospital admission based on presenting complaint [17]. These systematic errors are hence potentially misrepresentative of the data [17]. A literature review by Talbert and Lou Sole [8] in 2013 found that there has been substantial research suggesting that administrative databases, a subset of big data, have only moderate sensitivities and specificities for correct data coding and may underreport procedures [8].

The increasing trend of activity-based funding of hospitals in some countries, such as the United States and Australia, may also influence the information recorded in big data at discharge [19]. Activity-based funding is a policy intervention targeted at restructuring incentives across healthcare systems through a fixed funding allocation for each episode of care administered to each patient, regardless of their duration of stay and resources used [19]. Obvious benefits include reduced hospital costs and shorter hospital stays, however, hospitals may misuse the system to increase revenue by up-coding diagnoses, or focusing on profitable patients and procedures [19]. As a result, the diagnoses and procedures included in the discharge coding within big data may misrepresent the actual situation in the hospital.

It is crucial to note that electronic medical records adapted for clinical research serve the purpose of a clinical care record and are not designed for research [20]. Electronic inpatient databases document the clinician’s case notes, which often focus on treating the patient’s current illness and respond to the individual clinician’s concerns. These may not always correspond with the aims of future clinical researchers. As such, the available information on the patient may not necessarily be as comprehensive as required by the clinical researcher [12].

Analysis of inaccurate data may cause incorrect conclusions to be drawn. In situations where researchers simply use whatever big data are provided to them, the validity of the clinical research is compromised as the data collected and analysed may not truly reflect the research aims.

  • Data security

Griebel et al. [1] suggest that users who lack experience in using big data and third party users could potentially pose a threat to data confidentiality. Such circumstances may occur when healthcare providers work with commercial corporations and outsource the information to a commercial cloud [1]. However, mitigation strategies, such as the implementation of high-security data authentication protocols to limit access, can be put in place to ensure data security [1]. Examples of high-security data authentication protocols include advanced firewalls to prevent access by unauthorized users and setting up a digital certificate, which requires the user to identify himself or herself [5]. There are also newer techniques, such as obfuscation, where patient data is stored in an encrypted form and decryption is only allowed through authorised privacy manager software [5].


Choosing appropriate contexts to utilise big data in clinical research

Big data are beneficial to clinical research in providing the following information:

  • Patient demographics and risk factor profile for disease

Big data are highly applicable in the field of patient profile analytics [2]. Big data can be used to identify relationships between patient demographics and disease or treatment outcomes. By routine monitoring and documentation of patient flow and outcomes, big data allow the incidence and prevalence of diseases, as well as the overall outcome amongst selected patient groups, to be estimated [17]. Furthermore, big data are ideal for developing predictive analytic models based on risk factor profile. As big data capture patient demographics, they help the clinical researcher pinpoint patient risk factors specific to certain diseases, draw links with disease progression and hence, has the potential to be used in developing prediction models [17]. Moreover, risk factors can also be prognostic, and highlight the possibility of a future health outcome [17]. When advanced analytics are applied to these patient profiles and patients at risk of developing specific diseases are identified, there is the opportunity to intervene and provide preventive care to the selected group of patients [2].

  • Patient treatment outcome

Additionally, by combining both structured and unstructured data across multiple disciplines—medical and surgical clinical data, financial and operational data, and genomic data—to match treatments with outcomes, big data can also predict treatment effectiveness for patients [2]. Collectively, these suggest that big data can be useful in calculating the risks and benefits for various outcomes of both a disease and treatment in different patient groups, hence enabling the clinician to provide more efficient and cost-effective care [2,15].


 The future of big data

Improvements in big data organisation and an increasing familiarity with using big data will allow clinical researchers to better utilise the data to their advantage. For instance, researchers are progressively able to model inclusion criteria to obtain relevant data [21]. Up and coming technological infrastructure can be expected to springboard the potential of big data in medicine. For example, cloud computing allows big data to be bigger, better and faster. Cloud computing has the potential to provide researchers with multi-scale data integration tools that will help highlight relationships between discrete data entities [22,23]. Cloud computing will also enable researchers to customise personal networks and virtual servers to increase data security of the electronic resources being used [1].

Beyond clinical practice, there is also potential for big data in other healthcare areas [2]. Big data have the potential to integrate population clinical data sets with genomics data, facilitating pharmaceutical development [2]. There is also a role for big data to play in public health surveillance. Big data can aid in analysing and tracking disease patterns, which is of utmost importance in delivering effective and efficient healthcare responses during disease outbreaks [2].



Big data are a useful and efficient source for obtaining patient information. It offers immediate access to large amounts of patient data with high convenience, low cost and easy accessibility. However, big data may be a poor source for immediate causal inference in data analysis as it lacks randomisation. Yet, there is much potential for big data in clinical research and clinical researchers must improve their utilisation of big data in knowledge translation and data analysis. Appropriate handling of big data through well-designed algorithms and data analysis must be done to overcome its limitations. Nonetheless, with its allure of low-cost, immediate, and comprehensive data, the rise of big data is promising. It is here to stay.

Conflicts of interest

None declared.



The author of this paper would like to thank Dr Nora Mutalima (Research Co-ordinator Orthopaedic Services, Dandenong Hospital and Adjunct Research Fellow, Monash University) for her critical review and support.



[1]       Griebel L, Prokosch H, Köpcke F, Toddenroth D, Christoph J, Leb I et al. A scoping review of cloud computing in healthcare. BMC Med Inform Decis Mak. [Internet]. 2015 March [cited 20 May 2016];15(17):1-16. Available from:

[2]       Berger M, Doban V. Big data, advanced analytics and the future of comparative effectiveness research. Journal of Comparative Effectiveness Research. [Internet]. 2014 March [cited 25 May 2016];3(2):167-76. Available from:

[3]       Ketchersid T. Big Data in Nephrology: friend or foe?. Blood Purif. [Internet]. 2014 January [cited 20 May 2016];36(3-4):160-4. Available from:

[4]       McCowan C, Thomson E, Szmigielski C, Kalra D, Sullivan F, Prokosch H, Dugas M Using electronic health records to support clinical trials: a report on stakeholder engagement for EHR4CR. BioMed Research International. [Internet] 2015 June [cited 2 October 2016];2015:1-8. Available from:

[5]       Jee K, Kim G. Potentiality of big data in the meidcal sector: focus on how to reshape the healthcare system. Healthcare Informatics Research. [Internet] 2013 June [cited 2 October 2016];19(2):79-85. Available from:

[6]       Van Walraven C, Austin P. Administrative database research has unique characteristics that can risk biased results. Journal of Clinical Epidemiology. [Internet]. 2012 February [cited 20 May 2016]; 65 (2): 126-31. Available from:

[7]       Raghupathi W, Raghupathi V. Big data analytics in healthcare: promise and potential. Health Inf Sci Syst. [Internet] 2014 February [cited 24 May 2016];2(1):3. Available from:

[8]       Talbert S, Lou Soule M. Too much information. Clinical Nurse Specialist. [Internet]  2013 March [cited 24 May 2016];27(2):73-80. Available from:

[9]       Berger M, Doban V. Big data, advanced analytics and the future of comparative effectiveness research. Journal of Comparative Effectiveness Research. [Internet]. 2014 March [cited 25 May 2016];3(2):167-76. Available from:

[10]     Jolley, RJ et al. Validity of administrative data in recording sepsis: a systematic review. Critical Care. [Internet]. 2015 April [cited 25 May 2016]; 19(1):139. Available from:

[11]     Sterckx S, Rakic V, Cockbain J, Borry P. “You hoped we would sleep walk into accepting the collection of our data”: controversies surrounding the UK scheme and their wider relevance for biomedical research. Medicine, Health Care and Philosophy. [Internet] 2016 June [cited 25 May 2016];19(2):177-90. Available from:

[12]     Byrne N, Regan C, Howard L. Administrative registers in psychiatric research: a systematic review of validity studies. Acta Psychiatrica Scandinavica. [Internet] 2005 December [cited 25 May 2016]; 112 (6): 409-14. Available from:

[13]     Lopushinsky SR et al. Accuracy of administrative health data for the diagnosis of upper gastrointestinal diseases. Surgical Endoscopy. [Internet] 2007 October [cited 1 June 2016];21(10)1733-7. Available from:

[14]     Murdoch T, Detsky A. The inevitable application of big data to health care. The Journal of the American Medical Association. [Internet] 2013 April [cited 2 October 2016];300(13):1351-2. Available from:

[15]     Angus D. Fusing randomized trials with big data. JAMA. [Internet] 2015 August [cited 1 June 2016]; 314(8):767-8. Available from:

[16]     Kaplan R, Chambers D, Glasgow R. Big data and large sample size: a cautionary note on the potential for bias. Clinical and Translational Science.[Internet] 2014 July [cited 1 June 2016]; 7(4):342-6. Available from:

[17]     Cook J, Collins G. The rise of big clinical databases. British Journal of Surgery. [Internet]. 2015 January [cited 20 May 2016];102(2): 93-101. Available from:

[18]     Joppe, M. The Research Process.[Internet]. Ryerson University. 2000  [cited 3 March 2017]; Available from:

[19]     Palmer K, Agoritsas T, Martin D, Scott T, Mulla S, Miller A et al. Activity-based funding of hospitals and its impact on mortality, readmission, discharge destination, severity of illness, and volume of care: a systematic review and meta-analysis. PLoS ONE. [Internet] 2014 October [cited 1 June 2016];9(10): 1-14. Available from:

[20]     Dean, BB et al. Review: use of electronic medical records for health outcomes research: a literature review. Medical Care Research and Review. [Internet] 2009 December [cited 2 June 2016]; 66(6): 611-38. Available from:

[21]     John P. A. Ioannidis . Informed Consent, Big Data, and the Oxymoron of Research That Is Not Research. The American Journal of Bioethics. [Internet] 2013 March [cited 2 June 2016]; 13(4): 40-2. Available from: DOI: 10.1080/15265161.2013/768864

[22]     Scruggs S, Watson K, Su A, Hermjakob H, Yates J, Lindsey M et al. Harnessing the heart of big data. Circulation Research. [Internet] 2015 March [cited 2 June 2016];116(7):1115-9. Available from:

[23]     Sessler D. Big Data – and its contributions to peri-operative medicine. Anaesthesia. [Internet] 2013 December [cited 2 June 2016];69(2):100-5. Available from: