Article Text


Why democratise bioinformatics?
  1. Gabriella Captur1,2,
  2. Rodney H Stables3,
  3. Dennis Kehoe4,
  4. John Deanfield5,6,
  5. James C Moon2,7,8
  1. 1UCL Biological Mass Spectrometry Laboratory, Institute of Child Health and Great Ormond Street Hospital, London, UK
  2. 2NIHR University College London Hospitals Biomedical Research Centre, London, UK
  3. 3Liverpool Heart and Chest Hospital, Liverpool, UK
  4. 4Aimes Grid Service Providers Ltd, Fairfield, Liverpool, UK
  5. 5Farr Institute of Health Informatics Research at London, London, UK
  6. 6National Institute of Cardiovascular Outcomes Research, University College London, London, UK
  7. 7UCL Institute of Cardiovascular Science, University College London, London, UK
  8. 8The Cardiovascular Magnetic Resonance Imaging Unit, Barts Heart Centre, St Bartholomew's Hospital, London, UK
  1. Correspondence to Professor James C Moon, UCL Institute of Cardiovascular Science, University College London, London, UK, WC1E 6BT; j.moon{at}

Statistics from


Within clinical research institutions across the UK currently, only a small proportion of generated data is effectively being captured and safely stored long term; research efforts are fragmented and the challenges of multicentre collaboration are not yet overcome. A shared national initiative of accessible and secure bioinformatics solutions tailored to the needs of junior and senior clinical academics has the potential to address this unmet need and cardiovascular research provides a clear example.

Cardiovascular disease is a leading public health problem and a number one killer in the UK accounting for 40% of all national deaths and costing the UK economy £29 billion a year in healthcare expenditure and lost productivity. The UK spends more of its healthcare budget on cardiovascular disease and research than any other EU economy.1 ,2 Over the past 20 years, there has been an explosive growth in cardiovascular investigations, imaging and therapies across the National Health Service (NHS) underpinning clinical care but also the >£117 million annual research investment3 that creates expensive clinical cohorts.4 There is a pressing need to merge and curate (for at least 10 years) not only the large well-organised big cardiac science data sets5–8 but also the richly diverse and heterogeneous smaller cohort data sets produced by small groups and individual cardiologists, the so-called long-tail data9 (figures 1 and 2)—the large proportion of scientific data that falls into the long tail of the distribution curve,11 a product of the numerous small independent research efforts yielding a rich variety of specialty cardiac research data sets. The extreme right portion of the long tail includes unpublished dark data: siloed databases locked up in applications, null findings, laboratory notes, log archives, untagged image files, animal care records, etc.9 Dark data in cardiology can be illuminating but it is often inaccessible to the outside world. The merging of such myriad data sets and the eradication of data silos, plus linkage with outcomes could be greatly facilitated through the provision of a national set of standardised data collection instruments—a shared-cardioinformatics library of tools designed by and for clinical academics active in the long tail of cardiovascular research. Such bioinformatics set-up costs are high, usually placing them beyond a single centre's capabilities, which is why a national cross-centre initiative is required.

Figure 1

The spectrum of research in cardiology. Most cardiology research projects are between 20 and 1000 participants, typically representing also the middle of the translational pathway (red discontinuous box). The smallest studies may not necessarily need bioinformatics; the largest have funding already but are outnumbered 34:1 by the smaller studies. Creating cohort studies is expensive. Little bioinformatics exists to support them. Plot (2015) summarises lists study sizes in 300 consecutive cardiac trials registered with

Figure 2

Cardiac-bespoke bioinformatics platform. Searching the Bioinformatics Links Directory (, >3000 biomedical data archival platforms can be found. This plot shows the number of bioinformatics web servers by domain: absolute levels and growth over time. Using search terms for the cardiac domain (‘cardiology’, ‘cardiac’ and ‘cardiovascular’), it transpires that there are no servers dedicated to cardiac research between 2006 and 2015 (plot adapted from Cummings and Temple10).

Large national initiatives aggregating registry data like that led by the National Institute for Cardiovascular Outcomes Research (NICOR)12 are testament to the fact that linkage of national cardiovascular databases is feasible and has the potential for increased international comparative data analysis. However, NICOR infrastructure is not tailored to serve the needs of individual researchers aiming to conduct small-to-medium scale cardiovascular research in the cloud. The doctoral student with a sample size of 100 curating a 3-year project with finite funding needs accessible bioinformatics tools that he/she can customise and control. Electronic bioinformatics tools for these groups are usually limited to those provided locally by universities but such institutional databases are not easily accessible to collaborators in other centres. These investigators (sometimes junior staff) need access to secure but intuitive electronic data collection solutions that they can customise to the needs of their niche project. They need a simple but hierarchal way of controlling access, freedom to edit instruments and ease of data export to permit local statistical analysis.

Advantages of shared infrastructures for research in the long tail

Web-based data collection instruments have the capacity to improve the efficiency of the UK's appropriately high levels of investment into cardiovascular research. A national initiative, as opposed to segregated single-centre university-based infrastructures, automatically creates dissemination standards not through imposition, but because tools will be genuinely good, easy to use, accessible and practically helpful.

Cardiac research in the UK requires and receives high levels of funding to create expensive patient cohorts, but these cohorts are typically non-standardised, partitioned to reflect the group's niche expertise and data are rarely curated long term nor integrated with outcomes. From concept to guideline and then through to clinical practice takes many steps. Disseminated cloud-based bioinformatics broadens the range of translation that any individual research group can singly perform, facilitating the transition of ideas along the translational pathway (eg, from single-centre cohort, to multicentre, to outcome-studies, to standardisation, to guidelines). Standardised data collection, growing sample size, linking to other domains of science and then trickling results between groups suddenly become easy and information governance (IG) strengthened. Infrastructure reuse becomes possible and new areas of research are spawned through linkages, previously unachievable, leading to diffused benefits.

The release of a browser-based, flexible and secure electronic data capture (EDC) infrastructure automatically encourages research groups to share (data, instruments and dictionaries). Expensive multicentre UK cohort data sets may be securely accessed from any part of the country and robustly deidentified, standardised, curated and merged with other sorts of data for maximum scientific yield. With this infrastructure of ‘connectedness’, collaboration is suddenly easier providing a sustainable route to creating large-scale cardiac data13 and increasing the yields from UK research investment by accelerating the transition of a scientific idea into a new biomarker, clinical test or patient therapy. From scientific concept to societal benefit is a multistep process and network bioinformatics are needed along the pathway to impacts—academics may conceive ideas but teams, small-medium enterprises and pharmaceutical companies need to input and connect as ideas evolve.

For research exploring the development of novel cardiac biomarkers, data sharing in the long tail is key as it ensures research transparency, mitigates against known biases in publication and increases data reuse by third parties.11 Effect sizes need to be measured in phase I/II drug studies, but real-world disease and real-world biomarker performance need to be measured for phase III studies and this is where unexpected trial futility is often discovered (globally, the last 20 phase III trials in heart failure have been negative14 at a waste of billions)—a potential ‘regression to the truth’ as cardiac biomarkers exit the expert centres and real-world data handling begins. Understanding and anticipating the size of this ‘real-world effect’ is hard without access to multicentre, unselected pan-UK patient cohorts managed and curated using standardised bioinformatics tools at national level.

The time is right—important developments in UK health informatics

The promotion of bioinformatics assets to support long-tail research in the UK coincides with the NHS' growing appetite for information technology (IT) innovation and its growing focus on the procurement of smarter health informatics strategies. Several trusts are currently undergoing major transformational change and investing in ‘Health Clouds’ as funding constrictions drive health services to seek more efficient paperless reconfigurations, reduce complexity, improve data security and drive up the quality of patient services. National bodies like the Commissioning Support Units and the Health and Social Care Information Centre have been established to support this process. Healthcare clouds permit efficient electronic health information exchange allowing providers to rapidly and securely access and share a patient's medical information electronically, but this process is dependent on data standardisation.15 Once standardised, the data transferred can seamlessly integrate into a recipients' Electronic Health Record (EHR). The Open EHR vision for UK healthcare aims to create life-long interoperable patient EHRs, a key-stone component of which is semantic interoperability16 made possible through the CEN/ISO EN13606—a European norm for semantic interoperability in the EHR communication, approved by the European Committee for Standardization (CEN) and by the International Organization for Standardization (ISO).17 Open EHR and NHS health cloud technologies have major research ramifications—long-tail research instruments will be able to piggy-back onto this broader, evolving national infrastructure, permitting flexible spin up of resources as and when needed (‘power-by-the-hour’) and self-provisioning (studies can be containerised and rapidly deployed and reused).

Potential caveats of sharing in the long tail

Merging myriad data sets potentially introduces the risk of reanalysis of poor quality data sets or analysis of excellent data sets by non-experts using inappropriate applications, thus flooding the field with conflicting results.18 There is also the financial cost and time investment involved in preparing data and data collection instruments to permit their use by others, but shared standardised tools once developed will avoid this issue and deliver superior research network intensity. Cardiac researchers will dedicate enormous time preparing papers for publication driven by citation and H-index incentives, to the satisfaction of funders and to ensure survival of their teams and centres, but the career yields from large-scale data sharing (especially of dark data) are not that explicit.19 Investigators are at the mercy of the work ethic and replication etiquette20 of analysing third parties, coauthorship on downstream publications may be sporadic and there is commonly a sense of loss of control. Furthermore, data sharing could expose data errors or suboptimal reporting practices in high-impact studies many years after their original publication. If clinical guidelines had incorporated such data as evidence for patient care, the implications could be devastating.21

Example solution for shared cardioinformatics and future directions

In a pan-UK effort to tackle the barrier to multicentre data integration in the long tail, starting with cardiology, our group has previously partnered with IT architects (not-for-profit organisation AIMES Grid Service Providers, to deploy a cloud hypervisor pilot that provides easy-to-access bioinformatics tools for UK academics in the cardiovascular sciences. The primary EDC instrument used was REDCap (Research Electronic Data Capture22 distributed non-commercially by Vanderbilt University for academia)—a simple proprietary, user-friendly, no-cost, browser-based, metadata-driven system for data collection and management available to academics. It has several obvious advantages over competing infrastructures like standard office applications (Microsoft Excel and Access) or other EDC systems like Open Source OpenClinica. Learning OpenClinica is more difficult than REDCap with fewer online training module pages and no international consortium to turn to for support; there is no project development mode so undoing or replacing fields during set-up is cumbersome. REDCap permits advanced customisation through the use of hooks, hacks, application programming interfaces (API) and plugins offering a flexible way of adding microfeatures and widgets to research projects with specific user-driven requirements (eg, our in-house 17-segment cardiac bulls-eye plot for efficient regional wall motion scoring: Another benefit is the ability to combine REDCap directly with R26 through APIs which can easily export and import data into R, reducing the burden of data transformation and the possibility of human errors while streamlining the entire process of data collection, cleaning and analysis.23

This UK model, committed to the usual STAndards for the Reporting of Diagnostic accuracy studies (STARD),24 has been designed around the baseline skillset, aptitudes and needs of the everyday principal investigator and his/her junior/senior research team. Its uptake has been exponential due to its ease of customisation and set-up efficiency particularly for junior academics starting a new project. It currently supports 250 UK researchers, with a total of 105 projects actively recruiting. It is provisioned in a safe-haven environment consisting of a G-Cloud-assured high-availability cluster with a disaster recover element existing in a separate G-Cloud set-up. Both are located within a highly secured data centre information security management system (ISMS) managed by AIMES and in compliance with the standards of the ISO/International Electrotechnical Commission and with the Health Insurance Portability and Accountability Act. AIMES is an accredited N3 cloud provider on the UK Government's G-Cloud procurement framework. The platform is designed to accept the upload of data free of personally identifiable information deriving from research projects that already have the necessary ethical approvals in place. Backup power, backup servers and data restore facilities provisioned are fully compliant to NHS IG requirements. In handling research data, the platform is aligned with good clinical practice and the UK Data Protection Act (1998). Investigators applying for grants and research ethics approval, interested in using this infrastructure, are provided with boilerplate verbiage outlining its data security features and with access to online and face-to-face EDC training sessions. Investigators using the platform retain responsibility for obtaining appropriate consent from participants and they are asked to verify this in the mandatory User Responsibility Document when registering for access to the system. In collaboration with professional ISMS managers at AIMES experienced in patient data security, IS027001 standards and the UK NHS IG Governance ISMS, we have established processes to ensure full respect for ethics and research governance across the pilot, relevant to participants and participating researchers. The pilot is currently awaiting registration as a database platform with the UK Research Ethics Committee.

The vision for this pilot has grown out of extensive local experience, particularly in cardiac imaging—a field that is facing barriers to the clinical delivery of biomarkers because doing this properly requires multicentre collaboration and integration with other data types. The ongoing Open EHR developments coupled with advancements in Hadoop, other Apache open-source projects and cloud computing25 offer huge opportunities to the research community—EDC research tools such as REDCap can be integrated and long-tail cardiac data sets mined from within the EHRs using big data tools rather than simply limiting the research model to data collection by individual groups. It will become possible to capture niche cohort data from out of the larger routine clinical record, but further development of electronic patient record systems in the NHS is required—the ultimate research objective is to permit flawless mining of the entire EHR in a national, secure real-time web solution that also offers complete universal follow-up of outcomes (linked to hospitalisation and death records, etc) by electronic surveillance.


Biomedical research costs will spiral in the UK if individual centres continue to build their own individual bioinformatics clouds instead of sharing these in a national resource, ideally with funding by the NHS. Expensive research eventually translates into increased tariffs for new therapies—reflecting a lack of understanding of basic biology, or at least the transition of that understanding into clinical practice.

We are convinced that across the cardiovascular research domain, like the rest of medicine, the national aggregation of diverse long-tail data is the best way to convert numerous small but expensive cohort data sources into big data for improved knowledge. This practical and structured integration is achievable through a sustainable, common platform of network bioinformatics, breaking down translational barriers, improving research efficiency and with time, patient outcomes.


The authors are indebted to the staff of the AIMES Grid Service Providers . for the establishment of the UK pilot for cardioinformatics.


View Abstract


  • Contributors GC, JCM and DK planned the pilot infrastructure and wrote the manuscript. RHS and JD provided expert support and review of the manuscript. JCM is responsible for the overall content.

  • Funding GC is supported by the National Institute for Health Research Rare Diseases Translational Research Collaboration (NIHR RD-TRC) and by the NIHR University College London Hospitals Biomedical Research Centre. JCM is directly and indirectly supported by the University College London Hospitals NIHR Biomedical Research Centre and Biomedical Research Unit at Barts Hospital, respectively. This cloud-based EDC pilot has been funded by Barts Charity Grants #MGU0305 and #1107/2356/MRC0140 to GC and JCM.

  • Disclaimer The views expressed are those of the authors and not necessarily those of the NHS, the NIHR or the Department of Health.

  • Competing interests DK is the Chief Executive Officer of the AIMES Grid Service Providers, a commercial data centre service provider based in the North West of England.

  • Provenance and peer review Not commissioned; externally peer reviewed.

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.