The emergence of Ai technologies for analysing and generating texts has staff wondering what effect it will have on their jobs and how one might assess this technology as to its usefulness, accuracy and dangers.
The manner in which Ai-NLP has to be assessed has not emerged as a conversation between the various providers to their clients and a set of guidelines might be useful to imitate those conversations.
The essence of Ai-NLP is to identify semantic entities of interest in a target report. The technology used for this task is more broadly known as “machine learning” (ML). The ML process requires selecting an ML algorithm suited to the type of data to be analysed, in our case pathology reports, and the relevant values for each report that need to be identified for a given task.
The algorithm is trained with the set of reports (a corpus) and their respective values – this is the training corpus/set and it produces a language model, the is a model of the language used in pathology reports. The trained algorithm performs classification by being fed an unclassified report and it finding a report that most closely matches it and then adopts the values for the unclassified report of that matching report. As simple as the process sounds there are many issues that effect the quality of the results and therefore the acceptability of a particular ML implementation.
The major issues to consider are:
The characterisation of the modelling task.
The pre-processing algorithms applied to the training reports for ingestion into the algorithm.
The variables investigated in selecting the training and test corpora.
The source of the training corpus used to create the model.
The variables selected to assess the accuracy of the model.
The test corpora selected to represent the variables.
The characteristics of the variables used in the assessment of accuracy.
The methods for improving the model for particular clients.
The methods available for updating the model for changes in standards.
These issues might be resolved in different ways by different registries, and it is an open question as to how well a model trained for one jurisdiction might or might not be applicable in another jurisdiction. If we think disease epidemiology might vary across jurisdictions then the different models might well be more important than currently considered. Nevertheless the resolution of these issues have a material effect of the scope, accuracy and relevance of a particular classifier for a given task.
This approach was used for building a case identification classifier for the California Cancer Registry and is applicable to all Ai-NLP developments.
We have all heard the great excitement being generated by the release of the CHAT GPT generative AI platform.
But is it as exceptional as the pundits laude?
What might it be used for?
What are its limitations?
What do we need to understand about it to evaluate this sort of language processing technology?
There are two major approaches to computer driven language processing that have developed over the past 70 years.
The first approach was developed with linguistics knowledge and was focused on building algorithms that computed the features of language that linguists had defined over 300 years of investigation. We know this as the lexicon, grammar and semantics of language. This approach has been called computational linguists (CL) and natural language processing (NLP). In the 1990s this approach was complemented with the addition of machine learning approaches that could improve the recognition of target content and became known as statistical natural language processing. The target of this approach was to show understanding of text potentially at the level of human beings – this is yet to be achieved on anything but specialised settings. However in the background of the major thrust of CL was a small group of scholars working at text generation as part of computational language translation. These scholars were initially working with mapping lexica and grammar structures between languages. However, they discovered that machine learning algorithms, trained on matched translations, that predicted the “next word” in the construction of a sentence could give them a serious productivity and accuracy gain in machine translation. This work has now culminated in many effective automatic translation engines with the Google engine being the most prominent. A second collection of scholars were working on text summarisation and focused on interpreting the key aspects of a text and regurgitating it in a briefer form.
The second approach arrived in the late 20th century with the rise of search engines best manifest by Google. This approach treated text as a bag of words and the search problem as a string matching task, and discarded the need to do syntactic and semantic parsing to process language. This was commonly labelled string processing and eventually became known as Text Mining. In the 2000s, this approach was turned on its head with the addition large scale neural net machine learners used to create a more extensive characterisation of a word and its usage contexts in a body of text.
In a quirk of fate this turn actually brought the field back to linguistics in capturing the spirit of the field of corpus linguistics whose method is expressed by the very famous saying of 20th century linguist J.R. Firth “you shall know a word by the company it keeps”.
This approach has now grown to be known as Deep Learning intended to represent the rich contexts of words represented by a neural net of a significant complexity ( read depth) so that its operational characteristics are now unfathomable.
CHAT GPT is the inheritor of most of these technologies. However crucially it has stepped up its “learnings” by mining the breadth of the internet of language examples to have an immeasurable scale of texts to learn from across a broad scope of natural languages. The huge scale of this mining has meant that their learning process is so horrendous it costs more than $100 million to create a language model. These are now called large language models (LLMs) and they represent the positive and negative contexts of all (maybe most) words found in their training materials. We don’t know the word pruning strategy used by their algorithm so we can’t determine what words will be poorly represented by their model.
If we wind the clock back to the early tasks defined by the scholars when natural language processing began we can build more effective strategies for assessing CHAT GPT, and other competitor LLMs.
Putting aside the algorithm work on part-of-speech and parsing, key semantic tasks were named entity recognition and relationship recognition.
The question is to ask CHAT GPT how it performs on these tasks.
I asked CHAT GPT to generate some sample pathology reports for various cancers and the results appeared presentable. I even asked it to provide a report in a HL7 format and was impressed at the detail of the report although the cancer content was thin. It left me puzzled as to how it would learn to use such formatting when no learning materials would be available as all real world reports would be held confidentially by health organisations. Later a colleague at the CDC pointed out to me that there were examples in some of the cancer registry advisory manuals that it could have learnt from, maybe.
At that point I decided I needed to test it on content I knew the truth of so I asked it what it knew about myself. It gave a respectable answer based on my professional profile that would be available across the Internet. It identified that I worked at Sydney University which was first clue to a problem. It is 12 years since I was employed by Sydney Univeristy so did CHAT GPT have a problem with getting time relationships correct.
I next asked it what it new about my work IN Sydney (Australia) and it identified that I worked at three universities in Sydney. I mused that it might well have found my name on lists of attendees or speaker at each of these universities and thereby constructed a relationship that was incorrect.
So I went on a test run to establish what it would relate me to. I tried a query on my work in France and it found I worked at a univeristy (untrue) but at which I had given a seminar.
Next I asked it about my work in Mexico and it found that I had worked in a particular institute in Mexico. In fact I have NEVER visited Mexico in any capacity.
My conclusions from this investigation are, some of course may be incorrect:
CHAT GPT can identify entities to some reasonable reliability;
CHAT GPT has a serious problem with establishing the correct relationship between entities.
CHAT GPT assumes there is a truth relationship between the entities in a query and possibly across a sequence of queries.
CHAT GPT has a semantic hierarchy of entities that is uses to create relationships that can be unmeaningful.
When you draw these characteristics together then CHAT GPT is problematic because the truth of any statement it produces is potentially unreliable.
WARNING 1: Don’t believe anything that is produced by LLMs. Only believe what you know to be true from your own experience.
WARNING 2: CHAT GPT IS POTENTIALLY VERY DANGEROUS SO IT IS IMPERATIVE THAT LEGISLATION REQUIRES ANY DOCUMENT IT GENERATES TO HAVE AN IMMUTABLE WATERMARK SO WE CAN KNOW HOW IT WAS CREATED.
The AMIA 2022 AI Showcase has been devised as a 3-stage submission process where participants will be assessed at submissions 1 was 2 for their progression to the 2nd and 3rd presentations. The stages coincide with the three AMIA conferences:
Informatics Summit, March 21-24, 2022; Clinical Informatics Conference (CIF), May24-26 2022; and, the Annual Symposium, November 5-9, 2022.
Submission requirements for each stage are:
Stage 1. System description, results from a study of algorithm performance, and an outline of the methods of the full evaluation plan
Stage 2. Submission to address usability and workflow aspects of the AI system, including prototype usability or satisfaction, algorithm explainability, implementation lessons, and/or system use in context.
Stage 3. Submissions to summarize the comprehensive evaluation to include research from prior submissions with new research results that measure the impact of the tool.
Now that we have the abstracts for the Stage 1 conference we can analyse the extent to which the community of practice could satisfy those criteria. As well we can identify how the abstracts might fall short of the submission requirements and so aid the authors in closing the gap between their submission and the putative requirements. As well they will be able to ensure their poster or presentation fill some of the gaps where they have the information.
There is also the open question of what effect the “call to arms” in the Showcase was effective and how it might be improved in the future.
A succinct definition of the desirable contents of an Abstract right well be:
What was done?
Why was it done?
What are the results?
What is the practical significance?
What is the theoretical significance?
The Clinical AI Showcase has now provided a set of abstracts for the presentations and posters so how might we rate the informativeness of the abstracts against these criteria.
A review of the 22 abstracts has been conducted and a summary based on these 5 content attributes with 4&5 being merged has been provided in Table 2. The summaries have been designated by their key themes and collated into Table 1.
Using the publicised criteria for each of the three Stages of the AI Showcase it would appear that only 9 of the submissions are conformant, i.e. the papers categorised as Extended ML model,Development of ML model and Developed and Deployed. It is an open question as to whether the abstracts classed as Comparison of ML models fulfils Stage 1 criteria. The authors would do well to clarify the coverage of their work in their full presentation/poster.
The other categories appear to exist in a certain type of limbo. The Methods study only group appears out of place given the objectives of the AI Showcase. The Design only group would appear to be stepping out somewhat prematurely, although the broader advertising for the Showcase certainly encouraged early stage entries. As mentioned in my previous blog (blood reference) it would be exceedingly difficult for teams to meet the purported content for deadlines for the Stages 1 and 2 if the ideas for the project are in an embryonic design stage.
Teams with nearly or fully completed projects were able to submit works that also fulfilled many of the criteria for Stage 2 of the Showcase. The Developed and Deployed group showed progression in their projects that had reached deployment but in no case reported usability or workflow aspects with the exception of one paper that claimed their solution was installed at the bedside.
Two abstracts did not describe clinical applications of their ML but rather secondary use and these papers were doing NLP.
Good Abstract Writing
Most abstracts provided reasonable descriptions of the work they had done or intended to do. It was rare for abstracts to describe their results or the significance of their work, this undoubtably can be corrected in Stages 2 or 3 of the Showcase where they are required to report on their assessment of their tools practical use. Only one paper provided information on all four desirable abstract content items.
What can the Showcase Learn and do better
This Showcase has the admirable objective of encouraging researchers and clinical teams to perform AI projects to a better quality and in a more conclusive manner. However its Stages cover a cornucopia of objectives set out in a timeline that is unrealistic for projects just starting and poorly co-ordinated for projects near to or at completion. This is well evidenced by the some 40+ ML projects included in the Conference programme that are not part of the Showcase. If the Showcase is to continue, as it should, then a more considered approach to staged objectives, encouragement of appropriate teams, and more thoughtful timing would be a great spur to its future success.
Might I Humbly Suggest (MIHS) that a more refined definition of the stages be spelled out so that
a. groups just starting ML projects are provided with more systematic guidelines and milestones, and;
b. that groups in the middle of projects can ensure that they have planned for an appropriate level of completeness to their work.
Stage 1. What is the intended deliverable and Why it is necessary – Which clinical community has agreed to deployment and assessment.
Stage 2. What was done in the development – How the deliverable was created and what bench top verifications were conducted.
Stage 3. Deployment and Clinical Assessment – What were the issues in deployment. What were the methods and results of the clinical assessment. What arrangements have been made for the maintenance and improvement of the deliverable.
This definition excludes groups performing ML projects purely for their own investigative interest but without a specific participating clinical community. The place for their work is within the general programme of AMIA conferences. It also means that strictly speaking only 3 of the current acceptances would fit this definition for Stage 1, although 3 of the others could be contracted to fit this definition.
A concerning factor in the current timeline design is the short 2-month span between the deliverables for Stages 1 and 2. A group would pretty much have to have completed Stage 2 to submit to Stage 1 and be ready to submit to Stage 2 in 2 months.
Lastly the cost of attending 3 AMIA conferences in the one year would be excessively taxing especially to many younger scholars in the field. AMIA should provide a two-thirds discount to each conference to those team representatives who commit to entering the Showcase. This would be a great encouragement to get more teams involved.
Paper Topic
N
Design only
3
Comparison of ML models
5
Extended ML model
1
Development of ML model
5
Developed and Deployed – No operational assessment
3
Methods study only
1
Abstract Unavailable
2
TOTAL
20
Table 1. Category of key themes on the 22 Abstracts accepted into the AI Showcase.
Paper
What was done
Why was it done
What are their results
What is the significance
Comments
Category
Overgaard – CDS Tool for asthma – Presentation
Design of desirable features of AI solution.
to make review of the patient record more efficient.
Unspecified
Unknown – no clinical deployment.
Paper is about the putative design for a risk assessment tool and data extraction from the EHR.
Design only
Rossi – 90-day Mortality Prediction Model – Presentation
Deployed 90-day mortality prediction model. **
to align of patient preferences for advance care directives with therapeutic delivery, and improve rates of hospice admission and length of stay.
Unspecified
Unknown – clinical deployment planned.
Model is partially implemented with operationally endorsed workflows.
Development of ML model – Planned Deployment
Estiri – Unrecognised Bias in COVID Prediction Models – Presentation
Investigation of four COVID-19 prediction models for bias using an AI evaluation framework. **
AI algorithm biases could exacerbate health inequalities
Unspecified
Unknown – no clinical deployment.
Two bias topics are defined : (a) if the developed models show biases; (b) has the bias changed over time for patients not used in the development.
Comparison of ML models
Liu – Explainable ML to predict ICU Delerium – Presentation
A range of features were identified and a variety of MLs evaluated. Three prediction models were evaluated 6,12,24 hours. ****
To more accurately predict the onset of delirium
Described but numerics not provided
Implied due to described implementation design
Paper describes some aspect of all characteristics but not always completely.
Comparison of ML models
Patel – Explainable ML to predict periodontal disease – Presentation
“new” variables added to existing models revealed new associations.
Discover new information about risk factors.
Described but associations not provided
Unknown. No clinical assessment of the discovered associations, no clinical deployment.
AI methods not described.
Extended ML model
Liang – Wait time prediction for ER/ED – Virtual
Develop ML classifier (Tensorflow) to predict ED/ER wait times. Training and test sets described.
No explanation
Unspecified
unknown -clinical deployment status unclear
Focus is on the ML processes with little other information
Development of ML model
Patrick – Deep Understanding for coding pathology reports – Virtual
Built a system to identify cancer pathology reports and code them for 5 data items (Site, Histology, Grade, Behaviour, laterality)
California Cancer registry requested an automated NLP pipeline to improve production line efficiencies.
Various accuracies provided
Improvements over manual case identification and coding provided.
The work of this blog author.
Developed and Deployed – no operational assessment – Not clinical application
Eftekhari/Carlin – ML sepsis prediction for hematopoietic cell transplant recipients – Poster
Deployed an early warning system that used EMR data for HCT patients. Pipeline of processing extending to clinal workflows.
Sepsis in HCT patients has a different manifestation to sepsis in other settings.
Only specified results is deployment
Unknown – no clinical assessment.
Deployment is described showing its complexity. No evaluations.
Developed and Deployed – no operational assessment
Luo – Evaluation of Deep Phenotype concept recognitions on external EHR datasets – Poster
Recognises human phenotype Ontology concepts in biomedical al texts
No explanation
Unspecified
Unknown – no clinical deployment.
Abstract is the least informative. One sentence only.
Development of ML model – Not clinical application
Pillai – Quality Assurance for ML in Radiation Oncology – Poster
Five ML models were built and voting system devised to decide if a radiology treatment plan was Difficult or No Difficult. Feature extraction was provided.
To improve clinical staffs scrutiny of difficult plans to reduce errors downstream. Feature extraction to improve interpretability and transparency. ****
Unspecified
System planned to be integrated into clinical workflow.
Mostly about ML approach but shows some forethought into downstream adoption.
Comparison of ML models – Deployment planned
Chen – Validation of prediction of Age-related macular degeneration – Poster
ML Model to predict later AMD degeneration using 80K images from 3K patients.
to predict the risk of progression to vision-threatening late AMD in subsequent years
Unspecified
Unknown – no clinical deployment.
Focus is on the ML processes with little other information.
Development of ML model
Saleh – Comparison of predictive models for paediatric deterioration – Poster
Plan to develop and implement ML model to augment prediction of paediatric clinical deterioration within the clinical workflow.
Detecting deterioration in paediatric cases is effective at only 41% using existing tools.
Unspecified – planning stage only
System planned to be integrated into clinical workflow.
Early conceptualisation stage. Well framed objective and attentive to clinical acceptability. No framing of datasets, variables and ML methods.
Design only
Shah – ML for medical education simulation of chest radiography – Poster
not available
Mathur – Translational aspects of AI – Poster
Evaluation of the TEHAI framework compared to other frameworks for AI with emphasis on translational and ethical features of model development and its deployment.
A lack of standard training data and the clinal barriers to introducing AI into the workplace warrant the development of a AI evaluation framework.
Qualitative assessment of 25 features by reviewers.
No in vitro evaluation – only qualitative assessment
This is an attempt to improve the evaluation criteria we should be suing on AI systems. it fails to make convincing case it isa better method than alternatives.
Methods study only
Yu – Evaluating Pediatric sepsis predictive model – Poster
not available
Tsui – ML prediction for clinical deterioration and intervention – Poster
Built a ML for an Intensive care warning system for deterioration events and pharmacy interventions. It uses bedside monitor, and EHR data providing results in real-time.
No explanation
Unspecified
Unknown – no clinical assessment.
Operational system. Only a description of the deliverables – no evaluations
Developed and Deployed – no operational assessment
Rasmy – Evaluation of DL model for COVID-19 outcomes – Poster
A DL algorithm developed to predict for COVID-19 cases on admission: in-hospital mortality, need for mechanical ventilation, and long hospitalization.
No explanation
Unspecified – no numerics supplied
unknown – no clinical deployment
Seems to concentrate solely on the DL modelling.
Development of ML model
Wu – ML for predicting 30-day cancer readmissions – Poster
ML models built to identify 30-day unplanned admissions for cancer patients.
Unplanned dance readmissions have significantly poorly outcomes so the aim is to reduce them.
No Results but promised in the poster/presentation
Unknown – no clinical assessment.
No ML details just a justification in Abstract
Comparison of ML models
Mao – DL model for Vancomycin monitoring – Poster
A DL pharmacokinetic model for Vancomycin was compared to a Bayesian model
To provide a more accurate model of Vancomycin monitoring.
The DL model performed better than the Bayesian model.. No numerics provided.
Unknown – no clinical assessment.
Focus is on the ML processes with little other information.
Comparison of ML models
Ramnarine – Policy for Mass Casualty Trauma triage – Poster
Design of a strategy to build an ML for ER/ED triage and retriage categorisation for mass casualty incidents
To fill a void in national standards for triage and retriage
Unspecified – design proposal only
Unknown – no clinical assessment of practicality of acceptance.
This is a proposal with no concrete strategy for implementation and what would be used in the investigation from either a data source of ML strategy type.
Design only
Table 2 Summary of the 22 Abstracts according to 4 content attributes ascribed for good abstracts.
The recent announcement by AMIA of the “2022 Artificial Intelligence Evaluation Showcase” is no doubt welcomed by research experimenters but will it provide revelations in how to conduct better and more effective Clinical AI studies to produce truly valuable operational deployments? (See amia-2022-artificial-intelligence-evaluation-showcase/artificial-intelligence)
The showcase is divided into 3 phases with results from each phase to be presented at traditional conferences conducted by AMIA. Phase I involves presenting at the AMIA 2022 Informatics Conference held in March a “system description, results from a study of algorithm performance, and an outline of the methods of the full evaluation plan” . Phase 2 involves a presentation at the AMIA 2022 Clinical Informatics Conference in May “to address usability and workflow aspects of the AI system, including prototype usability or satisfaction, algorithm explainability, implementation lessons, and/or system use in context” . Phase 3 involves presenting a submission at the AMIA 2022 Annual Symposium in November “to summarize the comprehensive evaluation to include research from prior submissions with new research results that measure the impact of the tool”.
So in coalescing these three statements and drawing on the organisers other comments I would like to reframe their words into these prospective and admirable outcomes: a. improve scale and scope of the evaluation of AI tools so that we get fewer limited and poor quality AI publications; and, b. encourage the development of multidisciplinary teams with a wider range of expertises that would improve the quality of AI evaluation.
However what might be the unexpected outcomes of the Showcase? Certainly some researchers will gain publication for their work which they will welcome and is not without merit of itself, but if that is the only objective then why run a special Showcase. Why not use the normal mechanisms AMIA has available, such as run a special issue of JAMIA on AI/ML. Will the three phase submission and presentation format provide something that normal publication channels don’t provide – it’s not self evident from the published promotional materials for the Showcase. If the objective is to promote research into putative AI solutions for clinical data processing tasks then it is not anything different to the publication avenues currently available. If the aim is to bring forward the adoption of AI technologies into working environments then there are a number of unspoken obstacles not addressed by the Showcase call to arms.
So I ask the question will the Showcase create motivations that will drive us in the wrong direction for the improvement of productive Clinical AI solutions. That may seem unfair to the organisers who no doubt are working hard for legitimate outcomes, but the world of AI development has a number of deficits that may be reinforced rather than diminished by this well intentioned initiative.
You might wonder why am I so disbelieving that this honourable initiative will provide useful outputs that will push the Clinical AI industry further in a positive way. You might ask why I feel compelled to make a submission to the Showcase yet deep down think the exercise will be futile and a waste of time. I am buoyed in my misgivings by the recent article in the MIT Technology Review with the byline : “Hundreds of the AI tools have been built to catch COVID. None of them helped.” As well as my previous review of the topic (see https://www.jon-patrick.com/2021/04/ai-assessment-in-clinical-applications/).
In this conversation we will restrict our interpretation of “AI” to “supervised machine learning” (ML) the most common form of AI technology in use for analysing clinical data, and we draw on our experience in Clinical Natural Language Processing (CNLP) to formulate our analysis. It will be up to others to decide how applicable this commentary is to their own ML contexts.
Here are some of my musings over the obstacles facing the Clinical AI industry that it would be helpful for the showcase to specifically address.
1. DOES THE DATA MATCH THE OBJECTIVES. Research projects exploring the use of ML techniques for clinical case data is a FAR CRY from building industrial quality technology that clinicians find trustworthy enough to use. Research projects are conducted under a number of limitations which often are not clearly understood by the research teams. Typically, the data set used for training the ML models are flawed and inadequate in terms of the project objectives without obviously being so. The data can be flawed because it doesn’t cover all the corner cases of the problem space; that is, where the training sample is not properly representative of the problem space. This commonly occurs when the data has been provided opportunistically rather than selectively according to the project objectives. The data can also be flawed because the values are poorly expressed for ML purposes. For example, in one data set of GP notes the medicines files held the precise pharmacy descriptions, which caused all data values to be virtually unique across 60K records and therefore unsuitable as a classifying feature. The remediation required a physician to go through the records and create values of {normal, weaker than normal, stronger than normal} as a surrogate for the prescription details that were thought to be meaningful to the project objective. Providing a justification of each variable and its domain range used in a model would be a useful validation criterion.
2. IS THE ML ALGORITHM APPROPRIATE FOR THE TASK. Might the showcase lead to a plethora of studies using the current popular fad of Deep Learning which can be inappropriate in many health circumstances. Deep Learning is metaphorically a heavyweight technology that suits the needs of steel workers assembling a new skyscraper, whereas many clinical case studies need to be assembled with a watchmakers toolkit of minute components assembled with delicacy to achieve the highest accuracy required to effectively support clinical work. Deep learning techniques have been reported as useful in some settings, especially imaging, and are said to have great power because they are trained on very large data sets, but it also begs a number of questions e.g. How are models corrected when small but important errors are found? How are gold standard values established to be 99.99% pure gold (better than 24 karat of 99.9%) for such large data sets? How does Deep Learning incorporate the specific knowledge, rules and standards of professional practices when those practices vary from year to year especially when the large training set only becomes available year(s) after the fact? How does it correctly identify the extremely rare event (like certain diseases) that are definitionally much like a common event. As a generalisation Deep Learning provides the least transparency of all ML algorithms, yet at the same time as a counterpoint, there are researchers endeavouring to travel in the opposite direction and increase the explicability of AI applications. See the $1m prize awarded to Cynthia Rudin of Duke University, for research into ML systems that people can understand (https://www.wsj.com/articles/duke-professor-recognized-for-bringing-more-clarity-to-ai-decision-making-11634229889). Assessing the ML algorithm for its appropriateness to the applied task would be a useful evaluation criterion.
3. BAD DATA AND POWERFUL ALGORITHMS JUST SET US BACK. Researchers can be pressed to use data that is available on hand and so either tackle a problem of not great value or misinterpret the meaning, value and generalisation of their outcomes. This situation can lead to routine processes being used on poor quality data both in its definition and in its gold standard training classes so that whilst results are produced their value is limited. It must be accepted that good researchers (especially young researchers) will learn from these misguided efforts and go on to do better work next time round, so it has good educational and praxis value, but the interim impact can be of limited research value and so waste a great deal of time of all the external people in its assessment chain when presented for publication or deployment. Assessing the meaningfulness of the data in the context of the problem space would be a useful assessment criteria. To give them their due the showcase organisers might well have faith that will be achieved in Phases 2 and 3 of their programme.
4. RESEARCH EFFORT DOES NOT EQUAL SATISFACTORY INDUSTRIAL PERFORMANCE. The requirements for producing industrial quality technology is often beyond the competency and experience of research teams. This can be even more true for research teams embedded in large corporations who treat their techniques as the only way to resolve the task and like a hammer treat everything in the world as a nail. (see https://www.jon-patrick.com/2019/02/deficiencies-in-clinical-natural-language-processing-software-a-review-of-5-systems/ for an example). A working solution that is costly for staff to integrate into existing workflows requires considerable planning for adoption but even more ingenuity and experience to provide the best software engineering techniques. The supply of the source data for the working solution has to be secured and monitored on a daily basis once the operational system is in place. The continuous storage of incoming data, its efficient application to the ML algorithms and the delivery of outputs to the points of usage are all complex engineering and organisational matters, which researchers are commonly insensitive to if not entirely inexperienced with. The software engineering of complex workable solutions is just as important to successful industrial quality solutions as the ML algorithms, data sampling and model optimisation, but is invariably ignored in research publications and by the researchers themselves. The cry so often is – “It is all about the data” – which is so far from the truth for real solutions. Assessing the software engineering development and maintenance requirements of a proposed AI solution would be a useful evaluation criteria.
5. WILL A CLINICIAN CHANGE THEIR EXPERT OPINION. The produced systems are rarely tested on the real criterion of success – will a clinician actually use this technology to correct their own expert opinion? Just asking them if they approve of the solution is not sufficient. ML projects are normally tested for their accuracy, where the most common test used, 10-fold cross validation (10CV), is probably the weakest test that could be applied. In our work we ignore it as a test as it provides little information that could be the basis of action to improve processing. Even experiments that use a held out set are little better. The best computational test is validation against a post implementation test, that is, new data that has never been seen and is drawn from the world of real practice. This approach then necessitates more infrastructure and ongoing commitment to improving the solution – has the user client committed to that effort and for how long?
However, the ultimate test is the client. Will they suspend their own judgement to accept or even consider the judgement of the delivered tool? If not, then not all is lost. An ML system designed to replace human tasks can readily become an advisory to the human, prodding them to think of things they might not have otherwise thought of. But also one has to be careful not to overreach – the recent AMIA Daily Downloads (25th November 2021 EADT) headlines “Epic’s sepsis algorithm may have caused alert fatigue with 43% alert increase during pandemic”. Assessing the extent to which clinicians will revise their opinions would be a useful verification criteria.
6. AI IS USED IN MANY HEALTH SETTINGS THAT ARE NOT CLINICAL CARE. Many AI/ML applications that are used in health are for Public Health purposes or other secondary usage. These will not be able to show improved health outcomes, as required by Phase 3 of the Showcase, but rather they contribute to greater efficiency and productivity in the workplace. At best they represent second order effects on health outcomes. The narrowness of the Showcase call for participation appears to be based on a limited view of the breadth of ML applications in the health sector as distinct to the clinical sector. One just needs to look at the papers presented in past AMIA conferences to realise there is probably more applications of ML to secondary use of clinical data than primary use in clinical practice. Cancer Registries are a good example of the secondary use of most and ideally all pathology reports generated ACROSS the whole country that describe some aspect of cancer cases. If registries of all shades and persuasion are to keep up with the increasing count of patients and methods of treatment then Clincal NLP using ML will be a vital tool in their analytics armoury. Assessing the extent to which an AI technology makes work more efficient or reliable would be a useful productivity criteria.
7. PRESS REPORTS IGNORE ACCURACY. It is frustrating to read press reports that laude the “accomplishments” of an AI application without any content on the reliability of the application. Errors mean different things in different contexts. The meaningfulness of false positives(FPs) and false negatives(FNs) are generally undersold and often ignored. In clinical work an FP can have as serious a consequence as an FN as it means a patient receives inappropriate care endangering their health, even life perhaps, as much as an FN which would lead to a missed diagnosis and failure to deliver appropriate care. However in population based health applications usually FNs need to be minimised and a certain higher level of FPs can be tolerated as a compromise to minimise FNs. Assessment of the importance of FNs and FPs to the acceptability of the AI application would be a useful reliability criterion.
While I feel the motivation and projected outcomes for the Showcase are hazy, individuals will have to asses for themselves if it sufficiently addresses the difficulties in the field for participation to provide reciprocal value for the gargantuan effort and cost needed by contributors to be involved. It is the question we are asking ourselves at the moment.
Just making something different isn’t sufficient,
someone elsehas to use it meaningfully for itto have value.