Modern language processing technologies have arisen from two separate pathways in algorithm history.
The first language processing efforts in the 1960’s were conducted by linguists using early computers to process linguistic structures. The linguists built algorithms that identified parts of speech and parsed sentences according to language structure. As this field developed it became known as Computational Linguistics and continues today as a professional research community.
In 1990’s the computer scientists introduced machine learning methods into the field of computational linguistics following the spirit of corpus linguistics where they studied the statistical characteristics of corpora. These methods known as Statistical Natural Language Processing became an important subfield to what had also become known as Natural Language Processing (NLP).
In the second half of the ‘90s the rise of Google spurred the adoption of Text Mining, an approach that ignored the linguistics character of language and worked on just the text strings. This has become a fashionable approach due to the ease with which programmers can become engaged as enthusiasts in the processing strategies without the need to engage in the field of linguistics. This approach has become so popular it has assumed the title of NLP so today NLP is a title that is a conflation of two methodological paradigms.
Google has also fuelled a new approach since the early 2010s with its development of neural net machine learning methods called Deep Learning. This fits with their general approach of using extremely large data sets to characterize language texts. Whilst this approach has shown some useful features it is not universally applicable and some of its limitations are now beginning to emerge.
The two heritages of language processing are now Computational Linguistics and Text Mining each having separate origins and algorithmic philosophies and each now claiming the rubric of Natural Language Processing.
However algorithmic processing of text is not an end in itself but rather part of a journey to a functionality that is of value to users. The two most dominant user needs are to recognize the nature of a document (aka document classification), and, identify entities of some interest within the text (aka Semantic Entity Recognition). But the meaningfulness of this processing comes about from how the NLP analysis is then utilized, not the NLP of itself. It is this use of the NLP analysis, whether it be from the computational linguistics paradigm or the text mining paradigm, that creates the value for the user.
- DEEP UNDERSTANDING takes its heritage from computational linguistics and for processing clinical documents moves beyond text mining methods to bring to bear the full value of:
- coalescing grammatical understanding, semantic entity recognition, extensive clinical knowledge stores, and machine learning algorithmic power,
- so as to produce some transform of the native textual prose into some meaningful conceptual realm of interest that mimics the work of people who have the responsibility to interpret the clinical texts,
- at an accuracy level equal to or better than the human experts who do the task, and,
- knows its own limitations and passes the task back to the human processor when it can’t complete the task accurately enough, and
- provides an automatic feedback mechanism to identify potential errors to improve performance as the body of processing materials grows, so as to sustain Continuous Process Improvement.
An example is the work we do for cancer registries that need to translate cancer pathology reports into an agreed set of codes according to an international coding system known as the International Classification of Diseases – Oncology Version 3 (ICD-O-3).
Our Deep Understanding pipeline consists of 4 major components as shown in the diagram:
1. A machine learning document classifier which determines if a pathology report is identified as a cancer report or not.
2. A clinical entity recognizer that is a machine learner that identifies forty two different semantic classes of clinical content needed for correctly coding the report.
3. A coding inductive inference engine which holds the knowledge of the ICD-O-3 codes and many hundreds of prescribed rules for coding. It receives access to the text annotated for clinical entities and applies computational linguistics algorithms to parse the text and map the entities to the classification content to arrive at ICD-O-3 codes. The task is complicated by needing to code for 5 semantic objects:
Semantic Object | Number of Codes |
Body site for the location of the cancer in the body | 300+ |
Disease histology | 800+ |
Behaviour of the cancer | 5 |
Grade or severity of the disease | 20+ |
Laterality, or side of the body where the disease is located, if a bilateral organ | 6 |
4. A process to cycle knowledge learnt from the operational use of the pipeline process back to the machine learning NLP components to maintain Continuous Process Improvement (CPI) or Continuous Learning.
5. Each process has been periodically retrained and tuned over time to attain an accuracy rivalling and on some aspects bettering human coding performance.
While Deep Learning is a new and innovative method for learning the characteristics of a large corpus of texts it can be and usually is only an NLP subsystem of a larger processing objective. If that objective is to compute an understanding of the corpus content to the extent it can intelligently transform the text into higher order conceptual content to an accuracy level achieved by humans, then it is performing DEEP UNDERSTANDING.