Automatic analysis of medical records can help to improve our understanding of all diseases, their development and treatment. Main problem with it is that most medical records nowadays are stored in a semi-structured text, which makes automatic classification and analysis more difficult. One of the possible approaches towards process simplification is to export all relevant data from semi-structured health records into a database. We propose to use semantics-oriented approach for this, as it provides relative freedom in portability. With this aim disease ontology is used for intermediate representation of the domain. Ontology can be modified and expanded at any moment to adjust the system for another disease or medical area. Our contribution describes an experiment run on a set of medical records, which have already been exported to a database manually, and compares the obtained results to estimate efficiency of the method.
Keywords: medical records, export, database
In the biomedical domain the growth of electronic
health records popularity raises the need for formal
representation of information stored in medical records.
Most of the available health records are stored in form
of semi-structured text files, which is convenient for
human perception, but causes problems with
structuring, searching, classification, analysis and other
automatic tasks concentrated on patients' data
processing. It will also make it more difficult to apply
different data mining techniques on this data, as most
of techniques are designed to work with data stored in
form of a table or a relational database and are not able
to work with free-form text.
One of the possible approaches towards process
simplification is to export all relevant data from semistructured
health records into a database automatically.
This approach is rather challenging, because it is
difficult to create a software that understands natural
language in the same extent as humans. In general,
current state of art suggests to start the process by
extensive pre-processing of the input text using spellchecking,
sentence splitting, tokenization, part-ofspeech
tagging in some cases. Medical records in most
cases contain lots of abbreviations that should also be
resolved during pre-processing stage. We will try to
design a dedicated tool that will automate significant
part of the export process for specific type of real life
medical records – see Fig. 1 for a characteristic
example. All over full automation of such a process is
not yet achievable, because it requires preparation of
knowledge domain model, database model and
thesaurus with some manual adjustments, even the
partial solution can bring value to the health records
Our intention is to use semantics-oriented approach
to natural language analysis. The described method is
based on semantics-oriented approach to natural
language analysis proposed by Narinyani, 1980. In
semantics-oriented approach the lexical units of the
language are matched to certain semantic classes,
which express meaning of the given lexical unit in the
database. Semantic classes system and process of
analysis are designed in such a way, that during
analysis itself those combinations of semantic classes,
which have meaning of the largest semantic structure,
are distinguished. Semantic orientations make possible
mutual precisions of the lexemes sense based on their
context. For example, we meet in text phrase “aged
29”. Word “aged” will be referred to AGE semantic
class. Determination of semantic class for “29” will not be straightforward, because digits met in text can refer
to any of age, medicine dose, phone number, date, etc.
Some certainly wrong orientations can be removed if
constraints are applied (e.g. age can not be more than
100), but ambiguity will still be present. In this case context comes for help – nearest neighbor of “29” has semantic class AGE, and “29” among other classes has AGE class as well. This leads to conclusion that AGE is the most probable semantic orientation for “29”.
Paper medical records are still most popular way of
recording patient information for most medical
institutes. In order to have complete patients' medical
history in electronic form all existing paper records
should be scanned and processed by some OCR
software. Pure recognition system is not enough for
this goal, as it would not effectively reduce efforts.
This limitation occurs due to the fact that medical
records usually follow different kinds of formats, so it
is difficult to extract content using a uniform template.
A number of medical records scanning solutions have
been developed lately. Among them can be named
MedicScan, InfoMedix, FileScan, ExperVision, etc.
 provides a review of recent research and
achievements in information extraction from medical
records, it also mentions different approaches that were
used in similar systems. They include patternmatching,
shallow and full syntactic parsing as well as
semantic parsing, use of ontologies and sub-languages.
Each approach has its own advantages and
disadvantages (like lack of generalizability, non-robust
performance, expensive prerequisites) and the
particular choice should be made according to the
Our system is using semantics-oriented oriented
approach to natural language analysis. According to it
lexical semantics of words and phrases is represented by the so-called “orientation” , which is linked to a
number of domain model concepts that could represent
these words and phrases. In ambiguous cases the
strongest orientation is selected.
Our method of data extraction from medical records
will proceed in several phases: text processing,
ontology mapping and database mapping. All these
phases utilize specific structure of the textual
documents we are considering.
The test processing stage is responsible for
extracting meaningful entities from original text.
Careful analysis of the 70 available medical records
was done. It was noticed that each record has several
distinct sections and it was decided first to divide each
record. For example record shown in Fig. 1 will have 9
sections: RA, OA, Léky, Operace, Alergie, Transfuze,
Úrazy, Abůzus, GA.
Each section is analyzed separately as they are not
interconnected. Section consists of two parts – head
and tail. Head will contain text before colon (or main
notion if several words are there) - in most cases it will
have direct link to some column in database tables.
Tail will contain rest of the section (text after the
colon). It will be a list of several items in some cases.
Comma is considered as a separator of the tail items.
Stop words should be removed from the tails. Each
word in tail item should be brought to infinitive form
(suppose that we can drop original form of the word off
and insert in database only infinitive forms for a better unification). Each abbreviation should be resolved to
full form (or left as it is in some cases for unification).
Special treatment has to be applied to word
negations. There is no general solution to this problem,
yet. One possible approach can concentrate on those
words that start with “ne-” that are not present in some
list of words where “ne-” always goes with the word
and doesn't have negative meaning. For these words,
the “ne-” particle is removed and the negative flag
applied to the word is changed to true. Negation
treatment is left for a future direction for the moment,
as it requires careful analysis.
The idea is to have at the end of text processing stage
attribute/value pairs for each section, that can be easily
mapped to ontology.
Ontology mapping. For the first experiments we
used an ontology that was created manually to describe
medical domain from the available records. Further
applications of the system will be made using some
existing disease ontologies (Mesh, Snomed, Disease
Ontology, etc.). If existing ontologies do not fully
describe the disease, or the area in question, custom
ontology can be added. It is also expected that the tool
will allow uploading some custom ontology by endusers.
This will make it possible to use it for different
other areas where transformation to formal
representation may be needed.
Ontology mapping process by itself should be an
easy one after text is properly processed. Each word
should have one (or several in some cases) attributes
from the ontology which allows straightforward
mapping. In case several attributes can be applied to a
single word, we are facing an ambiguous expression.
Its main attribute is selected taking into account the
current context represented by the context attributes.
These attributes are analyzed and if some intersection
appears with the ambiguous word, then it is resolved;
otherwise statistical approach is used – most common
attribute for this word within particular group of
records is selected.
Database mapping. Database mapping is performed
according to some database model, which will be
elaborated for each database separately. Database
model is created in strict correlation with ontology to
reduce efforts on database mapping stage. Each column
in database should have link with just one ontology
attribute to prevent occurrenc ambiguity in this step.
It is planned to provide user with some tools for
semi-automatic creation of database model in future. It
will read database structure and create core model,
which can be later broadened with additional fields and
Usage of Ontologies
Ontologies are becoming more and popular way of
representation of any domain semantics – concepts and
entities in this domain, and their properties and
relationships. This happens mostly due to ontologies' flexibility, which is particularly valuable nowadays,
when information is constantly changing and growing.
Ontologies can also join information from different
domains, sources and create new relations based on
this. It is very important that existing ontologies can be
updated and extended with new knowledge very easily.
This helps very much when several different domains
are under research and they do not have common
Due to the limited Czech disease ontologies present,
for initial experiments ontology was created manually.
Fig. 2 represents main part of it. This ontology fully
describes domain referenced in available medical
records – information about diseases, medicines, other
medical history and also general information, like
concepts about age, gender, status, etc.
When creating the ontology we strictly followed
structure and content of the available health records.
That's why it contains 9 main classes – general_info,
medicine_history, surgery_history, allergy_history,
transfusion_history, injury_history, abuse_history and
gynecological_history. Each of these classes are
further supplemented with detailed classification,
additional properties and concepts.
Future experiment will be produced using Mesh-CZ
ontology, as it seems to suitable for our goals. MeSH
(Medical Subject Heading) thesaurus is a controlled
vocabulary developed by the U.S. National Library of
Medicine . Current version of thesaurus includes
concepts related to biomedical and health information
that appear in MEDLINE/PubMed, NLM catalog
database and other NLM databases. Czech translation
of MeSH was done in 1977 by Czech National Medical
Library, it is revised and updated periodically.
Automated export of information from medical
records brings value only if its result is reliable and
precise. To prove effectiveness of the approach
evaluation of the results is performed. We evaluated
performance of the application using precision metric,
which is a measure of the amount of true returned data
compared to the amount of false exported data.
Due to the fact that all medical records, that were
used in the experiment, were initially exported to the
database manually, evaluation process was
straightforward. We have compared fields from the two
databases – one processed manually and another
automatically, and counted number of miss-exports.
Miss-exports were considered fields that were not
matching. All the erroneous fields were selected and
analyzed. A number of them was eventually considered
as a correct export, since the error in automatically
exported field consisted in different representation of
the word. This gave us precision of 94% (2793 correct
exports out of 2982 fields).
Our assumption is that errors were mainly made
because of misspellings in original health records. For
the moment some corrections were made manually. To
avoid this in future we are planning to incorporate automatic spell-checking step in next experiments.
Other possible source of errors could be lack of reliable
context for some concepts, which led to mistakes in the
process of ambiguity resolution.
Export of data from a natural language source is a
rigorous and complicated task, especially in biomedical
domain. Medical text is harder to analyze, because it
often contains multiple abbreviations, medical codes,
latin concepts, etc. But the existence of a reliable
automatic export tool would bring a lot of value to the
domain, because its potential uses are numerous.
Our approach proved to be rather accurate with an
ontology, restricted to the available medical records.
We are planning to run a set of further experiments
using general ontology, possibly sightly adjusted to
meet our goal. It is expected that efficiency of the
method will decrease during the first tests with a more
general ontology. Results will be analyzed to define
possible improvements in the method, and these
improvements will be applied. Usage of restricted
ontology still remains as an option for a reliable data
export process, if general ontology causes vey low
level of precision after all the improvements.
We thank BioDat Research Group for the provided
database with the manually exported medical records and Prof. Olga Stepankova for her help in reviewing
Liuba Grama, Mgr.
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University in Prague
Karlovo nám. 13,
CZ-121 35 Praha 2
Phone: +420 608 268 746
 S.M. Meystre, G.K. Savova, K.C. Kipper-Schuler, J.F. Hurdle. Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research, 2008.
 M.A. Sasse, C.W. Johnson. Human-computer Interaction, INTERACT '99, 1999
 C.E. Lipscomb. Medical Subject Headings (MeSH). Bull Med Libr Assoc. 2000 July; 88(3): 265–266 [
4] M. Zakova, L. Novakova, O. Stepankova, T. Markova. Ontologies in Annotation and Analysis of Medical Records, 2008.
 D. Jurafsky, J.H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Prentice-Hall, ISBN: 0-13- 095069-6, 2000.