Authors: Liuba Grama 1
Authors‘ workplace: Czech Technical University in Prague , Prague, Czech Republic 1
Published in: Lékař a technika - Clinician and Technology No. 2, 2012, 42, 38-41
Category: Conference YBERC 2012


Automatic analysis of medical records can help to improve our understanding of all diseases, their development and treatment. Main problem with it is that most medical records nowadays are stored in a semi-structured text, which makes automatic classification and analysis more difficult. One of the possible approaches towards process simplification is to export all relevant data from semi-structured health records into a database. We propose to use semantics-oriented approach for this, as it provides relative freedom in portability. With this aim disease ontology is used for intermediate representation of the domain. Ontology can be modified and expanded at any moment to adjust the system for another disease or medical area. Our contribution describes an experiment run on a set of medical records, which have already been exported to a database manually, and compares the obtained results to estimate efficiency of the method.

medical records, export, database


In the biomedical domain the growth of electronic health records popularity raises the need for formal representation of information stored in medical records. Most of the available health records are stored in form of semi-structured text files, which is convenient for human perception, but causes problems with structuring, searching, classification, analysis and other automatic tasks concentrated on patients' data processing. It will also make it more difficult to apply different data mining techniques on this data, as most of techniques are designed to work with data stored in form of a table or a relational database and are not able to work with free-form text.

One of the possible approaches towards process simplification is to export all relevant data from semistructured health records into a database automatically. This approach is rather challenging, because it is difficult to create a software that understands natural language in the same extent as humans. In general, current state of art suggests to start the process by extensive pre-processing of the input text using spellchecking, sentence splitting, tokenization, part-ofspeech tagging in some cases. Medical records in most cases contain lots of abbreviations that should also be resolved during pre-processing stage. We will try to design a dedicated tool that will automate significant part of the export process for specific type of real life medical records – see Fig. 1 for a characteristic example. All over full automation of such a process is not yet achievable, because it requires preparation of knowledge domain model, database model and thesaurus with some manual adjustments, even the partial solution can bring value to the health records analysis.

Fig. 1: Characteristic example of a medical record used in experiment
Fig. 1: Characteristic example of a medical record used in experiment

Our intention is to use semantics-oriented approach to natural language analysis. The described method is based on semantics-oriented approach to natural language analysis proposed by Narinyani, 1980. In semantics-oriented approach the lexical units of the language are matched to certain semantic classes, which express meaning of the given lexical unit in the database. Semantic classes system and process of analysis are designed in such a way, that during analysis itself those combinations of semantic classes, which have meaning of the largest semantic structure, are distinguished. Semantic orientations make possible mutual precisions of the lexemes sense based on their context. For example, we meet in text phrase “aged 29”. Word “aged” will be referred to AGE semantic class. Determination of semantic class for “29” will not be straightforward, because digits met in text can refer to any of age, medicine dose, phone number, date, etc. Some certainly wrong orientations can be removed if constraints are applied (e.g. age can not be more than 100), but ambiguity will still be present. In this case context comes for help – nearest neighbor of “29” has semantic class AGE, and “29” among other classes has AGE class as well. This leads to conclusion that AGE is the most probable semantic orientation for “29”.

Method Describtion

Paper medical records are still most popular way of recording patient information for most medical institutes. In order to have complete patients' medical history in electronic form all existing paper records should be scanned and processed by some OCR software. Pure recognition system is not enough for this goal, as it would not effectively reduce efforts. This limitation occurs due to the fact that medical records usually follow different kinds of formats, so it is difficult to extract content using a uniform template.

A number of medical records scanning solutions have been developed lately. Among them can be named MedicScan, InfoMedix, FileScan, ExperVision, etc.

[1] provides a review of recent research and achievements in information extraction from medical records, it also mentions different approaches that were used in similar systems. They include patternmatching, shallow and full syntactic parsing as well as semantic parsing, use of ontologies and sub-languages. Each approach has its own advantages and disadvantages (like lack of generalizability, non-robust performance, expensive prerequisites) and the particular choice should be made according to the pursued goals.

Our system is using semantics-oriented oriented approach to natural language analysis. According to it lexical semantics of words and phrases is represented by the so-called “orientation” [2], which is linked to a number of domain model concepts that could represent these words and phrases. In ambiguous cases the strongest orientation is selected.

Our method of data extraction from medical records will proceed in several phases: text processing, ontology mapping and database mapping. All these phases utilize specific structure of the textual documents we are considering.

The test processing stage is responsible for extracting meaningful entities from original text. Careful analysis of the 70 available medical records was done. It was noticed that each record has several distinct sections and it was decided first to divide each record. For example record shown in Fig. 1 will have 9 sections: RA, OA, Léky, Operace, Alergie, Transfuze, Úrazy, Abůzus, GA.

Each section is analyzed separately as they are not interconnected. Section consists of two parts – head and tail. Head will contain text before colon (or main notion if several words are there) - in most cases it will have direct link to some column in database tables.

Tail will contain rest of the section (text after the colon). It will be a list of several items in some cases. Comma is considered as a separator of the tail items.

Stop words should be removed from the tails. Each word in tail item should be brought to infinitive form (suppose that we can drop original form of the word off and insert in database only infinitive forms for a better unification). Each abbreviation should be resolved to full form (or left as it is in some cases for unification).

Special treatment has to be applied to word negations. There is no general solution to this problem, yet. One possible approach can concentrate on those words that start with “ne-” that are not present in some list of words where “ne-” always goes with the word and doesn't have negative meaning. For these words, the “ne-” particle is removed and the negative flag applied to the word is changed to true. Negation treatment is left for a future direction for the moment, as it requires careful analysis.

The idea is to have at the end of text processing stage attribute/value pairs for each section, that can be easily mapped to ontology.

Ontology mapping. For the first experiments we used an ontology that was created manually to describe medical domain from the available records. Further applications of the system will be made using some existing disease ontologies (Mesh, Snomed, Disease Ontology, etc.). If existing ontologies do not fully describe the disease, or the area in question, custom ontology can be added. It is also expected that the tool will allow uploading some custom ontology by endusers. This will make it possible to use it for different other areas where transformation to formal representation may be needed.

Ontology mapping process by itself should be an easy one after text is properly processed. Each word should have one (or several in some cases) attributes from the ontology which allows straightforward mapping. In case several attributes can be applied to a single word, we are facing an ambiguous expression. Its main attribute is selected taking into account the current context represented by the context attributes. These attributes are analyzed and if some intersection appears with the ambiguous word, then it is resolved; otherwise statistical approach is used – most common attribute for this word within particular group of records is selected.

Database mapping. Database mapping is performed according to some database model, which will be elaborated for each database separately. Database model is created in strict correlation with ontology to reduce efforts on database mapping stage. Each column in database should have link with just one ontology attribute to prevent occurrenc ambiguity in this step.

It is planned to provide user with some tools for semi-automatic creation of database model in future. It will read database structure and create core model, which can be later broadened with additional fields and ontology links.

Usage of Ontologies

Ontologies are becoming more and popular way of representation of any domain semantics – concepts and entities in this domain, and their properties and relationships. This happens mostly due to ontologies' flexibility, which is particularly valuable nowadays, when information is constantly changing and growing. Ontologies can also join information from different domains, sources and create new relations based on this. It is very important that existing ontologies can be updated and extended with new knowledge very easily. This helps very much when several different domains are under research and they do not have common existing ontology.

Due to the limited Czech disease ontologies present, for initial experiments ontology was created manually. Fig. 2 represents main part of it. This ontology fully describes domain referenced in available medical records – information about diseases, medicines, other medical history and also general information, like concepts about age, gender, status, etc. 

Fig. 2: Ontology used in the experiment
Fig. 2: Ontology used in the experiment

When creating the ontology we strictly followed structure and content of the available health records. That's why it contains 9 main classes – general_info, personal_disease_history, family_disease_history, medicine_history, surgery_history, allergy_history, transfusion_history, injury_history, abuse_history and gynecological_history. Each of these classes are further supplemented with detailed classification, additional properties and concepts.

Future experiment will be produced using Mesh-CZ ontology, as it seems to suitable for our goals. MeSH (Medical Subject Heading) thesaurus is a controlled vocabulary developed by the U.S. National Library of Medicine [3]. Current version of thesaurus includes concepts related to biomedical and health information that appear in MEDLINE/PubMed, NLM catalog database and other NLM databases. Czech translation of MeSH was done in 1977 by Czech National Medical Library, it is revised and updated periodically.


Automated export of information from medical records brings value only if its result is reliable and precise. To prove effectiveness of the approach evaluation of the results is performed. We evaluated performance of the application using precision metric, which is a measure of the amount of true returned data compared to the amount of false exported data.

Due to the fact that all medical records, that were used in the experiment, were initially exported to the database manually, evaluation process was straightforward. We have compared fields from the two databases – one processed manually and another automatically, and counted number of miss-exports. Miss-exports were considered fields that were not matching. All the erroneous fields were selected and analyzed. A number of them was eventually considered as a correct export, since the error in automatically exported field consisted in different representation of the word. This gave us precision of 94% (2793 correct exports out of 2982 fields).

Our assumption is that errors were mainly made because of misspellings in original health records. For the moment some corrections were made manually. To avoid this in future we are planning to incorporate automatic spell-checking step in next experiments. Other possible source of errors could be lack of reliable context for some concepts, which led to mistakes in the process of ambiguity resolution.


Export of data from a natural language source is a rigorous and complicated task, especially in biomedical domain. Medical text is harder to analyze, because it often contains multiple abbreviations, medical codes, latin concepts, etc. But the existence of a reliable automatic export tool would bring a lot of value to the domain, because its potential uses are numerous.

Our approach proved to be rather accurate with an ontology, restricted to the available medical records. We are planning to run a set of further experiments using general ontology, possibly sightly adjusted to meet our goal. It is expected that efficiency of the method will decrease during the first tests with a more general ontology. Results will be analyzed to define possible improvements in the method, and these improvements will be applied. Usage of restricted ontology still remains as an option for a reliable data export process, if general ontology causes vey low level of precision after all the improvements.


We thank BioDat Research Group for the provided database with the manually exported medical records and Prof. Olga Stepankova for her help in reviewing this paper.

Liuba Grama, Mgr.

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University in Prague

Karlovo nám. 13,

CZ-121 35 Praha 2

E-mail: gramulka@gmail.com

Phone: +420 608 268 746


[1] S.M. Meystre, G.K. Savova, K.C. Kipper-Schuler, J.F. Hurdle. Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research, 2008.

[2] M.A. Sasse, C.W. Johnson. Human-computer Interaction, INTERACT '99, 1999

[3] C.E. Lipscomb. Medical Subject Headings (MeSH). Bull Med Libr Assoc. 2000 July; 88(3): 265–266 [

4] M. Zakova, L. Novakova, O. Stepankova, T. Markova. Ontologies in Annotation and Analysis of Medical Records, 2008.

[5] D. Jurafsky, J.H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Prentice-Hall, ISBN: 0-13- 095069-6, 2000.


Article was published in

The Clinician and Technology Journal

Issue 2

2012 Issue 2

Most read in this issue
Forgotten password

Don‘t have an account?  Create new account

Forgotten password

Enter the email address that you registered with. We will send you instructions on how to set a new password.


Don‘t have an account?  Create new account