HomeSpecificationImplementationPublicationsContact
 

Implementation

Here we explain specific design choices made during the implementation of MaSTerClass.

XML (eXtensible Markup Language) is used to annotate the corpus. XML documents have a hierarchical structure built from XML elements. For our specific application, XML is particularly suitable because of its structural properties. Namely, natural language sentences are highly structured. However, they are highly variable and cannot be easily and transparently represented by a fixed structure. For example, a corpus cannot be easily translated into the relational format, as it is not possible to predict all syntactic structures due to high variability of natural languages. On the other hand, this is natively achieved in XML. Moreover, lexico-syntactic information is attached to each sentence constituent through a set of appropriate elements and their attributes. Adding structure and other linguistic information to text is a great advantage in the electronic text representation. Still, XML documents are stored in the plain text format, which makes them platform, hardware, software and language independent, hence making MaSTerClass extensible and pluggable into wider system frameworks.

Further, MaSTerClass has been implemented in Java, which is currently the major application development language due to its many advantages such as being portable, fully object-oriented, interpreted, high-performing, robust, secure, etc. Basing MaSTerClass on XML and Java makes it flexible due to their portability and extensibility. Java is also a natural choice for working with XML, because it offers a wide range of support for its processing. To work with XML in Java we need to parse XML documents into XML trees, navigate through these trees to access specific XML elements, construct such trees and convert them into XML documents.

Developing MaSTerClass as an XML-aware application has been facilitated to a great extent by automatically generating the Java classes for accessing and manipulating XML tags specific to our application. For this purpose we relied on JWSDP (Java Web Services Developer Pack), an integrated toolkit used to build and test XML applications, Web services, and Web applications with the latest Web service technologies and standards implementations. JWSDP technology used in the development of the MaSTerClass system includes JAXB (Java Architecture for XML Binding), which provides an efficient and standard way of mapping between XML data and Java code. This technology enables automatic generation of Java classes from XML schemas (binding) and subsequently the conversion of XML documents into Java object representation (unmarshalling) and vice versa (marshalling). It also supports the validation of XML documents against XML schemas. An apparent advantage of writing an XML schema instead of Java classes directly is the simpler syntax and the reduced amount of code that needed to be produced manually. In addition, any possible changes in the structure of the corpus need to be recorded manually only in the schema and the corresponding code can be generated automatically, which is both easier and faster due to the self-descriptive property of the XML schema and hence improved readability.

In the Java implementation of the MaSTerClass system, we relied on the JAXB technology to manipulate individual sentences in the XML-annotated corpus. However, to facilitate access to individual sentences during the retrieval phase of the CBR cycle, we used a database management system (DBMS), because it is generally able to handle large volumes of data efficiently. XML documents can be stored in a relational, object or XML-native database and efficiently retrieved from the database. By storing an XML document in a database rather than a file, we also opened a possibility for the MaSTerClass system to be used within a wider system framework, where the corpus information could be accessed by different modules and the security and integrity enforced by the DBMS itself. We opted to use an XML-native database as a natural choice for the storage and easy manipulation of XML documents. We used an X-Hive database to store the XML-annotated corpus and the XQuery language to query the corpus.

Finally, the UMLS ontology is used to represent the biomedical knowledge. Although it contains the nested structures, for which the XML representation is particularly suitable, it has been represented by a relational database. This choice has been influenced by the table-like structure of the original UMLS files and the fact that it allows fast access to specific types of elements. The table-like representation of the ontology is generally easily achieved, because the ontology used has a fixed structure despite the frequent changes content-wise. We used a MySQL database to store information from the ontology and SQL queries to retrieve biomedical information from the database.