6

JMI - Design and Development of a Linked Open Data-Based Health Information Repr...

 3 years ago
source link: https://medinform.jmir.org/2014/2/e31/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Introduction

Information is a foundation for effective decision-making. This information need is even more critical in public health organizations to support areas such as epidemiologic surveillance, health outcome assessment, program evaluation and performance measurement, public health planning, and policy analysis [1]. In order to satisfy this, we need better and more flexible health data representation, analysis, querying, and visualization methods. The amount of available online health data both in structured and unstructured formats is constantly increasing. The World Health Organization (WHO), for example, has established a data repository providing access to over 50 datasets on priority health topics including mortality and prevalence of human immunodeficiency virus infection/acquired immunodeficiency syndrome (HIV/AIDS) in different WHO regions [2]. Moreover, the United Nations [3] and the Centers for Disease Control and Prevention (CDC) [4] have online data repositories on the different indicators for different countries.

While these are important initiatives to publish health data online, there has been relatively little attention paid to data representation methods in most health data portals so far [5]. Current data representation and distribution methods with only tabular formats, such as comma-separated values (CSV), PDF, and Excel—and little metadata—makes health information integration, comparison, and reuse very difficult. Additionally, even though different indicators have relationships to each other, the datasets are not linked in most portals. Vocabularies and data formats are inconsistent, which makes finding, assembling, and normalizing these datasets time consuming and prone to errors [6].

Exploiting the different kinds of public health information about a given topic is a challenging task because data is spread across different platforms in heterogeneous formats. Better data management methods and tools are required to move from a Web of documents, only understandable by human users, to a Web of Data in which information is expressed in a format that can be read and used by machines. This would enable us to find, share, and integrate information more easily [7].

Linked Data, as explained by Tim Berners-Lee [7], is a method to publish structured data by using standard Web technologies to connect related data and make them accessible on the Web. The Linked Data publishing pattern uses HTTP uniform resource identifiers (URIs) for identifying data items, the Resource Description Framework (RDF) for describing data, and links to describe the relationships. Other standards used in Linked Open Data (LOD) applications include Resource Description Framework Schema (RDFS) for describing RDF vocabularies, and SPARQL Protocol and RDF Query Language (SPARQL) for querying RDF graphs [8].

The primary goal of the Linked Data initiative is to make the World Wide Web (WWW) not only useful for interlinking documents, but also for sharing and interlinking data [9]. The movement is driven by the hypothesis that these technologies could revolutionize global data sharing, integration, and analysis, just like the classic Web-revolutionized information sharing and communication over the last two decades. However, to our knowledge there are not many studies on the potential of LOD for public health data management.

Motivated by the universal hypothesis of Linked Data to revolutionize data sharing, integration, and analysis, the main objectives of this work are (1) to test the potential of LOD for health data representation and visualization, (2) to identify the available technologies and tools for Linked Data-based health information system development, and (3) to evaluate the usability level of LOD-based systems by end users.

In this paper, we present the development of the system from data modeling to visualization and potential LOD tools available for development. Identifying the tools and testing the potential of LOD will be helpful as an input to the health informatics and Semantic Web community in the research effort to find ways to represent data in a flexible manner.

Methods

Overview

Our methodology was “Integration-oriented development and evaluation” in the sense that we used the available LOD tools to develop the system and then we reflected on the development process, the potentials, and finally on usability for end users. We gave special emphasis to the data management process as efficient data management and conversion is the backbone of the LOD-based system development [10]. We used the RDF for data representation, Fuseki triple store for data storage, and Sgvizler for information visualization. Additionally, we integrated a SPARQL query interface for interacting with the data. We primarily used the WHO health observatory dataset to test the system. All the data were represented using RDF and interlinked with other related datasets on the Web of Data using Silk [11], a link discovery framework for Web of Data. A preliminary usability assessment was conducted following the System Usability Scale (SUS) method. The final revised SUS questionnaire used for the evaluation is shown in Multimedia Appendix 1. The details, with more focus on the data management process, are explained throughout this paper.

Data Sources

The dataset for this work was retrieved from the WHO global health observatory data repository [2]. The data used covered the years from 1990 to 2010. Missing data for some years were complemented with data from other similar official sources, such as the United Nations program for HIV/AIDS (UNAIDS) [3] and country-specific official sources like the national AIDS resource centers of each African country. From those databases, HIV statistical data, as well as additional location and total population information, were extracted for sub-Saharan African countries. Most of the data were in Microsoft Excel and CSV formats. All the data were converted and prepared in Excel using the Excel2RDF [12] converter. For the enrichment, DBpedia, Bio2RDF and LinkedCT were used as sources. For data license, all our published Linked Data adheres to the original data publisher’s license and terms of use.

Data Modeling and Conversion

Shared vocabularies are a key to enable interoperability in healthcare systems by providing an agreed-upon terminology that can be looked up through URIs that cannot be referenced [13]. We have identified potential health, statistical, spatial, and time vocabularies and ontologies to share the data in a reusable way and then mapped them to the external ontologies using predicates (see Table 1). We used the common RDF [14], RDFS [15], Web Ontology Language (OWL) [16], friend of a friend (FOAF) [17], and Data Cube [18] vocabularies for data annotation. Those are standard vocabularies to represent data in LOD by expressing relationships between the data. We use the Data Cube vocabulary for all the statistical data to represent, not only the numbers, but also advanced metadata with space and time dimensions of the observation. Some of the standard predicates were replaced with more generic elements from the Data Cube vocabulary (eg, qb:prevalence instead of qb:observation) to make them more understandable to health professionals and healthcare managers. We assume that using some of the terms that are already known by health professionals will make the system more usable and easily adaptable. After identifying the ontologies and vocabularies, the original data was converted in a semi-automated way to avoid information loss. Conversion using Excel2RDF is done by selecting the range of data values and headers from the spreadsheet that are to be converted. Then, the headers are fed into the mapping wizard, which assists the mappings of row/column concepts to RDF vocabularies. Excel data triplication using Excel2RDF is discussed by Pesce et al [19].

Table 1. Different domain ontological vocabularies and predicates reused for modeling data in the conversion process.View this table

Data Storage

The main difference between existing health information system development and Linked Data-based systems is the way data is represented and stored. Current systems mostly use tabular formats (eg, Excel, CSV) or relational database systems such as Oracle. Linked Data-based systems, however, usually build on triple stores as their main data storage. This triple-based representation enables integration of data available from various sources without the need for physical storage of the RDF triple that corresponds to the relational data [20]. These systems provide data management and data access via application programming interfaces (APIs) and query languages to RDF data. For this work, we used the Fuseki triple store [21]. It provides representational state transfer (REST)-style SPARQL HTTP Update, SPARQL Query, and SPARQL Update using the SPARQL protocol over HTTP [22].

Data Enrichment

The primary intention of representing health data using the LOD approach is to be able to discover and link health data from different sources and use them in new applications. Interlinking data from our RDF datasets to other datasets, which are already in the LOD cloud, was challenging. It requires identification of similar link types in our datasets, and then finding suitable matching links in external datasets. Zevari et al point out similar challenges in link discovery in health datasets [23].

In our data enrichment, we used both manual and automatic methods. We manually enriched the dataset with links to some sources such as DBpedia, while large numbers of links to sources such as Bio2RDF were generated automatically. The enrichment is based on owl:sameAs relations, which interconnect different identifiers for the same real-world item across different datasets (eg, DBpedia:Ethiopia owl:sameAs geonames:7733022). Such a sameAs-link references different identifiers for the same real-world entity—Ethiopia, in our example—from different sources [10]. We enriched the data with links to data sources generated by related initiatives such as Bio2RDF [24], LinkedCT [25], Pubmed [26], and other geospatial and health-related initiatives using standard RDF and Unified Medical Language System (UMLS) vocabularies. We used the Silk Link Discovery Framework [27] for automatic link discovery and to provide the built-in Fuseki query interface to access the data. To access the target data, we first configured access parameters to the target dataset endpoints using the <DataSource> directive. The only mandatory data source parameter is the endpoint URI. By specifying the source and destination endpoints on target datasets, we interlinked the data. In total, we retrieved 5312 data elements to be added to the system. Additionally, we implemented a visualization interface over the triple store using Sgvizler [28], a JavaScript library which renders the results of SPARQL queries as charts or HTML elements [29]. Figure 1 gives an overview of the overall methodology.

download?filename=2dedfcb023a4f9890655f6d0179f7a97.jpg&alt_name=3531-47398-1-PB.jpg
Figure 1. The overall workflow diagram for the methodology from the data conversion, data interlinking, and data query to visualization. View this figure

Results

Overview

We developed a Linked Open Health Data (LOHD) system that integrates spatial and statistical health data from various sources. In the system, users can query HIV-related information about African countries and the system will support them in querying and visualizing the data in both space and time.

LOHD System Architecture

For the system development, we preferred a multilayer architecture, which provides flexibility and reusability. For example, data management, query processing, and visualization are logically separate processes. The advantages of a multilayer architecture have been discussed in the literature in detail [27-29]. By breaking up the system into a hierarchy, different layers can be developed sequentially and modified asynchronously without affecting the entire system architecture [10]. The architecture of our system is composed of 4 main layers (see Figure 2): (1) the data layer, (2) the transformation layer, (3) the service layer, and (4) the presentation layer. The data layer stores the converted and interlinked data. The transformation layer is the processing layer where every SPARQL query is processed using crawling pattern to localize data from the Web of Linked Data. The service layer controls the data access and bridges the client to the server via service protocols. The presentation layer allows the users to interact with the services using either retrieval or visualization tools. All the system architecture layers and the underlying LOD application tools are shown in Figure 2.

download?filename=5e3a29fc793198ded74c50462cdd7aaa.jpg&alt_name=3531-47399-1-PB.jpg
Figure 2. LOHD System architecture with all the four layers from the data representation layer to presentation layer and the corresponding LOD tools. View this figure

Visualization

Coherent LOD visualizations enable nontechnical users to use the Web of Data [30] and increase the usability and accessibility of Linked Data-based systems [31]. In most Linked Data-based systems, the user is expected to write SPARQL queries, which is challenging for nontechnical users. To overcome those challenges, we integrate a live visualization interface using Sgvizler. Once the query is selected, the users have the option to choose the visualization method for the data output. All the visualization methods available on Sgvizler are supported by our system. In the following sample queries, we show some of the visualizations based on spatial or temporal queries.

Time Series Visualization of Linked Data

Time series visualizations help to display patterns and trends that are not readily apparent in the numbers themselves. In traditional databases, time series visualization is mostly done by external applications which are cumbersome and time consuming. But in Linked Data-based systems, you can write your query and choose the visualization type from the drop-down menu. Figure 3 shows the trend of HIV prevalence in Ethiopia, as an example, and the system automatically shows the live visualization of the trend for the requested year.

download?filename=c67d91f09f0eb8b6f6dbcd06a76216c4.jpg&alt_name=3531-47400-1-PB.jpg
Figure 3. Time series visualization of HIV prevalence in Ethiopia from the years 1990-2010. To visualize other countries, substitute the country name in the query. View this figure

Geographical Visualization of Linked Data

Location is becoming a basic attribute for health data [32]. Location-based visualizations are mostly difficult using traditional databases unless they are exported to geographic information system (GIS) software for further analysis. In LOD-based systems, location-based visualizations are facilitated by the ability to write queries and choose the visualization method. Figure 4 shows an example where the visualization shows the prevalence of HIV based on each country’s location on the African map. When someone clicks on the icon of the country, it will show the basic information about the country and the trend of HIV for the specified time period in the query.

download?filename=f7c1f880c4dc3ee25a71e954eeaa362c.jpg&alt_name=3531-47401-1-PB.jpg
Figure 4. Location-based visualization of HIV prevalence in sub-Saharan Africa. The health-related data and the time series graph are displayed by clicking on the map of the country. View this figure

Indicator-Based Visualization of Linked Data

Indicators are the basic components of any health data. Most international disease prevalence comparisons and local-level reporting are done using indicators in a specific period of time. LOD-based systems support queries with different indicators—such as HIV prevalence rate by country or region, antiretroviral therapy (ART) coverage rate, population or gross domestic product (GDP)—and make a correlation analysis between those variables over time. In Figure 5, we show a 3-dimensional correlation analysis with time series animation.

download?filename=ae1778d135ac0f40155654d4598082a6.jpg&alt_name=3531-47402-1-PB.jpg
Figure 5. Indicator-based correlation visualization over LOHD system of HIV prevalence and ART coverage versus time. View this figure

Evaluation

The system was evaluated in a small-scale user study to get feedback from healthcare data managers and users regarding the usability and learnability of the system. A total of 19 participants were selected for this evaluation, both with a technical and nontechnical background. The participants had no relationship with the investigator and the selection was done purposefully to ensure we recruited participants who currently work on health data management, and to get a proportional mix of different professions. Of the 19 selected participants, 17 of them responded to the questionnaire (89%). The technical participants (9/17, 53%) were data managers with IT backgrounds, health information system developers, and system administrators in different healthcare organizations in Africa. The nontechnical participants (8/17, 47%) were public data users, such as demographic data managers, doctors, and public health professionals. The evaluation was done based on the SUS with some wording amended, tailored for our participants (see Multimedia Appendix 1). In the evaluation, we were interested in the feedback from the participants on the query-based data access. The Linked Data search engine was not provided to participants, given its early stage of development for complex query request. The SUS is mostly regarded as a quick and easy way to conduct a usability assessment [33]. Even though the tool is self-described as “quick and dirty”, it has been evaluated in many studies (more than 600 articles) as valid and reliable [34]. Based on the SUS scoring criteria, the final calculated score was 82, which is well above the average SUS score of 68. Table 2 summarizes the evaluation responses for each criteria of the system usability.

Additionally, 2 open-ended questions were asked to the users to better understand their views and their specific requirements for using the system. The frequent answers for those questions can be explained by dividing them into 2 groups. The participants with technical backgrounds were relatively happy and 8 out of 9 (89%) of them mentioned that such systems would be useful in the future. The nontechnical participants (8/8), however, mentioned that the system was not easy to use. This is understandable seeing that the current Linked Data tools demand writing queries. Publishing the data in machine-understandable form and making live visualization without having to use external applications were the most frequently mentioned benefits of the system by the participants of the evaluation (15/17, 88%). The need to write queries in the optional interface and identifying the appropriate visualization tool were reported as being the difficult aspects of such systems by 16 of the 17 (94%) participants.

Table 2. . SUS evaluation criteria and participant response (n=17).View this table

Discussion

Principal Findings

We developed a Linked Data-based health information representation, querying, and visualization system. We used the system to represent and store the data, facilitating flexible queries and different kinds of visualizations. There are other ongoing efforts to convert healthcare- and life science-related datasets to a Linked Data cloud such as Linked Open Drug Data (LODD), LinkedCT, Open Biomedical Ontologies (OBO), and the World Wide Web Consortium’s (W3C) Health Care and Life Sciences working groups [31,32]. Thanks to such initiatives and recently developed Semantic Web tools, converting data to RDF has become straightforward. However, just converting the data to RDF and publishing it online is not enough [35,36]. The main difficulty is to integrate the data representation methods to application-level tools and make them usable for health information consumers in a shared, semantically meaningful, easily discoverable, and reusable manner.

In our system, we represented the health data with its important dimensions—magnitude, time, and space—in the form of RDF and we used both manual and automatic interconnection methods to enrich the data. We integrated visualization and retrieval methods for the data to make data visualization and retrieval possible with already available tools. There was a similar initiative by Zappa et al to integrate mutation data in the LOD cloud [35]. The methodology we follow for development is similar except that they use another tool for the data conversion. What makes our work different is that in addition to converting the data and making it available in RDF, we focus on integrating additional query and visualization interface tools to make the system more usable, especially for nontechnical users.

Our system development method was integration oriented in the sense that it reflects the way to convert the different dimensions of the data to Linked Data and integrate them with already developed tools, enabling the system to support information access. In selecting our tools, we found out that RDF is currently a robust data model to represent data with metadata [14,37] that gives the opportunity of integrating data and availing data for query. Our selection of Sgvizler for visualization was motivated by its current support of different types of visualization and its integration with HTML webpages by letting the user specify queries of interest [29]. One of the difficulties we noticed here is that for complex queries, Sgvizler is relatively slow. This may make it difficult to use for big data and complex query-based systems. Nonetheless, we believe that advanced-level, live correlation visualization of certain disease trends in space and time dimensions from different sources is one of the biggest promises of Linked Data-based systems in the future.

Measuring the degree of advancement that a Linked Data representation brings to public health data is difficult to quantify. Nonetheless, from the technology perspective, the data becomes search engine discoverable and machine understandable, which addresses the main issues of the current health data silos problem [38]. While Linked Data and Semantic Web technologies are not as mature as other database technologies, they present a promising alternative in public health information portal development. A good example that can explain this is the data representation scheme in the World Bank database [39], which includes both a portal for downloading data as Excel or PDF files, as well as a Linked Data version for downloading the data as RDF with the ability to query their endpoints. The main advantage of having Linked Data as an additional option in the World Bank database can be seen in the results of search engine results. If you input “Prevalence of HIV in Egypt” and “GDP of Egypt” into search engines, we can clearly see the data representation limitation of health portals. Since the World Bank data is represented in a machine-understandable and search engine-discoverable way, you can see the graphs and additional descriptions, which are very useful for an end user searching for them.

The user evaluation of our system confirms the existing usability limitations of Linked Data mentioned by different authors [21,32,35,36,40]. Linked Data is currently mostly used by the Semantic Web community and other users with a strong technical background. To make the Linked Data-based systems more usable by end users, we need to develop enhanced tools that can avoid the need to write queries.

In our evaluation, 41% (7/17) of the participants (strongly agree and agree together) reported that they need the support of a technical person to use this system, which is high when compared to other system evaluations [4,33]. Yet this is an expected result given the current technical nature of data access in LOD when using queries. The promising result from the evaluation is that 70% (12/17) of the participants are confident in using and understanding the visualizations of the system. This indicates that the LOD-based representation of public health data offers a new perspective in the future of health data portal development.

Limitations

There are some limitations in this work. Primarily, the amount of data we used is small to generalize the robustness of the LOD tools. As already outlined in different studies [41-43] Semantic Web technologies work well with small datasets but might not be the best option with big datasets. Secondly, our user evaluation was based on a small set of participants and the SUS scale, which has its own limitations, making generalization of the usability assessment result difficult.

For future research we recommend integrating and testing an advanced-level search engine to ensure that LOD-based systems are more usable outside the Semantic Web community. Additionally, implementing and testing a similar system with a big dataset by describing the data more robustly with domain-specific, additional ontological vocabularies, interlinking with more ontologies, and including more visualization options for grouped data is recommended. Moreover, implementation of advanced-level correlation analysis visualization from different sources will make LOD technology more interesting and usable by healthcare professionals.

Conclusions

The system introduced in this article shows that LOD has a promising potential in the representation of complex health-related data. This is mainly due to its reusable and interoperable manner that can serve intelligent queries, and ultimately support decision-making. However, the development of advanced LOD search engines is necessary to increase its usability.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK