• Reduce text

    Reduce text
  • Restore text size

    Restore text size
  • Increase the text

    Increase the text
  • Print

    Print

Navigating the Semantic Web

The World Wide Web is an ocean of heterogeneous data from a broad diversity of sources. How can we structure these data so that they can be processed by machines and yield new knowledge? Danaï Symeonidou is a member of the Joint Research Unit for Mathematics, Informatics, and Statistics for the Environment and Agronomy (MISTEA) at the INRA centre of Occitanie-Montpellier. Here, she explains the intricacies of the Semantic Web.

Semantic Web cloud. © INRA
By Nicole Ladet, translated by Jessica Pearce
Updated on 06/14/2019
Published on 05/03/2019

What is the relationship between the Semantic Web and the World Wide Web?

Danaï Symeonidou: The Semantic Web is structured, and the information it contains can be interpreted. The goal is to transform the World Wide Web, which currently contains a lot of unstructured information, into the Semantic Web. At present, the World Wide Web’s lack of structure means that its data cannot be processed by machines.
Many people are working to develop the Semantic Web. For example, DBpedia is structuring the data found in Wikipedia. These data are published and accessible via the Linked Open Data (LOD) Cloud (Figure 1). Individuals, including INRA scientists, can contribute to this cloud and thus make their data available online. Each circle within the cloud diagram corresponds to a dataset that is already available via the LOD cloud. I work on data like these, which have already been structured.

Figure 1: A Linked Open Data Cloud diagram. © INRA
Figure 1: A Linked Open Data Cloud diagram © INRA

How are data structured?

Symeonidou: I work on structured data that come from the Web, notably RDF data. RDF stands for "resource description framework," which is a descriptive model that allows data to be structured; processed; and analysed by third parties. DBpedia's data have been structured using RDF. In this framework, information is expressed in "triples": groups taking the form subject–predicate–object. Some examples of triples are <PersonNumber32-FirstName-Danai> or <PlantNumber1610-Variety-Gariguette>. When data are structured in this way, machines can process them. Furthermore, there are ontologies underlying this structure.

What is an ontology?

Symeonidou: You can think of an ontology as a data description tool. It is the abstract information associated with an object. There are two main components: classes (e.g., person, wine, vineyard, bike) and attributes (e.g., aRegion, aColor). A given wine has a color, and it is produced by a certain vineyard. This example is illustrated in Figure 2. The grey ellipses represent ontological classes, and the blue ellipses represent their attributes. On the right-hand side of the figure is an example of data associated with this ontology: ChateauBeaujolaisRed is a red wine made in the Beaujolais region by Château-Morgon. An ontology is a way of representing knowledge that allows us to understand a thematic area, employ consistent terms, and properly assess the data. For example, an ontology may specify that a person can only be married to another person. So, if our data state that a person is married to an animal, there is a problem!

Figure 2: Example of an ontology and its associated data set. © INRA
Figure 2: Example of an ontology and its associated data set © INRA

Does this mean that an ontology is essentially a set of grammatical rules for data?

Symeonidou: Yes, exactly. It is a generic structure that can be used to establish hierarchies and correlations among data. For example, within an ontology, we can define a researcher as a person. If we then take a data set in which there is a researcher, who is designated by <PersonNo32-type-Researcher>, for instance, and the ontology states that researchers are also people, we can therefore deduce that <PersonNo32-type-Person>.

What does data linking involve?

Symeonidou: When you link data, you are integrating them. To illustrate this concept, let's take a database—which is a large set of data—that describes many people. We could use the INRA web directory, for example. Now imagine that we want to link the data in the directory to the data collected by the French tax authorities and the French national health agency, so that all their services could be connected. Linking data is about finding the connections between pieces of data that reference the same real-world object.

In Figure 3, we can see an oenological example, where two bits of data are referencing the same wine. Discovering that ChâteauBeaujolaisRed and Beaujolais_Red refer to the same wine: that is the process of data linking.

Figure 3: Data linking. © INRA
Figure 3: Data linking © INRA

It is impossible to link data manually. You need to have an identifier. In our example involving people, social security numbers are a type of identifier: if there are two bits of data associated with the same social security number, those data are referring to the same person. However, most of the time, we do not know what the identifier is. In my work, I develop algorithms to efficiently discover identifiers. I take a set of characteristics—like last name, first name, and birth date, for example—and I can distinguish among all the people in one database and compare them with all the people in another database. By combining people's last names, first names, and addresses, I can also find the same people. These combinations are called "keys."

Can you tell us more about keys?

Symeonidou: Keys are the combinations of attributes that allow us to individually identify each element within a dataset. In a dataset containing five attributes, the total number of combinations possible is 25 - 1 = 31. A combination can be a name by itself; a name and an address; or a name plus an address and a birth date. It is important to try everything and to identify the most useful approach using algorithms that we have developed. We work on the Web, and we are looking at enormous quantities of heterogeneous data whose features make analysis difficult. Using informatic tools, we can rapidly perform calculations that could not be done by hand.

What is the difference between symbolic data and numerical data?

Symeonidou: Symbolic data are text based, while numerical data are number based. They are analysed differently. For example, if we measured the pH of several substances, we can say that two substances with pHs of 3.47 and 3.49, respectively, are more similar to each other in pH than they are to a substance with a pH of 9.2. In contrast, an apple variety named "Royal Gala" is not necessarily more similar to an apple variety named "King of the Pippins" than to an apple variety named "Golden Delicious"! This example shows us that, in the case of symbolic data, letter order tells us nothing about similarity.

What is a function?

Symeonidou: For example, surface area equals length times width is a function. There are more complicated functions, such as those involving exponents. At INRA, I frequently work with numerical data, and I must determine if there are functions that describe the data I have at hand. To get more experience, I went to the Insight Centre for Data Analytics in Cork, where there are data analysis specialists. In particular, I worked with an expert in "genetic" algorithms.

What is a "genetic" algorithm?

Symeonidou: To use an analogy, it's like what we sometimes see in nature, where there are hybrid offspring that display a combination of their parents’ best traits. Our goal is to compare and contrast a variety of "hybrids" to obtain those of the highest quality. We decide which ones to keep and which ones to reject based on our scoring system. In the beginning, we try random functions, like 3 times length + 4 times width = surface area, and we see how well they fit the data. I try about a dozen random functions, and I calculate a score for each. I keep the top three and then try to improve them. It is like each function is a chromosome, where all the chromosomes taken together form a population. I can take a few elements, "genes," from each function/chromosome, and I combine them to see if I can obtain a better score. You might be wondering, “What is the advantage of this approach?” Well, we use this technique because we could never explore all the functions possible. "Genetic" algorithms let us assess some of them and steer us towards those we should pursue. We retain the best functions from a given cycle, and then we try to make them even better. As soon as we reach a certain score threshold, we can decide to stop or to keep going.

What are the applications of this work?

Symeonidou: We are currently looking at agricultural data collected by INRA's Sciences for Oenology Joint Research Unit (SPO). They are simulation data, and we want to be able to assess if they are valid. I have also worked on numerical data describing the characteristics of wine, but those came from outside of INRA. Additionally, we are exploring ways in which our work can be applied to real-life experimental data.

Contact(s)
Associated Division(s):
Applied Mathematics and Informatics
Associated Centre(s):
Occitanie-Montpellier