You are at the newest post.

## June292015

19:36

### Rethink Big and Europe?s Position in Big Data

I will here take a break from core database and talk a bit about EU policies for research funding.

I had lunch with Stefan Manegold of CWI last week, where we talked about where European research should go. Stefan is involved in RETHINK big, a European research project for compiling policy advice regarding big data for EC funding agencies. As part of this, he is interviewing various stakeholders such as end user organizations and developers of technology.

RETHINK big wants to come up with a research agenda primarily for hardware, anything from faster networks to greener data centers. CWI represents software expertise in the consortium.

So, we went through a regular questionnaire about how we see the landscape. I will summarize this below, as this is anyway informative.

### Core competence

My own core competence is in core database functionality, specifically in high performance query processing, scale-out, and managing schema-less data. Most of the Virtuoso installed base is in the RDF space, but most potential applications are in fact outside of this niche.

### User challenges

The life sciences vertical is the one in which I have the most application insight, from going to Open PHACTS meetings and holding extensive conversations with domain specialists. We have users in many other verticals, from manufacturing to financial services, but there I do not have as much exposure to the actual applications.

Having said this, the challenges throughout tend to be in diversity of data. Every researcher has their MySQL database or spreadsheet, and there may not even be a top level catalogue of everything. Data formats are diverse. Some people use linked data (most commonly RDF) as a top level metadata format. The application data, such as gene sequences or microarray assays, reside in their native file formats and there is little point in RDF-izing these.

There are also public data resources that are published in RDF serializations as vendor-neutral, self-describing format. Having everything as triples, without a priori schema, makes things easier to integrate and in some cases easier to describe and query.

So, the challenge is in the labor intensive nature of data integration. Data comes with different levels of quantity and quality, from hand-curated to NLP extractions. Querying in the single- or double-digit terabyte range with RDF is quite possible, as we have shown many times on this blog, but most use cases do not even go that far. Anyway, what we see on the field is primarily a data diversity game. The scenario is data integration; the technology we provide is database. The data transformation proper, data cleansing, units of measure, entity de-duplication, and such core data-integration functions are performed using diverse, user-specific means.

Jerven Bolleman of the Swiss Institute of Bioinformatics is a user of ours with whom we have long standing discussions on the virtues of federated data and querying. I advised Stefan to go talk to him; he has fresh views about the volume challenges with unexpected usage patterns. Designing for performance is tough if the usage pattern is out of the blue, like correlating air humidity on the day of measurement with the presence of some genomic patterns. Building a warehouse just for that might not be the preferred choice, so the problem field is not exhausted. Generally, I’d go for warehousing though.

### What technology would you like to have? Network or power efficiency?

OK. Even a fast network is a network. A set of processes on a single shared-memory box is also a kind of network. InfiniBand is maybe half the throughput and 3x the latency of single threaded interprocess communication within one box. The operative word is latency. Making large systems always involves a network or something very much like one in large scale-up scenarios.

On the software side, next to nobody understands latency and contention; yet these are the one core factor in any pursuit of scalability. Because of this situation, paradigms like MapReduce and bulk synchronous parallel (BSP) processing have become popular because these take the communication out of the program flow, so the programmer cannot muck this up, as otherwise would happen with the inevitability of destiny. Of course, our beloved SQL or declarative query in general does give scalability in many tasks without programmer participation. Datalog has also been used as a means of shipping computation around, as in the the work of Hellerstein.

There are no easy solutions. We have built scale-out conscious, vectorized extensions to SQL procedures where one can express complex parallel, distributed flows, but people do not use or understand these. These are very useful, even indispensable, but only on the inside, not as a programmer-facing construct. MapReduce and BSP are the limit of what a development culture will absorb. MapReduce and BSP do not hide the fact of distributed processing. What about things that do? Parallel, partitioned extensions to Fortran arrays? Functional languages? I think that all the obvious aids to parallel/distributed programming have been conceived of. No silver bullet; just hard work. And above all the discernment of what paradigm fits what problem. Since these are always changing, there is no finite set of rules, and no substitute for understanding and insight, and the latter are vanishingly scarce. "Paradigmatism," i.e., the belief that one particular programming model is a panacea outside of its original niche, is a common source of complexity and inefficiency. This is a common form of enthusiastic naïveté.

If you look at power efficiency, the clusters that are the easiest to program consist of relatively few high power machines and a fast network. A typical node size is 16+ cores and 256G or more RAM. Amazon has these in entirely workable configurations, as documented earlier on this blog. The leading edge in power efficiency is in larger number of smaller units, which makes life again harder. This exacerbates latency and forces one to partition the data more often, whereas one can play with replication of key parts of data more freely if the node size is larger.

One very specific item where research might help without having to rebuild the hardware stack would be better, lower-latency exposure of networks to software. Lightweight threads and user-space access, bypassing slow protocol stacks, etc. MPI has some of this, but maybe more could be done.

So, I will take a cluster of such 16-core, 256GB machines on a faster network, over a cluster of 1024 x 4G mobile phones connected via USB. Very selfish and unecological, but one has to stay alive and life is tough enough as is.

### Are there pressures to adapt business models based on big data?

The transition from capex to opex may be approaching maturity, as there have been workable cloud configurations for the past couple of years. The EC2 from way back, with at best a 4 core 16G VM and a horrible network for $2/hr, is long gone. It remains the case that 4 months of 24x7 rent in the cloud equals the purchase price of physical hardware. So, for this to be economical long-term at scale, the average utilization should be about 10% of the peak, and peaks should not be on for more than 10% of the time. So, database software should be rented by the hour. A 100-150% markup for the$2.80 a large EC2 instance costs would be reasonable. Consider that 70% of the cost in TPC benchmarks is database software.

There will be different pricing models combining different up-front and per-usage costs, just as there are for clouds now. If the platform business goes that way and the market accepts this, then systems software will follow. Price/performance quotes should probably be expressed as speed/price/hour instead of speed/price.

The above is rather uncontroversial but there is no harm restating these facts. Reinforce often.

### Well, the question is raised, what should Europe do that would have tangible impact in the next 5 years?

This is a harder question. There is some European business in wide area and mobile infrastructures. Competing against Huawei will keep them busy. Intel and Mellanox will continue making faster networks regardless of European policies. Intel will continue building denser compute nodes, e.g., integrated Knight’s Corner with dual IB network and 16G fast RAM on chip. Clouds will continue making these available on demand once the technology is in mass production.

What’s the next big innovation? Neuromorphic computing? Quantum computing? Maybe. For now, I’d just do more engineering along the core competence discussed above, with emphasis on good marketing and scalable execution. By this I mean trained people who know something about deployment. There is a huge training gap. In the would-be "Age of Data," knowledge of how things actually work and scale is near-absent. I have offered to do some courses on this to partners and public alike, but I need somebody to drive this show; I have other things to do.

I have been to many, many project review meetings, mostly as a project partner but also as reviewer. For the past year, the EC has used an innovation questionnaire at the end of the meetings. It is quite vague, and I don’t think it delivers much actionable intelligence.

What would deliver this would be a venture capital type activity, with well-developed networks and active participation in developing a business. The EC is not now set up to perform this role, though. But the EC is a fairly large and wealthy entity, so it could invest some money via this type of channel. Also there should be higher individual incentives and rewards for speed and excellence. Getting the next Horizon 2020 research grant may be good, but better exists. The grants are competitive enough and the calls are not bad; they follow the times.

In the projects I have seen, productization does get some attention, e.g., the LOD2 stack, but it is not something that is really ongoing or with dedicated commercial backing. It may also be that there is no market to justify such dedicated backing. Much of the RDF work has been "me, too" — let’s do what the real database and data integration people do, but let’s just do this with triples. Innovation? Well, I took the best of the real DB world and adapted this to RDF, which did produce a competent piece of work with broad applicability, extending outside RDF. Is there better than this? Well, some of the data integration work (e.g., LIMES) is not bad, and it might be picked up by some of the players that do this sort of thing in the broader world, e.g., Informatica, the DI suites of big DB vendors, Tamr, etc. I would not know if this in fact adds value to the non-RDF equivalents; I do not know the field well enough, but there could be a possibility.

The recent emphasis for benchmarking, spearheaded by Stefano Bertolo is good, as exemplified by the LDBC FP7. There should probably be one or two projects of this sort going at all times. These make challenges known and are an effective means of guiding research, with a large multiplier: Once a benchmark gets adopted, infinitely more work goes into solving the problem than in stating it in the first place.

The aims and calls are good. The execution by projects is variable. For 1% of excellence, there apparently must be 99% of so-and-so, but this is just a fact of life and not specific to this context. The projects are rather diffuse. There is not a single outcome that gets all the effort. In this, the level of engagement of participants is less and focus is much more scattered than in startups. A really hungry, go-getter mood is mostly absent. I am a believer in core competence. Well, most people will agree that core competence is nice. But the projects I have seen do not drive for it hard enough.

It is hard to say exactly what kinds of incentives could be offered to encourage truly exceptional work. The American startup scene does offer high rewards and something of this could be transplanted into the EC project world. I would not know exactly what form this could take, though.

## June282015

09:08

### Improved Customer Experience by use of Semantic Web and Linked Data technologies

With the rise of Linked Data technologies, there come several new approaches into play for the improvement of customer experience across all digital channels of a company. All of these methodologies can be subsumed under the term “the connected customer”.

These are interesting not only for retailers operating a web shop, but also for enterprises seeking for new ways to develop tailor-made customer services and to increase customer retention.

Linked Data methodologies can help to improve several measurements alongside a typical customer experience lifecycle.

2. Cross-selling through a better contextualization of product information
3. Semantically enhanced help desk, user forums and self service platforms
4. Better ways to understand and interpret a customer intention by use of enterprise vocabularies
5. More dynamic management of complex multi-channel websites through a better cost-effectiveness
6. More precise methods for data analytics, e.g. to allow marketers to better target campaigns and content to the user’s preferences
7. Enhanced search experience at aggregators like Google through the use of microdata and schema.org

In the center of this approach, knowledge graphs work like a ‘linking machine’. Based on standards-based semantic models, business entities are getting linked in a most dynamic way. Those graphs go beyond the power of social graphs. While social graphs are focused on people only, are knowledge graphs connecting all kinds of relevant business objects to each other.

When customers and their behaviours are represented in a knowledge model, Linked data technologies try to preserve as much semantics as possible. By these means they are able to complement other approaches for big data analytics, which rather tend to flatten out the data model behind business entities.

## June262015

13:15

### Using SPARQL clause VALUES in PoolParty

Since PoolParty fully supports SPARQL 1.1 functionalities you can use clauses like VALUES. The VALUES clause can be used to provide an unordered solution sequence that is joined with the results of the query evaluation. From my perspective it is a convenience of filtering variables and an increase in readability of queries.

E.g. when you want to know which cocktails you can create with Gin and a highball glass you can go to http://vocabulary.semantic-web.at/PoolParty/sparql/cocktails and fire this query:

PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
PREFIX co: <http://vocabulary.semantic-web.at/cocktail-ontology/>
SELECT ?cocktailLabel
WHERE {
  ?cocktail co:consists-of ?ingredient ;
    co:uses ?drinkware ;
    skos:prefLabel ?cocktailLabel .
  ?ingredient skos:prefLabel ?ingredientLabel .
  ?drinkware skos:prefLabel ?drinkwareLabel .
  FILTER (?ingredientLabel = "Gin"@en && ?drinkwareLabel = "Highball glass"@en )
}

When you want to add additional pairs of ingredients and drink ware you want to filter in combination the query gets quite clumsy. Wrongly placed braces can break the syntax. In addition, when writing complicated queries you easily insert errors, e.g. by mixing boolean operators which results in wrong results…

...
FILTER ((?ingredientLabel = "Gin"@en && ?drinkwareLabel = "Highball glass"@en ) ||
     (?ingredientLabel = "Vodka"@en && ?drinkwareLabel ="Old Fashioned glass"@en ))
}

Using VALUES can help in this situation. For example this query shows you how to filter both pairs Gin+Highball glass and Vodka+Old Fashioned glass in a neat way:

PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
PREFIX co: <http://vocabulary.semantic-web.at/cocktail-ontology/>
SELECT ?cocktailLabel
WHERE {
  ?cocktail co:consists-of ?ingredient ;
    co:uses ?drinkware ;
    skos:prefLabel ?cocktailLabel .
  ?ingredient skos:prefLabel ?ingredientLabel .
  ?drinkware skos:prefLabel ?drinkwareLabel .
}
VALUES ( ?ingredientLabel ?drinkwareLabel )
{
  ("Gin"@en "Highball glass"@en)
  ("Vodka"@en "Old Fashioned glass"@en)
}

Especially when you create SPARQL code automatically, e.g. generated by a form, this clause can be very useful.

## June162015

21:53

### Virtuoso Elastic Cluster Benchmarks AMI on Amazon EC2

We have another new Amazon machine image, this time for deploying your own Virtuoso Elastic Cluster on the cloud. The previous post gave a summary of running TPC-H on this image. This post is about what the AMI consists of and how to set it up.

Note: This AMI is running a pre-release build of Virtuoso 7.5, Commercial Edition. Features are subject to change, and this build is not licensed for any use other than the AMI-based benchmarking described herein.

There are two preconfigured cluster setups; one is for two (2) machines/instances and one is for four (4). Generation and loading of TPC-H data, as well as the benchmark run itself, is preconfigured, so you can do it by entering just a few commands. The whole sequence of doing a terabyte (1000G) scale TPC-H takes under two hours, with 30 minutes to generate the data, 35 minutes to load, and 35 minutes to do three benchmark runs. The 100G scale is several times faster still.

To experiment with this AMI, you will need a set of license files, one per machine/instance, which our Sales Team can provide.

Detailed instructions are on the AMI, in /home/ec2-user/cluster_instructions.txt, but the basic steps to get up and running are as follows:

1. Instantiate machine image ami-811becea) (AMI ID is subject to change; you should be able to find the latest by searching for "OpenLink Virtuoso Benchmarks" in "Community AMIs"; this one is short-named virtuoso-bench-cl) with two or four (2 or 4) R3.8xlarge instances within one virtual private cluster and placement group. Make sure the VPC security is set to allow all connections.

2. Log in to the first, and fill in the configuration file with the internal IP addresses of all machines instantiated in step 1.

3. Distribute the license files to the instances, and start the OpenLink License Manager on each machine.

4. Run 3 shell commands to set up the file systems and the Virtuoso configuration files.

5. If you do not plan to run one of these benchmarks, you can simply start and work with the Virtuoso cluster now. It is ready for use with an empty database.

6. Before running one of these benchmark, generate the appropriate dataset with the dbgen.sh command.

7. Bulk load the data with load.sh.

8. Run the benchmark with run.sh.

Right now the cluster benchmarks are limited to TPC-H but cluster versions of the LDBC Social Network and Semantic Publishing benchmarks will follow soon.

08:51

## June102015

16:03

### In Hoc Signo Vinces (part 21 of n): Running TPC-H on Virtuoso Elastic Cluster on Amazon EC2

We have made an Amazon EC2 deployment of Virtuoso 7 Commercial Edition, configured to use the Elastic Cluster Module with TPC-H preconfigured, similar to the recently published OpenLink Virtuoso Benchmark AMI running the Open Source Edition. The details of the new Elastic Cluster AMI and steps to use it will be published in a forthcoming post. Here we will simply look at results of running TPC-H 100G scale on two machines, and 1000G scale on four machines. This shows how Virtuoso provides great performance on a cloud platform. The extremely fast bulk load — 33 minutes for a terabyte! — means that you can get straight to work even with on-demand infrastructure.

In the following, the Amazon instance type is R3.8xlarge, each with dual Xeon E5-2670 v2, 244G RAM, and 2 x 300G SSD. The image is made from the Amazon Linux with built-in network optimization. We first tried a RedHat image without network optimization and had considerable trouble with the interconnect. Using network-optimized Amazon Linux images inside a virtual private cloud has resolved all these problems.

The network optimized 10GE interconnect at Amazon offers throughput close to the QDR InfiniBand running TCP-IP; thus the Amazon platform is suitable for running cluster databases. The execution that we have seen is not seriously network bound.

### 100G on 2 machines, with a total of 32 cores, 64 threads, 488 GB RAM, 4 x 300 GB SSD

Load time: 3m 52s Run Power Throughput Composite 1 523,554.3 590,692.6 556,111.2 2 565,353.3 642,503.0 602,694.9

### 1000G on 4 machines, with a total of 64 cores, 128 threads, 976 GB RAM, 8 x 300 GB SSD

Load time: 32m 47s Run Power Throughput Composite 1 592,013.9 754,107.6 668,163.3 2 896,564.1 828,265.4 861,738.4 3 883,736.9 829,609.0 856,245.3

For the larger scale we did 3 sets of power + throughput tests to measure consistency of performance. By the TPC-H rules, the worst (first) score should be reported. Even after bulk load, this is markedly less than the next power score due to working set effects. This is seen to a lesser degree with the first throughput score also.

The numerical quantities summaries are available in a report.zip file, or individually --

• report-100-1.txt 
• report-100-2.txt 
• report-1000-1.txt 
• report-1000-2.txt 
• report-1000-3.txt 

Subsequent posts will explain how to deploy Virtuoso Elastic Clusters on AWS.

## June092015

15:51

### Introducing the OpenLink Virtuoso Benchmarks AMI on Amazon EC2

The OpenLink Virtuoso Benchmarks AMI is an Amazon EC2 machine image with the latest Virtuoso open source technology preconfigured to run —

• TPC-H , the classic of SQL data warehousing

• LDBC SNB, the new Social Network Benchmark from the Linked Data Benchmark Council

• LDBC SPB, the RDF/SPARQL Semantic Publishing Benchmark from LDBC

This package is ideal for technology evaluators and developers interested in getting the most performance out of Virtuoso. This is also an all-in-one solution to any questions about reproducing claimed benchmark results. All necessary tools for building and running are included; thus any developer can use this model installation as a starting point. The benchmark drivers are preconfigured with appropriate settings, and benchmark qualification tests can be run with a single command.

The Benchmarks AMI includes a precompiled, preconfigured checkout of the v7fasttrack github repository, checkouts of the github repositories of the benchmarks, and a number of running directories with all configuration files preset and optimized. The image is intended to be instantiated on a R3.8xlarge Amazon instance with 244G RAM, dual Xeon E5-2670 v2, and 600G SSD.

Benchmark datasets and preloaded database files can be downloaded from S3 when large, and generated as needed on the instance when small. As an alternative, the instance is also set up to do all phases of data generation and database bulk load.

The following benchmark setups are included:

• TPC-H 100G
• TPC-H 300G
• LDBC SNB Validation
• LDBC SNB Interactive 100G
• LDBC SNB Interactive 300G (SF3)
• LDBC SPB Validation
• LDBC SPB Basic 256 Mtriples (SF5)
• LDBC SPB Basic 1 Gtriple

The AMI will be expanded as new benchmarks are introduced, for example, the LDBC Social Network Business Intelligence or Graph Analytics.

To get started:

1. Instantiate machine image ami-5304ef38 (AMI ID is subject to change; you should be able to find the latest by searching for "OpenLink Virtuoso Benchmarks" in "Community AMIs") with a R3.8xlarge instance.

2. Connect via ssh.

3. See the README (also found in the ec2-user's home directory) for full instructions on getting up and running.

15:24

### SNB Interactive, Part 3: Choke Points and Initial Run on Virtuoso

In this post we will look at running the LDBC SNB on Virtuoso.

First, let's recap what the benchmark is about:

1. fairly frequent short updates, with no update contention worth mentioning
2. short random lookups
3. medium complex queries centered around a person's social environment

The updates exist so as to invalidate strategies that rely too heavily on precomputation. The short lookups exist for the sake of realism; after all, an online social application does lookups for the most part. The medium complex queries are to challenge the DBMS.

The DBMS challenges have to do firstly with query optimization, and secondly with execution with a lot of non-local random access patterns. Query optimization is not a requirement, per se, since imperative implementations are allowed, but we will see that these are no more free of the laws of nature than the declarative ones.

The workload is arbitrarily parallel, so intra-query parallelization is not particularly useful, if also not harmful. There are latency constraints on operations which strongly encourage implementations to stay within a predictable time envelope regardless of specific query parameters. The parameters are a combination of person and date range, and sometimes tags or countries. The hardest queries have the potential to access all content created by people within 2 steps of a central person, so possibly thousands of people, times 2000 posts per person, times up to 4 tags per post. We are talking in the millions of key lookups, aiming for sub-second single-threaded execution.

The test system is the same as used in the TPC-H series: dual Xeon E5-2630, 2x6 cores x 2 threads, 2.3GHz, 192 GB RAM. The software is the feature/analytics branch of v7fasttrack, available from www.github.com.

The dataset is the SNB 300G set, with:

1,136,127 persons 125,249,604 knows edges 847,886,644 posts , including replies 1,145,893,841 tags of posts or replies 1,140,226,235 likes of posts or replies

As an initial step, we run the benchmark as fast as it will go. We use 32 threads on the driver side for 24 hardware threads.

Below are the numerical quantities for a 400K operation run after 150K operations worth of warmup.

Duration: 10:41.251 Throughput: 623.71 (op/s)

The statistics that matter are detailed below, with operations ranked in order of descending client-side wait-time. All times are in milliseconds.

% of total total_wait name count mean min max 20     % 4,231,130 LdbcQuery5   656 6,449.89     245 10,311 11     % 2,272,954 LdbcQuery8  18,354  123.84     14  2,240 10     % 2,200,718 LdbcQuery3   388 5,671.95     468 17,368  7.3   % 1,561,382 LdbcQuery14   1,124 1,389.13     4  5,724  6.7   % 1,441,575 LdbcQuery12   1,252 1,151.42     15  3,273  6.5   % 1,396,932 LdbcQuery10   1,252 1,115.76     13  4,743  5     % 1,064,457 LdbcShortQuery3PersonFriends  46,285  22.9979   0  2,287  4.9   % 1,047,536 LdbcShortQuery2PersonPosts  46,285  22.6323   0  2,156  4.1   %  885,102 LdbcQuery6   1,721  514.295    8  5,227  3.3   %  707,901 LdbcQuery1   2,117  334.389    28  3,467  2.4   %  521,738 LdbcQuery4   1,530  341.005    49  2,774  2.1   %  440,197 LdbcShortQuery4MessageContent  46,302  9.50708  0  2,015  1.9   %  407,450 LdbcUpdate5AddForumMembership  14,338  28.4175   0  2,008  1.9   %  405,243 LdbcShortQuery7MessageReplies  46,302  8.75217  0  2,112  1.9   %  404,002 LdbcShortQuery6MessageForum  46,302  8.72537  0  1,968  1.8   %  387,044 LdbcUpdate3AddCommentLike  12,659  30.5746   0  2,060  1.7   %  361,290 LdbcShortQuery1PersonProfile  46,285  7.80577  0  2,015  1.6   %  334,409 LdbcShortQuery5MessageCreator  46,302  7.22234  0  2,055  1     %  220,740 LdbcQuery2   1,488  148.347    2  2,504  0.96  %  205,910 LdbcQuery7   1,721  119.646    11  2,295  0.93  %  198,971 LdbcUpdate2AddPostLike   5,974  33.3062   0  1,987  0.88  %  189,871 LdbcQuery11   2,294  82.7685   4  2,219  0.85  %  182,964 LdbcQuery13   2,898  63.1346   1  2,201  0.74  %  158,188 LdbcQuery9   78 2,028.05    1,108  4,183  0.67  %  143,457 LdbcUpdate7AddComment   3,986  35.9902   1  1,912  0.26  %  54,947 LdbcUpdate8AddFriendship   571  96.2294   1  988  0.2   %  43,451 LdbcUpdate6AddPost   1,386  31.3499   1  2,060  0.0086%  1,848 LdbcUpdate4AddForum   103  17.9417   1  65  0.0002%  44 LdbcUpdate1AddPerson   2  22        10  34

At this point we have in-depth knowledge of the choke points the benchmark stresses, and we can give a first assessment of whether the design meets its objectives for setting an agenda for the coming years of graph database development.

The implementation is well optimized in general but still has maybe 30% room for improvement. We note that this is based on a compressed column store. One could think that alternative data representations, like in-memory graphs of structs and pointers between them, are better for the task. This is not necessarily so; at the least, a compressed column store is much more space efficient. Space efficiency is the root of cost efficiency, since as soon as the working set is not in memory, a random access workload is badly hit.

The set of choke points (technical challenges) actually revealed by the benchmark is so far as follows:

• Cardinality estimation under heavy data skew — Many queries take a tag or a country as a parameter. The cardinalities associated with tags vary from 29M posts for the most common to 1 for the least common. Q6 has a common tag (in top few hundred) half the time and a random, most often very infrequent, one the rest of the time. A declarative implementation must recognize the cardinality implications from the literal and plan accordingly. An imperative one would have to count. Missing this makes Q6 take about 40% of the time instead of 4.1% when adapting.

• Covering indices — Being able to make multi-column indices that duplicate some columns from the table often saves an entire table lookup. For example, an index on post by author can also contain the post's creation date.

• Multi-hop graph traversal — Most queries access a two-hop environment starting at a person. Two queries look for shortest paths of unbounded length. For the two-hop case, it makes almost no difference whether this is done as a union or a special graph traversal operator. For shortest paths, this simply must be built into the engine; doing this client-side incurs prohibitive overheads. A bidirectional shortest path operation is a requirement for the benchmark.

• Top K Most queries returning posts order results by descending date. Once there are at least k results, anything older than the kth can be dropped, adding a date selection as early as possible in the query. This interacts with vectored execution, so that starting with a short vector size more rapidly produces an initial top k.

• Late projection — Many queries access several columns and touch millions of rows but only return a few. The columns that are not used in sorting or selection can be retrieved only for the rows that are actually returned. This is especially useful with a column store, as this removes many large columns (e.g., text of a post) from the working set.

• Materialization — Q14 accesses an expensive-to-compute edge weight, the number of post-reply pairs between two people. Keeping this precomputed drops Q14 from the top place. Other materialization would be possible, for example Q2 (top 20 posts by friends), but since Q2 is just 1% of the load, there is no need. One could of course argue that this should be 20x more frequent, in which case there could be a point to this.

• Concurrency control — Read-write contention is rare, as updates are randomly spread over the database. However, some pages get read very frequently, e.g., some middle level index pages in the post table. Keeping a count of reading threads requires a mutex, and there is significant contention on this. Since the hot set can be one page, adding more mutexes does not always help. However, hash partitioning the index into many independent trees (as in the case of a cluster) helps for this. There is also contention on a mutex for assigning threads to client requests, as there are large numbers of short operations.

In subsequent posts, we will look at specific queries, what they in fact do, and what their theoretical performance limits would be. In this way we will have a precise understanding of which way SNB can steer the graph DB community.

## June082015

00:12

### Some introductory presentations for CKAN

Reposted from the CKAN Association LinkedIn group. Feel free to join if you use LinkedIn.

Thanks to Augusto Herrmann Batista and OK Brazil for allowing the following repost:

I recently presented a couple of “lightning courses” to introduce an audience to CKAN.

One was at the Linked Open Data Brasil conference in Florianópolis, Brazil, on November 2014. It’s in Portuguese language.

The other one was presented at the IV Moscow Urban Forum, in Russia, on December 2014. This one is in English.

Feel free to share and reuse, as they are CC-BY.

## June032015

16:51

### The Virtuoso Science Library

There is a lot of scientific material on Virtuoso, but it has not been presented all together in any one place. So I am making here a compilation of the best resources with a paragraph of introduction on each. Some of these are project deliverables from projects under the EU FP7 programme; some are peer-reviewed publications.

For the future, an updated version of this list may be found on the main Virtuoso site.

## European Project Deliverables

• GeoKnow D 2.6.1: Graph Analytics in the DBMS (2015-01-05)

This introduces the idea of unbundling basic cluster DBMS functionality like cross partition joins and partitioned group by to form a graph processing framework collocated with the data.

• GeoKnow D2.4.1: Geospatial Clustering and Characteristic Sets (2015-01-06)

This presents experimental results of structure-aware RDF applied to geospatial data. The regularly structured part of the data goes in tables; the rest is triples/quads. Furthermore, for the first time in the RDF space, physical storage location is correlated to properties of entities, in this case geo location, so that geospatially adjacent items are also likely adjacent in the physical data representation.

• LOD2 D2.1.5: 500 billion triple BSBM (2014-08-18)

This presents experimental results on lookup and BI workloads on Virtuoso cluster with 12 nodes, for a total of 3T RAM and 192 cores. This also discusses bulk load, at up to 6M triples/s and specifics of query optimization in scale-out settings.

• LOD2 D2.6: Parallel Programming in SQL (2012-08-12)

This discusses ways of making SQL procedures partitioning-aware, so that one can, map-reduce style, send parallel chunks of computation to each partition of the data.

## Publications

### 2015

• Minh-Duc, Pham, Linnea, P., Erling, O., and Boncz, P.A. "Deriving an Emergent Relational Schema from RDF Data," WWW, 2015.

This paper shows how RDF is in fact structured and how this structure can be reconstructed. This reconstruction then serves to create a physical schema, reintroducing all the benefits of physical design to the schema-last world. Experiments with Virtuoso show marked gains in query speed and data compactness.

### 2012

• Orri Erling: Virtuoso, a Hybrid RDBMS/Graph Column Store. IEEE Data Eng. Bull. (DEBU) 35(1):3-8 (2012)

This paper introduces the Virtuoso column store architecture and design choices. One design is made to serve both random updates and lookups as well as the big scans where column stores traditionally excel. Examples are given from both TPC-H and the schema-less RDF world.

• Minh-Duc Pham, Peter A. Boncz, Orri Erling: S3G2: A Scalable Structure-Correlated Social Graph Generator. TPCTC 2012:156-172

This paper presents the basis of the social network benchmarking technology later used in the LDBC benchmarks.

### 2009

• Orri Erling, Ivan Mikhailov: Faceted Views over Large-Scale Linked Data. LDOW 2009

This paper introduces anytime query answering as an enabling technology for open-ended querying of large data on public service end points. While not every query can be run to completion, partial results can most often be returned within a constrained time window.

• Orri Erling, Ivan Mikhailov: Virtuoso: RDF Support in a Native RDBMS. Semantic Web Information Management 2009:501-519

This is a general presentation of how a SQL engine needs to be adapted to serve a run-time typed and schema-less workload.

### 2007

• Orri Erling, Ivan Mikhailov: RDF Support in the Virtuoso DBMS. CSSW 2007:59-68

This is an initial discussion of RDF support in Virtuoso. Most specifics are by now different but this can give a historical perspective.

## May142015

15:37

### SNB Interactive, Part 2 - Modeling Choices

SNB Interactive is the wild frontier, with very few rules. This is necessary, among other reasons, because there is no standard property graph data model, and because the contestants support a broad mix of programming models, ranging from in-process APIs to declarative query.

In the case of Virtuoso, we have played with SQL and SPARQL implementations. For a fixed schema and well known workload, SQL will always win. The reason is that SQL allows materialization of multi-part indices and data orderings that make sense for the application. In other words, there is transparency into physical design. An RDF/SPARQL-based application may also have physical design by means of structure-aware storage, but this is more complex and here we are just concerned with speed and having things work precisely as we intend.

## Schema Design

SNB has a regular schema described by a UML diagram. This has a number of relationships, of which some have attributes. There are no heterogenous sets, i.e., no need for run-time typed attributes or graph edges with the same label but heterogenous end-points. Translation into SQL or SPARQL is straightforward. Edges with attributes (e.g., the foaf:knows relation between people) would end up represented as a subject with the end points and the effective date as properties. The relational implementation has a two-part primary key and the effective date as a dependent column. A native property graph database would use an edge with an extra property for this, as such are typically supported.

The only table-level choice has to do with whether posts and comments are kept in the same or different data structures. The Virtuoso schema uses a single table for both, with nullable columns for the properties that occur only in one. This makes the queries more concise. There are cases where only non-reply posts of a given author are accessed. This is supported by having two author foreign key columns each with its own index. There is a single nullable foreign key from the reply to the post/comment being replied to.

The workload has some frequent access paths that need to be supported by index. Some queries reward placing extra columns in indices. For example, a common pattern is accessing the most recent posts of an author or a group of authors. There, having a composite key of ps_creatorid, ps_creationdate, ps_postid pays off since the top-k on creationdate can be pushed down into the index without needing a reference to the table.

The implementation is free to choose data types for attributes, particularly datetimes. The Virtuoso implementation adopts the practice of the Sparksee and Neo4j implementations and represents this is a count of milliseconds since epoch. This is less confusing, faster to compare, and more compact than a native datetime datatype that may or may not have timezones, etc. Using a built-in datetime seems to be nearly always a bad idea. A dimension table or a number for a time dimension avoids the ambiguities of a calendar or at least makes these explicit.

The benchmark allows procedurally maintained materializations of intermediate results for use by queries as long as these are maintained transaction-by-transaction. For example, each person could have the 20 newest posts by their immediate contacts precomputed. This would reduce Q2 "top of the wall" to a single lookup. This does not however appear to be worthwhile. The Virtuoso implementation does do one such materialization for Q14: A connection weight is calculated for every pair of persons that know each other. This is related to the count of replies by either to content generated by the other. If there does not exist a single reply in either direction, the weight is taken to be 0. This weight is precomputed after bulk load and subsequently maintained each time a reply is added. The table for this is the only row-wise structure in the schema and represents a half-matrix of connected people, i.e., person1, person2 -> weight. Person1 is by convention the one with the smaller p_personid. Note that comparing IDs in this way is useful but not normally supported by SPARQL/RDF systems. SPARQL would end up comparing strings of URIs with disastrous performance implications unless an implementation-specific trick were used.

In the next installment, we will analyze an actual run.

15:37

### SNB Interactive, Part 1 - What is SNB Interactive Really About?

This is the first in a series of blog posts analyzing the Interactive workload of the LDBC Social Network Benchmark. This is written from the dual perspective of participating in the benchmark design, and of building the OpenLink Virtuoso implementation of same.

With two implementations of SNB Interactive at four different scales, we can take a first look at what the benchmark is really about. The hallmark of a benchmark implementation is that its performance characteristics are understood; even if these do not represent the maximum of the attainable, there are no glaring mistakes; and the implementation represents a reasonable best effort by those who ought to know such, namely the system vendors.

The essence of a benchmark is a set of trick questions or "choke points," as LDBC calls them. A number of these were planned from the start. It is then the role of experience to tell whether addressing these is really the key to winning the race. Unforeseen ones will also surface.

So far, we see that SNB confronts the implementor with choices in the following areas:

• Data model — Tabular relational (commonly known as SQL), graph relational (including RDF), property graph, etc.

• Physical storage model — Row-wise vs. column-wise, for instance.

• Ordering of materialized data — Sorted projections, composite keys, replicating columns in auxiliary data structures, etc.

• Persistence of intermediate results —  Materialized views, triggers, precomputed temporary tables, etc.

• Query optimization — join order/type, interesting physical data orderings, late projection, top k, etc.

• Parameters vs. literals — Sometimes different parameter values result in different optimal query plans.

• Predictable, uniform latency — Measurement rules stipulate the the SUT (system under test) must not fall behind the simulated workload.

• Durability — How to make data durable while maintaining steady throughput, e.g., logging, checkpointing, etc.

In the process of making a benchmark implementation, one naturally encounters questions about the validity, reasonability, and rationale of the benchmark definition itself. Additionally, even though the benchmark might not directly measure certain aspects of a system, making an implementation will take a system past its usual envelope and highlight some operational aspects.

• Data generation — Generating a mid-size dataset takes time, e.g., 8 hours for 300G. In a cloud situation, keeping the dataset in S3 or similar is necessary; re-generating every time is not an option.

• Query mix — Are the relative frequencies of the operations reasonable? What bias does this introduce?

• Uniformity of parameters — Due to non-uniform data distributions in the dataset, there is easily a 100x difference between "fast" and "slow" cases of a single query template. How long does one need to run to balance these fluctuations?

• Working set — Experience shows that there is a large difference between almost-warm and steady-state of working set. This can be a factor of 1.5 in throughput.

• Reasonability of latency constraints — In the present case, a qualifying run must have no more than 5% of all query executions starting over 1 second late. Each execution is scheduled beforehand and done at the intended time. If the SUT does not keep up, it will have all available threads busy and must finish some work before accepting new work, so some queries will start late. Is this a good criterion for measuring consistency of response time? There are some obvious possibilities for abuse.

• Ease of benchmark implementation/execution — Perfection is open-ended and optimization possibilities infinite, albeit with diminishing returns. Still, getting started should not be too hard. Since systems will be highly diverse, testing that these in fact do the same thing is important. The SNB validation suite is good for this and, given publicly available reference implementations, the effort of getting started is not unreasonable.

• Ease of adjustment — Since a qualifying run must meet latency constraints while going as fast as possible, setting the performance target involves trial and error. Does the tooling make this easy?

• Reasonability of durability rule — Right now, one is not required to do checkpoints but must report the time to roll forward from the last checkpoint or initial state. Inspiring vendors to build faster recovery is certainly good, but we are not through with all the implications. What about redundant clusters?

The following posts will look at the above in light of actual experience.

## May052015

15:13

### Thoughts on KOS (Part 3): Trends in knowledge organization

The accelerating pace of change in the economic, legal and social environment combined with tendencies towards increased decentralization of organizational structures have had a profound impact on the way we organize and utilize and organize knowledge. The internet as we know it today and especially the World Wide Web as the multimodal interface for the presentation and consumption of multimedia information are the most prominent examples of these developments. To illustrate the impact of new communication technologies on information practices Saumure & Shiri (2008) conducted a survey on knowledge organization trends in the Library and Information Sciences before and after the emergence of the World Wide Web. Table 1 shows their results.

The survey illustrates three major trends: 1) the spectrum of research areas has broadened significantly from originally complex and expert-driven methodologies and systems to more light-weight, application-oriented approaches; 2) while certain research areas have kept their status over the years (i.e. Cataloguing & Classification or Machine Assisted Knowledge Organization), new areas of research have gained importance (i.e. Metadata Applications & Uses, Classifying Web Information, Interoperability Issues) while formerly prevalent topics like Cognitive Models or Indexing have declined in importance or dissolved into other areas; and 3) the quantity of papers that are explicitly and implicitly dealing with metadata issues have significantly increased.

These insights coincide with a survey conducted by The Economist (2010) that comes to the conclusion that metadata has become a key enabler in the creation of controllable and exploitable information ecosystems under highly networked circumstances. Metadata provide information about data, objects and concepts. This information can be descriptive, structural or administrative. Metadata adds value to data sets by providing structure (i.e. schemas) and increasing the expressivity (i.e. controlled vocabularies) of a dataset.

According to Weibel & Lagoze (1997, p. 177):

“[the] association of standardized descriptive metadata with networked objects has the potential for substantially improving resource discovery capabilities by enabling field-based (e.g., author, title) searches, permitting indexing of non-textual objects, and allowing access to the surrogate content that is distinct from access to the content of the resource itself.”

These trends influence the functional requirements of the next generation’s Knowledge Organization Systems (KOSs) as a support infrastructure for knowledge sharing and knowledge creation under conditions of distributed intelligence and competence.

Go to previous posts in this series:
Thoughts on KOS (Part1): Getting to grips with “semantic” interoperability or
Thoughts on KOS (Part 2): Classifying Knowledge Organisation Systems

## References

Saumure, Kristie; Shiri, Ali (2008). Knowledge organization trends in library and information studies: a preliminary comparison of pre- and post-web eras. In: Journal of Information Science, 34/5, 2008, pp. 651–666

The Economist (2010). Data, data everywhere. A special report on managing information. http://www.emc.com/collateral/analyst-reports/ar-the-economist-data-data-everywhere.pdf, accessed 2013-03-10

Weibel, S. L., & Lagoze, C. (1997). An element set to support resource discovery. In: International Journal on Digital Libraries, 1/2, pp. 176-187

15:07

### PoolParty 5.1 comes with integrated Graph Search feature

SWC has launched PoolParty Semantic Suite Version 5.1, its taxonomy management and knowledge graph management software platform.

Version 5.1 offers several new features including an ontology publishing module and an integrated graph based search application, which shows instantly how changes on the taxonomy will influence search results.

## New features of PoolParty 5.1 include:

• Several updates of 3rd party components used by the PoolParty server, e.g. update of Sesame to version 2.7.14 to gain full SPARQL 1.1 compatibility and provide additional RDF serialization formats (N-Quads, RDF/JSON)
• GraphSearch per project based on the calculated corpora in corpus management. After successful calculation, a GraphSearch interface is available via a persistent URL, e.g. http://vocabulary.semantic-web.at/PoolParty/graphsearch/cocktails, which is the GraphSearch over a knowledge graph about Cocktails.
• Unified URI management to support URI creation aligned for projects and custom schemes
• Enterprise Security: Several measurments have been undertaken to provide highest enterprise security possible
• Publishing of custom schemes: Similar to the linked data frontend for projects, a schema publishing functionality for custom schemes has been added. A human readable version of the scheme is displayed per default when accessing the schema URL in a browser. As an example, take a look at the Cocktail ontology.

Find the detailed Release Notes in our online documentationx

## April282015

09:36

### Our semantic event recommendations

Just a couple of years ago critics argued that the semantic approach in IT wouldn’t make the transformation from an inspiring academic discipline to a relevant business application. They were wrong! With the digitalization of business, the power of semantic solutions to handle Big Data became obvious.

Thanks to a dedicated global community of semantic technology experts, we can observe a rapid development of software solutions in this field. The progress is coupled to a fast growing number of corporations that are implementing semantic solutions to win insights from existing but unused data.
Knowledge transfer is extremely important in semantics. Lets have a look on the community calendar for the upcoming months. We are looking forward to share our experiences and learn. Join us!

## April272015

13:51

### SWC’s Semantic Event Recommendations

Just a couple of years ago critics argued that the semantic approach in IT wouldn’t make the transformation from an inspiring academic discipline to a relevant business application. They were wrong! With the digitalization of business, the power of semantic solutions to handle Big Data became obvious.Thanks to a dedicated global community of semantic technology experts, we can observe a rapid development of software solutions in this field. The progress is coupled to a fast growing number of corporations that are implementing semantic solutions to win insights from existing but unused data.

Knowledge transfer is extremely important in semantics. Lets have a look on the community calendar for the upcoming months. We are looking forward to share our experiences and learn. Join us!

13:03

### Bernhard Haslhofer about his motivation to work as advisor for Semantic Web Company

Being a researcher by training, it is my job to know the state of the art and to make significant and original contributions in my research field. Understanding and keeping at least in pace with technological developments is certainly challenging but also a major motivation for this job.

In the field of computer science it is common practice to validate and/or demonstrate novel techniques by writing papers and implementing software prototypes. Even though many of those prototypes offer innovative and novel features, they often remain hidden within the scientific community because of lacking long-term support or missing market knowledge and business skills. Turning research-driven innovation into products therefore requires innovative enterprises that can offer those complementary skills and are open to novel technological approaches.

I strongly believe that a tight cooperation between people from academia and industry brings mutual benefits for both sides: research-driven innovation for enterprises as well as valuable real-world feedback loop for academia.

In recent years, people at SWC have already demonstrated awareness and a high level of openness to novel ideas and developments in academia (e.g., Linked Data) and, above all, showed how those ideas can successfully be transformed into products and business. In my new role as Chief Data Scientist at SWC I am looking forward to further support research-driven innovation by questioning the status quo and identifying concrete steps to improve product features, with the overall goal of getting better in what we do.

Short Bio Bernhard Haslhofer

Dr. Bernhard Haslhofer is working as a Data Scientist at the Austrian Institute of Technology. His research interest lies in gaining insights from large-scale and connected datasets by applying machine learning, information retrieval, and network analytics techniques. Previously, Bernhard worked as post doctoral researcher and lecturer at Cornell University Information Science, and received a Ph.D. in Computer Science from University of Vienna. He has numerous Linked Data related publications, serves in several related program committees, and is a recipient of an EU Marie Curie Fellowship and several research awards.

## April212015

15:06

### Thoughts on KOS (Part 2): Classifying Knowledge Organisation Systems

Traditional KOSs include a broad range of system types from term lists to classification systems and thesauri. These organization systems vary in functional purpose and semantic expressivity. Most of these traditional KOSs were developed in a print and library environment. They have been used to control the vocabulary used when indexing and searching a specific product, such as a bibliographic database, or when organizing a physical collection such as a library (Hodge et al. 2000).

## KOS in the era of the Web

With the proliferation the World Wide Web new forms of knowledge organization principles emerged based on hypertextuality, modularity, decentralisation and protocol-based machine communication (Berners-Lee 1998). New forms of KOSs emerged like folksonomies, topic maps and knowledge graphs, also commonly and broadly referred to as ontologies[1].

With reference to Gruber’s (1993/1993a) classic definition:

“a common ontology defines the vocabulary with which queries and assertions are exchanged among agents” based on “ontological commitments to use the shared vocabulary in a coherent and consistent manner.”

From a technological perspective ontologies function as integration layer for semantically interlinked concepts with the purpose to improve the machine-readability of the underlying knowledge model. Ontologies leverage interoperability from a syntactic to a semantic level for the purpose of knowledge sharing. According to Hodge et al. (2003)

“semantic tools emphasize the ability of the computer to process the KOS against a body of text, rather than support the human indexer or trained searcher. These tools are intended for use in the broader, more uncontrolled context of the Web to support information discovery by a larger community of interest or by Web users in general.” (Hodge et al. 2003)

In other words ontologies are being considered valuable to classifying web information in that they aid in enhancing interoperability – bringing together resources from multiple sources (Saumure & Shiri 2008, p. 657).

## Which KOS serves your needs?

Schaffert et al. (2005) introduce a model to classify ontologies balong their scope, acceptance and expressivity, as can be seen in the figure below.

According to this model the design of KOSs has to take account of the user group (acceptance model), the nature and abstraction level of knowledge to be represented (model scope) and the adequate formalism to represent knowledge for specific intellectual purposes (level of expressiveness). Although the proposed classification leaves room for discussion, it can help to distinguish various KOSs from each other and gain a better insight into the architecture of functionally and semantically intertwined KOSs. This is especially important under conditions of interoperability.

[1] It must be critically noted that the inflationary usage of the term “ontology” often in neglect of its philosophical roots has not necessarily contributed to a clarification of the concept itself. A detailed discussion of this matter is beyond the scope of this post. In this paper the author refers to Gruber’s (1993a) definition of ontology as “an explicit specification of a conceptualization”, which is commonly being referred to in artificial intelligence research.

The next post will look at trends inknowledge organization before and after the emergence of the world wide web.

Go to the previous post:Thoughts on KOS (Part1): Getting to grips with “semantic” interoperability

## References:

Gruber, Thomas R. (1993). Toward Principles for the Design of Ontologies Used for Knowledge Sharing. In International Journal Human-Computer Studies 43, pp. 907-928.

Gruber, Thomas R. (1993a). A translation approach to portable ontologies. In: Knowledge Acquisition, 5/2, pp. 199-220

Hodge, Gail (2000). Systems of Knowledge Organization for Digital Libraries: Beyond Traditional Authority Files. In: First Digital Library Federation electronic edition, September 2008. Originally published in trade paperback in the United States by the Digital Library Federation and the Council on Library and Information Resources, Washington, D.C., 2000

Hodge, Gail M.; Zeng, Marcia Lei; Soergel, Dagobert (2003). Building a Meaningful Web: From Traditional Knowledge Organization Systems to New Semantic Tools. In: Proceedings of the 2003 Joint Conference on Digital Libraries (JCDL’03), IEEE

Saumure, Kristie; Shiri, Ali (2008). Knowledge organization trends in library and information studies: a preliminary comparison of pre- and post-web eras. In: Journal of Information Science, 34/5, 2008, pp. 651–666

Schaffert, Sebastian; Gruber, Andreas; Westenthaler, Rupert (2005). A Semantic Wiki for Collaborative Knowledge Formation. In: Reich, Siegfried; Güntner, Georg; Pellegrini, Tassilo; Wahler, Alexander (Eds.). Semantic Content Engineering. Linz: Trauner, pp. 188-202

## April102015

12:44

### Thoughts on KOS (Part1): Getting to grips with “semantic” interoperability

Enabling and managing interoperability at the data and the service level is one of the strategic key issues in networked knowledge organization systems (KOSs) and a growing issue in effective data management. But why do we need “semantic” interoperability and how can we achieve it?

## Interoperability vs. Integration

The concept of (data) interoperability can best be understood in contrast to (data) integration. While integration refers to a process, where formerly distinct data sources and their representation models are being merged into one newly consolidated data source, the concept of interoperability is defined by a structural separation of knowledge sources and their representation models, but that allows connectivity and interactivity between these sources by deliberately defined overlaps in the representation model. Under circumstances of interoperability data sources are being designed to provide interfaces for connectivity to share and integrate data on top of a common data model, while leaving the original principles of data and knowledge representation intact. Thus, interoperability is an efficient means to improve and ease integration of data and knowledge sources.

## Three levels of interoperability

When designing interoperable KOSs it is important to distinguish between structural, syntactic and semantic interoperability (Galinski 2006):

• Structural interoperability is achieved by representing metadata using a shared data model like the Dublin Core Abstraction Model or RDF (Resource Description Framework).
• Syntactic interoperability if achieved by serializing data in a shared mark-up language like XML, Turtle or N3.
• Semantic interoperability is achieved by using a shared terminology or controlled vocabulary to label and classify metadata terms and relations.

Given the fact that metadata standards carry a lot of intrinsic legacy, it is sometimes very difficult to achieve interoperability at all three levels mentioned above. Metadata formats and models are historically grown, they are most of the time a result of community decision processes, often highly formalized for specific functional purposes and most of the time deliberately rigid and difficult to change. Hence it is important to have a clear understanding and documentation of the application profile of a metadata format as a precondition for enabling interoperability at all three levels mentioned above. Semantic Web standards do a really good job in this respect!!

In the next post, we will take a look at various KOSs and how they differ with trespect to expressivity, scope and target group.

08:37

# Goal

For the Nolde project it was requested to build a knowledge graph, containing detailed information about the austrian music scene: artists, bands and their music releases. We decided to use PoolParty, since theses entities should be accessible in an editorial workflow. More details about the implementation will be provided in a later blog post.

In the first round I want to share my experiences with the mapping of music data into SKOS. Obviously, LinkedBrainz was the perfect source to collect and transform such data since this is available as RDF/NTriples dumps and even providing a SPARQL endpoint! LinkedBrainz data is modeled using the Music Ontology.

E.g. you can select all mo:MusicArtists with relation to Austria.

I imported LinkedBrainz dump files and imported them into a triple store, together with DBpedia dumps.

With two CONSTRUCT queries, I was able to collect the required data and transform it into SKOS, into a PoolParty compatible format:

## Construct Artists

Every matching MusicArtist results in a SKOS concept. The foaf:name is mapped to skos:prefLabel (in German).

As you can see, I used Custom Schema features to provide self-describing metadata on top of pure SKOS features: a MusicBrainz link, a MusicBrainz Id, DBpedia link, homepage…

In addition you can see in the query that also data from DBpedia was collected. In case a owl:sameAs relationship to DBpedia exists, a possible abstract is retrieved. When a DBpedia abstract is available it is mapped to skos:definition.

## Construct Releases (mo:SignalGroups) with relations to Artists

Similar to the Artists, a matching SignalGroup results in a SKOS Concept. A skos:related relationship is defined between an Artist and his Releases.

# Outcome

The SPARQL construct queries provided ttl files that could by imported directly into PoolParty, resulting in a project, containing nearly 1,000 Artists and 10,000 Releases:

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.