Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

September 29 2015


SPARQL analytics proves boxers live dangerously

You have always thought that SPARQL is only a query language for RDF data? Then think again, because SPARQL can also be used to implement some cool analytics. I show here two queries that demonstrate that principle.

For simplicity we use a publicly available dataset of DBpedia on an open SPARQL endpoint: (execute with default graph =

Mean life expectancy for different sports

The query shown here starts from the class dbp:Athlete and retrieves sub classes thereof that cover different sports. With that athletes of that areas are obtained and their birth and death dates (i.e. we only take into account deceased individuals). From the dates the years are extracted. Here a regular expression is used because the SPARQL function to extract years from a literal of a date type returned errors and could not be used. From the birth and death years the age is calculated (we filter for a range of 20 to 100 years because in data sources like this erroneous entries have always to be accounted for). Then the data is simply grouped and we count for each sport the number of athletes that were selected and the average age they reached.

prefix dbp:<http: //></http:>
select ?athleteGroupEN (count(?athlete) as ?count) (avg(?age) as ?ageAvg)
where {
    filter(?age >= 20 && ?age < = 100) .
        select distinct ?athleteGroupEN ?athlete (?deathYear - ?birthYear as ?age)
        where {
            ?subOfAthlete rdfs:subClassOf dbp:Athlete .
            ?subOfAthlete rdfs:label ?athleteGroup filter(lang(?athleteGroup) = "en") .
            bind(str(?athleteGroup) as ?athleteGroupEN)
            ?athlete a ?subOfAthlete .
            ?athlete dbp:birthDate ?birth filter(datatype(?birth) = xsd:date) .
            ?athlete dbp:deathDate ?death filter(datatype(?death) = xsd:date) .
            bind (strdt(replace(?birth,"^(\\d+)-.*","$1"),xsd:integer) as ?birthYear) .
            bind (strdt(replace(?death,"^(\\d+)-.*","$1"),xsd:integer) as ?deathYear) .
} group by ?athleteGroupEN having (count(?athlete) >= 25) order by ?ageAvg

The results are not unexpected and show that athletes in the area of motor sports, wresting and boxing die at younger age. On the other hand horse riders, but also tennis and golf players live on average clearly longer.

wrestler 693 58.962481962481962 winter sport Player 1775 66.60169014084507 tennis player 577 71.483535528596187 table tennis player 45 68.733333333333333 swimmer 402 68.674129353233831 soccer player 6572 63.992391965916007 snooker player 25 70.12 rugby player 1452 67.272038567493113 rower 69 63.057971014492754 poker player 30 66.866666666666667 national collegiate athletic association athlete 44 68.090909090909091 motorsport racer 1237 58.117219078415521 martial artist 197 67.157360406091371 jockey (horse racer) 139 65.992805755395683 horse rider 181 74.651933701657459 gymnast 175 65.805714285714286 gridiron football player 4247 67.713680244878738 golf player 400 71.13 Gaelic games player 95 70.589473684210526 cyclist 1370 67.469343065693431 cricketer 4998 68.420368147258904 chess player 45 70.244444444444444 boxer 869 60.352128883774453 bodybuilder 27 52 basketball player 822 66.165450121654501 baseball player 9207 68.611382643640708 Australian rules football player 2790 69.52831541218638

This is especially relevant when that data is large and one would have to extract it from the database and import it into another tool to do the counting and calculations.

Simple statistical measures over life expectancy

Another standard statistical measure is the standard deviation. A good description about how to calculate it can be found for example here. We start again with the class dbp:Athlete and calculate the ages they reached (this time for the entire class dbp:Athlete not its sub classes). Another thing we need are the squares of the ages that we calculate with “(?age * ?age as ?ageSquare)”. At the next stage we count the number of athletes in the result, and calculate the average age, the square of the sums and the sum of the squares. With those values we can calculate in the next step the standard deviation of the ages in our data set. Note that SPARQL does not specify a function for calculating square roots but RDF stores like Virtuoso (that hosts the DBpedia data) provide additional functions like bif:sqrt for calculating the square root of a value.

prefix dbp:<http: //></http:>
select ?count ?ageAvg (bif:sqrt((?ageSquareSum - (strdt(?ageSumSquare,xsd:double) / 
       ?count)) / (?count - 1)) as ?standDev)
where {
   select (count(?athlete) as ?count) (avg(?age) as ?ageAvg) 
          (sum(?age) * sum(?age) as ?ageSumSquare) (sum(?ageSquare) as ?ageSquareSum)
   where {
         select ?subOfAthlete ?athlete ?age (?age * ?age as ?ageSquare)
         where {
             filter (?age >= 20 && ?age <= 100) .
                select distinct ?subOfAthlete ?athlete (?deathYear - ?birthYear as ?age)
                where {
                    ?subOfAthlete rdfs:subClassOf dbp:Athlete .
                    ?athlete a ?subOfAthlete .
                    ?athlete dbp:birthDate ?birth filter(datatype(?birth) = xsd:date) .
                    ?athlete dbp:deathDate ?death filter(datatype(?death) = xsd:date) .
                    bind (strdt(replace(?birth,"^(\\d+)-.*","$1"),xsd:integer) as ?birthYear) .
                    bind (strdt(replace(?death,"^(\\d+)-.*","$1"),xsd:integer) as ?deathYear) .
38542 66.876290799647138 17.6479

These examples show that SPARQL is quite powerful and a lot more than “just” a query language for RDF data but that it is possible to implement basic statistical methods directly at the level of the triple store without the need to extract the data and import it into another tool.

September 25 2015


KM World listed PoolParty Semantic Suite as Trend-Setting Product 2015

PoolParty Semantic Suite has been recognized by KMWorld as  “Trend-Setting Product 2015”. More than 1,000 separate software offerings from more than 200 vendors were reviewed. KMWorld is the United States’ leading magazine for topics surrounding knowledge management systems and content and document management.

Andreas Blumauer, founder and CEO of the Semantic Web Company, comments on the award as follows: ” We are truly honoured that KMWorld has chosen us for its prestigious innovator list. It proofs that standards-based technologies are on the rise in the enterprise sector. What makes the PoolParty Semantic Suite truly valuable, is that it unites most relevant functionalities for seamless, personalized digital experiences. Subject matter experts and IT can smoothly cooperate, which creates relevant business-technology synergies. This is the essence of a successful digital transformation.”


KMWorld Editor-in-Chief Hugh McKellar says, “The panel, which consists of editorial colleagues, market and technology analysts, KM theoreticians, practitioners, customers and a select few savvy users (in a variety of disciplines) reviewed the offerings. All selected products fulfill the ultimate goal of knowledge management—delivering the right information to the right people at the right time.”


PoolParty Semantic Suite

PoolParty is a semantic technology platform provided by the Semantic Web Company. The EU-based company has been a pioneer in the semantic web since 2001. The product is recognized by industry leaders as one of the most developed semantic technology platforms, supporting enterprise needs in knowledge management, data analytics and content excellence. Typical PoolParty users such as taxonomists, subject matter experts and data analysts can easily build and enhance a knowledge graph without coding skills. Boehringer, Credit Suisse, Roche and The World Bank are among many customers now profiting from transforming data into customer insights with PoolParty.


About KMWorld

KMWorld is the leading information provider serving the Knowledge Management systems market and covers the latest in content, document and knowledge management, informing more than 30,000 subscribers about the components and processes – and subsequent success stories – that together offer solutions for improving business performance. KMWorld is a publishing unit of Information Today, Inc.


Press Contact

Semantic Web Company

Thomas Thurner

phone: +43-1-402-12-35



September 24 2015


120+ CKAN Portals in the Palm of Your Hand. Via the Open Data Companion (ODC)

CKAN is a powerful open-source data portal platform which provides out-of-the-box tools that allow data producers to make data easily accessible and reusable by everyone. Making CKAN as Free and Open Source Software (FOSS) has been a key factor in helping grow the availability and accessibility of open data across the Internet.

The emergence of mobile devices and the mobile platform has led to a shift in the way people access and consume information. Popular consensus  and reports show that mobile device usage and time spent on mobile devices are rapidly increasing. This means that mobile devices are now one of the fastest and easiest means of accessing data and information. Yet, as of now, open data lacks a strong mobile presence.

Open Data Companion (ODC) [pronounced “Odyssey”] seeks to address this challenge by providing a free mobile app that serves as a unified access point to over 120 CKAN 2.0+ compliant open data portals and thousands of datasets from around the world; right from your mobile device. Crafted with mobile-optimised features and design, this is an easy and convenient way to find, access and share open data. ODC provides a way for CKAN portal administrators and data producers to deliver open data to mobile users without the need for additional costs or further portal configuration.

ODC provides key mobile features for CKAN Portals:

  • Mobile users can setup access to as many CKAN-powered portals as they want.
  • Browse datasets from over 120 CKAN-powered data portals around the world by categories.
  • Receive push notifications on your mobile device when new datasets are available from your selected data portals.
  • Download and view data records (resources) on your mobile device.
  • Preview dataset resources and create data visualisations in app before download (as supported by the portal).
  • Bookmark/save datasets for later viewing.
  • “Favourite” your data portals for future easy access.
  • Share links to datasets on social media, email, sms etc. right from the app.
  • In-app tutorial videos designed to help you quickly get productive with the app. Tutorial videos are available offline once downloaded.
To ensure that ODC is usable by all CKAN portals in the wild, the app uses the publicly open and powerful CKAN API which is supported by all CKAN portals. By using the CKAN API to access portals’ data and meta-data, the app safeguards portals from external malicious attacks; more importantly portal administrators remain in control of the data being delivered to the public through the app. For instance, in order for ODC to provide in-app previews and visualisations of datasets, portal administrators must install the correct CKAN resource preview extensions. Basically, whatever dataset can be accessed from a CKAN portal website, can also be accessed by ODC through the CKAN APIs.

How to Make Your CKAN Portal Available to the Mobile Community

Making your CKAN portal available to the mobile community through the ODC app is done in 3 easy steps. As a portal administrator, ensure your CKAN portal is running on CKAN 2.0 and above (at the time of writing latest CKAN version is 2.4); ensure your portal is publicly available on the World Wide Web. Finally, submit your portal details to the CKAN Census (where the app developer will periodically check for new portal submissions) OR submit the portal details directly to the developer through the feedback section of the app and the app website. That’s all!

Feedback Welcome

ODC is available for download on the Google Play Store and all feedback is welcome. The app is actively developed, so more features will be released. Send feedback through the app or follow ODC on Twitter. You can also read more about the ODC vision, objectives and features from the app website.

Bringing CKAN Portals to the mobile platform is a big step in improving open data accessibility and reusability. It also opens doors to more public involvement in open data growth. I am excited to see what these new opportunities produce, first for the CKAN community and then for the Open Data community in general.

September 21 2015


Showcase your data

We all know CKAN is great for publishing and managing data, and it has powerful visualisation tools to provide instant insights and analysis. But it’s also useful and inspiring to see examples of how open data is being used.

CKAN has previously provided for this with the ‘Related Items’ feature (also known as ‘Apps & Ideas’). We wanted to enhance this feature to address some of its shortcomings, packaged up as an extension to easily replace and migrate from Related Items. So we developed the Showcase extension!

Showcase Example

A Showcase details page. This Showcase example is originally from

Separating out useful, but under-loved features from CKAN core to extensions like this:

  • makes core CKAN a leaner and a more focused codebase
  • gives these additional features a home, with more dedicated ownership and support
  • means updates and fixes for an extension don’t have to wait until the next release of CKAN

Some improvements made in Showcase include:

  • each showcase has its own details page
  • more than one dataset can be linked to a showcase
  • a new role of Showcase Admin to help manage showcases
  • free tagging of showcases, instead of a predefined list of ‘types’
  • showcase discovery by search and filtering by tag

This was my first contribution to the CKAN project and I wanted to ensure the established voices from the CKAN developer community were able to contribute guidance and feedback.

Remote collaboration can be hard, so I looked at the tools we already use as a team, to lower the barrier to participation. I wanted something that was versioned, allowed commenting and collaboration, and provided notification to interested parties as the specification developed. We use GitHub to collect ideas for new features in a repository as Issues, so it seemed like a natural extension to take these loose issues (ideas) and turn them into pull requests (proposals). The proposal and supporting documents can be committed as simple MarkDown files, and discussed within the Pull Request. This provides line-by-line commentary tools enabling quick iteration based on the feedback. If a proposal is accepted and implemented, the pull request can be merge, if the proposal is unsuccessful, it can be closed.

The Pull Request for the Showcase specification has 22 commits, and 57 comments from nine participants. Their contributions were invaluable and helped to quickly establish what and how the extension was going to be built. Their insights helped me get up to speed with CKAN and its extension framework and prevented me from straying too far in the wrong direction.

So, by developing the specification and coding in the open, we’ve managed to take an unloved feature of CKAN and give it a bit of polish and hopefully a new lease of life. I’d love to hear how you’re using it!

September 18 2015


Pyramids, Pipelines and a Can-of-Sweave – CKAN Asia-Pacific Meetup

Florian Mayer from the Western Australian Department of Parks and Wildlife presents various methods he is using to create Wisdom.

Data+Code = Information; Information + Context = Wisdom

So, can this be done with workbooks, applications and active documents?

As Florian might say, “Yes it CKAN”!

Grab the code and materials related to the work from here:

asia-pacificThis presentation was given at the first Asia-Pacific CKAN meetup on the 17th of September, hosted at Link Digital, as an initiative of the CKAN Community and Communications team. You can join the meetup and come along to these fortnightly sessions via video conference.

If you have some interesting content to present then please get in touch with @starl3n to schedule a session.

September 16 2015


Implementing VectorTiles Preview of Geodata on HDX

This post is modified version of a post on the HDX blog.  It is modified here to highlight information of most interest to the CKAN community.  You can see the original post here.

Humanitarian data is almost always inherently geographic. Even the data in a simple CSV file will generally correspond to some piece of geography: a country, a district, a town, a bridge, or a hospital, for example.

HDX has built on CKAN’s preview capabilities with the ability to preview large (up to 500MB) vector geographic datasets in a variety of formats.  Resources uploaded (or linked) to HDX with the format strings ‘geojson’, ‘zipped shapefile’, or ‘kml’ will trigger the creation of a geo preview. Here is an example showing administrative boundaries for Colombia:


To minimize bandwidth in the interest of often poorly-connected field locations, we built the preview from vector tiles. This means that details are removed at small scales but will reappear as you zoom in.

The preview is created only for the first layer it encounters in a resource. If the resource contains multiple layers, the others will not show up. For those cases, you can create separate resources for each layer and they will be available in the preview. Multiple geometry types (polygon + line, for example) in kml or geojson are not yet supported.


It’s a common problem in interactive mapping: to preview the whole geographic dataset, we would need to send all of the data to the browser, but that can require a long download or even crash the browser. The classic solution is to use a set of pre-rendered map tiles — static map images made for different zoom levels and cut into tiny pieces called tiles.  The browser has to load only a few of these pieces for any given view of the map. However, because they are just raster images, the user cannot interact with them in any advanced way.

We wanted to maintain interactivity with the data, eventually having hover effects or allowing users to customize styling, so we knew that we needed a different approach. We reached out to our friends at Geonode who pointed us to the recently developed Vector Tiles Specification.

The vector tile solution is a similar approach to traditional map tiles, but instead of creating static image tiles, it involves cutting the geodata layer into small tiles of vector data. Each zoom level receives a simplification (level of detail, or LoD) pass, which reduces the number of vertices displayed, similar to the way that 3D video games or simulators reduce the number of polygons in distant objects to improve performance. This means that for any given zoom level and location, the browser needs to download only the vertices necessary to fill the map.  You can learn more about how vector tiles work in this helpful FOSS4G NA talk from earlier this year.

Because vector tiles are a somewhat-new technology, there wasn’t any off-the-shelf framework to let us integrate them with our CKAN instance. Instead, we built a custom solution from several existing components (along with our own integration code):

Our architecture looks like this:


The GISRestLayer orchestrates the entire process by notifying each component when there is a task to do. It then informs CKAN when the task is complete, and a dataset has a geo preview available.  It can take a minute or longer to generate the preview, so the asynchronous approach — managed through Redis Queue (RQ) — was essential to let our users continue to work while the process is running. A special HDX team member, Geodata Preview Bot, is used to make the changes to CKAN. This makes the nature of the activity on the dataset clear to our users.

Future development

This approach gives HDX a good foundation for adding new geodata features in the future. We will be conducting research to understand what users think is important to add next. Here are some initial new-feature ideas:

  • Automatically generate additional download formats so that every geodataset is available in zipped shapefile, GeoJSON, KML, etc.
  • Allow the contributing user to specify the order of the resources in the map legend (and therefore which one appears by default).
  • Allow users to preview multiple datasets on the same map for comparison.
  • Automatically apply different symbol colors to different resources in the same dataset.
  • Allow users to style the geographic data, changing colors and symbols.
  • Allow users to configure and embed maps of their data in their organization or crisis pages.
  • Provide OGC-compliant web services of contributed datasets (WFS, WMS, etc.).
  • Allow external geographic data services (WMS, WFS, etc) to be added to a map preview.
  • Make our vector tiles available as a web service.

If any of these enhancements sound useful or you have new ideas, send us an email at If you have geodata to share with the HDX community, start adding your data here.

We would like to say a special thanks to Jeffrey Johnson who pointed us toward the vector tiles solution and to the contributors of all the open source projects listed above! In addition to GISRestLayer, you’ll find the rest of our code here.

September 11 2015


Building tools for Open Data adoption

At DataCats, we are focused on a simple problem — how do we make sure every single government has easy access to get up and running with Open Data? In other words, how do we make it as easy as possible for governments of all levels to start publishing open data?

The answer, as you might tell by this blog, is CKAN. But CKAN uses a very non-traditional technology stack, especially by government standards. Python, PostgreSQL, Solr, and Unix, are not in the toolbox of most IT departments. This is true not only for local government in Europe and North America, but also for almost all government in the developing world.

Our answer to this problem are two software projects which, like CKAN, are Free and Open Source Software. The first is the eponymously named datacats, and the second is named CKAN Multisite. The two projects together aim to solve the operational difficulties in deploying and managing CKAN installations.

datacats is a command line library built on Docker, a popular new alternative to virtualization that is experiencing explosive growth in industry. It aims to help CKAN developers easily get set up and running with one or more CKAN development instances, as well as deploy those easily on any provider – be it Amazon, Microsoft Azure, Digital Ocean, or a plain old physical server data centre.

Our team has been using datacats to develop a number of large CKAN projects for governments here in Canada and around the world. Being open source, we get word every week of another IT department somewhere that is trying it out.

CKAN Multisite is a companion project to datacats, targeted at system administrators who wish to manage one or more CKAN instances on their infrastructure. The project was very generously sponsored by U.S. Open Data. Multisite provides a simple API and a web interface through which starting, stopping, and managing CKAN servers is as simple as pressing a button. In essence it gives you your very own CKAN cloud.

CKAN is as an open source project that many national and large city governments depend on as the cornerstone of their open data programs. We hope that these two open source projects will help the CKAN ecosystem continue to grow. If you are a sysadmin or a developer working on CKAN, give it a try — and if you have the appetite — consider contributing to the projects themselves.

September 09 2015

PricewaterhouseCoopers is launching its Information Management Application, based on EXALEAD CloudView

August 21 2015


Matthew Fullerton and some interesting CKAN extension development.

Matthew Fullerton - mattfullertonNote: This is a re-post from one of our CKAN community contributors, Matthew Fullerton. He has been working on some interesting extensions, which are outlined below. You can support Matthew’s work by providing comments below, or you can link through to his GitHub profile to comment or get in touch there.


Styling GeoJSON data

The GeoView extension makes it easy to add resource views of GeoJSON data. In our extended extension, attributes of the features (lines, points) in the FeatureCollection are styled according to MapBox’s SimpleStyle spec.

Here’s an example where the file has been processed to add colors based on traffic flow state:

And another where the points are styled to (vaguely) look like colored traffic lights: (watch out, it can take a while to load)

Realtime GeoJSON data

Using leaflet.realtime, an extension for the leaflet library that CKAN (GeoView) uses to visualize GeoJSON, maps can have changing points or colors/styles.

Here is an example of traffic lights changing according to pre-recorded data:

I’ll try and add a demo with moving data points soon, it ought to work without any further code changes. The problem is often getting the live data in GeoJSON format… but we have a backend for preprocessing other data.

Realtime data plotting

By making only a few small changes, we are able to continuously update Graph views. You can see the changing (or not) temperature in our office here:

That’s an example for ‘lines and points’ but it works for things like bar graphs too. Last week we had people competing to achieve the best time in a remote controlled robot race where their time was automatically displayed as a bar on a ‘leader board’. For good measure we had an automatically updating histogram of the times too. Updating the actual data in CKAN is easy thanks to the DataStore API.

Matthew Fullerton

Freelance Software Developer and EXIST Stipend holder with the start up project “Tapestry” -

August 16 2015


Two new CKAN extensions – Webhooks and Geopusher

Denis Zgonjanin recently shared the following update on two new extensions via the CKAN Dev mail list.

If you are working on CKAN extensions and would like to share details with other developers then post your updates via the mail list. We’ll always look at promoting the great work of community contributions via this blog :) If you have an interesting CKAN story to share feel free to ping @starl3n to organise a guest post.

From Denis:


A problem I’ve had personally is having my open data apps know when a dataset they’ve been using has been updated. You can of course poll CKANperiodically, but then you need cron jobs or a queue, and when you’re using a cheap PaaS like heroku for your apps, integrating queues and cron is just an extra hassle.

This extension lets people register a URL with CKAN, which CKAN will call when a certain event happens – for example, a dataset update. The extension uses the built-in CKAN celery queue, so as to be non-blocking.

If you do end up using it, there are still a bunch of nice features to be built, including a simple web interface through which users can register webhooks (right now they can only be created through the action API)


So you know how you have a lot of Shapefiles and KML files in your CKANs (because government), but your users prefer GeoJSON? This extension will automatically convert shapefiles and KML into GeoJSON, and create a new GeoJSON resource within that dataset. There are some cases where this won’t work depending on complexity of SHP or KML file, but it works well in general.

This extension also uses the built-in celery queue to do its work, so for both of these extensions you will need to start the celery daemon in order to use them:

`paster --plugin=<span class="il">ckan</span> celeryd -c development.ini`

August 11 2015


DBpedia Usage Report, August 2015

We recently published the latest DBpedia Usage Report, covering v3.3 (released July, 2009) to v3.10 (sometimes called "DBpedia 2014"; released September, 2014).

The new report has usage data through July 31, 2015, and brought a few surprises to our eyes. What do you think?

August 05 2015


Beauty behind the scenes

Good things can often go unnoticed, especially if they’re not immediately visible. Last month the government of Sweden, through Vinnova, released a revamped version of their open data portal, Ö The portal still runs on CKAN, the open data management system. It even has the same visual feeling but the principles behind the portal are completely different. The main idea behind the new version of Ö is automation. Open Knowledge teamed up with the Swedish company Metasolutions to build and deliver an automated open data portal.

Responsive design

In modern web development, one aspect of website automation called responsive design has become very popular. With this technique the website automatically adjusts the presentation depending on the screen size. That is, it knows how best to present the content given different screen sizes. Ö got a slight facelift in terms of tweaks to its appearance, but the big news on that front is that it now has a responsive design. The portal looks different if you access it on mobile phones or if you visit it on desktops, but the content is still the same.

These changes were contributed to CKAN. They are now a part of the CKAN core web application as of version 2.3. This means everyone can now have responsive data portals as long as they use a recent version of CKAN .

New Ö

New Ö

Old Ö

Old Ö

Data catalogs

Perhaps the biggest innovation of Ö is how the automation process works for adding new datasets to the catalog. Normally with CKAN, data publishers log in and create or update their datasets on the CKAN site. CKAN has for a long time also supported something called harvesting, where an instance of CKAN goes out and fetches new datasets and makes them available. That’s a form of automation, but it’s dependent on specific software being used or special harvesters for each source. So harvesting from one CKAN instance to another is simple. Harvesting from a specific geospatial data source is simple. Automatically harvesting from something you don’t know and doesn’t exist yet is hard.

That’s the reality which Ö faces. Only a minority of public organisations and municipalities in Sweden publish open data at the moment. So a decision hasn’t been made by a majority of the public entities for what software or solution will be used to publish open data.

To tackle this problem, Ö relies on an open standard from the World Wide Web Consortium called DCAT (Data Catalog Vocabulary). The open standard describes how to publish a list of datasets and it allows Swedish public bodies to pick whatever solution they like to publish datasets, as long as one of its outputs conforms with DCAT.

Ö actually uses a DCAT application profile which was specially created for Sweden by Metasolutions and defines in more detail what to expect, for example that Ö expects to find dataset classifications according the Eurovoc classification system.

Thanks to this effort significant improvements have been made to CKAN’s support for RDF and DCAT. They include application profiles (like the Swedish one) for harvesting and exposing DCAT metadata in different formats. So a CKAN instance can now automatically harvest datasets from a range of DCAT sources, which is exactly what Ö does. For Ö, the CKAN support also makes it easy for Swedish public bodies who use CKAN to automatically expose their datasets correctly so that they can be automatically harvested by Ö For more information have a look at the CKAN DCAT extension documentation.

Dead or alive

The Web is decentralised and always changing. A link to a webpage that worked yesterday might not work today because the page was moved. When automatically adding external links, for example, links to resources for a dataset, you run into the risk of adding links to resources that no longer exist.

To counter that Ö uses a CKAN extension called Dead or alive. It may not be the best name, but that’s what it does. It checks if a link is dead or alive. The checking itself is performed by an external service called deadoralive. The extension just serves a set of links that the external service decides to check to see if some links are alive. In this way dead links are automatically marked as broken and system administrators of Ö can find problematic public bodies and notify them that they need to update their DCAT catalog (this is not automatic because nobody likes spam).

These are only the automation highlights of the new Ö Other changes were made that have little to do with automation but are still not immediately visible, so a lot of Ö’s beauty happens behind the scenes. That’s also the case for other open data portals. You might just visit your open data portal to get some open data, but you might not realise the amount of effort and coordination it takes to get that data to you.

Image of Swedish flag by Allie_Caulfield on Flickr (cc-by)

August 03 2015


How the PoolParty Semantic Suite is learning to speak 40+ languages

Business is becoming more and more globalised, and enterprises and organisations are acting in several different regions and thus facing more challenges of different cultural aspects as well as respective language barriers. Looking at the European market, we even see 24 working languages in EU28, which make cross-border services considerably complicated. As a result, powerful language technology is needed, and intense efforts have already been taken in the EU to deal with this situation and enable the vision of a multilingual digital single market (a priority area of the European Commission this year, see:


Here at the Semantic Web Company we also witness fast-growing demands for language-independent and/or specific-language and cross-language solutions to enable business cases like cross-lingual search or multilingual data management approaches. To provide such solutions, a multilingual metadata and data management approach is needed, and this is where PoolParty Semantic Suite comes into the game: as PoolParty follows W3C semantic web standards like SKOS, we have language-independent-based technologies in place and our customers already benefit from them. However, as regards text analysis and text extraction, the ability to process multilingual information and data is key for success – which means that the systems need to speak as many languages as possible.

Our new cooperation with K Dictionaries (KD) is enabling the PoolParty Semantic Suite to continuously “learn to speak” more and more languages, by making use of KD’s rich monolingual, bilingual and multilingual content and its long-time experience in lexicography as a base for improved multi-language text analysis and processing.

KD ( is a technology-oriented content and data creator that is based in Tel Aviv and cooperates with publishing partners, ICT firms, the academe and professional associations worldwide. It deals with nearly 50 languages, offering quality monolingual, bilingual and multilingual lexical datasets, morphological word forms, phonetic transcription, etc.

As a result of this cooperation, PoolParty now provides language bundles in the following languages, which can be licensed together with all types of PoolParty servers:

  • English
  • French
  • German
  • Italian
  • Japanese
  • Korean
  • Russian
  • Slovak
  • Spanish

Additional language bundles are in preparation and will be in place soon!

Furthermore, SWC and KD are partners in a brand new EUREKA project that is supported by a bilateral technology/innovation program between Austria and Israel. The project is called LDL4HELTA (Linked Data Lexicography for High-End Language Technology Application) and combines lexicography and Language Technology with Semantic Web and Linked (Open) Data mechanisms and technologies to improve existing and develop new products and services. It integrates the products of both partners to better serve existing customers and new ones, as well as to enter together new markets in the field of Linked Data lexicography-based Language Technology solutions. This project has been successfully kicked off in early July and has a duration of 24 months, with the first concrete results due early in 2016.

The LDL4HELTA project is supported by a research partner (Austrian Academy of Sciences) and an expert Advisory Board including  Prof Christian Chiarcos (Goethe University, Frankfurt), Mr Orri Erling (OpenLink Software), Dr Sebastian Hellmann (Leipzig University), Prof Alon Itai (Technion, Haifa), and Ms Eveline Wandl-Wogt (Austrian Academy of Sciences).

So stay tuned and we will inform you about news and activities of this cooperation here in the blog continuously!

July 31 2015

Flows, data or quality: how to manage interaction with customers efficiently

July 27 2015


Meet the PoolParty Team in Washington, Vienna, Utrecht, …, or on the web

TBC15_VisitMeet the PoolParty Team at upcoming events, either to listen to a presentation or to have a chat with our team in the exhibition area. PoolParty is proud sponsor of the following events in the autumn of 2015:

Find details about a talk about ‘Dynamic Semantic Publishing’, which will be given by Andreas Blumauer, CEO of Semantic Web Company at Taxonomy Bootcamp.

If you are not able to visit these events, watch out for the webinars we will organize in the upcoming months in our event calendar.


Meet the PoolParty Team in Washington, Vienna, Utrecht, …, or on the web

TBC15_VisitMeet the PoolParty Team at upcoming events, either to listen to a presentation or to have a chat with our team in the exhibition area. PoolParty is proud sponsor of the following events in the autumn of 2015:

Find details about a talk about ‘Dynamic Semantic Publishing’, which will be given by Andreas Blumauer, CEO of Semantic Web Company at Taxonomy Bootcamp.

If you are not able to visit these events, watch out for the webinars we will organize in the upcoming months in our event calendar.

July 22 2015


CKAN 2.4 release and patch releases

We are happy to announce that CKAN 2.4 is now released. In addition, new patch releases for older versions of CKAN are now available to download and install.

CKAN 2.4

The 2.4 release brings a way to set the CKAN config via environment variables and via the API, which is useful for automated deployment setups. 2.4 also includes plenty of other improvements contributed by the CKAN developer community during the past 4 months, as detailed in the 2.4.0 CHANGELOG

If you have customizations or extensions, we suggest you trial the upgrade first in a test environment and refer to the changes in the changelog. Upgrade instructions are below.

CKAN patch releases

These new patch releases for CKAN 2.0.x, 2.1.x, 2.2.x and 2.3.x fix important bugs and security issues, so users are strongly encouraged to upgrade to the latest patch release for the CKAN version they are using.

For a list of the fixes included you can check the CHANGELOG:


For details on how to upgrade, see the following links depending on your install method:

If you find any issue, you can let the technical team know in the mailing list or the IRC channel.


July 15 2015


Big Data, Part 2: Virtuoso Meets Impala

In this article we will look at Virtuoso vs. Impala with 100G TPC-H on two R3.8 EC2 instances. We get a single user win for Virtuoso by a factor of 136, and a five user win by a factor of 55. The details and analysis follow.

The load setup is the same as ever, with copying from CSV files attached as external tables into Parquet tables. We get lineitem split over 88 Parquet files, which should provide enough parallelism for the platform. The Impala documentation states that there can be up to one thread per file, and here we wish to see maximum parallelism for a single query stream. We use the schema from the Impala github checkout, with string for string and date columns, and decimal for numbers. We suppose the authors know what works best.

The execution behavior is surprising. Sometimes we get full platform utilization, but quite often only 200% CPU per box. The query plan for Q1, for example, says 2 cores per box. This makes no sense, as the same plan fully well knows the table cardinality. The settings for scanner threads and cores to use (in impala-shell) can be changed, but the behavior does not seem to change.

Following are the run times for one query stream.

Query Virtuoso Impala Notes — 332     s 841     s Data Load Q1 1.098 s 164.61  s Q2 0.187 s 24.19  s Q3 0.761 s 105.70  s Q4 0.205 s 179.67  s Q5 0.808 s 84.51  s Q6 2.403 s 4.43  s Q7 0.59  s 270.88  s Q8 0.775 s 51.89  s Q9 1.836 s 177.72  s Q10 3.165 s 39.85  s Q11 1.37  s 22.56  s Q12 0.356 s 17.03  s Q13 2.233 s 103.67  s Q14 0.488 s 10.86  s Q15 0.72  s 11.49  s Q16 0.814 s 23.93  s Q17 0.681 s 276.06  s Q18 1.324 s 267.13  s Q19 0.417 s 368.80  s Q20 0.792 s 60.45  s Q21 0.720 s 418.09  s Q22 0.155 s 40.59  s Total 20     s 2724     s

Because the platform utilization was often low, we made a second experiment running the same queries in five parallel sessions. We show the average execution time for each query. We then compare this with the Virtuoso throughput run average times. We permute the single query stream used in the first tests in 5 different orders, as per the TPC-H spec. The results are not entirely comparable, because Virtuoso is doing the refreshes in parallel. According to Impala documentation, there is no random delete operation, so the refreshes cannot be implemented.

Just to establish a baseline, we do SELECT COUNT (*) FROM lineitem. This takes 20s when run by itself. When run in five parallel sessions, the fastest terminates in 64s and the slowest in 69s. Looking at top, the platform utilization is indeed about 5x more in CPU%, but the concurrency does not add much to throughput. This is odd, considering that there is no synchronization requirement worth mentioning between the operations.

Following are the average times for each query in the 5 stream experiment.

Query Virtuoso Impala Notes Q1 1.95 s 191.81 s Q2 0.70 s 40.40 s Q3 2.01 s 95.67 s Q4 0.71 s 345.11 s Q5 2.93 s 112.29 s Q6 4.76 s 14.41 s Q7 2.08 s 329.25 s Q8 3.00 s 98.91 s Q9 5.58 s 250.88 s Q10 8.23 s 55.23 s Q11 4.26 s 27.84 s Q12 1.74 s 37.66 s Q13 6.07 s 147.69 s Q14 1.73 s 23.91 s Q15 2.27 s 23.79 s Q16 2.41 s 34.76 s Q17 3.92 s 362.43 s Q18 3.02 s 348.08 s Q19 2.27 s 443.94 s Q20 3.05 s 92.50 s Q21 2.00 s 623.69 s Q22 0.37 s 61.36 s Total for
Slowest Stream 67    s 3740    s

There are 4 queries in Impala that terminated with an error (memory limit exceeded). These were two Q21s, one Q19, one Q4. One stream executed without errors, so this stream is reported as the slowest stream. Q21 will, in the absence of indexed access, do a hash build side of half of lineitem, which explains running out of memory. Virtuoso does Q21 mostly by index.

Looking at the 5 streams, we see CPU between 1000% and 2000% on either box. This looks about 5x more than the 250% per box that we were seeing with, for instance, Q1. The process sizes for impalad are over 160G, certainly enough to have the working set in memory. iostat also does not show any I, so we seem to be running from memory, as intended.

We observe that Impala does not store tables in any specific order. Therefore a merge join of orders and lineitem is not possible. Thus we always get a hash join with a potentially large build side, e.g., half of orders and half of lineitem in Q21, and all orders in Q9. This explains in part why these take so long. TPC-DS does not pose this particular problem though, as there are no tables in the DS schema where the primary key of one would be the prefix of that of another.

However, the lineitem/orders join does not explain the scores on Q1, Q20, or Q19. A simple hash join of lineitem and part was about 90s, with a replicated part hash table. In the profile, the hash probe was 74s, which seems excessive. One would have to single-step through the hash probe to find out what actually happens. Maybe there are prohibitive numbers of collisions, which would throw off the results across the board. We would have to ask the Impala community about this.

Anyway, Impala experts out there are invited to set the record straight. We have attached the results and the output of the Impala profile statement for each query for the single stream run. contains the evidence for the single-stream run; holds the 5-stream run.

To be more Big Data-like, we should probably run with significantly larger data than memory; for example, 3T in 0.5T RAM. At EC2, we could do this with 2 I3.8 instances (6.4T SSD each). With Virtuoso, we'd be done in 8 hours or so, counting 2x for the I/O and 30x for the greater scale (the 100G experiment goes in 8 minutes or so, all included). With Impala, we could be running for weeks, so at the very least we'd like to do this with an Impala expert, to make sure things are done right and will not have to be retried. Some of the hash joins would have to be done in multiple passes and with partitioning.

In subsequent articles, we will look at other players in this space, and possibly some other benchmarks, like the TPC-DS subset that Actian uses to beat Impala.

July 14 2015


Semantic Web Company with LOD2 project top listed at the first EC Innovation Radar

The Innovation Radar is a DG Connect support initiative which focuses on the identification of high potential innovations and the key innovators behind them in FP7, CIP and H2020 projects. The Radar supports the innovators by suggesting a range of targeted actions that can assist them in fulfilling their potential in the market place. The first Innovation Radar Report reviews the innovation potential of ICT projects funded under 7th Framework Programme and the Competitiveness and Innovation Framework Programme. Between May 2014 and January 2015, the Commission reviewed 279 ICT projects, which had resulted in a total of 517 innovations, delivered by 544 organisations in 291 European cities.

Core of the analysis is the Innovation Capacity Indicator (ICI), which measures both the ability of the innovator company and the quality of the environment in which it operates. AND: among this results, SWC has received two top rankings. One for the recently concluded LOD2 project (LOD2 – Creating Knowledge out of Interlinked Data) and another as being one of the key organisations and thereby innovating SMEs within this projects. Also listed are our partners OpenLink Software and Wolters Kluwer Germany.

Ranking of the top 10 innovations and key organisations behind them (Innovation Radar 2015)

We are happy and proud that the report identifies Semantic Web Company as one of those players (10 %) where commercial exploitation of innovations is already ongoing. That strengthens and confirmes our approach to interconnect FP7 and H2020 research and innovation activities with real-world business use cases coming from our customers and partners. Thereby our core product PoolParty Semantic Suite can be taken as a best practice example of embedding collaborative research into an innovation-driven commercial product. For Semantic Web Company, the report is particularly encouraging because of it’s emphasis on the positive role of SMEs, where the report sees 41% of high-potential innovation coming from…


Blogpost by Martin Kaltenböck and Thomas Thurner

July 13 2015


Vectored Execution in Column/Row Stores

This article discusses the relationship between vectored execution and column- and row-wise data representations. Column stores are traditionally considered to be good for big scans but poor at indexed access. This is not necessarily so, though. We take TPC-H Q9 as a starting point, working with different row- and column-wise data representations and index choices. The goal of the article is to provide a primer on the performance implications of different physical designs.

All the experiments are against the TPC-H 100G dataset hosted in Virtuoso on the test system used before in the TPC-H series: dual Xeon E5-2630, 2x6 cores x 2 threads, 2.3GHz, 192 GB RAM. The Virtuoso version corresponds to the feature/analytics branch in the v7fasttrack github project. All run times are from memory, and queries generally run at full platform, 24 concurrent threads.

We note that RDF stores and graph databases usually do not have secondary indices with multiple key parts. However, these do predominantly index-based access as opposed to big scans and hash joins. To explore the impact of this, we have decomposed the tables into projections with a single dependent column, which approximates a triple store or a vertically-decomposed graph database like Sparksee.

So, in these experiments, we store the relevant data four times over, as follows:

  • 100G TPC-H dataset in the column-wise schema as discussed in the TPC-H series, now complemented with indices on l_partkey and on l_partkey, l_suppkey

  • The same in row-wise data representation

  • Column-wise tables with a single dependent column for l_partkey, l_suppkey, l_extendedprice, l_quantity, l_discount, ps_supplycost, s_nationkey, p_name. These all have the original tables primary key, e.g., l_orderkey, l_linenumber for the l_ prefixed tables

  • The same with row-wise tables

The column-wise structures are in the DB qualifier, and the row-wise are in the R qualifier. There is a summary of space consumption at the end of the article. This is relevant for scalability, since even if row-wise structures can be faster for scattered random access, they will fit less data in RAM, typically 2 to 3x less. Thus, if "faster" rows cause the working set not to fit, "slower" columns will still win.

As a starting point, we know that the best Q9 is the one in the Virtuoso TPC-H implementation which is described in Part 10 of the TPC-H blog series. This is a scan of lineitem with a selective hash join followed ordered index access of orders, then hash joins against the smaller tables. There are special tricks to keep the hash tables small by propagating restrictions from the probe side to the build side.

The query texts are available here, along with the table declarations and scripts for populating the single-column projections. rs.sql makes the tables and indices, rsload.sql copies the data from the TPC-H tables.

The business question is to calculate the profit from sale of selected parts grouped by year and country of the supplier. This touches most of the tables, aggregates over 1/17 of all sales, and touches at least every page of the tables concerned, if not every row.

                                                                         n_name  AS  nation, 
                                                 EXTRACT(year FROM o_orderdate)  AS  o_year,
          SUM (l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity)  AS  sum_profit
    FROM  lineitem, part, partsupp, orders, supplier, nation
   WHERE    s_suppkey = l_suppkey
     AND   ps_suppkey = l_suppkey
     AND   ps_partkey = l_partkey
     AND    p_partkey = l_partkey
     AND   o_orderkey = l_orderkey
     AND  s_nationkey = n_nationkey
     AND  p_name LIKE '%green%'
GROUP BY  nation, o_year
ORDER BY  nation, o_year DESC

Query Variants

The query variants discussed here are:

  1. Hash based, the best plan -- 9h.sql

  2. Index based with multicolumn rows, with lineitem index on l_partkey -- 9i.sql, 9ir.sql

  3. Index based with multicolumn rows, lineitem index on l_partkey, l_suppkey -- 9ip.sql, 9ipr.sql

  4. Index based with one table per dependent column, index on l_partkey -- 9p.sql

  5. index based with one table per dependent column, with materialized l_partkey, l_suppkey -> l_orderkey, l_minenumber -- 9pp.sql, 9ppr.sql

These are done against row- and column-wise data representations with 3 different vectorization settings. The dynamic vector size starts at 10,000 values in a vector, and adaptively upgrades this to 1,000,000 if it finds that index access is too sparse. Accessing rows close to each other is more efficient than widely scattered rows in vectored index access, so using a larger vector will likely cause a denser, hence more efficient, access pattern.

The 10K vector size corresponds to running with a fixed vector size. The Vector 1 sets vector size to 1, effectively running a tuple at a time, which corresponds to a non-vectorized engine.

We note that lineitem and its single column projections contain 600M rows. So, a vector of 10K values will hit, on the average, every 60,000th row. A vector of 1,000,000 will thus hit every 600th. This is when doing random lookups that are in no specific order, e.g., getting lineitems by a secondary index on l_partkey.

1 — Hash-based plan

Vector Dynamic 10k 1 Column-wise 4.1 s 4.1 s 145   s Row-wise 25.6 s 25.9 s 45.4 s

Dynamic vector size has no effect here, as there is no indexed access that would gain from more locality. The column store is much faster because of less memory access (just scan the l_partkey column, and filter this with a Bloom filter; and then hash table lookup to pick only items with the desired part). The other columns are accessed only for the matching rows. The hash lookup is vectored since there are hundreds of compressed l_partkey values available at each time. The row store does the hash lookup row by row, hence losing cache locality and instruction-level parallelism.

Without vectorization, we have a situation where the lineitem scan emits one row at a time. Restarting the scan with the column store takes much longer, since 5 buffers have to be located and pinned instead of one for the row store. The row store is thus slowed down less, but it too suffers almost a factor of 2 from interpretation overhead.

2 — Index-based, lineitem indexed on l_partkey

Vector Dynamic 10k 1 Column-wise 30.4 s 62.3 s 321   s Row-wise 31.8 s 27.7 s 122   s

Here the plan scans part, then partsupp, which shares ordering with part; both are ordered on partkey. Then lineitem is fetched by a secondary index on l_partkey. This produces l_orderkey, l_lineitem, which are used to get the l_suppkey. We then check if the l_suppkey matches the ps_suppkey from partsupp, which drops 3/4 of the rows. The next join is on orders, which shares ordering with lineitem; both are ordered on orderkey.

There is a narrow win for columns with dynamic vector size. When access becomes scattered, rows win by 2.5x, because there is only one page to access instead of 1 + 3 for columns. This is compensated for if the next item is found on the same page, which happens if the access pattern is denser.

3 — Index-based, lineitem indexed on L_partkey, l_suppkey

Vector Dynamic 10k 1 Column-wise 16.9 s 47.2 s 151   s Row-wise 22.4 s 20.7 s 89   s

This is similar to the previous, except that now only lineitems that match ps_partkey, ps_suppkey are accessed, as the secondary index has two columns. Access is more local. Columns thus win more with dynamic vector size.

4 — Decomposed, index on l_partkey

Vector Dynamic 10k 1 Column-wise 35.7 s 170   s 601   s Row-wise 44.5 s 56.2 s 130   s

Now, each of the l_extendedprice, l_discount, l_quantity and l_suppkey is a separate index lookup. The times are slightly higher but the dynamic is the same.

The non-vectored columns case is hit the hardest.

5 — Decomposed, index on l_partkey, l_suppkey

Vector Dynamic 10k 1 Column-wise 19.6 s 111   s 257   s Row-wise 32.0 s 37   s 74.9 s

Again, we see the same dynamic as with a multicolumn table. Columns win slightly more at long vector sizes because of overall better index performance in the presence of locality.

Space Utilization

The following tables list the space consumption in megabytes of allocated pages. Unallocated space in database files is not counted.

The row-wise table also contains entries for column-wise structures (DB.*) since these have a row-wise sparse index. The size of this is however negligible, under 1% of the column-wise structures.

Row-Wise    Column-Wise MB structure 73515 R.DBA.LINEITEM 14768 R.DBA.ORDERS 11728 R.DBA.PARTSUPP 10161 r_lpk_pk 10003 r_l_pksk 9908 R.DBA.l_partkey 8761 R.DBA.l_extendedprice 8745 R.DBA.l_discount 8738 r_l_pk 8713 R.DBA.l_suppkey 6267 R.DBA.l_quantity 2223 R.DBA.CUSTOMER 2180 R.DBA.o_orderdate 2041 r_O_CK 1911 R.DBA.PART 1281 R.DBA.ps_supplycost 811 R.DBA.p_name 127 R.DBA.SUPPLIER 88 DB.DBA.LINEITEM 24 DB.DBA.ORDERS 11 DB.DBA.PARTSUPP 9 R.DBA.s_nationkey 5 l_pksk 4 DB.DBA.l_partkey 4 lpk_pk 4 DB.DBA.l_extendedprice 3 l_pk 3 DB.DBA.l_suppkey 2 DB.DBA.CUSTOMER 2 DB.DBA.l_quantity 1 DB.DBA.PART 1 O_CK 1 DB.DBA.l_discount    MB structure 36482 DB.DBA.LINEITEM 13087 DB.DBA.ORDERS 11587 DB.DBA.PARTSUPP 5181 DB.DBA.l_extendedprice 4431 l_pksk 3072 DB.DBA.l_partkey 2958 lpk_pk 2918 l_pk 2835 DB.DBA.l_suppkey 2067 DB.DBA.CUSTOMER 1618 DB.DBA.PART 1156 DB.DBA.l_quantity 961 DB.DBA.ps_supplycost 814 O_CK 798 DB.DBA.l_discount 724 DB.DBA.p_name 436 DB.DBA.o_orderdate 126 DB.DBA.SUPPLIER 1 DB.DBA.s_nationkey

In both cases, the large tables are on top, but the column-wise case takes only half the space due to compression.

We note that the single column projections are smaller column-wise. The l_extendedprice is not very compressible hence column-wise takes much more space than l_quantity; the row-wise difference is less. Since the leading key parts l_orderkey, l_linenumber are ordered and very compressible, the column-wise structures are in all cases noticeably more compact.

The same applies to the multipart index l_pksk and r_l_pksk (l_partkey, l_suppkey, l_orderkey, l_linenumber) in column- and row-wise representations.

Note that STRING columns (e.g., l_comment) are not compressed. If they were, the overall space ratio would be even more to the advantage of the column store.


Column stores and vectorization inextricably belong together. Column-wise compression yields great gains also for indices, since sorted data is easy to compress. Also for non-sorted data, adaptive use of dictionaries, run lengths, etc., produce great space savings. Columns also win with indexed access if there is locality.

Row stores have less dependence on locality, but they also will win by a factor of 3 from dropping interpretation overhead and exploiting join locality.

For point lookups, columns lose by 2+x but considering their better space efficiency, they will still win if space savings prevent going to secondary storage. For bulk random access, like in graph analytics, columns will win because of being able to operate on a large vector of keys to fetch.

For many workloads, from TPC-H to LDBC social network, multi-part keys are a necessary component of physical design for performance if indexed access predominates. Triple stores and most graph databases do not have such and are therefore at a disadvantage. Self-joining, like in RDF or other vertically decomposed structures, can cost up to a factor of 10-20 over a column-wise multicolumn table. This depends however on the density of access.

For analytical workloads, where the dominant join pattern is the scan with selective hash join, column stores are unbeatable, as per common wisdom. There are good physical reasons for this and the row store even with well implemented vectorization loses by a factor of 5.

For decomposed structures, like RDF quads or single column projections of tables, column stores are relatively more advantageous because the key columns are extensively repeated, and these compress better with columns than with rows. In all the RDF workloads we have tried, columns never lose, but there is often a draw between rows and columns for lookup workloads. The longer the query, the more columns win.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!