Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

November 25 2015

Inside engine (1)

November 10 2015


PoolParty 5.2 is out now!

We are proud to announce the PoolParty release 5.2. Here are our three top highlights:

  • PoolParty’s Custom Scheme management capabilities have been extended, providing now a clear distinction between ontologies and custom schemes. Ontologies can still be added from a list of predefined ontologies. In addition, custom ontologies can be created allowing to define classes, relations and attributes that are not covered by the predefined set of ontologies.
  • The Corpus Management workflow has been redesigned to make integration of new terms based on corpus analysis as easy as possible. The tree view in Corpus Management now also provides a view on the thesaurus. No switching between the two views is necessary anymore.
  • The Custom Schema functionalities have been extended. Users can reuse relations more flexibly and also new pre-defined elements were added.

For more information, please take a look at: R elease Note 5.2 .

The post PoolParty 5.2 is out now! appeared first on PoolParty Semantic Suite.

November 06 2015


If you like “Friends” you probably also will like “Veronica’s Closet” (find out with SPARQL why)

In a previous blog post I have discussed the power of SPARQL to go beyond data retrieval to analytics. Here I look into the possibilities to implement a product recommender all in SPARQL. Products are considered to be similar if they share relevant characteristics, and the higher the overlap the higher the similarity. In the case of movies or TV programs there are static characteristics (e.g. genre, actors, director) and dynamic ones like viewing patterns of the audience.

The static part of this we can look up in resources like the DBpedia. If we look at the data related to the resource <> (that represents the TV show “Friends”) we can use for example the associated subjects (see predicate dcterms:subject). In this case we find for example <> or <> If we want to find other TV shows that are related to the same subjects we can do this with the following query:

Bildschirmfoto 2015-11-06 um 13.39.02

click to get code

The query can be exectuted at the DBpedia SPARQL endpoint (default graph Read from the inside out the query does the following:
  1. Count the number of subjects related to TV show “Friends”.
  2. Get all TV shows that share at least one subject with “Friends” and count how many they have in common.
  3. For each of those related shows count the number of subjects they are related to.
  4. Now we can calculate the relative overlap in subjects which is (number of shared subjects) / (numbers of subjects for “Friends” + number of subjects for other show – number of common subjects).

This gives us a score of how related one show is to another one. The results are sorted by score (the higher the better) and these are the results for “Friends”:

subjCount ShowAB
subjCount ShowA
subjCount ShowB
subj Score
Will_&_Grace 10 16 18 0.416667 Sex_and_the_City 10 16 21 0.37037 Seinfeld 10 16 23 0.344828 Veronica’s_Closet 7 16 12 0.333333 The_George_Carlin_Show 6 16 9 0.315789 Frasier 8 16 18 0.307692

In the fist line of the results we see that “Friends” is associated with 16 subjects (that is the same in every line), “Will & Grace” with 18, and they share 10 subjects. That results into a score of 0.416667. Other characteristics to look at are actors starring a show, the creators (authors), or executive producers.

We can pack all this in one query and retrieve similar TV shows based on shared subjects, starring actors, creators, and executive producers. The inner queries retrieve the shows that share some of those characteristics, count numbers as shown before and calculate a score for each dimension. The individual scores can be weighted, in the example here the creator score is multiplied by 0.5 and the producer score by 0.75 to adjust the influence of each of them.

Bildschirmfoto 2015-11-06 um 13.43.27

click to get code

 This results into:

subj Score
star Score
creator Score
execprod Score
integrated Score
The_Powers_That_Be_(TV_series) 0.17391 0.0 1.0 0.0 0.1684782608 Veronica’s_Closet 0.33333 0.0 0.0 0.428571 0.1636904761 Family_Album_(1993_TV_series) 0.14285 0.0 0.666667 0.0 0.1190476190 Jesse_(TV_series) 0.28571 0.0 0.0 0.181818 0.1055194805 Will_&_Grace 0.41666 0.0 0.0 0.0 0.1041666666 Sex_and_the_City 0.37037 0.0 0.0 0.0 0.0925925925 Seinfeld 0.34482 0.0 0.0 0.0 0.0862068965 Work_It_(TV_series) 0.13043 0.0 0.0 0.285714 0.0861801242 Better_with_You 0.25 0.0 0.0 0.125 0.0859375 Dream_On_(TV_series) 0.16666 0.0 0.333333 0.0 0.0833333333 The_George_Carlin_Show 0.31578 0.0 0.0 0.0 0.0789473684 Frasier 0.30769 0.0 0.0 0.0 0.0769230769 Everybody_Loves_Raymond 0.30434 0.0 0.0 0.0 0.0760869565 Madman_of_the_People 0.3 0.0 0.0 0.0 0.075 Night_Court 0.3 0.0 0.0 0.0 0.075 What_I_Like_About_You_
(TV_series) 0.25 0.0 0.0 0.0625 0.07421875 Monty_(TV_series) 0.15 0.14285 0.0 0.0 0.0732142857 Go_On_(TV_series) 0.13043 0.07692 0.0 0.111111 0.0726727982 The_Trouble_with_Larry 0.19047 0.1 0.0 0.0 0.0726190476 Joey_(TV_series) 0.21739 0.07142 0.0 0.0 0.0722049689

Each line shows the individual scores for each of the predicates used and in the last column the final score. You can also try out the query with “House” <> or “Suits” <> and get shows related to those.

This approach can be used for any similar data, too, where we want to obtain similar items based on characteristics they share. One could for example compare persons (by e.g. profession, interests, …), or consumer electronic products like photo cameras (resolution, storage, size or price range).

November 03 2015


Presenting ‘Dynamic Semantic Publishing’ at Taxonomy Boot Camp 2015

Andreas Blumauer gave a talk at Taxonomy Boot Camp 2015 in Washington D.C. His presentation covered the following key message:

Dynamic Semantic Publishing is not only about documents and news articles, it is rather based on the linking (‘triangulating’) of users/employees, products/projects, and content/articles. Based on this methodology, not only dynamically generated ‘topic pages’ can be created, but also a ‘connected customer’ experience becomes reality. It’s all about providing more personalized user experience and customer journeys.

Andreas mentioned a couple of real-world use cases embracing this paradigm. Amongst others he mentioned use cases in the areas of publishing (Wolters Kluwer), media (Red Bull), health information (healthdirect Australia), and clean energy (Climate Tagger).

Download: Dynamic Semantic Publishing

The post Presenting ‘Dynamic Semantic Publishing’ at Taxonomy Boot Camp 2015 appeared first on PoolParty Semantic Suite.


ADEQUATe for the Quality of Open Data

The ADEQUATe project builds on two observations: An increasing amount of Open Data becomes available as an important resource for emerging businesses and furtheron the integration of such open, freely re-usable data sources into organisations’ data warehouse and data management systems is seen as a key success factor for competitive advantages in a data-driven economy.

The project now identifies crucial issues which have to be tackled to fully exploit the value of open data and the efficient integration with other data sources:

  1. the overall quality issues with meta data and the data itself
  2. the lack of interoperability between data sources

AdequateThe projects approch is now to address this point already in an early stage – when the open data is freshly provided by either governmental organisations or others.

The ADEQUATe project works with a combination of data and community driven approaches to address the above mentioned challenges. This include 1) the continuously assessment of Data Quality of Open Data Portals based on a comprehensive list of quality metrics, 2) the application of a set of (semi)-automatic algorithms in combination with crowdsourcing approaches to improve identified quality issues and 3) the use of Semantic Web Technologies to transform legacy Open Data sources (mainly common text formats) into Linked Data.

So the project intends to research and develop novel automated and community-driven data quality improvement techniques and then integrate pilot implentations into existing Open Data portals ( and  Furtheron a quality assessment & monitoring framework will evaluate and demonstrate the impact of the ADEQUATe solutions for the above mentioned business case.

About: ADEQUATe is funded by the Austrian FFG under the Programme ICT of the Future. The Project is run by Semantic Web Company together with Institute for Information Business of Vienna University of Economics & Business and the Department for E-Governance and Administration at the Danube University Krems. The Project started by August 2015 and will run to March 2018.



October 09 2015


Ensure data consistency in PoolParty

Semantic Web Company and its PoolParty team are participating in the H2020 funded project ALIGNED. This project evaluates software engineering and data engineering processes in the context of how these both worlds can be aligned in an efficient way. All project partners are working on several use cases, which shall result in a set of detailed requirements for combined software and data engineering. The ALIGNED project framework also includes work and research on data consistency in PoolParty Thesaurus Server (PPT).

ALIGNED: Describing, finding and repairing inconsistencies in RDF data sets

When using RDF to represent the data model of applications, inconsistencies can occur. Compared with the schema approach of relational databases, a data model using RDF offers much more flexibility. Usually, the application’s business logic produces and modifies the model data and, therefore, can guarantee the consistency needed for its operations. However, information may not only be created and modified by the application itself but may also originate from external sources like RDF imports into the data model’s triple store. This may result in inconsistent model data causing the application to fail. Therefore, constraints have to be specified and enforced to ensure data consistency for the application. In Phase 1 of the ALIGNED project, we outline the problem domain and requirements for the PoolParty Thesaurus Server use case with the goal of establishing a solution for describing, finding and repairing inconsistencies in RDF data sets. We propose a framework as a basis for integrating RDF consistency management into PoolParty Thesaurus Server software components. The approach is a work in progress that aims for adopting technologies developed by the ALIGNED project partners and refine them for usage in an industrial-strength application.

Technical View

Users of PoolParty often wish to import arbitrary datasets, vocabularies, or ontologies. But these datasets do not always meet these constraints PoolParty impose. Currently, when users attempt to import data which violates the constraints, the data will simply fail to display, or in the worst case, cause unexpected behaviour and lead to/reflect errors in the application. Enhanced PoolParty will feedback the user why the import has failed, suggest ways in which the user can fix the problem and also identify potential new constraints that could be applied to the data structure. Apart from the import functionality, different other software components, like the taxonomy editor, or the reasoning engine drive RDF data constraints and vice versa. The following figure outlines utilization and importance of data consistency constraints in the PoolParty application:


click for larger view

Approaches and solutions for many of these components already exist. However, the exercise within ALIGNED is to integrate them in an easy-to-use way to comply with the PoolParty environment. Consistency constraints, for example, can be formulated using RDF Data Shapes or interpreting RDFS/OWL constructs with constraints-based semantics. RDFUnit already partly supports these techniques. Repair strategies and curation interfaces are covered by the Seshat Global History Databank project. Automated repair of large datasets can be managed by the UnifiedViews ETL tool, whereas immediate notification on data inconsistencies can be disseminated via the rsine semantic notification framework.


Within the ALIGNED projects, all project partners demand simple (i.e. maintainable and usable) data quality and consistency management and work on solutions to meet their requirements. Our next steps will encompass research on how to apply these technologies to the PoolParty problem domain, and to take part in unifying and integrating the different existing tools and approaches. The immediate challenge to address will be to build an interoparable catalog of formalized PoolParty data consistency constraints and repair strategies so that they are machine-processable in a (semi-)automatic way.

October 08 2015


Webinar – PoolParty for Sustainable Development: The Climate Tagger

Climate change is the greatest challenge of our time, spanning countries and continents, societies and generations, sectors and disciplines. Yet crucial data and information on climate issues are still too often amassed – diffuse – in closed silos. Tools like the “Climate Tagger” utilize Linked Open Data to scan, sort, categorize and enrich climate and development-related data, improving efficiency and performance of knowledge management systems and thereby supporting to face the challenges in climate change these days….We have a short window of opportunity these days to solve the arising climate change challenges, and Open Knowledge and Open Data are key factors to face and solve these challenges!

This webinar explains and demonstrates how PoolParty Semantic Suite can be used for information and data management solutions, and concept tagging in particular in the fields of clean energy and sustainable development. Florian Bauer, COO of an international non-profit organisation with the mission to advance clean energy markets in developing countries will explain why “climate smart decisions” require connected knowledge systems and how semantic technologies can help to achieve that. As a concrete use case we will present the “Climate Tagger”, a tool run by REEEP that helps to connect climate knowledge and that is based on the PoolParty Semantic Suite. Other best practice examples will be presented and Q&A session will allow participants to interact.


  • Florian Bauer, COO & Director “Open Knowledge” of REEEP
  • Quinn Reifmesser, Senior Project Manager REEEP
  • Martin Kaltenböck, Managing Partner & CFO of SWC
  • Sukaina Bharwani, Stockholm Environment Institute Oxford (SEI)


Save the Date: November 5, 2015
3:00pm – 4:00pm CET / 9:00am – 10:00am Eastern Time

Free registration

The post Webinar – PoolParty for Sustainable Development: The Climate Tagger appeared first on PoolParty Semantic Suite.

September 29 2015


SPARQL analytics proves boxers live dangerously

You have always thought that SPARQL is only a query language for RDF data? Then think again, because SPARQL can also be used to implement some cool analytics. I show here two queries that demonstrate that principle.

For simplicity we use a publicly available dataset of DBpedia on an open SPARQL endpoint: (execute with default graph =

Mean life expectancy for different sports

The query shown here starts from the class dbp:Athlete and retrieves sub classes thereof that cover different sports. With that athletes of that areas are obtained and their birth and death dates (i.e. we only take into account deceased individuals). From the dates the years are extracted. Here a regular expression is used because the SPARQL function to extract years from a literal of a date type returned errors and could not be used. From the birth and death years the age is calculated (we filter for a range of 20 to 100 years because in data sources like this erroneous entries have always to be accounted for). Then the data is simply grouped and we count for each sport the number of athletes that were selected and the average age they reached.

prefix dbp:<http: //></http:>
select ?athleteGroupEN (count(?athlete) as ?count) (avg(?age) as ?ageAvg)
where {
    filter(?age >= 20 && ?age < = 100) .
        select distinct ?athleteGroupEN ?athlete (?deathYear - ?birthYear as ?age)
        where {
            ?subOfAthlete rdfs:subClassOf dbp:Athlete .
            ?subOfAthlete rdfs:label ?athleteGroup filter(lang(?athleteGroup) = "en") .
            bind(str(?athleteGroup) as ?athleteGroupEN)
            ?athlete a ?subOfAthlete .
            ?athlete dbp:birthDate ?birth filter(datatype(?birth) = xsd:date) .
            ?athlete dbp:deathDate ?death filter(datatype(?death) = xsd:date) .
            bind (strdt(replace(?birth,"^(\\d+)-.*","$1"),xsd:integer) as ?birthYear) .
            bind (strdt(replace(?death,"^(\\d+)-.*","$1"),xsd:integer) as ?deathYear) .
} group by ?athleteGroupEN having (count(?athlete) >= 25) order by ?ageAvg

The results are not unexpected and show that athletes in the area of motor sports, wresting and boxing die at younger age. On the other hand horse riders, but also tennis and golf players live on average clearly longer.

wrestler 693 58.962481962481962 winter sport Player 1775 66.60169014084507 tennis player 577 71.483535528596187 table tennis player 45 68.733333333333333 swimmer 402 68.674129353233831 soccer player 6572 63.992391965916007 snooker player 25 70.12 rugby player 1452 67.272038567493113 rower 69 63.057971014492754 poker player 30 66.866666666666667 national collegiate athletic association athlete 44 68.090909090909091 motorsport racer 1237 58.117219078415521 martial artist 197 67.157360406091371 jockey (horse racer) 139 65.992805755395683 horse rider 181 74.651933701657459 gymnast 175 65.805714285714286 gridiron football player 4247 67.713680244878738 golf player 400 71.13 Gaelic games player 95 70.589473684210526 cyclist 1370 67.469343065693431 cricketer 4998 68.420368147258904 chess player 45 70.244444444444444 boxer 869 60.352128883774453 bodybuilder 27 52 basketball player 822 66.165450121654501 baseball player 9207 68.611382643640708 Australian rules football player 2790 69.52831541218638

This is especially relevant when that data is large and one would have to extract it from the database and import it into another tool to do the counting and calculations.

Simple statistical measures over life expectancy

Another standard statistical measure is the standard deviation. A good description about how to calculate it can be found for example here. We start again with the class dbp:Athlete and calculate the ages they reached (this time for the entire class dbp:Athlete not its sub classes). Another thing we need are the squares of the ages that we calculate with “(?age * ?age as ?ageSquare)”. At the next stage we count the number of athletes in the result, and calculate the average age, the square of the sums and the sum of the squares. With those values we can calculate in the next step the standard deviation of the ages in our data set. Note that SPARQL does not specify a function for calculating square roots but RDF stores like Virtuoso (that hosts the DBpedia data) provide additional functions like bif:sqrt for calculating the square root of a value.

prefix dbp:<http: //></http:>
select ?count ?ageAvg (bif:sqrt((?ageSquareSum - (strdt(?ageSumSquare,xsd:double) / 
       ?count)) / (?count - 1)) as ?standDev)
where {
   select (count(?athlete) as ?count) (avg(?age) as ?ageAvg) 
          (sum(?age) * sum(?age) as ?ageSumSquare) (sum(?ageSquare) as ?ageSquareSum)
   where {
         select ?subOfAthlete ?athlete ?age (?age * ?age as ?ageSquare)
         where {
             filter (?age >= 20 && ?age <= 100) .
                select distinct ?subOfAthlete ?athlete (?deathYear - ?birthYear as ?age)
                where {
                    ?subOfAthlete rdfs:subClassOf dbp:Athlete .
                    ?athlete a ?subOfAthlete .
                    ?athlete dbp:birthDate ?birth filter(datatype(?birth) = xsd:date) .
                    ?athlete dbp:deathDate ?death filter(datatype(?death) = xsd:date) .
                    bind (strdt(replace(?birth,"^(\\d+)-.*","$1"),xsd:integer) as ?birthYear) .
                    bind (strdt(replace(?death,"^(\\d+)-.*","$1"),xsd:integer) as ?deathYear) .
38542 66.876290799647138 17.6479

These examples show that SPARQL is quite powerful and a lot more than “just” a query language for RDF data but that it is possible to implement basic statistical methods directly at the level of the triple store without the need to extract the data and import it into another tool.

September 25 2015


KM World listed PoolParty Semantic Suite as Trend-Setting Product 2015

PoolParty Semantic Suite has been recognized by KMWorld as  “Trend-Setting Product 2015”. More than 1,000 separate software offerings from more than 200 vendors were reviewed. KMWorld is the United States’ leading magazine for topics surrounding knowledge management systems and content and document management.

Andreas Blumauer, founder and CEO of the Semantic Web Company, comments on the award as follows: ” We are truly honoured that KMWorld has chosen us for its prestigious innovator list. It proofs that standards-based technologies are on the rise in the enterprise sector. What makes the PoolParty Semantic Suite truly valuable, is that it unites most relevant functionalities for seamless, personalized digital experiences. Subject matter experts and IT can smoothly cooperate, which creates relevant business-technology synergies. This is the essence of a successful digital transformation.”


KMWorld Editor-in-Chief Hugh McKellar says, “The panel, which consists of editorial colleagues, market and technology analysts, KM theoreticians, practitioners, customers and a select few savvy users (in a variety of disciplines) reviewed the offerings. All selected products fulfill the ultimate goal of knowledge management—delivering the right information to the right people at the right time.”


PoolParty Semantic Suite

PoolParty is a semantic technology platform provided by the Semantic Web Company. The EU-based company has been a pioneer in the semantic web since 2001. The product is recognized by industry leaders as one of the most developed semantic technology platforms, supporting enterprise needs in knowledge management, data analytics and content excellence. Typical PoolParty users such as taxonomists, subject matter experts and data analysts can easily build and enhance a knowledge graph without coding skills. Boehringer, Credit Suisse, Roche and The World Bank are among many customers now profiting from transforming data into customer insights with PoolParty.


About KMWorld

KMWorld is the leading information provider serving the Knowledge Management systems market and covers the latest in content, document and knowledge management, informing more than 30,000 subscribers about the components and processes – and subsequent success stories – that together offer solutions for improving business performance. KMWorld is a publishing unit of Information Today, Inc.


Press Contact

Semantic Web Company

Thomas Thurner

phone: +43-1-402-12-35



September 24 2015


120+ CKAN Portals in the Palm of Your Hand. Via the Open Data Companion (ODC)

CKAN is a powerful open-source data portal platform which provides out-of-the-box tools that allow data producers to make data easily accessible and reusable by everyone. Making CKAN as Free and Open Source Software (FOSS) has been a key factor in helping grow the availability and accessibility of open data across the Internet.

The emergence of mobile devices and the mobile platform has led to a shift in the way people access and consume information. Popular consensus  and reports show that mobile device usage and time spent on mobile devices are rapidly increasing. This means that mobile devices are now one of the fastest and easiest means of accessing data and information. Yet, as of now, open data lacks a strong mobile presence.

Open Data Companion (ODC) [pronounced “Odyssey”] seeks to address this challenge by providing a free mobile app that serves as a unified access point to over 120 CKAN 2.0+ compliant open data portals and thousands of datasets from around the world; right from your mobile device. Crafted with mobile-optimised features and design, this is an easy and convenient way to find, access and share open data. ODC provides a way for CKAN portal administrators and data producers to deliver open data to mobile users without the need for additional costs or further portal configuration.

ODC provides key mobile features for CKAN Portals:

  • Mobile users can setup access to as many CKAN-powered portals as they want.
  • Browse datasets from over 120 CKAN-powered data portals around the world by categories.
  • Receive push notifications on your mobile device when new datasets are available from your selected data portals.
  • Download and view data records (resources) on your mobile device.
  • Preview dataset resources and create data visualisations in app before download (as supported by the portal).
  • Bookmark/save datasets for later viewing.
  • “Favourite” your data portals for future easy access.
  • Share links to datasets on social media, email, sms etc. right from the app.
  • In-app tutorial videos designed to help you quickly get productive with the app. Tutorial videos are available offline once downloaded.
To ensure that ODC is usable by all CKAN portals in the wild, the app uses the publicly open and powerful CKAN API which is supported by all CKAN portals. By using the CKAN API to access portals’ data and meta-data, the app safeguards portals from external malicious attacks; more importantly portal administrators remain in control of the data being delivered to the public through the app. For instance, in order for ODC to provide in-app previews and visualisations of datasets, portal administrators must install the correct CKAN resource preview extensions. Basically, whatever dataset can be accessed from a CKAN portal website, can also be accessed by ODC through the CKAN APIs.

How to Make Your CKAN Portal Available to the Mobile Community

Making your CKAN portal available to the mobile community through the ODC app is done in 3 easy steps. As a portal administrator, ensure your CKAN portal is running on CKAN 2.0 and above (at the time of writing latest CKAN version is 2.4); ensure your portal is publicly available on the World Wide Web. Finally, submit your portal details to the CKAN Census (where the app developer will periodically check for new portal submissions) OR submit the portal details directly to the developer through the feedback section of the app and the app website. That’s all!

Feedback Welcome

ODC is available for download on the Google Play Store and all feedback is welcome. The app is actively developed, so more features will be released. Send feedback through the app or follow ODC on Twitter. You can also read more about the ODC vision, objectives and features from the app website.

Bringing CKAN Portals to the mobile platform is a big step in improving open data accessibility and reusability. It also opens doors to more public involvement in open data growth. I am excited to see what these new opportunities produce, first for the CKAN community and then for the Open Data community in general.

September 21 2015


Showcase your data

We all know CKAN is great for publishing and managing data, and it has powerful visualisation tools to provide instant insights and analysis. But it’s also useful and inspiring to see examples of how open data is being used.

CKAN has previously provided for this with the ‘Related Items’ feature (also known as ‘Apps & Ideas’). We wanted to enhance this feature to address some of its shortcomings, packaged up as an extension to easily replace and migrate from Related Items. So we developed the Showcase extension!

Showcase Example

A Showcase details page. This Showcase example is originally from

Separating out useful, but under-loved features from CKAN core to extensions like this:

  • makes core CKAN a leaner and a more focused codebase
  • gives these additional features a home, with more dedicated ownership and support
  • means updates and fixes for an extension don’t have to wait until the next release of CKAN

Some improvements made in Showcase include:

  • each showcase has its own details page
  • more than one dataset can be linked to a showcase
  • a new role of Showcase Admin to help manage showcases
  • free tagging of showcases, instead of a predefined list of ‘types’
  • showcase discovery by search and filtering by tag

This was my first contribution to the CKAN project and I wanted to ensure the established voices from the CKAN developer community were able to contribute guidance and feedback.

Remote collaboration can be hard, so I looked at the tools we already use as a team, to lower the barrier to participation. I wanted something that was versioned, allowed commenting and collaboration, and provided notification to interested parties as the specification developed. We use GitHub to collect ideas for new features in a repository as Issues, so it seemed like a natural extension to take these loose issues (ideas) and turn them into pull requests (proposals). The proposal and supporting documents can be committed as simple MarkDown files, and discussed within the Pull Request. This provides line-by-line commentary tools enabling quick iteration based on the feedback. If a proposal is accepted and implemented, the pull request can be merge, if the proposal is unsuccessful, it can be closed.

The Pull Request for the Showcase specification has 22 commits, and 57 comments from nine participants. Their contributions were invaluable and helped to quickly establish what and how the extension was going to be built. Their insights helped me get up to speed with CKAN and its extension framework and prevented me from straying too far in the wrong direction.

So, by developing the specification and coding in the open, we’ve managed to take an unloved feature of CKAN and give it a bit of polish and hopefully a new lease of life. I’d love to hear how you’re using it!

September 18 2015


Pyramids, Pipelines and a Can-of-Sweave – CKAN Asia-Pacific Meetup

Florian Mayer from the Western Australian Department of Parks and Wildlife presents various methods he is using to create Wisdom.

Data+Code = Information; Information + Context = Wisdom

So, can this be done with workbooks, applications and active documents?

As Florian might say, “Yes it CKAN”!

Grab the code and materials related to the work from here:

asia-pacificThis presentation was given at the first Asia-Pacific CKAN meetup on the 17th of September, hosted at Link Digital, as an initiative of the CKAN Community and Communications team. You can join the meetup and come along to these fortnightly sessions via video conference.

If you have some interesting content to present then please get in touch with @starl3n to schedule a session.

September 16 2015


Implementing VectorTiles Preview of Geodata on HDX

This post is modified version of a post on the HDX blog.  It is modified here to highlight information of most interest to the CKAN community.  You can see the original post here.

Humanitarian data is almost always inherently geographic. Even the data in a simple CSV file will generally correspond to some piece of geography: a country, a district, a town, a bridge, or a hospital, for example.

HDX has built on CKAN’s preview capabilities with the ability to preview large (up to 500MB) vector geographic datasets in a variety of formats.  Resources uploaded (or linked) to HDX with the format strings ‘geojson’, ‘zipped shapefile’, or ‘kml’ will trigger the creation of a geo preview. Here is an example showing administrative boundaries for Colombia:


To minimize bandwidth in the interest of often poorly-connected field locations, we built the preview from vector tiles. This means that details are removed at small scales but will reappear as you zoom in.

The preview is created only for the first layer it encounters in a resource. If the resource contains multiple layers, the others will not show up. For those cases, you can create separate resources for each layer and they will be available in the preview. Multiple geometry types (polygon + line, for example) in kml or geojson are not yet supported.


It’s a common problem in interactive mapping: to preview the whole geographic dataset, we would need to send all of the data to the browser, but that can require a long download or even crash the browser. The classic solution is to use a set of pre-rendered map tiles — static map images made for different zoom levels and cut into tiny pieces called tiles.  The browser has to load only a few of these pieces for any given view of the map. However, because they are just raster images, the user cannot interact with them in any advanced way.

We wanted to maintain interactivity with the data, eventually having hover effects or allowing users to customize styling, so we knew that we needed a different approach. We reached out to our friends at Geonode who pointed us to the recently developed Vector Tiles Specification.

The vector tile solution is a similar approach to traditional map tiles, but instead of creating static image tiles, it involves cutting the geodata layer into small tiles of vector data. Each zoom level receives a simplification (level of detail, or LoD) pass, which reduces the number of vertices displayed, similar to the way that 3D video games or simulators reduce the number of polygons in distant objects to improve performance. This means that for any given zoom level and location, the browser needs to download only the vertices necessary to fill the map.  You can learn more about how vector tiles work in this helpful FOSS4G NA talk from earlier this year.

Because vector tiles are a somewhat-new technology, there wasn’t any off-the-shelf framework to let us integrate them with our CKAN instance. Instead, we built a custom solution from several existing components (along with our own integration code):

Our architecture looks like this:


The GISRestLayer orchestrates the entire process by notifying each component when there is a task to do. It then informs CKAN when the task is complete, and a dataset has a geo preview available.  It can take a minute or longer to generate the preview, so the asynchronous approach — managed through Redis Queue (RQ) — was essential to let our users continue to work while the process is running. A special HDX team member, Geodata Preview Bot, is used to make the changes to CKAN. This makes the nature of the activity on the dataset clear to our users.

Future development

This approach gives HDX a good foundation for adding new geodata features in the future. We will be conducting research to understand what users think is important to add next. Here are some initial new-feature ideas:

  • Automatically generate additional download formats so that every geodataset is available in zipped shapefile, GeoJSON, KML, etc.
  • Allow the contributing user to specify the order of the resources in the map legend (and therefore which one appears by default).
  • Allow users to preview multiple datasets on the same map for comparison.
  • Automatically apply different symbol colors to different resources in the same dataset.
  • Allow users to style the geographic data, changing colors and symbols.
  • Allow users to configure and embed maps of their data in their organization or crisis pages.
  • Provide OGC-compliant web services of contributed datasets (WFS, WMS, etc.).
  • Allow external geographic data services (WMS, WFS, etc) to be added to a map preview.
  • Make our vector tiles available as a web service.

If any of these enhancements sound useful or you have new ideas, send us an email at If you have geodata to share with the HDX community, start adding your data here.

We would like to say a special thanks to Jeffrey Johnson who pointed us toward the vector tiles solution and to the contributors of all the open source projects listed above! In addition to GISRestLayer, you’ll find the rest of our code here.

September 11 2015


Building tools for Open Data adoption

At DataCats, we are focused on a simple problem — how do we make sure every single government has easy access to get up and running with Open Data? In other words, how do we make it as easy as possible for governments of all levels to start publishing open data?

The answer, as you might tell by this blog, is CKAN. But CKAN uses a very non-traditional technology stack, especially by government standards. Python, PostgreSQL, Solr, and Unix, are not in the toolbox of most IT departments. This is true not only for local government in Europe and North America, but also for almost all government in the developing world.

Our answer to this problem are two software projects which, like CKAN, are Free and Open Source Software. The first is the eponymously named datacats, and the second is named CKAN Multisite. The two projects together aim to solve the operational difficulties in deploying and managing CKAN installations.

datacats is a command line library built on Docker, a popular new alternative to virtualization that is experiencing explosive growth in industry. It aims to help CKAN developers easily get set up and running with one or more CKAN development instances, as well as deploy those easily on any provider – be it Amazon, Microsoft Azure, Digital Ocean, or a plain old physical server data centre.

Our team has been using datacats to develop a number of large CKAN projects for governments here in Canada and around the world. Being open source, we get word every week of another IT department somewhere that is trying it out.

CKAN Multisite is a companion project to datacats, targeted at system administrators who wish to manage one or more CKAN instances on their infrastructure. The project was very generously sponsored by U.S. Open Data. Multisite provides a simple API and a web interface through which starting, stopping, and managing CKAN servers is as simple as pressing a button. In essence it gives you your very own CKAN cloud.

CKAN is as an open source project that many national and large city governments depend on as the cornerstone of their open data programs. We hope that these two open source projects will help the CKAN ecosystem continue to grow. If you are a sysadmin or a developer working on CKAN, give it a try — and if you have the appetite — consider contributing to the projects themselves.

September 09 2015

PricewaterhouseCoopers is launching its Information Management Application, based on EXALEAD CloudView

August 21 2015


Matthew Fullerton and some interesting CKAN extension development.

Matthew Fullerton - mattfullertonNote: This is a re-post from one of our CKAN community contributors, Matthew Fullerton. He has been working on some interesting extensions, which are outlined below. You can support Matthew’s work by providing comments below, or you can link through to his GitHub profile to comment or get in touch there.


Styling GeoJSON data

The GeoView extension makes it easy to add resource views of GeoJSON data. In our extended extension, attributes of the features (lines, points) in the FeatureCollection are styled according to MapBox’s SimpleStyle spec.

Here’s an example where the file has been processed to add colors based on traffic flow state:

And another where the points are styled to (vaguely) look like colored traffic lights: (watch out, it can take a while to load)

Realtime GeoJSON data

Using leaflet.realtime, an extension for the leaflet library that CKAN (GeoView) uses to visualize GeoJSON, maps can have changing points or colors/styles.

Here is an example of traffic lights changing according to pre-recorded data:

I’ll try and add a demo with moving data points soon, it ought to work without any further code changes. The problem is often getting the live data in GeoJSON format… but we have a backend for preprocessing other data.

Realtime data plotting

By making only a few small changes, we are able to continuously update Graph views. You can see the changing (or not) temperature in our office here:

That’s an example for ‘lines and points’ but it works for things like bar graphs too. Last week we had people competing to achieve the best time in a remote controlled robot race where their time was automatically displayed as a bar on a ‘leader board’. For good measure we had an automatically updating histogram of the times too. Updating the actual data in CKAN is easy thanks to the DataStore API.

Matthew Fullerton

Freelance Software Developer and EXIST Stipend holder with the start up project “Tapestry” -

August 16 2015


Two new CKAN extensions – Webhooks and Geopusher

Denis Zgonjanin recently shared the following update on two new extensions via the CKAN Dev mail list.

If you are working on CKAN extensions and would like to share details with other developers then post your updates via the mail list. We’ll always look at promoting the great work of community contributions via this blog :) If you have an interesting CKAN story to share feel free to ping @starl3n to organise a guest post.

From Denis:


A problem I’ve had personally is having my open data apps know when a dataset they’ve been using has been updated. You can of course poll CKANperiodically, but then you need cron jobs or a queue, and when you’re using a cheap PaaS like heroku for your apps, integrating queues and cron is just an extra hassle.

This extension lets people register a URL with CKAN, which CKAN will call when a certain event happens – for example, a dataset update. The extension uses the built-in CKAN celery queue, so as to be non-blocking.

If you do end up using it, there are still a bunch of nice features to be built, including a simple web interface through which users can register webhooks (right now they can only be created through the action API)


So you know how you have a lot of Shapefiles and KML files in your CKANs (because government), but your users prefer GeoJSON? This extension will automatically convert shapefiles and KML into GeoJSON, and create a new GeoJSON resource within that dataset. There are some cases where this won’t work depending on complexity of SHP or KML file, but it works well in general.

This extension also uses the built-in celery queue to do its work, so for both of these extensions you will need to start the celery daemon in order to use them:

`paster --plugin=<span class="il">ckan</span> celeryd -c development.ini`

August 11 2015


DBpedia Usage Report, August 2015

We recently published the latest DBpedia Usage Report, covering v3.3 (released July, 2009) to v3.10 (sometimes called "DBpedia 2014"; released September, 2014).

The new report has usage data through July 31, 2015, and brought a few surprises to our eyes. What do you think?

August 05 2015


Beauty behind the scenes

Good things can often go unnoticed, especially if they’re not immediately visible. Last month the government of Sweden, through Vinnova, released a revamped version of their open data portal, Ö The portal still runs on CKAN, the open data management system. It even has the same visual feeling but the principles behind the portal are completely different. The main idea behind the new version of Ö is automation. Open Knowledge teamed up with the Swedish company Metasolutions to build and deliver an automated open data portal.

Responsive design

In modern web development, one aspect of website automation called responsive design has become very popular. With this technique the website automatically adjusts the presentation depending on the screen size. That is, it knows how best to present the content given different screen sizes. Ö got a slight facelift in terms of tweaks to its appearance, but the big news on that front is that it now has a responsive design. The portal looks different if you access it on mobile phones or if you visit it on desktops, but the content is still the same.

These changes were contributed to CKAN. They are now a part of the CKAN core web application as of version 2.3. This means everyone can now have responsive data portals as long as they use a recent version of CKAN .

New Ö

New Ö

Old Ö

Old Ö

Data catalogs

Perhaps the biggest innovation of Ö is how the automation process works for adding new datasets to the catalog. Normally with CKAN, data publishers log in and create or update their datasets on the CKAN site. CKAN has for a long time also supported something called harvesting, where an instance of CKAN goes out and fetches new datasets and makes them available. That’s a form of automation, but it’s dependent on specific software being used or special harvesters for each source. So harvesting from one CKAN instance to another is simple. Harvesting from a specific geospatial data source is simple. Automatically harvesting from something you don’t know and doesn’t exist yet is hard.

That’s the reality which Ö faces. Only a minority of public organisations and municipalities in Sweden publish open data at the moment. So a decision hasn’t been made by a majority of the public entities for what software or solution will be used to publish open data.

To tackle this problem, Ö relies on an open standard from the World Wide Web Consortium called DCAT (Data Catalog Vocabulary). The open standard describes how to publish a list of datasets and it allows Swedish public bodies to pick whatever solution they like to publish datasets, as long as one of its outputs conforms with DCAT.

Ö actually uses a DCAT application profile which was specially created for Sweden by Metasolutions and defines in more detail what to expect, for example that Ö expects to find dataset classifications according the Eurovoc classification system.

Thanks to this effort significant improvements have been made to CKAN’s support for RDF and DCAT. They include application profiles (like the Swedish one) for harvesting and exposing DCAT metadata in different formats. So a CKAN instance can now automatically harvest datasets from a range of DCAT sources, which is exactly what Ö does. For Ö, the CKAN support also makes it easy for Swedish public bodies who use CKAN to automatically expose their datasets correctly so that they can be automatically harvested by Ö For more information have a look at the CKAN DCAT extension documentation.

Dead or alive

The Web is decentralised and always changing. A link to a webpage that worked yesterday might not work today because the page was moved. When automatically adding external links, for example, links to resources for a dataset, you run into the risk of adding links to resources that no longer exist.

To counter that Ö uses a CKAN extension called Dead or alive. It may not be the best name, but that’s what it does. It checks if a link is dead or alive. The checking itself is performed by an external service called deadoralive. The extension just serves a set of links that the external service decides to check to see if some links are alive. In this way dead links are automatically marked as broken and system administrators of Ö can find problematic public bodies and notify them that they need to update their DCAT catalog (this is not automatic because nobody likes spam).

These are only the automation highlights of the new Ö Other changes were made that have little to do with automation but are still not immediately visible, so a lot of Ö’s beauty happens behind the scenes. That’s also the case for other open data portals. You might just visit your open data portal to get some open data, but you might not realise the amount of effort and coordination it takes to get that data to you.

Image of Swedish flag by Allie_Caulfield on Flickr (cc-by)

August 03 2015


How the PoolParty Semantic Suite is learning to speak 40+ languages

Business is becoming more and more globalised, and enterprises and organisations are acting in several different regions and thus facing more challenges of different cultural aspects as well as respective language barriers. Looking at the European market, we even see 24 working languages in EU28, which make cross-border services considerably complicated. As a result, powerful language technology is needed, and intense efforts have already been taken in the EU to deal with this situation and enable the vision of a multilingual digital single market (a priority area of the European Commission this year, see:


Here at the Semantic Web Company we also witness fast-growing demands for language-independent and/or specific-language and cross-language solutions to enable business cases like cross-lingual search or multilingual data management approaches. To provide such solutions, a multilingual metadata and data management approach is needed, and this is where PoolParty Semantic Suite comes into the game: as PoolParty follows W3C semantic web standards like SKOS, we have language-independent-based technologies in place and our customers already benefit from them. However, as regards text analysis and text extraction, the ability to process multilingual information and data is key for success – which means that the systems need to speak as many languages as possible.

Our new cooperation with K Dictionaries (KD) is enabling the PoolParty Semantic Suite to continuously “learn to speak” more and more languages, by making use of KD’s rich monolingual, bilingual and multilingual content and its long-time experience in lexicography as a base for improved multi-language text analysis and processing.

KD ( is a technology-oriented content and data creator that is based in Tel Aviv and cooperates with publishing partners, ICT firms, the academe and professional associations worldwide. It deals with nearly 50 languages, offering quality monolingual, bilingual and multilingual lexical datasets, morphological word forms, phonetic transcription, etc.

As a result of this cooperation, PoolParty now provides language bundles in the following languages, which can be licensed together with all types of PoolParty servers:

  • English
  • French
  • German
  • Italian
  • Japanese
  • Korean
  • Russian
  • Slovak
  • Spanish

Additional language bundles are in preparation and will be in place soon!

Furthermore, SWC and KD are partners in a brand new EUREKA project that is supported by a bilateral technology/innovation program between Austria and Israel. The project is called LDL4HELTA (Linked Data Lexicography for High-End Language Technology Application) and combines lexicography and Language Technology with Semantic Web and Linked (Open) Data mechanisms and technologies to improve existing and develop new products and services. It integrates the products of both partners to better serve existing customers and new ones, as well as to enter together new markets in the field of Linked Data lexicography-based Language Technology solutions. This project has been successfully kicked off in early July and has a duration of 24 months, with the first concrete results due early in 2016.

The LDL4HELTA project is supported by a research partner (Austrian Academy of Sciences) and an expert Advisory Board including  Prof Christian Chiarcos (Goethe University, Frankfurt), Mr Orri Erling (OpenLink Software), Dr Sebastian Hellmann (Leipzig University), Prof Alon Itai (Technion, Haifa), and Ms Eveline Wandl-Wogt (Austrian Academy of Sciences).

So stay tuned and we will inform you about news and activities of this cooperation here in the blog continuously!

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!