Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

January 27 2016

16:14

CKAN extensions Archiver and QA upgraded

Popular CKAN extensions ‘Archiver’ and ‘QA’ have recently been significantly upgraded. Now it is relatively simple to add automatic broken link checking and 5 stars of openness grading to any CKAN site. At a time when many open data portals suffer from quality problems, adding these reports make it easy to identify the problems and get credit when they are resolved.

Whilst these extensions have been around for a few years, most of the development has been on forks, whilst the core has been languishing. In the past couple of months there has been a big push to merge all the efforts from US (data.gov), Finland, Greece, Slovakia and Netherlands, and particularly those from UK (data.gov.uk), into core. It’s been a big leap forward in functionality. Now installers no longer need to customize templates – you get details of broken links and 5 stars shown on every dataset simply by installing and configuring the extensions. And now we’re all on the same page, it means we can work together better from now on.

ckanext-qa ckanext-archiver

The Archiver Extension regularly tries out all datasets’ data links to see if they are still working. File URLs that do work are downloaded and the user is offered the ‘cached’ copy. Otherwise, URLs that are broken are marked in red and listed in a report. See more: ckanext-archiver repo, docs and demo images

The QA Extension analyses the data files that Archiver has downloaded to reliably determine their format – CSV, XLS, PDF, etc, rather than trusting the format that the publisher has said they are. This information is combined with the data license and whether the data is currently accessible to give a rating out of 5 according to Tim Berners-Lee’s 5 Stars of Openness. A file that has no open licence, or is not available gets 0 stars. If it passes those tests but is only a PDF then it gets 1 star. A machine-readable but proprietry format like XLS gets it 2 stars, and an open format like CSV gets it 3 stars. 4 and 5 star data is that which uses standard schemas and references other datasets, which tends to mean RDF. See ckanext-qa repo, docs and demo images

11:29

Code of Conduct

As the CKAN community grows and includes more people from various backgrounds it seems like a good time to adopt a Code of Conduct that will ensure it remains a welcoming place for everybody.

The Code of Conduct can be accessed on the main CKAN repository:

https://github.com/ckan/ckan/blob/master/CONDUCT.rst

Rather than trying to come up with a useful one ourselves we have adopted one based on The Open Code of Conduct.

As stated on the code, if you feel this has been breached you can contact conduct at ckan.org. This currently forwards to the members of the tech team.

As ever, feel free to send us any comments or feedback.

January 19 2016

08:41
How a CIO is improving the customer experience

January 11 2016

15:22

New Semantic Publishing Benchmark Record

There is a new SPB (Semantic Publishing Benchmark) 256 Mtriple record with Virtuoso.

As before, the result has been measured with the feature/analytics branch of the v7fasttrack open source distribution, and it will soon be available as a preconfigured Amazon EC2 image. The updated benchmarks AMI with this version of the software will be out there within the next week, to be announced on this blog.

On the Cost of RDF Query Optimization

RDF query optimization is harder than the relational equivalent; first, because there are more joins, hence an NP complete explosion of plan search space, and second, because cardinality estimation is harder and usually less reliable. The work on characteristic sets, pioneered by Thomas Neumann in RDF3X, uses regularities in structure for treating properties usually occurring in the same subject as columns of a table. The same idea is applied for tuning physical representation in the joint Virtuoso / MonetDB work published at WWW 2015.

The Virtuoso results discussed here, however, are all based on a single RDF quad table with Virtuoso's default index configuration.

Introducing query plan caching raises the Virtuoso score from 80 qps to 144 qps at the 256 Mtriple scale. The SPB queries are not extremely complex; lookups with many more triple patterns exist in actual workloads, e.g., Open PHACTS. In such applications, query optimization indeed dominates execution times. In SPB, data volumes touched by queries grow near linearly with data scale. At the 256 Mtriple scale, nearly half of CPU cycles are spent deciding a query plan. Below are the CPU cycles for execution and compilation per query type, sorted by descending sum of the times, scaled to milliseconds per execution. These are taken from a one minute sample of running at full throughput.

Test system is the same used before in the TPC-H series: dual Xeon E5-2630 Sandy Bridge, 2 x 6 cores x 2 threads, 2.3GHz, 192 GB RAM.

We measure the compile and execute times, with and without using hash join. When considering hash join, the throughput is 80 qps. When not considering hash join, the throughput is 110 qps. With query plan caching, the throughput is 145 qps whether or not hash join is considered. Using hash join is not significant for the workload but considering its use in query optimization leads to significant extra work.

With hash join

Compile Execute Total Query 3156 ms 1181 ms 4337 ms Total 1327 ms 28 ms 1355 ms query 01 444 ms 460 ms 904 ms query 08 466 ms 54 ms 520 ms query 06 123 ms 268 ms 391 ms query 05 257 ms 5 ms 262 ms query 11 191 ms 59 ms 250 ms query 10 9 ms 179 ms 188 ms query 04 114 ms 26 ms 140 ms query 07 46 ms 62 ms 108 ms query 09 71 ms 25 ms 96 ms query 12 61 ms 13 ms 74 ms query 03 47 ms 2 ms 49 ms query 02        

Without hash join

Compile Execute Total Query 1816 ms 1019 ms 2835 ms Total 197 ms 466 ms 663 ms query 08 609 ms 32 ms 641 ms query 01 188 ms 293 ms 481 ms query 05 275 ms 61 ms 336 ms query 09 163 ms 10 ms 173 ms query 03 128 ms 38 ms 166 ms query 10 102 ms 5 ms 107 ms query 11 63 ms 27 ms 90 ms query 12 24 ms 57 ms 81 ms query 06 47 ms 1 ms 48 ms query 02 15 ms 24 ms 39 ms query 07 5 ms 5 ms 10 ms query 04

Considering hash join always slows down compilation, and sometimes improves and sometimes worsens execution. Some improvement in cost-model and plan-space traversal-order is possible, but altogether removing compilation via caching is better still. The results are as expected, since a lookup workload such as SPB has little use for hash join by nature.

The rationale for considering hash join in the first place is that analytical workloads rely heavily on this. A good TPC-H score is simply unfeasible without this as previously discussed on this blog. If RDF is to be a serious contender beyond serving lookups, then hash join is indispensable. The decision for using this however depends on accurate cardinality estimates on either side of the join.

Previous work (e.g., papers from FORTH around MonetDB) advocates doing away with a cost model altogether, since one is hard and unreliable with RDF anyway. The idea is not without its attraction but will lead to missing out of analytics or to relying on query hints for hash join.

The present Virtuoso thinking is that going to rule based optimization is not the preferred solution, but rather using characteristic sets for reducing triples into wider tables, which also cuts down on plan search space and increases reliability of cost estimation.

When looking at execution alone, we see that actual database operations are low in the profile, with memory management taking the top 19%. This is due to CONSTRUCT queries allocating small blocks for returning graphs, which is entirely avoidable.

December 17 2015

14:49

CKAN 2.5 released, patch versions for 2.0.x, 2.1.x, 2.2.x, 2.3.x and 2.4.x available

We are happy to announce that CKAN 2.5 is now released. In addition, new patch releases for older versions of CKAN are now available to download and install.

CKAN 2.5

The 2.5 release (actually 2.5.1 as we skipped 2.5.0) offers speed improvements to the home page, searching and several other key pages and API. In addition, CKAN extensions can provide language translations in a more integrated way. And it’s now easy to customize the file uploader to suit using different cloud providers. 2.5 also includes plenty of other improvements contributed by the CKAN developer community during the past 4 months, as detailed in the CHANGELOG.

If you have customizations or extensions, we suggest you trial the upgrade first in a test environment and refer to the changes in the changelog. Upgrade instructions are below.

CKAN patch releases

These new patch releases for CKAN 2.0.x, 2.1.x, 2.2.x, 2.3.x and 2.4.x fix important bugs and security issues, so users are strongly encouraged to upgrade to the latest patch release for the CKAN version they are using.

For a list of the fixes included you can check the CHANGELOG.

Upgrading

For details on how to upgrade, see the following links depending on your install method:

If you find any issue, you can let the technical team know in the mailing list or the IRC channel.

 

December 11 2015

15:31
[2015 Review] Monoprix Success Story: innovating in quality through data

December 02 2015

09:35

Ready to connect to the Semantic Web – now what?

As an open data fan or as someone who is just looking to learn how to publish data on the Web and distribute it through the Semantic Web you will be facing the question “How to describe the dataset that I want to publish?” The same question is asked also by people who apply for a publicly funded project at the European Commission and want to have a Data Management plan. Next we are going to discuss possibilities which help describe the dataset to be published. 

The goal of publishing the data should be to make it available for access or download and to make it interoperable. One of the big benefits is to make the data available for software applications which in turn means the datasets have to be machine-readable. From the perspective of a software developer some additional information than just name, author, owner, date…  would be helpful:

  • the condition for re-use (rights, licenses)
  • the specific coverage of the dataset (type of data, thematic coverage, geographic coverage)
  • technical specifications to retrieve and parse an instance (a distribution) of the dataset (format, protocol)
  • the features/dimensions covered by the dataset (temperature, time, salinity, gene, coordinates)
  • the semantics of the  features/dimensions (unit of measure, time granularity, syntax, reference taxonomies)

To describe a dataset the best is always to look first at existing standards and existing vocabularies. The answer is not found looking only at one vocabulary but at several.

Data Catalog Vocabulary (DCAT)

DCAT is an RDF Schema vocabulary for representing data catalogs. It is an RDF vocabulary for describing any dataset, which can be standalone or part of a catalog.

Vocabulary of Interlinked Datasets (VoID)

VoID is an RDF vocabulary, and a set of instructions, that enable the discovery and usage of linked data sets. VOID is an RDF vocabulary for expressing metadata about RDF datasets.

Data Cube vocabulary

Data Cube vocabulary is focused purely on the publication of multi-dimensional data on the web. It is an RDF vocabulary for describing statistical datasets.

Asset Description Metadata Schema (ADMS)

ADMS is a W3C standard developed in 2013 and is a profile of DCAT, used to describe semantic assets.

You will find only partial answers of how to describe your dataset in existing vocabularies while some aspects are missing or complicated to express.

  1. Type of data – there is no specific property for the type of data covered in a dataset. This value should be machine readable which means it should be standardized, possibly to an URI which can be de-reference-able to a thing. And this ‘thing’ should be part of an authority list/taxonomy which is not existing yet. However one can use the adms:representationTechnique, which gives more information about the format in which a dataset is released. This points only to dcterms:format and dcat:mediaType.
  2. Technical properties like – format, protocol etc.
    • There is no property for protocol and again these values should be machine-readable, standardized possibly to an URI.
    • VoID can help with the protocol metadata but only for RDF datasets: dataDump, sparqlEndpoint.
  3. Dimensions of a dataset.
    • SDMX defines a dimension as “A statistical concept used, in combination with other statistical concepts, to identify a statistical series or single observations.” Dimensions in a dataset can therefore be called features, predictors, or variables (depending on the domain). One can use dc:conformsTo and use a dc:Standard if the dataset dimensions can be defined by a formalized standard. Otherwise statistical vocabularies can help with this aspect which can become quite complex. One can use the Data Cube vocabulary specificallyqd:DimensionProperty, qd:AttributeProperty, qd:MeasureProperty, qd:CodedProperty in combination with skos:Concept and sdmx:ConceptRole.
      image2015-11-4 12
  4. Data provenance – there is the dc:source that can be used at dataset level but there is no solution if we want to specify the source at data record level.

In the end one needs to combine different vocabularies to best describe a dataset.

DescribeAdataset 2

The tools out there used for helping in publishing data seem to be missing one or more of the above mentioned parts.

  • CKAN maintained by the Open Knowledge Foundation uses most of DCAT and doesn’t describe dimensions.
  • Dataverse created by Harvard University uses a custom vocabulary and doesn’t describe dimensions.
  • CIARD RING uses full DCAT AP with some extended properties (protocol, data type) and local taxonomies with URIs mapped when possible to authorities.
  • OpenAIRE, DataCite (using re3data to search repositories) and Dryad use their own vocabularies.

The solution to these existing issues seem to be in general, introducing custom vocabularies.

References:

November 25 2015

15:21
Inside exalead.com engine (1)

November 10 2015

14:28

PoolParty 5.2 is out now!

We are proud to announce the PoolParty release 5.2. Here are our three top highlights:

  • PoolParty’s Custom Scheme management capabilities have been extended, providing now a clear distinction between ontologies and custom schemes. Ontologies can still be added from a list of predefined ontologies. In addition, custom ontologies can be created allowing to define classes, relations and attributes that are not covered by the predefined set of ontologies.
  • The Corpus Management workflow has been redesigned to make integration of new terms based on corpus analysis as easy as possible. The tree view in Corpus Management now also provides a view on the thesaurus. No switching between the two views is necessary anymore.
  • The Custom Schema functionalities have been extended. Users can reuse relations more flexibly and also new pre-defined elements were added.

For more information, please take a look at: R elease Note 5.2 .

The post PoolParty 5.2 is out now! appeared first on PoolParty Semantic Suite.

November 06 2015

12:40

If you like “Friends” you probably also will like “Veronica’s Closet” (find out with SPARQL why)

In a previous blog post I have discussed the power of SPARQL to go beyond data retrieval to analytics. Here I look into the possibilities to implement a product recommender all in SPARQL. Products are considered to be similar if they share relevant characteristics, and the higher the overlap the higher the similarity. In the case of movies or TV programs there are static characteristics (e.g. genre, actors, director) and dynamic ones like viewing patterns of the audience.

The static part of this we can look up in resources like the DBpedia. If we look at the data related to the resource <http://dbpedia.org/resource/Friends> (that represents the TV show “Friends”) we can use for example the associated subjects (see predicate dcterms:subject). In this case we find for example <http://dbpedia.org/resource/Category:American_television_sitcoms> or <http://dbpedia.org/resource/Category:Television_shows_set_in_New_York_City> If we want to find other TV shows that are related to the same subjects we can do this with the following query:

Bildschirmfoto 2015-11-06 um 13.39.02

click to get code

The query can be exectuted at the DBpedia SPARQL endpoint http://live.dbpedia.org/sparql (default graph http://dbpedia.org). Read from the inside out the query does the following:
  1. Count the number of subjects related to TV show “Friends”.
  2. Get all TV shows that share at least one subject with “Friends” and count how many they have in common.
  3. For each of those related shows count the number of subjects they are related to.
  4. Now we can calculate the relative overlap in subjects which is (number of shared subjects) / (numbers of subjects for “Friends” + number of subjects for other show – number of common subjects).

This gives us a score of how related one show is to another one. The results are sorted by score (the higher the better) and these are the results for “Friends”:

showB
subjCount ShowAB
subjCount ShowA
subjCount ShowB
subj Score
Will_&_Grace 10 16 18 0.416667 Sex_and_the_City 10 16 21 0.37037 Seinfeld 10 16 23 0.344828 Veronica’s_Closet 7 16 12 0.333333 The_George_Carlin_Show 6 16 9 0.315789 Frasier 8 16 18 0.307692

In the fist line of the results we see that “Friends” is associated with 16 subjects (that is the same in every line), “Will & Grace” with 18, and they share 10 subjects. That results into a score of 0.416667. Other characteristics to look at are actors starring a show, the creators (authors), or executive producers.

We can pack all this in one query and retrieve similar TV shows based on shared subjects, starring actors, creators, and executive producers. The inner queries retrieve the shows that share some of those characteristics, count numbers as shown before and calculate a score for each dimension. The individual scores can be weighted, in the example here the creator score is multiplied by 0.5 and the producer score by 0.75 to adjust the influence of each of them.

Bildschirmfoto 2015-11-06 um 13.43.27

click to get code

 This results into:

showB
subj Score
star Score
creator Score
execprod Score
integrated Score
The_Powers_That_Be_(TV_series) 0.17391 0.0 1.0 0.0 0.1684782608 Veronica’s_Closet 0.33333 0.0 0.0 0.428571 0.1636904761 Family_Album_(1993_TV_series) 0.14285 0.0 0.666667 0.0 0.1190476190 Jesse_(TV_series) 0.28571 0.0 0.0 0.181818 0.1055194805 Will_&_Grace 0.41666 0.0 0.0 0.0 0.1041666666 Sex_and_the_City 0.37037 0.0 0.0 0.0 0.0925925925 Seinfeld 0.34482 0.0 0.0 0.0 0.0862068965 Work_It_(TV_series) 0.13043 0.0 0.0 0.285714 0.0861801242 Better_with_You 0.25 0.0 0.0 0.125 0.0859375 Dream_On_(TV_series) 0.16666 0.0 0.333333 0.0 0.0833333333 The_George_Carlin_Show 0.31578 0.0 0.0 0.0 0.0789473684 Frasier 0.30769 0.0 0.0 0.0 0.0769230769 Everybody_Loves_Raymond 0.30434 0.0 0.0 0.0 0.0760869565 Madman_of_the_People 0.3 0.0 0.0 0.0 0.075 Night_Court 0.3 0.0 0.0 0.0 0.075 What_I_Like_About_You_
(TV_series) 0.25 0.0 0.0 0.0625 0.07421875 Monty_(TV_series) 0.15 0.14285 0.0 0.0 0.0732142857 Go_On_(TV_series) 0.13043 0.07692 0.0 0.111111 0.0726727982 The_Trouble_with_Larry 0.19047 0.1 0.0 0.0 0.0726190476 Joey_(TV_series) 0.21739 0.07142 0.0 0.0 0.0722049689

Each line shows the individual scores for each of the predicates used and in the last column the final score. You can also try out the query with “House” <http://dbpedia.org/resource/House_(TV_series)> or “Suits” <http://dbpedia.org/resource/Suits_(TV_series)> and get shows related to those.

This approach can be used for any similar data, too, where we want to obtain similar items based on characteristics they share. One could for example compare persons (by e.g. profession, interests, …), or consumer electronic products like photo cameras (resolution, storage, size or price range).

November 03 2015

14:21

Presenting ‘Dynamic Semantic Publishing’ at Taxonomy Boot Camp 2015

Andreas Blumauer gave a talk at Taxonomy Boot Camp 2015 in Washington D.C. His presentation covered the following key message:

Dynamic Semantic Publishing is not only about documents and news articles, it is rather based on the linking (‘triangulating’) of users/employees, products/projects, and content/articles. Based on this methodology, not only dynamically generated ‘topic pages’ can be created, but also a ‘connected customer’ experience becomes reality. It’s all about providing more personalized user experience and customer journeys.

Andreas mentioned a couple of real-world use cases embracing this paradigm. Amongst others he mentioned use cases in the areas of publishing (Wolters Kluwer), media (Red Bull), health information (healthdirect Australia), and clean energy (Climate Tagger).

Download: Dynamic Semantic Publishing

The post Presenting ‘Dynamic Semantic Publishing’ at Taxonomy Boot Camp 2015 appeared first on PoolParty Semantic Suite.

10:19

ADEQUATe for the Quality of Open Data

The ADEQUATe project builds on two observations: An increasing amount of Open Data becomes available as an important resource for emerging businesses and furtheron the integration of such open, freely re-usable data sources into organisations’ data warehouse and data management systems is seen as a key success factor for competitive advantages in a data-driven economy.

The project now identifies crucial issues which have to be tackled to fully exploit the value of open data and the efficient integration with other data sources:

  1. the overall quality issues with meta data and the data itself
  2. the lack of interoperability between data sources

AdequateThe projects approch is now to address this point already in an early stage – when the open data is freshly provided by either governmental organisations or others.

The ADEQUATe project works with a combination of data and community driven approaches to address the above mentioned challenges. This include 1) the continuously assessment of Data Quality of Open Data Portals based on a comprehensive list of quality metrics, 2) the application of a set of (semi)-automatic algorithms in combination with crowdsourcing approaches to improve identified quality issues and 3) the use of Semantic Web Technologies to transform legacy Open Data sources (mainly common text formats) into Linked Data.

So the project intends to research and develop novel automated and community-driven data quality improvement techniques and then integrate pilot implentations into existing Open Data portals (data.gv.at and opendataportal.at).  Furtheron a quality assessment & monitoring framework will evaluate and demonstrate the impact of the ADEQUATe solutions for the above mentioned business case.

About: ADEQUATe is funded by the Austrian FFG under the Programme ICT of the Future. The Project is run by Semantic Web Company together with Institute for Information Business of Vienna University of Economics & Business and the Department for E-Governance and Administration at the Danube University Krems. The Project started by August 2015 and will run to March 2018.

 

 

October 09 2015

15:12

Ensure data consistency in PoolParty

Semantic Web Company and its PoolParty team are participating in the H2020 funded project ALIGNED. This project evaluates software engineering and data engineering processes in the context of how these both worlds can be aligned in an efficient way. All project partners are working on several use cases, which shall result in a set of detailed requirements for combined software and data engineering. The ALIGNED project framework also includes work and research on data consistency in PoolParty Thesaurus Server (PPT).

ALIGNED: Describing, finding and repairing inconsistencies in RDF data sets

When using RDF to represent the data model of applications, inconsistencies can occur. Compared with the schema approach of relational databases, a data model using RDF offers much more flexibility. Usually, the application’s business logic produces and modifies the model data and, therefore, can guarantee the consistency needed for its operations. However, information may not only be created and modified by the application itself but may also originate from external sources like RDF imports into the data model’s triple store. This may result in inconsistent model data causing the application to fail. Therefore, constraints have to be specified and enforced to ensure data consistency for the application. In Phase 1 of the ALIGNED project, we outline the problem domain and requirements for the PoolParty Thesaurus Server use case with the goal of establishing a solution for describing, finding and repairing inconsistencies in RDF data sets. We propose a framework as a basis for integrating RDF consistency management into PoolParty Thesaurus Server software components. The approach is a work in progress that aims for adopting technologies developed by the ALIGNED project partners and refine them for usage in an industrial-strength application.

Technical View

Users of PoolParty often wish to import arbitrary datasets, vocabularies, or ontologies. But these datasets do not always meet these constraints PoolParty impose. Currently, when users attempt to import data which violates the constraints, the data will simply fail to display, or in the worst case, cause unexpected behaviour and lead to/reflect errors in the application. Enhanced PoolParty will feedback the user why the import has failed, suggest ways in which the user can fix the problem and also identify potential new constraints that could be applied to the data structure. Apart from the import functionality, different other software components, like the taxonomy editor, or the reasoning engine drive RDF data constraints and vice versa. The following figure outlines utilization and importance of data consistency constraints in the PoolParty application:

resolving_data_consistency_violations

click for larger view

Approaches and solutions for many of these components already exist. However, the exercise within ALIGNED is to integrate them in an easy-to-use way to comply with the PoolParty environment. Consistency constraints, for example, can be formulated using RDF Data Shapes or interpreting RDFS/OWL constructs with constraints-based semantics. RDFUnit already partly supports these techniques. Repair strategies and curation interfaces are covered by the Seshat Global History Databank project. Automated repair of large datasets can be managed by the UnifiedViews ETL tool, whereas immediate notification on data inconsistencies can be disseminated via the rsine semantic notification framework.

Outlook

Within the ALIGNED projects, all project partners demand simple (i.e. maintainable and usable) data quality and consistency management and work on solutions to meet their requirements. Our next steps will encompass research on how to apply these technologies to the PoolParty problem domain, and to take part in unifying and integrating the different existing tools and approaches. The immediate challenge to address will be to build an interoparable catalog of formalized PoolParty data consistency constraints and repair strategies so that they are machine-processable in a (semi-)automatic way.

October 08 2015

14:34

Webinar – PoolParty for Sustainable Development: The Climate Tagger

Climate change is the greatest challenge of our time, spanning countries and continents, societies and generations, sectors and disciplines. Yet crucial data and information on climate issues are still too often amassed – diffuse – in closed silos. Tools like the “Climate Tagger” utilize Linked Open Data to scan, sort, categorize and enrich climate and development-related data, improving efficiency and performance of knowledge management systems and thereby supporting to face the challenges in climate change these days….We have a short window of opportunity these days to solve the arising climate change challenges, and Open Knowledge and Open Data are key factors to face and solve these challenges!

This webinar explains and demonstrates how PoolParty Semantic Suite can be used for information and data management solutions, and concept tagging in particular in the fields of clean energy and sustainable development. Florian Bauer, COO of an international non-profit organisation with the mission to advance clean energy markets in developing countries will explain why “climate smart decisions” require connected knowledge systems and how semantic technologies can help to achieve that. As a concrete use case we will present the “Climate Tagger”, a tool run by REEEP that helps to connect climate knowledge and that is based on the PoolParty Semantic Suite. Other best practice examples will be presented and Q&A session will allow participants to interact.

Speakers

  • Florian Bauer, COO & Director “Open Knowledge” of REEEP
  • Quinn Reifmesser, Senior Project Manager REEEP
  • Martin Kaltenböck, Managing Partner & CFO of SWC
  • Sukaina Bharwani, Stockholm Environment Institute Oxford (SEI)

 

Save the Date: November 5, 2015
3:00pm – 4:00pm CET / 9:00am – 10:00am Eastern Time

Free registration

The post Webinar – PoolParty for Sustainable Development: The Climate Tagger appeared first on PoolParty Semantic Suite.

September 29 2015

14:30

SPARQL analytics proves boxers live dangerously

You have always thought that SPARQL is only a query language for RDF data? Then think again, because SPARQL can also be used to implement some cool analytics. I show here two queries that demonstrate that principle.

For simplicity we use a publicly available dataset of DBpedia on an open SPARQL endpoint: http://live.dbpedia.org/sparql (execute with default graph = http://dbpedia.org).

Mean life expectancy for different sports

The query shown here starts from the class dbp:Athlete and retrieves sub classes thereof that cover different sports. With that athletes of that areas are obtained and their birth and death dates (i.e. we only take into account deceased individuals). From the dates the years are extracted. Here a regular expression is used because the SPARQL function to extract years from a literal of a date type returned errors and could not be used. From the birth and death years the age is calculated (we filter for a range of 20 to 100 years because in data sources like this erroneous entries have always to be accounted for). Then the data is simply grouped and we count for each sport the number of athletes that were selected and the average age they reached.

prefix dbp:<http: //dbpedia.org/ontology></http:>
select ?athleteGroupEN (count(?athlete) as ?count) (avg(?age) as ?ageAvg)
where {
    filter(?age >= 20 && ?age < = 100) .
    {
        select distinct ?athleteGroupEN ?athlete (?deathYear - ?birthYear as ?age)
        where {
            ?subOfAthlete rdfs:subClassOf dbp:Athlete .
            ?subOfAthlete rdfs:label ?athleteGroup filter(lang(?athleteGroup) = "en") .
            bind(str(?athleteGroup) as ?athleteGroupEN)
            ?athlete a ?subOfAthlete .
            ?athlete dbp:birthDate ?birth filter(datatype(?birth) = xsd:date) .
            ?athlete dbp:deathDate ?death filter(datatype(?death) = xsd:date) .
            bind (strdt(replace(?birth,"^(\\d+)-.*","$1"),xsd:integer) as ?birthYear) .
            bind (strdt(replace(?death,"^(\\d+)-.*","$1"),xsd:integer) as ?deathYear) .
        }
    }
} group by ?athleteGroupEN having (count(?athlete) >= 25) order by ?ageAvg

The results are not unexpected and show that athletes in the area of motor sports, wresting and boxing die at younger age. On the other hand horse riders, but also tennis and golf players live on average clearly longer.

athleteGroupEN
count
ageAvg
wrestler 693 58.962481962481962 winter sport Player 1775 66.60169014084507 tennis player 577 71.483535528596187 table tennis player 45 68.733333333333333 swimmer 402 68.674129353233831 soccer player 6572 63.992391965916007 snooker player 25 70.12 rugby player 1452 67.272038567493113 rower 69 63.057971014492754 poker player 30 66.866666666666667 national collegiate athletic association athlete 44 68.090909090909091 motorsport racer 1237 58.117219078415521 martial artist 197 67.157360406091371 jockey (horse racer) 139 65.992805755395683 horse rider 181 74.651933701657459 gymnast 175 65.805714285714286 gridiron football player 4247 67.713680244878738 golf player 400 71.13 Gaelic games player 95 70.589473684210526 cyclist 1370 67.469343065693431 cricketer 4998 68.420368147258904 chess player 45 70.244444444444444 boxer 869 60.352128883774453 bodybuilder 27 52 basketball player 822 66.165450121654501 baseball player 9207 68.611382643640708 Australian rules football player 2790 69.52831541218638

This is especially relevant when that data is large and one would have to extract it from the database and import it into another tool to do the counting and calculations.

Simple statistical measures over life expectancy

Another standard statistical measure is the standard deviation. A good description about how to calculate it can be found for example here. We start again with the class dbp:Athlete and calculate the ages they reached (this time for the entire class dbp:Athlete not its sub classes). Another thing we need are the squares of the ages that we calculate with “(?age * ?age as ?ageSquare)”. At the next stage we count the number of athletes in the result, and calculate the average age, the square of the sums and the sum of the squares. With those values we can calculate in the next step the standard deviation of the ages in our data set. Note that SPARQL does not specify a function for calculating square roots but RDF stores like Virtuoso (that hosts the DBpedia data) provide additional functions like bif:sqrt for calculating the square root of a value.

prefix dbp:<http: //dbpedia.org/ontology></http:>
select ?count ?ageAvg (bif:sqrt((?ageSquareSum - (strdt(?ageSumSquare,xsd:double) / 
       ?count)) / (?count - 1)) as ?standDev)
where {
 {
   select (count(?athlete) as ?count) (avg(?age) as ?ageAvg) 
          (sum(?age) * sum(?age) as ?ageSumSquare) (sum(?ageSquare) as ?ageSquareSum)
   where {
       {
         select ?subOfAthlete ?athlete ?age (?age * ?age as ?ageSquare)
         where {
             filter (?age >= 20 && ?age <= 100) .
             {
                select distinct ?subOfAthlete ?athlete (?deathYear - ?birthYear as ?age)
                where {
                    ?subOfAthlete rdfs:subClassOf dbp:Athlete .
                    ?athlete a ?subOfAthlete .
                    ?athlete dbp:birthDate ?birth filter(datatype(?birth) = xsd:date) .
                    ?athlete dbp:deathDate ?death filter(datatype(?death) = xsd:date) .
                    bind (strdt(replace(?birth,"^(\\d+)-.*","$1"),xsd:integer) as ?birthYear) .
                    bind (strdt(replace(?death,"^(\\d+)-.*","$1"),xsd:integer) as ?deathYear) .
                 }
              }
           }
        }
     }
  }
}
count
ageAvg
standDev
38542 66.876290799647138 17.6479

These examples show that SPARQL is quite powerful and a lot more than “just” a query language for RDF data but that it is possible to implement basic statistical methods directly at the level of the triple store without the need to extract the data and import it into another tool.

September 25 2015

08:56

KM World listed PoolParty Semantic Suite as Trend-Setting Product 2015

PoolParty Semantic Suite has been recognized by KMWorld as  “Trend-Setting Product 2015”. More than 1,000 separate software offerings from more than 200 vendors were reviewed. KMWorld is the United States’ leading magazine for topics surrounding knowledge management systems and content and document management.

Andreas Blumauer, founder and CEO of the Semantic Web Company, comments on the award as follows: ” We are truly honoured that KMWorld has chosen us for its prestigious innovator list. It proofs that standards-based technologies are on the rise in the enterprise sector. What makes the PoolParty Semantic Suite truly valuable, is that it unites most relevant functionalities for seamless, personalized digital experiences. Subject matter experts and IT can smoothly cooperate, which creates relevant business-technology synergies. This is the essence of a successful digital transformation.”

 

KMWorld Editor-in-Chief Hugh McKellar says, “The panel, which consists of editorial colleagues, market and technology analysts, KM theoreticians, practitioners, customers and a select few savvy users (in a variety of disciplines) reviewed the offerings. All selected products fulfill the ultimate goal of knowledge management—delivering the right information to the right people at the right time.”

 

PoolParty Semantic Suite

PoolParty is a semantic technology platform provided by the Semantic Web Company. The EU-based company has been a pioneer in the semantic web since 2001. The product is recognized by industry leaders as one of the most developed semantic technology platforms, supporting enterprise needs in knowledge management, data analytics and content excellence. Typical PoolParty users such as taxonomists, subject matter experts and data analysts can easily build and enhance a knowledge graph without coding skills. Boehringer, Credit Suisse, Roche and The World Bank are among many customers now profiting from transforming data into customer insights with PoolParty.

www.poolparty.biz

 

About KMWorld

KMWorld is the leading information provider serving the Knowledge Management systems market and covers the latest in content, document and knowledge management, informing more than 30,000 subscribers about the components and processes – and subsequent success stories – that together offer solutions for improving business performance. KMWorld is a publishing unit of Information Today, Inc.

www.kmworld.com

 

Press Contact

Semantic Web Company

Thomas Thurner

phone: +43-1-402-12-35

mail: t.thurner@semantic-web.at

 

September 24 2015

06:54

120+ CKAN Portals in the Palm of Your Hand. Via the Open Data Companion (ODC)

CKAN is a powerful open-source data portal platform which provides out-of-the-box tools that allow data producers to make data easily accessible and reusable by everyone. Making CKAN as Free and Open Source Software (FOSS) has been a key factor in helping grow the availability and accessibility of open data across the Internet.

The emergence of mobile devices and the mobile platform has led to a shift in the way people access and consume information. Popular consensus  and reports show that mobile device usage and time spent on mobile devices are rapidly increasing. This means that mobile devices are now one of the fastest and easiest means of accessing data and information. Yet, as of now, open data lacks a strong mobile presence.

Open Data Companion (ODC) [pronounced “Odyssey”] seeks to address this challenge by providing a free mobile app that serves as a unified access point to over 120 CKAN 2.0+ compliant open data portals and thousands of datasets from around the world; right from your mobile device. Crafted with mobile-optimised features and design, this is an easy and convenient way to find, access and share open data. ODC provides a way for CKAN portal administrators and data producers to deliver open data to mobile users without the need for additional costs or further portal configuration.

ODC provides key mobile features for CKAN Portals:

  • Mobile users can setup access to as many CKAN-powered portals as they want.
  • Browse datasets from over 120 CKAN-powered data portals around the world by categories.
  • Receive push notifications on your mobile device when new datasets are available from your selected data portals.
  • Download and view data records (resources) on your mobile device.
  • Preview dataset resources and create data visualisations in app before download (as supported by the portal).
  • Bookmark/save datasets for later viewing.
  • “Favourite” your data portals for future easy access.
  • Share links to datasets on social media, email, sms etc. right from the app.
  • In-app tutorial videos designed to help you quickly get productive with the app. Tutorial videos are available offline once downloaded.
To ensure that ODC is usable by all CKAN portals in the wild, the app uses the publicly open and powerful CKAN API which is supported by all CKAN portals. By using the CKAN API to access portals’ data and meta-data, the app safeguards portals from external malicious attacks; more importantly portal administrators remain in control of the data being delivered to the public through the app. For instance, in order for ODC to provide in-app previews and visualisations of datasets, portal administrators must install the correct CKAN resource preview extensions. Basically, whatever dataset can be accessed from a CKAN portal website, can also be accessed by ODC through the CKAN APIs.

How to Make Your CKAN Portal Available to the Mobile Community

Making your CKAN portal available to the mobile community through the ODC app is done in 3 easy steps. As a portal administrator, ensure your CKAN portal is running on CKAN 2.0 and above (at the time of writing latest CKAN version is 2.4); ensure your portal is publicly available on the World Wide Web. Finally, submit your portal details to the CKAN Census (where the app developer will periodically check for new portal submissions) OR submit the portal details directly to the developer through the feedback section of the app and the app website. That’s all!

Feedback Welcome

ODC is available for download on the Google Play Store and all feedback is welcome. The app is actively developed, so more features will be released. Send feedback through the app or follow ODC on Twitter. You can also read more about the ODC vision, objectives and features from the app website.

Bringing CKAN Portals to the mobile platform is a big step in improving open data accessibility and reusability. It also opens doors to more public involvement in open data growth. I am excited to see what these new opportunities produce, first for the CKAN community and then for the Open Data community in general.

September 21 2015

16:04

Showcase your data

We all know CKAN is great for publishing and managing data, and it has powerful visualisation tools to provide instant insights and analysis. But it’s also useful and inspiring to see examples of how open data is being used.

CKAN has previously provided for this with the ‘Related Items’ feature (also known as ‘Apps & Ideas’). We wanted to enhance this feature to address some of its shortcomings, packaged up as an extension to easily replace and migrate from Related Items. So we developed the Showcase extension!

Showcase Example

A Showcase details page. This Showcase example is originally from http://data.beta.nyc/showcase/illegal-hotels-inspections

Separating out useful, but under-loved features from CKAN core to extensions like this:

  • makes core CKAN a leaner and a more focused codebase
  • gives these additional features a home, with more dedicated ownership and support
  • means updates and fixes for an extension don’t have to wait until the next release of CKAN

Some improvements made in Showcase include:

  • each showcase has its own details page
  • more than one dataset can be linked to a showcase
  • a new role of Showcase Admin to help manage showcases
  • free tagging of showcases, instead of a predefined list of ‘types’
  • showcase discovery by search and filtering by tag

This was my first contribution to the CKAN project and I wanted to ensure the established voices from the CKAN developer community were able to contribute guidance and feedback.

Remote collaboration can be hard, so I looked at the tools we already use as a team, to lower the barrier to participation. I wanted something that was versioned, allowed commenting and collaboration, and provided notification to interested parties as the specification developed. We use GitHub to collect ideas for new features in a repository as Issues, so it seemed like a natural extension to take these loose issues (ideas) and turn them into pull requests (proposals). The proposal and supporting documents can be committed as simple MarkDown files, and discussed within the Pull Request. This provides line-by-line commentary tools enabling quick iteration based on the feedback. If a proposal is accepted and implemented, the pull request can be merge, if the proposal is unsuccessful, it can be closed.

The Pull Request for the Showcase specification has 22 commits, and 57 comments from nine participants. Their contributions were invaluable and helped to quickly establish what and how the extension was going to be built. Their insights helped me get up to speed with CKAN and its extension framework and prevented me from straying too far in the wrong direction.

So, by developing the specification and coding in the open, we’ve managed to take an unloved feature of CKAN and give it a bit of polish and hopefully a new lease of life. I’d love to hear how you’re using it!

September 18 2015

05:06

Pyramids, Pipelines and a Can-of-Sweave – CKAN Asia-Pacific Meetup

Florian Mayer from the Western Australian Department of Parks and Wildlife presents various methods he is using to create Wisdom.

Data+Code = Information; Information + Context = Wisdom

So, can this be done with workbooks, applications and active documents?

As Florian might say, “Yes it CKAN”!

Grab the code and materials related to the work from here: http://catalogue.alpha.data.wa.gov.au/dataset/data-wa-gov-au

asia-pacificThis presentation was given at the first Asia-Pacific CKAN meetup on the 17th of September, hosted at Link Digital, as an initiative of the CKAN Community and Communications team. You can join the meetup and come along to these fortnightly sessions via video conference.

If you have some interesting content to present then please get in touch with @starl3n to schedule a session.

September 16 2015

20:52

Implementing VectorTiles Preview of Geodata on HDX

This post is modified version of a post on the HDX blog.  It is modified here to highlight information of most interest to the CKAN community.  You can see the original post here.

Humanitarian data is almost always inherently geographic. Even the data in a simple CSV file will generally correspond to some piece of geography: a country, a district, a town, a bridge, or a hospital, for example.

HDX has built on CKAN’s preview capabilities with the ability to preview large (up to 500MB) vector geographic datasets in a variety of formats.  Resources uploaded (or linked) to HDX with the format strings ‘geojson’, ‘zipped shapefile’, or ‘kml’ will trigger the creation of a geo preview. Here is an example showing administrative boundaries for Colombia:

image00

To minimize bandwidth in the interest of often poorly-connected field locations, we built the preview from vector tiles. This means that details are removed at small scales but will reappear as you zoom in.

The preview is created only for the first layer it encounters in a resource. If the resource contains multiple layers, the others will not show up. For those cases, you can create separate resources for each layer and they will be available in the preview. Multiple geometry types (polygon + line, for example) in kml or geojson are not yet supported.

Implementation

It’s a common problem in interactive mapping: to preview the whole geographic dataset, we would need to send all of the data to the browser, but that can require a long download or even crash the browser. The classic solution is to use a set of pre-rendered map tiles — static map images made for different zoom levels and cut into tiny pieces called tiles.  The browser has to load only a few of these pieces for any given view of the map. However, because they are just raster images, the user cannot interact with them in any advanced way.

We wanted to maintain interactivity with the data, eventually having hover effects or allowing users to customize styling, so we knew that we needed a different approach. We reached out to our friends at Geonode who pointed us to the recently developed Vector Tiles Specification.

The vector tile solution is a similar approach to traditional map tiles, but instead of creating static image tiles, it involves cutting the geodata layer into small tiles of vector data. Each zoom level receives a simplification (level of detail, or LoD) pass, which reduces the number of vertices displayed, similar to the way that 3D video games or simulators reduce the number of polygons in distant objects to improve performance. This means that for any given zoom level and location, the browser needs to download only the vertices necessary to fill the map.  You can learn more about how vector tiles work in this helpful FOSS4G NA talk from earlier this year.

Because vector tiles are a somewhat-new technology, there wasn’t any off-the-shelf framework to let us integrate them with our CKAN instance. Instead, we built a custom solution from several existing components (along with our own integration code):

Our architecture looks like this:

image03

The GISRestLayer orchestrates the entire process by notifying each component when there is a task to do. It then informs CKAN when the task is complete, and a dataset has a geo preview available.  It can take a minute or longer to generate the preview, so the asynchronous approach — managed through Redis Queue (RQ) — was essential to let our users continue to work while the process is running. A special HDX team member, Geodata Preview Bot, is used to make the changes to CKAN. This makes the nature of the activity on the dataset clear to our users.

Future development

This approach gives HDX a good foundation for adding new geodata features in the future. We will be conducting research to understand what users think is important to add next. Here are some initial new-feature ideas:

  • Automatically generate additional download formats so that every geodataset is available in zipped shapefile, GeoJSON, KML, etc.
  • Allow the contributing user to specify the order of the resources in the map legend (and therefore which one appears by default).
  • Allow users to preview multiple datasets on the same map for comparison.
  • Automatically apply different symbol colors to different resources in the same dataset.
  • Allow users to style the geographic data, changing colors and symbols.
  • Allow users to configure and embed maps of their data in their organization or crisis pages.
  • Provide OGC-compliant web services of contributed datasets (WFS, WMS, etc.).
  • Allow external geographic data services (WMS, WFS, etc) to be added to a map preview.
  • Make our vector tiles available as a web service.

If any of these enhancements sound useful or you have new ideas, send us an email at hdx.feedback@gmail.com. If you have geodata to share with the HDX community, start adding your data here.

We would like to say a special thanks to Jeffrey Johnson who pointed us toward the vector tiles solution and to the contributors of all the open source projects listed above! In addition to GISRestLayer, you’ll find the rest of our code here.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl