Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

December 10 2014

11:45

Welcome nRelate Users!

 
Our friends and competitors, nRelate, announced this week that they will be discontinuing their service and plugins. They have graciously recommended users to try our plugins as a replacement, so this blog post is inteded primarily to welcome them and give them introduction to how zemanta works.

Similar to nRelate, Related Posts by Zemanta brings new visitors to your site, helping you increase your reach and readership. It will automatically recommend sources from your blog and gives you the option to editorially curate related content from around the web for each post — OR BOTH!

To simplify the transition, we added specific support for users switching from nRelate, so that our plugin will automatically use some of your old settings. More details below.

And there’s a few extra features you might enjoy, more details bellow:

  • Super fast, super accurate recommendations
  • In-Text links recommendations from popular sources such as Wikipedia, IMDB, and YouTube
  • Responsive themes for desktop, tablet, and mobile
  • Like writing CSS? Customize Related Posts with your own code!

Performance

Our recommendations aren’t database intensive, which is why we’re one of the few plugins that is not blacklisted by WP Engine. We paid special attention to how the queries are built, using all our years of experience with building search engines. This means better performance, not to mention you can set how far back you wish to index your articles in the plugin settings.

Customizable themes

Related Posts plugin comes with a lot of beautiful themes with different thumbnail sizes. You can even manually adjust the “look and feel” of the theme with custom CSS and other advanced settings. And if you bump into an obstacle, our Support team is happy to help you out, just reach out.
Customize the look and feel of Zemanta’s Related Posts plugin by navigating to WordPress Dashboard > Settings > Related Posts by Zemanta

Editorial support

Zemanta offers users editorial support by enabling a Related articles widget. You manually choose the articles you wish to recommend to others. This way you reach out to other bloggers that write about similar topics – remember sharing is caring! =)

nRelate Compatibility

To simplify the transition, we added specific support for users switching from nRelate, so that our plugin will automatically use some of your old settings:

  • number of posts to show in the widget
  • related title above the widget
  • show excerpts of articles or not
  • max age of articles to be recommended
  • thumbnail size
  • thumbnail – use or not

If you have any questions getting started, please let us know! Send us an email to support@zemanta.com. We’d love to hear from you.

November 28 2014

16:06

Climate Tagger based on PoolParty launched

climatetagger_logo-smClimate change is the greatest challenge of our time, spanning countries and continents, societies and generations, sectors and disciplines. Yet crucial data and information on climate issues are still too often amassed – diffuse – in closed silos. Climate Tagger utilizes Linked Open Data to scan, sort, categorize and enrich climate and development-related data, improving efficiency and performance of knowledge management systems.

Climate Tagger brings together the semantic power of Semantic Web Company’s PoolParty Semantic Suite with the domain expertise of REEEP and CTCN, resulting in an automatic annotation module for Drupal 7 and CKAN with an accuracy never seen before.

“Climate Tagger is the result of a shared commitment to breaking down the ‘information silos’ that exist in the climate compatible development community, and to provide concrete solutions that can be implemented right now, anywhere” said REEEP Director General Martin Hiller. “Together with CTCN and SWC laid the foundations for a system that can be continuously improved and expanded to bring new sectors, systems and organizations into the climate knowledge community.”

For the Open Data and Linked Open Data communities, a Climate Tagger plugin for CKAN has also been published, which was developed by developed by NREL in cooperation with CTCN’s support, harnessing the same taxonomy and expert vetted thesaurus behind the Climate Tagger, helping connect open data to climate compatible content through the simultaneous use of these tools.

Link: Climate Tagger website

14:30

Automatic Semantic Tagging for Drupal CMS launched

REEEP [1] and CTCN [2] have recently launched Climate Tagger, a new tool to automatically scan, label, sort and catalogue datasets and document collections. Climate Tagger now incorporates a Drupal Module for automatic annotation of Drupal content nodes. Climate Tagger addresses knowledge-driven organizations in the climate and development arenas, providing automated functionality to streamline, catalogue and link their Climate Compatible Development data and information resources.

Climate Tagger

Climate Tagger for Drupal is a simple, FREE and easy-to-use way to integrate the well-known Reegle Tagging API [3], originally developed in 2011 with the support of CDKN [4], (now part of the Climate Tagger suite as Climate Tagger API) into any web site based on the Drupal Content Management System [5]. Climate Tagger is backed by the expansive Climate Compatible Development Thesaurus, developed by experts in multiple fields and continuously updated to remain current (explore the thesaurus at http://www.reegle.info/glossary). The thesaurus is available in English, French, Spanish, German and Portuguese. And can connect content on different portals published in these different languages.

Climate Tagger for Drupal can be fine-tuned to individual (and existing) configuration of any Drupal 7 installation by:

  • determining which content types and fields will be automatically tagged
  • scheduling “batch jobs” for automatic updating (also for already existing contents; where the option is available to re-tag all content or only tag with new concepts found via a thesaurus expansion / update)
  • automatically limit and manage volumes of tag results based on individually chosen scoring thresholds
  • blending with manual tagging
click to enlarge

click to enlarge

“Climate Tagger [6] brings together the semantic power of Semantic Web Company’s PoolParty Semantic Suite [7] with the domain expertise of REEEP and CTCN, resulting in an automatic annotation module for Drupal 7 with an accuracy never seen before” states Martin Kaltenböck, Managing Partner of Semantic Web Company [8], which acts as the technology provider behind the module.

Climate Tagger is the result of a shared commitment to breaking down the ‘information silos’ that exist in the climate compatible development community, and to provide concrete solutions that can be implemented right now, anywhere” said REEEP Director General Martin Hiller. “Together with CTCN and SWC laid the foundations for a system that can be continuously improved and expanded to bring new sectors, systems and organizations into the climate knowledge community.”

For the Open Data and Linked Open Data communities, a Climate Tagger plugin for CKAN [9] has also been published, which was developed by developed by NREL [10] in cooperation with CTCN’s support, harnessing the same taxonomy and expert vetted thesaurus behind the Climate Tagger, helping connect open data to climate compatible content through the simultaneous use of these tools.

REEEP Director General Martin Hiller and CTCN Director Jukka Uosukainen will be talking about Climate Tagger at the COP20 side event hosted by the Climate Knowledge Brokers Group in Lima [11], Peru, on Monday, December 1st at 4:45pm.

Further reading and downloads

About REEEP:

REEEP invests in clean energy markets in developing countries to lower CO2 emissions and build prosperity. Based on strategic portfolio of high impact projects, REEEP works to generate energy access, improve lives and economic opportunities, build sustainable markets, and combat climate change.

REEEP understands market change from a practice, policy and financial perspective. We monitor, evaluate and learn from our portfolio to understand opportunities and barriers to success within markets. These insights then influence policy, increase public and private investment, and inform our portfolio strategy to build scale within and replication across markets. REEEP is committed to open access to knowledge to support entrepreneurship, innovation and policy improvements to empower market shifts across the developing world.

About the CTCN

The Climate Technology Centre & Network facilitates the transfer of climate technologies by providing technical assistance, improving access to technology knowledge, and fostering collaboration among climate technology stakeholders. The CTCN is the operational arm of the UNFCCC Technology Mechanism and is hosted by the United Nations Environment Programme (UNEP) in collaboration with the United Nations Industrial Development Organization (UNIDO) and 11 independent, regional organizations with expertise in climate technologies.

About Semantic Web Company

Semantic Web Company (SWC, http://www.semantic-web.at) is a technology provider headquartered in Vienna (Austria). SWC supports organizations from all industrial sectors worldwide to improve their information and data management. Their products have outstanding capabilities to extract meaning from structured and unstructured data by making use of linked data technologies.

12:09

Introducing the Linked Data Business Cube

With the increasing availability of semantic data on the World Wide Web and its reutilization for commercial purposes, questions arise about the economic value of interlinked data and business models that can be built on top of it. The Linked Data Business Cube provides a systematic approach to conceptualize business models for Linked Data assets. Similar to an OLAP Cube, the Linked Data Business Cube provides an integrated view on stakeholders (x-axis), revenue models (y-axis) and Linked Data assets (z-axis), thus allowing to systematically investigate the specificities of various Linked Data business models.

Linked Data Business Cube_Full

 

Mapping Revenue Models to Linked Data Assets

By mapping revenue models to Linked Data assets we can modify the Linked Data Business Cube as illustrated in the figure below.

Linked Data Business Cube_Revenue-Type

The figure indicates that with increasing business value of a resource the opportunities to derive direct revenues rise. Assets that are easily substitutable generate little incentives for direct revenues but can be used to trigger indirect revenues. This basically applies to instance data and metadata. On the other side, assets that are unique and difficult to imitate and substitute, i.e. in terms of competence and investments necessary to provide the service, carry the highest potential for direct revenues. This applies to assets like content, service and technology. Generally speaking, the higher the value proposition of an asset – in terms of added value – the higher the willingness to pay.

Ontologies seem to function as a “mediating layer” between “low-incentive assets” and “high-incentive assets”. This means that ontologies as a precondition for the provision and utilization of Linked Data can be capitalized in a variety of ways, depending on the business strategy of the Linked Data provider.

It is important to note that each revenue model has specific merits and flaws and requires certain preconditions to work properly. Additionally they often occur in combination as they are functionally complementary.

Mapping Revenue Models to Stakeholders

A Linked Data ecosystem is usually comprised of several stakeholders that engage in the value creation process. The cube can help us to elaborate the most reasonable business model for each stakeholder.

Linked Data Business Cube_Stakeholders

Summing up, Linked Data generates new business opportunities, but the commercialization of Linked Data is very context specific. Revenue models change in accordance to the various assets involved and the stakeholders who take use of them. Knowing these circumstances is crucial in establishing successful business models, but to do so it requires a holistic and interconnected understanding of the value creation process and the specific benefits and limitations Linked Data generates at each step of the value chain.

Read more: Asset Creation and Commercialization of Interlinked Data

November 24 2014

09:09

Highlights of the 1st Meetup on Question Answering Systems – Leipzig, November 21st

On November 21st, AKSW group was hosting the 1st meetup on “Question Answering” (QA) systems. In this meeting, researchers from AKSW/University of Leipzig, CITEC/University of Bielefeld, Fraunhofer IAIS/University of BonDERI/National University of Ireland and the University of Passau presented the recent results of their work on QA systems. The following themes were discussed during the meeting:

  • Ontology-driven QA on the Semantic Web. Christina Unger presented Pythia system for ontology-based QA. Slides are available here.
  • Distributed Semantic Models for achieving scalability & consistency on QA. André Freitas presented TREO and EasyESA which employ vector-based approach for semantic approximation.
  • Template-based QA. Jens Lehmann presented TBSL for Template-based Question Answering over RDF Data.
  • Keyword-based QA. Saeedeh Shekarpour presented SINA approach for semantic interpretation of user queries for QA on interlinked data.
  • Hybrid QA over Linked Data. Ricardo Usbeck presented HAWK for hybrid question answering using Linked Data and full-text indizes.
  • Semantic Parsing with Combinatory Categorial Grammars (CCG). Sherzod Hakimov. Slides are available here.
  • QA on statistical Linked Data. Konrad Höffner presented LinkedSpending and RDF Data Cube vocabulary to apply QA on statistical Linked Data.
  • WDAqua (Web Data and Question Answering) project. Christoph Lange presented the WDAqua project which is part of the EU’s Marie Skłodowska-Curie Action Innovative Training Networks. WDAqua focuses on answering different aspects of the question, “how can we answer complex questions with web data?”
  • OKBQA (Open Knowledge Base & Question-Answering). Axel-C. Ngonga Ngomo presented OKBQA which aims to bring cutting edge experts in knowledge base construction and application in order to create an extensive architecture for QA systems which has no restriction on programming languages.
  • Open QA. Edgard Marx presented open source question answering framework that unifies QA approaches from several domain experts.

The meetup decided to meet biannually to fuse efforts. All agreed upon investigating existing architecture for question answering systems to be able to offer a promising, collaborative architecture for future endeavours. Join us next time! For more information contact Ricardo Usbeck.

Ali and Ricardo on behalf of the QA meetup

November 21 2014

15:39

New Template for CKAN Extensions

We’ve just merged a new template for CKAN extensions. Whenever you create a new CKAN extension using the paster --plugin=ckan create -t ckanext ... command (as documented in the writing extensions tutorial) it’ll now use the new template, which gives you:

  • PyPI integration – setup.py and MANIFEST.in files are automatically generated for your extension, ready for publishing to PyPI
  • A tests directory including stub tests for you to get started writing tests for your extension
  • Travis CI integration – automatically run your tests in a clean environment each time you push a new commit to GitHub. A .travis.yml file and build and run scripts are automatically generated for your extension, you still need to log in to Travis and click the switch to turn on Travis for your extension though.
  • Coveralls.io integration – track the code coverage of your tests. A .coveragerc file is automatically generated for your extension. Again, you still need to login to Coveralls and turn it on.
  • A .gitignore file
  • A LICENSE file (uses the GNU AGPL by default)
  • A reStructuredText README file with a skeleton documentation structure including generated installation and configuration instructions, how to run the tests, etc
  • Travis, Coveralls and pypip.in README badges! Show the world that you have continuous integration, good test coverage, PyPI downloads, and your extension’s supported Python version, development status and license.

Screenshot from 2014-11-21 16:26:14

For an example of an extension built using this template, look at ckanext-deadoralive.

What we’re trying to do with this new template is:

  1. Save ourselves time, by not having to manually create all of this boilerplate every time we roll a new CKAN extension
  2. Help improve the quality of CKAN extensions by encouraging developers to write good tests and documentation, and to use services PyPI, Travis and Coveralls

More to come. If you have any ideas for things to add to the CKAN extension template, let us know on ckan-dev

November 20 2014

09:08

Announcing GERBIL: General Entity Annotator Benchmark Framework

Dear all,

We are happy to announce GERBIL – a General Entity Annotation Benchmark Framework, a demo can be found at! With GERBIL, we aim to establish a highly available, easy quotable and liable focal point for Named Entity Recognition and Named Entity Disambiguation (Entity Linking) evaluations:

  • GERBIL provides persistent URLs for experimental settings. By these means, GERBIL also addresses the problem of archiving experimental results.
  • The results of GERBIL are published in a human-readable as well as a machine-readable format. By these means, we also tackle the problem of reproducibility.
  • GERBIL provides 11 different datasets and 9 different entity annotators. Please talk to us if you want to add yours.

To ensure that the GERBIL framework is useful to both end users and tool developers, its architecture and interface were designed with the following principles in mind:

  • Easy integration of annotators: We provide a web-based interface that allows annotators to be evaluated via their NIF-based REST interface. We provide a small NIF library for an easy implementation of the interface.
  • Easy integration of datasets: We also provide means to gather datasets for evaluation directly from data services such as DataHub.
  • Extensibility: GERBIL is provided as an open-source platform that can be extended by members of the community both to new tasks and different purposes.
  • Diagnostics: The interface of the tool was designed to provide developers with means to easily detect aspects in which their tool(s) need(s) to be improved.
  • Portability of results: We generate human- and machine-readable results to ensure maximum usefulness and portability of the results generated by our framework.

We are looking for your feedback!

Best regards,

Ricardo Usbeck for The GERBIL Team

November 19 2014

21:19

Zemanta & Distil Partner to Protect Content Ad Campaigns from Fraud

Bad Bots might be great villains in sci-fi, but they certainly don’t belong in our clients’ content amplification campaigns

against_bots2

As a Content DSP, Zemanta brings programmatic to native-content advertising. Since launching our first content marketing solution in 2007, the Zemanta Editorial Network, we have worked diligently to ensure our clients campaigns are free of fraud.  We now bring that same diligence to our first-of-a-kind Content DSP, where we help brands promote their content across virtually the entire paid content ecosystem.

As a DSP, our first step in ensuring quality traffic is to work with quality networks and platforms working with leaders like Yahoo!, Outbrain, AdiantIAC/nRelate and AOL/Gravity. But we know that fraud is out there, so we don’t stop there.

Today we are happy to announce we are partnering with Distil Networks, a leader in ad fraud protection, to take the next step in ensuring bad actors play no role in our clients’ content campaigns. We chose to partner and integrate Distil’s anti-fraud technology into our content advertising platform because we were impressed by Distil’s core product experience, which has been used across hundreds of millions of impressions and dozens of publishers. Together, we are determined to keep fraudsters and their bots out of the content marketing business. “Distil is excited to extend its solution further into the online marketing and advertising industry by working with Zemanta, a market leader in content marketing promotion & distribution, to ensure that advertisers are not paying for bot views and bot clicks – just humans”, said Charlie Minesinger, Distil’s Director of Channel Partners.

Good robots help you navigate your X-wing fighter. They don’t steal your marketing dollars. #173379363 / gettyimages.com

We’re happy to be working with Distil to stop the bad bots from ruining great content ad campaigns and we are desperately waiting for an R2 unit to help with the evening commute. To learn more about how Zemanta can help you succeed with content marketing you can read more here or drop us a line at partners@zemanta.com.

November 17 2014

14:29

@BioASQ challenge gaining momentum

BioASQ is a series of challenges aiming to bring us closer to the vision of machines that can answer questions of biomedical professionals and researchers. The second BioASQ challenge started in February 2013. It comprised two different tasks: Large-scale biomedical semantic indexing (Task 2a), and biomedical semantic question answering (Task 2b).

In total 216 users and 142 systems registered to the automated evaluation system of BioASQ in order to participate in the challenge; 28 teams (with 95 systems) finally submitted their suggested solutions and answers. The final results were presented at the BioASQ workshop in the Cross Language Evaluation Forum (CLEF), which took place between September 23 and 26 in Sheffield, U.K.

The Awards Went To The Following Teams

Task 2a (Large-scale biomedical semantic indexing):

  • Fudan University (China)
  • NCBI (USA)
  • Aristotle University of Thessaloniki (Greece) and atypon.com (USA)

Task 2b (Biomedical semantic question answering):

  • Fudan University (China)
  • NCBI (USA)
  • University of Alberta (Canada)
  • Seoul National University (South Korea)
  • Toyota Technological Institute (Japan)
  • Aristotle University of Thessaloniki (Greece) and atypon.com (USA)

Best Overall Contribution:

  • NCBI (USA)
The second BioASQ competition, challenge continued the impressive achievements of the first one, pushing the research frontiers in biomedical indexing and question answering. The systems that participated in both tasks of the challenge achieved a notable increase in accuracy over the first year. Among the highlights is the fact that the best systems in task 2a outperformed again the very strong baseline MTI system provided by NLM. This is despite the fact that the MTI system itself has been improved by incorporating ideas proposed by last year’s winning systems. The end of the second challenge marks also the end of the financial support for BioASQ, by the European Commission. We would like to take this opportunity to thank the EC for supporting our vision. The main project results (incl. frameworks, datasets and publications) can be found at the project showcase page at http://bioasq.org/project/showcase.
Nevertheless, the BioASQ challenge will continue with its third round BioASQ3, which will start in February 2015. Stay tuned!

About BioASQ

The BioASQ team combines researchers with complementary expertise from 6 organisations in 3 countries: the Greek National Center for Scientific Research “Demokritos” (coordinator), participating with its Institutes of ‘Informatics & Telecommunications’ and ‘Biosciences & Applications’, the German IT company Transinsight GmbH, the French University Joseph Fourier, the German research Group for Agile Knowledge Engineering and Semantic Web at the University of Leipzig, the French University Pierre et Marie Curie‐Paris 6 and the Department of Informatics of the Athens University of Economics and Business in Greece (visit the BioASQ project partners page). Moreover, biomedical experts from several countries assist in the creation of the evaluation data and a number of key players in the industry and academia from around the world participate in the advisory board of the project.
BioASQ started in October 2012 and was funded for two years by the European Commission as a support action (FP7/2007-2013: Intelligent Information Management, Targeted Competition Framework; grant agreement n° 318652). More information can be found at: http://www.bioasq.org.
Project Coordinator: George Paliouras (paliourg@iit.demokritos.gr).
08:27
Interview of Alexandre Figuiere, EXALEAD representative at the After Market Conference

November 13 2014

21:19

LDBC: Making Semantic Publishing Execution Rules

LDBC SPB (Semantic Publishing Benchmark) is based on the BBC Linked Data use case. Thus the data modeling and transaction mix reflect the BBC's actual utilization of RDF. But a benchmark is not only a condensation of current best practice. The BBC Linked Data is deployed on Ontotext GraphDB (formerly known as OWLIM).

So, in SPB we wanted to address substantially more complex queries than the lookups than the BBC linked data deployment primarily serves. Diverse dataset summaries, timelines, and faceted search qualified by keywords and/or geography, are examples of online user experience that SPB needs to cover.

SPB is not an analytical workload, per se, but we still find that the queries fall broadly in two categories:

  • Some queries are centered on a particular search or entity. The data touched by the query size does not grow at the same rate as the dataset.
  • Some queries cover whole cross sections of the dataset, e.g., find the most popular tags across the whole database.
These different classes of questions need to be separated in a metric, otherwise the short lookup dominates at small scales, and the large query at large scales.

Another guiding factor of SPB was the BBC's and others' express wish to cover operational aspects such as online backups, replication, and fail-over in a benchmark. True, most online installations have to deal with these, yet these things are as good as absent from present benchmark practice. We will look at these aspects in a different article; for now, I will just discuss the matter of workload mix and metric.

Normally, the lookup and analytics workloads are divided into different benchmarks. Here, we will try something different. There are three things the benchmark does:

  • Updates - These sometimes insert a graph, sometimes delete and re-insert the same graph, sometimes just delete a graph. These are logarithmic to data size.

  • Short queries - These are lookups that most often touch on recent data and can drive page impressions. These are roughly logarithmic to data scale.

  • Analytics - These cover a large fraction of the dataset and are roughly linear to data size.

A test sponsor can decide on the query mix within certain bounds. A qualifying run must sustain a minimum, scale-dependent update throughput and must execute a scale-dependent number of analytical query mixes, or run for a scale-dependent duration. The minimum update rate, the minimum number of analytics mixes and the minimum duration all grow logarithmically to data size.

Within these limits, the test sponsor can decide how to mix the workloads. Publishing several results emphasizing different aspects is also possible. A given system may be especially good at one aspect, leading the test sponsor to accentuate this.

The benchmark has been developed and tested at small scales, between 50 and 150M triples. Next we need to see how it actually scales. There we expect to see how the two query sets behave differently. One effect that we see right away when loading data is that creating the full text index on the literals is in fact the longest running part. For a SF 32 ( 1.6 billion triples) SPB database we have the following space consumption figures:

  • 46,886 MB of RDF literal text
  • 23,924 MB of full text index for RDF literals
  • 23,598 MB of URI strings
  • 21,981 MB of quads, stored column-wise with default index scheme

Clearly, applying column-wise compression to the strings is the best move for increasing scalability. The literals are individually short, so literal per literal compression will do little or nothing but applying this by the column is known to get a 2x size reduction with Google Snappy.

The full text index does not get much from column store techniques, as it already consists of words followed by space efficient lists of word positions. The above numbers are measured with Virtuoso column store, with quads column-wise and the rest row-wise. Each number includes the table(s) and any extra indices associated to them.

Let's now look at a full run at unit scale, i.e., 50M triples.

The run rules stipulate a minimum of 7 updates per second. The updates are comparatively fast, so we set the update rate to 70 updates per second. This is seen not to take too much CPU. We run 2 threads of updates, 20 of short queries, and 2 of long queries. The minimum run time for the unit scale is 10 minutes, so we do 10 analytical mixes, as this is expected to take a little over 10 minutes. The run stops by itself when the last of the analytical mixes finishes.

The interactive driver reports:

Seconds run : 2,144
    Editorial:
        2 agents

        68,164 inserts (avg :   46  ms, min :    5  ms, max :   3002  ms)
         8,440 updates (avg :   72  ms, min :   15  ms, max :   2471  ms)
         8,539 deletes (avg :   37  ms, min :    4  ms, max :   2531  ms)

        85,143 operations (68,164 CW Inserts   (98 errors), 
                            8,440 CW Updates   ( 0 errors), 
                            8,539 CW Deletions ( 0 errors))
        39.7122 average operations per second

    Aggregation:
        20 agents

        4120  Q1   queries (avg :    789  ms, min :   197  ms, max :   6,767   ms, 0 errors)
        4121  Q2   queries (avg :     85  ms, min :    26  ms, max :   3,058   ms, 0 errors)
        4124  Q3   queries (avg :     67  ms, min :     5  ms, max :   3,031   ms, 0 errors)
        4118  Q5   queries (avg :    354  ms, min :     3  ms, max :   8,172   ms, 0 errors)
        4117  Q8   queries (avg :    975  ms, min :    25  ms, max :   7,368   ms, 0 errors)
        4119  Q11  queries (avg :    221  ms, min :    75  ms, max :   3,129   ms, 0 errors)
        4122  Q12  queries (avg :    131  ms, min :    45  ms, max :   1,130   ms, 0 errors)
        4115  Q17  queries (avg :  5,321  ms, min :    35  ms, max :  13,144   ms, 0 errors)
        4119  Q18  queries (avg :    987  ms, min :   138  ms, max :   6,738   ms, 0 errors)
        4121  Q24  queries (avg :    917  ms, min :    33  ms, max :   3,653   ms, 0 errors)
        4122  Q25  queries (avg :    451  ms, min :    70  ms, max :   3,695   ms, 0 errors)

        22.5239 average queries per second. 
        Pool 0, queries [ Q1 Q2 Q3 Q5 Q8 Q11 Q12 Q17 Q18 Q24 Q25 ]


        45,318 total retrieval queries (0 timed-out)
        22.5239 average queries per second

The analytical driver reports:

    Aggregation:
        2 agents

        14    Q4   queries (avg :   9,984  ms, min :   4,832  ms, max :   17,957  ms, 0 errors)
        12    Q6   queries (avg :   4,173  ms, min :      46  ms, max :    7,843  ms, 0 errors)
        13    Q7   queries (avg :   1,855  ms, min :   1,295  ms, max :    2,415  ms, 0 errors)
        13    Q9   queries (avg :     561  ms, min :     446  ms, max :      662  ms, 0 errors)
        14    Q10  queries (avg :   2,641  ms, min :   1,652  ms, max :    4,238  ms, 0 errors)
        12    Q13  queries (avg :     595  ms, min :     373  ms, max :    1,167  ms, 0 errors)
        12    Q14  queries (avg :  65,362  ms, min :   6,127  ms, max :  136,346  ms, 2 errors)
        13    Q15  queries (avg :  45,737  ms, min :  12,698  ms, max :   59,935  ms, 0 errors)
        13    Q16  queries (avg :  30,939  ms, min :  10,224  ms, max :   38,161  ms, 0 errors)
        13    Q19  queries (avg :     310  ms, min :      26  ms, max :    1,733  ms, 0 errors)
        12    Q20  queries (avg :  13,821  ms, min :  11,092  ms, max :   15,435  ms, 0 errors)
        13    Q21  queries (avg :  36,611  ms, min :  14,164  ms, max :   70,954  ms, 0 errors)
        13    Q22  queries (avg :  42,048  ms, min :   7,106  ms, max :   74,296  ms, 0 errors)
        13    Q23  queries (avg :  48,474  ms, min :  18,574  ms, max :   93,656  ms, 0 errors)
        0.0862 average queries per second. 
        Pool 0, queries [ Q4 Q6 Q7 Q9 Q10 Q13 Q14 Q15 Q16 Q19 Q20 Q21 Q22 Q23 ]


        180 total retrieval queries (2 timed-out)
        0.0862 average queries per second

The metric would be 22.52 qi/s , 310 qa/h, 39.7 u/s @ 50Mt (SF 1)

The SUT is dual Xeon E5-2630, all in memory. The platform utilization is steadily above 2000% CPU (over 20/24 hardware threads busy on the DBMS). The DBMS is Virtuoso Open Source (v7fasttrack at github.com, feature/analytics branch).

The minimum update rate of 7/s was sustained, but fell short of the target of 70/s. In this run, most demand was put on the interactive queries. Different thread allocations would give different ratios of the metric components. The analytics mix, for example, is about 3x faster without other concurrent activity.

Is this good or bad? I would say that this is possible but better can certainly be accomplished.

The initial observation is that Q17 is the worst of the interactive lot. 3x better is easily accomplished by avoiding a basic stupidity. The query does the evil deed of checking for a substring in a URI. This is done in the wrong place and accounts for most of the time. The query is meant to test geo retrieval but ends up doing something quite different. Optimizing this right would by itself almost double the interactive score. There are some timeouts in the analytical run, which as such disqualifies the run. This is not a fully compliant result, but is close enough to give an idea of the dynamics. So we see that the experiment is definitely feasible, is reasonably defined, and that the dynamics seen make sense.

As an initial comment of the workload mix, I'd say that interactive should have a few more very short point-lookups, to stress compilation times and give a higher absolute score of queries per second.

Adjustments to the mix will depend on what we find out about scaling. As with SNB, it is likely that the workload will shift a little so this result might not be comparable with future ones.

In the next SPB article, we will look closer at performance dynamics and choke points and will have an initial impression on scaling the workload.

21:09

LDBC: Creating a Metric for SNB

In the Making It Interactive post on the LDBC blog, we were talking about composing an interactive Social Network Benchmark (SNB) metric. Now we will look at what this looks like in practice.

A benchmark is known by its primary metric. An actual benchmark implementation may deal with endless complexity but the whole point of the exercise is to reduce this all to an extremely compact form, optimally a number or two.

For SNB, we suggest clicks per second Interactive at scale (cpsI@ so many GB) as the primary metric. To each scale of the dataset corresponds a rate of update in the dataset's timeline (simulation time). When running the benchmark, the events in simulation time are transposed to a timeline in real time.

Another way of expressing the metric is therefore acceleration factor at scale. In this example, we run a 300 GB database at an acceleration of 1.64; i.e., in the present example, we did 97 minutes of simulation time in 58 minutes of real time.

Another key component of a benchmark is the full disclosure report (FDR). This is expected to enable any interested party to reproduce the experiment.

The system under test (SUT) is Virtuoso running an SQL implementation of the workload at 300 GB (SF = 300). This run gives an idea of what an official report will look like but is not one yet. The implementation differs from the present specification in the following:

  • The SNB test driver is not used. Instead, the workload is read from the file system by stored procedures on the SUT. This is done to circumvent latencies in update scheduling in the test driver which would result in the SUT not reaching full platform utilization.

  • The workload is extended by 2 short lookups, i.e., person profile view and post detail view. These are very short and serve to give the test more of an online flavor.

  • The short queries appear in the report as multiple entries. This should not be the case. This inflates the clicks per second number but does not significantly affect the acceleration factor.

As a caveat, this metric will not be comparable with future ones.

Aside from the composition of the report, the interesting point is that with the present workload, a 300 GB database keeps up with the simulation timeline on a commodity server, also when running updates. The query frequencies and run times are in the full report. We also produced a graphic showing the evolution of the throughput over a run of one hour --

ldbc-snb-qpm.png
(click to embiggen)

We see steady throughput except for some slower minutes which correspond to database checkpoints. (A checkpoint, sometimes called a log checkpoint, is the operation which makes a database state durable outside of the transaction log.) If we run updates only at full platform, we get an acceleration of about 300x in memory for 20 minutes, then 10 minutes of nothing happening while the database is being checkpointed. This is measured with 6 2TB magnetic disks. Such a behavior is incompatible with an interactive workload. But with a checkpoint every 10 minutes and updates mixed with queries, checkpointing the database does not lead to impossible latencies. Thus, we do not get the TPC-C syndrome which requires tens of disks or several SSDs per core to run.

This is a good thing for the benchmark, as we do not want to require unusual I/O systems for competition. Such a requirement would simply encourage people to ignore the specification for the point and would limit the number of qualifying results.

The full report contains the details. This is also a template for later "real" FDRs. The supporting files are divided into test implementation and system configuration. With these materials plus the data generator, one should be able to repeat the results using a Virtuoso Open Source cut from v7fasttrack at github.com, feature/analytics branch.

In later posts we will analyze the results a bit more and see how much improvement potential we find. The next SNB article will be about the business intelligence and graph analytics areas of SNB.

15:59

The Potential of Big Data Applications for the Healthcare Sector

 

At the industrial Big Data Conference Big Data Minds in Berlin, Prof. Sonja Zillner presented "The Potential of Big Data Applications for the Healthcare Sector". With the presentation of the BIG Data Public Private Forum discussed the challenges of BIG Data and the emerging Data Economy for the Healthcare Sector. In particular,  the results of the BIG user needs and requisites study for the Big Data applications in the Healthcare Sector were introduced. The study shows that Big Data technologies can be used to improve the quality and efficiency of healthcare delivery.  However, the realization of Big Data applications in the healthcare sector is challenging. In order to take advantage of the promising opportunities of Big Data technologies, a clear understanding of driver and constraints, user needs and requirements is needed.

The feedback of the audience was very good and several participants of the conference requested the access of the BIG Requirements Study.

Categories:

November 09 2014

20:01

Export Datasets from CKAN to Excel

ckanapi-exporter is a new API script that we’ve developed for exporting dataset metadata from CKAN to Excel-compatible CSV files. Check out the short presentation below, and visit ckanapi-exporter for more details:

November 04 2014

12:19

CKAN Extension Registry – Share and Find CKAN Extensions

We are happy to announce the new CKAN Extensions Registry which lists available CKAN Extensions:

http://extensions.ckan.org/

CKAN Extensions are a way to extend and alter the functionality of the base CKAN platform using the numerous extension points provided by CKAN. CKAN Extensions provide limitless possibilities from altering the site look and feel to adding site pages, from new validation methods to modifying or adding APIs.

There are currently 100 extensions already listed in the registry based on an initial survey of the extensions available “in the wild” (on github etc), and we will be adding more going forward.

CKAN Extension Registry Front Page

Add Your Extension

Instructions for adding your extension to the registry are here:

http://extensions.ckan.org/add/

All About Extensions

CKAN Extensions are a way to extend the functionality of the base CKAN platform using the numerous extension points provided by CKAN.

Support for creating CKAN Extensions was first introduced in Autumn 2010 and has been extended multiple times ever since. Until now we have collected lists of extensions on the wiki but with the growing number of Extensions it is useful to have a proper registry (an Extension registry was one of the most requested items in the Roadmap consultation).

Examples include:

Next Steps

At present, the Registry is confined to “functional” extensions which add new functionality to CKAN and are not specific to a given site.

We are considering adding a section for theme oriented and site-specific extensions (e.g. support for metadata specific to a given site) since these extensions may be useful as inspiration and instruction to others even if they are not likely to be directly installed.

10:00
BREAKFAST WITH EXALEAD DASSAULT SYSTÈMES LEADER

November 03 2014

18:12

Job offer: Technical Consultant (Data Science & Linked Data)

The Semantic Web Company (SWC) is a leading provider of software and services in the areas of Semantic Information Management and Linked Data technologies. SWC’s renowned PoolParty Software Platform is used in large enterprises, Government Organizations, NPOs and NGOs around the globe to extract meaning from big data.

We are looking for a technical consultant working at the interface between customer projects and product development. Expertise in some of the following areas is required: data science, data mining, text mining, knowledge engineering, taxonomy management, semantic web, semantic search, computer linguistics and/or linked data. Our consultants are an integral part of a dynamic, interdisciplinary, output-focussed team of semantic technology experts.

Semantic Web Company values loyality, intelligence and innovation and rewards strong performance with increased responsibility and growth opportunities. We offer great work-life balance and a culture that is cutting-edge, collaborative and fun. If you are interested in making an immediate impact in a growing company, we invite you to apply today.

Job Description:

  • Requirements engineering for customer projects
  • Project management of customer and / or R&D projects
  • Collaborating with information professionals to initiate customer projects
  • Data and knowledge engineering mainly based on our core product PoolParty Semantic Suite (http://www.poolparty.biz/)
  • Conceptual assistance to the product development team
  • Conceptual assistance to the business development team
  • Supporting the R&D efforts of the Semantic Web Company

Job Requirements:

  • Profound expertise with some knowledge technologies like graph databases, text mining, ontology engineering, machine learning etc.
  • Knowledge of Java
  • Strong troubleshooting/problem-solving skills
  • Exacting attention to detail and documentation
  • Ownership of problems
  • Ability to effectively manage multiple projects simultaneously
  • An inquiring mind, intense curiosity, interdisciplinary understanding and strong desire to innovate in the areas of Linked Data and Semantic Systems
  • At least a Bachelor degree related to computer science or information science and 4+ years of working experience or
  • a Master degree related to computer science or information science and 2+ years of working experience or
  • Excellent skills in written and spoken English. Additional languages (German, French, Spanish) are not obligatory but advantageous
  • Communication skills as well as experience in project management and requirements engineering
  • Able to travel occasionally in US and Europe, if required

 

  • Job Category: Technology Provider
  • Career Level: Mid Career (2+ years of experience)
  • Job Type: Full Time/Permanent
  • Positions: 2
  • Company Name: Semantic Web Company GmbH
  • City: Vienna
  • Country: Austria
  • For Austria: Gross Salary EUR 41.160,- p.a. Possibility for overpayment is based on education and experience.

 

Send your full application to:

Semantic Web Company
c/o Andreas Blumauer

Mail: jobs@semantic-web.at

Company: http://www.semantic-web.at
PoolParty Product Suite: http://poolparty.biz

October 28 2014

16:03

AKSW successful at #ISWC2014

Dear followers, 9 members of AKSW have been participating at the 13th International Semantic Web Conference (ISWC) at Riva del Garda, Italy. Next to listening to interesting talks, giving presentations or discussing with fellow Semantic Web researchers, AKSW won 4 significant prizes:

We do work on way more projects, which you can find at http://aksw.org/projects/. Cheers, Ricardo on behalf of the AKSW group
Best Paper Award

October 27 2014

12:38

EU Big Data Value in Heidelberg Workshop

At the Heidelberg Final Event Workshop, Sebnem Rusitschka demonstrated the value of big data by presenting present and future use cases for Siemens in Europe.
The slides are now available on SlideShare (see above) and directly as PDF.
Tags:
12:24

BYTE Community Overview in Heidelberg

The BYTE community stands for "Big data roadmap and cross-disciplinarY community for addressing socieTal Externalities". Edward Curry gave a project overview at the BIG Final Event in Heidelberg.
The slides are now available on SlideShare (see above) and directly as PDF.
Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.