Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

August 26 2014

07:13

The world famous trade fair Arbeitsschutz Aktuell in Frankfurt, Germany

TenForce is present at the world famous trade fair Arbeitsschutz Aktuell in Frankfurt, Germany from 25-28 /08. 

Please come and visit us at our booth 3.1/C33. And pick up some belgian chocolates!

http://www.arbeitsschutz-aktuell.de/

August 18 2014

20:54

LOD2 Finale (part 2 of n): The 500 Giga-triples

No epic is complete without a descent into hell. Enter the historia calamitatum of the 500 Giga-triples (Gt) at CWI's Scilens cluster.

Now, from last time, we know to generate the data without 10 GB of namespace prefixes per file and with many short files. So we have 1.5 TB of gzipped data in 40,000 files, spread over 12 machines. The data generator has again been modified. Now the generation was about 4 days. Also from last time, we know to treat small integers specially when they occur as partition keys: 1 and 2 are very common values and skew becomes severe if they all go to the same partition; hence consecutive small INTs each go to a different partition, but for larger ones the low 8 bits are ignored, which is good for compression: Consecutive values must fall in consecutive places, but not for small INTs. Another uniquely brain-dead feature of the BSBM generator has also been rectified: When generating multiple files, the program would put things in files in a round-robin manner, instead of putting consecutive numbers in consecutive places, which is how every other data generator or exporter does it. This impacts bulk load locality and as you, dear reader, ought to know by now, performance comes from (1) locality and (2) parallelism.

The machines are similar to last time: each a dual E5 2650 v2 with 256 GB RAM and QDR InfiniBand (IB). No SSD this time, but a slightly higher clock than last time; anyway, a different set of machines.

The first experiment is with triples, so no characteristic sets, no schema.

So, first day (Monday), we notice that one cannot allocate more than 9 GB of memory. Then we figure out that it cannot be done with malloc, whether in small or large pieces, but it can with mmap. Ain't seen that before. One day shot. Then, towards the end of day 2, load begins. But it does not run for more than 15 minutes before a network error causes the whole thing to abort. All subsequent tries die within 15 minutes. Then, in the morning of day 3, we switch from IB to Gigabit Ethernet (GigE). For loading this is all the same; the maximal aggregate throughput is 800 MB/s, which is around 40% of the nominal bidirectional capacity of 12 GigE's. So, it works better, for 30 minutes, and one can even stop the load and do a checkpoint. But after resuming, one box just dies; does not even respond to ping. We change this to another. After this, still running on GigE, there are no more network errors. So, at the end of day 3, maybe 10% of the data are in. But now it takes 2h21min to make a checkpoint, i.e., make the loaded data durable on disk. One of the boxes manages to write 2 MB/s to a RAID-0 of three 2 TB drives. Bad disk, seen such before. The data can however be read back once the write is finally done.

Well, this is a non-starter. So, by mid-day of day 4, another machine has been replaced. Now writing to disk is possible within expected delays.

In the afternoon of day 4, the load rate is about 4.3 Mega-triples (Mt) per second, all going in RAM.

In the evening of day 4, adding more files to load in parallel increases the load rate to between 4.9 and 5.2 Mt/s. This is about as fast as this will go, since the load is not exactly even. This comes from the RDF stupidity of keeping an index on everything, so even object values where an index is useless get indexed, leading to some load peaks. For example, there is an index on POSG for triples were the predicate is rdf:type and the object is a common type. Use of characteristic sets will stop this nonsense.

But let us not get ahead of the facts: At 9:10 PM of day 4, the whole cluster goes unreachable. No, this is not a software crash or swapping; this also affects boxes on which nothing of the experiment was running. A whole night of running is shot.

A previous scale model experiment of loading 37.5 Gt in 192 GB of RAM, paging to a pair of 2 TB disks, has been done a week before. This finishes in time, keeping a load rate of above 400 Kt/s on a 12-core box.

At 10AM on day 5 (Friday), the cluster is rebooted; a whole night's run missed. The cluster starts and takes about 30 minutes to get to its former 5 Mt/s load rate. We now try switching the network back to InfiniBand. The whole ethernet network seemed to have crashed at 9PM on day 4. This is of course unexplained but the experiment had been driving the ethernet at about half its cross-sectional throughput, so maybe a switch crashed. We will never know. We will now try IB rather than risk this happening again, especially since if it did repeat, the whole weekend would be shot, as we would have to wait for the admin to reboot the lot on Monday (day 8).

So, at noon on day 5, the cluster is restarted with IB. The cruising speed is now 6.2 Mt/s, thanks to the faster network. The cross sectional throughput is about 960 MB/s, up from 720 MB/s, which accounts for the difference. CPU load is correspondingly up. This is still not full platform since there is load unbalance as noted above.

At 9PM on day 5, the rate is around 5.7 Mt/s with the peak node at 1500% CPU out of a possible 1600%. The next one is under 800%, which is just to show what it means to index everything. In specific, the node that has the highest CPU is the one in whose partition the bsbm:offer class falls, so that there is a local peak since one of every 9 or so triples says that something is an offer. The stupidity of the triple store is to index garbage like this to begin with. The reason why the performance is still good is that a POSG index where P and O are fixed and the S is densely ascending is very good, with everything but the S represented as run lengths and the S as bitmaps. Still, no representation at all is better for performance than even the most efficient representation.

The journey consists of 3 different parts. At 10PM, the 3rd and last part is started. The triples have more literals, but the load is more even. The cruising speed is 4.3 Mt/s down from 6.2, but the data has a different shape, including more literals.

The last stretch of the data is about reviews. This stretch of the data has less skew. So we increase parallelism, running 8 x 24 files at a time. The load rate goes above 6.3 Mt/s.

At 6:45 in the morning of day 6, the data is all loaded. The count of triples is 490.0 billion. If the load were done in a single stretch without stops and reconfiguration, it would likely go in under 24h. The average rate for a 4 hour sample between midnight and 4AM of day 6 is 6.8 MT/s. The resulting database files add up to 10.9 TB, with about 20% of the volume in unallocated pages.

At this time, noon of day 6, we find that some cross-partition joins need more distinct pieces of memory than the default kernel settings allow per process. A large number of partitions makes a large number of sometimes long messages which makes many mmaps. So we will wait until morning of day 8 (Monday) for the administrator to set these. In the meantime, we analyze the behavior of the workload on the 37 Gt scale model cluster on my desktop.

To be continued...

LOD2 Finale Series

20:54

LOD2 Finale (part 1 of n): RDF Before The Dawn

The LOD2 FP7 ends at the end of August, 2014. This post begins a series that will crown the project with a grand finale, another decisive step towards the project’s chief goal of giving RDF and linked data performance parity with SQL systems.

In a nutshell, LOD2 went like this:

  1. Triples were done right, taking the best of the column store world and adapting it to RDF. This is now in widespread use.

  2. SQL was done right, as I have described in detail in the TPC-H series. This is generally available as open source in v7fasttrack. SQL is the senior science and a runner-up like sem-tech will not carry the day without mastering this.

  3. RDF is now breaking free of the triple store. RDF is a very general, minimalistic way of talking about things. It is not a prescription on how to do database. Confusing these two things has given rise to RDF’s relative cost against alternatives. To cap off LOD2, we will have the flexibility of triples with the speed of the best SQL.

In this post we will look at accomplishments so far and outline what is to follow during August. We will also look at what in fact constitutes the RDF overhead, why this is presently so, and why this does not have to stay thus.

This series will be of special interest to anybody concerned with RDF efficiency and scalability.

At the beginning of LOD2, I wrote a blog post discussing the RDF technology and its planned revolution in terms of the legend of Perseus. The classics give us exemplars and archetypes, but actual histories seldom follow them one-to-one; rather, events may have a fractal nature where subplots reproduce the overall scheme of the containing story.

So it is also with LOD2: The Promethean pattern of fetching the fire (state of the art of the column store) from the gods (the DB world) and bringing it to fuel the campfires of the primitive semantic tribes is one phase, but it is not the totality. This is successfully concluded, and Virtuoso 7 is widely used at present. Space efficiency gains are about 3x over the previous, with performance gains anywhere from 3 to 100x. As pointed out in the Star Schema Benchmark series (part 1 and part 2), in the good case one can run circles in SPARQL around anything but the best SQL analytics databases.

In the larger scheme of things, this is just preparation. In the classical pattern, there is the call or the crisis: Presently this is that having done triples about as right as they can be done, the mediocre in SQL can be vanquished, but the best cannot. Then there is the actual preparation: Perseus talking to Athena and receiving the shield of polished brass and the winged sandals. In the present case, this is my second pilgrimage to Mount Database, consisting of the TPC-H series. Now, the incense has been burned and libations offered at each of the 22 stations. This is not reading papers, but personally making one of the best-ever implementations of this foundational workload. This establishes Virtuoso as one of the top-of-the-line SQL analytics engines. The RDF public, which is anyway the principal Virtuoso constituency today, may ask what this does for them.

Well, without this step, the LOD2 goal of performance parity with SQL would be both meaningless and unattainable. The goal of parity is worth something only if you compare the RDF contestant to the very best SQL. And the comparison cannot possibly be successful unless it incorporates the very same hard core of down-to-the-metal competence the SQL world has been pursuing now for over forty years.

It is now time to cut the Gorgon’s head. The knowledge and prerequisite conditions exist.

The epic story is mostly about principles. If it is about personal combat, the persons stand for values and principles rather than for individuals. Here the enemy is actually an illusion, an error of perception, that has kept RDF in chains all this time. Yes, RDF is defined as a data model with triples in named graphs, i.e., quads. If nothing else is said, an RDF Store is a thing that can take arbitrary triples and retrieve them with SPARQL. The naïve implementation is to store things as rows in a quad table, indexed in any number of ways. There have been other approaches suggested, such as property tables or materialized views of some joins, but these tend to flush the baby with the bathwater: If RDF is used in the first place, it is used for its schema-less-ness and for having global identifiers. In some cases, there is also some inference, but the matter of schema-less-ness and identifiers predominates.

We need to go beyond a triple table and a dictionary of URI names while maintaining the present semantics and flexibility. Nobody said that physical structure needs to follow this. Everybody just implements things this way because this is the minimum that will in any case be required. Combining this with a SQL database for some other part of the data/workload hits basically insoluble problems of impedance mismatch between the SQL and SPARQL type systems, maybe using multiple servers for different parts of a query, etc. But if you own one of the hottest SQL racers in DB city and can make it do anything you want, most of these problems fall away.

The idea is simple: Put the de facto rectangular part of RDF data into tables; do not naively index everything in places where an index gives no benefit; keep the irregular or sparse part of the data as quads. Optimize queries according to the table-like structure, as that is where the volume is and where getting the best plan is a make or break matter, as we saw in the TPC-H series. Then, execute in a way where the details of the physical plan track the data; i.e., sometimes the operator is on a table, sometimes on triples, for the long tail of exceptions.

In the next articles we will look at how this works and what the gains are.

These experiments will for the first time showcase the adaptive schema features of the Virtuoso RDF store. Some of these features will be commercial only, but the interested will be able to reproduce the single server experiments themselves using the v7fasttrack open source preview. This will be updated around the second week of September to give a preview of this with BSBM and possibly some other datasets, e.g., Uniprot. Performance gains for regular datasets will be very large.

To be continued...

LOD2 Finale Series

August 15 2014

11:14

AKSW Colloquium “Towards an Open Question Answering Architecture” conference pre-presentation on Monday, August 18 in P702

Towards an Open Question Answering Architecture

On Monday, August 18 ,13.30, Edgard Marx, will give a pre-presentation of his Semantics’ conference talk about the accepted paper Towards an Open Question Answering Architecture.

About the AKSW Colloquium

This event is part of a series of events about Semantic Web technology. Please see http://wiki.aksw.org/Colloquium for further information about previous and future events. As always, Bachelor and Master students are able to get points for attendance and there is complimentary coffee and cake after the session.

08:09

AKSW member will participate in ECAI 2014, Prague, Czech Republic

ECAI Header Image

Hello!

The  21st European Conference  on Artificial Intelligence (ECAI) will be held in the city of Prague, Czech  Republic from 18th to 22nd August 2014. Various excellent papers on artificial intellegence, logic, rule mining and many more topics will be presented.

AKSW member Ricardo Usbeck will present a poster of AGDISTIS – Agnostic Disambiguation of Named Entities using Linked Data. To the best of our knowledge AGDISTIS is able to outperform the state-of-the-art approaches in entity linking by up to 29% f-meassure. Come and visit him at ECAI 2014.

A demo of AGDISTIS is available here: http://agdistis.aksw.org/demo and the paper can be found here.

Cheers,
Ricardo on behalf of AKSW

August 14 2014

11:00

The Rise of Content Ads

Here is a video of a talk I gave at a Digiday Summit this spring talking about the rise of content ads.  For those of you who are not familiar, content ads fall under the umbrella of native ads and are ads that use content as the ad.  In this talk, I cover the following topics:

  • What are content ads
  • Evolution of the content ad ecosystem
  • What do advertisers want from content ads
  • Evolution of Zemanta‘s role and now a DSP for cotent ads
  • What do advertisers want from content ads
  • How can online publishers work with content ads
  • How will standards evolve for content ads

August 05 2014

10:26

Commission Communication "Towards a thriving data-driven economy"

The European Commission Communication "Towards a thriving data-driven economy" has been adopted on 2 July 2014. Its aim is to focus on the positive aspects of big data and to promote it to leaders, according to Vice President @NeelieKroesEU. An excerpt from the press release:

The main problems identified in public consultations on big data are:

  1. Lack of cross-border coordination

  2. Insufficient infrastructure and funding opportunities

  3. A shortage of data experts and related skills

  4. Fragmented and overly complex legal environment

Main concrete actions proposed today to solve these problems:

  1. A Big Data public-private partnership that funds “game-changing” big data ideas, in areas such as personalised medicine and food logistics.

  2. Create an open data incubator (within the Horizon 2020 framework), to help SMEs set up supply chains based on data and use cloud computing more.

  3. Propose new rules on "data ownership" and liability of data provision for data gathered via Internet of Things (Machine to Machine communication)

  4. Mapping of data standards, identifying potential gaps

  5. Establish a series of Supercomputing Centres of Excellence to increase number of skilled data workers in Europe

  6. Create network of data processing facilities in different Member States

See also

08:31

Additional contributions to SEMANTiCS 2014

Hello again!
Unfortunately, we missed the opportunity to inform you about other contributions of AKSW to the SEMANTiCS 2014.
First, we missed to tell you about another accepted paper:
  • Towards Question Answering on Statistical Linked Data ( Konrad Höffner and Jens Lehmann)

Second, there is also another excellent and interesting series of workshops.

Date Title Hosts Room Room Website ________________________________ ____________ 09:00 – 12:30 14:00 – 17:30 01.09.2014 Link Discovery of the Web of Data (Organized by GeoKnow & LinkingLOD) Axel Ngonga (Uni Leipzig) yes - Website 01.09.2014 GeoLD – Geospatial Linked Data (organised by the GeoKnow Project) Jens Lehmann (Uni Leipzig)

Daniel Hladky (Ontos)

Andreas Both (Unister)

yes yes Website 01.09.2014 MLODE 2014 – Content Analysis and the Semantic Web, a LIDER Hackathon Bettina Klimek (Uni Leipzig), Philipp Cimiano (Uni Bielefeld) yes yes Website 02.09.2014 MLODE 2014 – Mulitlingual Linked Open Data for Enterprises, LIDER Roadmapping workshop Bettina Klimek (Uni Leipzig), Philipp Cimiano (Uni Bielefeld) yes yes Website 02.09.2014 MLODE 2014 – Community meetings and break-out sessions Bettina Klimek (Uni Leipzig), Philipp Cimiano (Uni Bielefeld) - yes Website
Visit us from on the 1st to the 5th September in Leipzig, Germany and enjoy the talks. More information on these publications at http://aksw.org/Publications.
Cheers,
Ricardo on behalf of AKSW

August 04 2014

09:56

Content Creation in Today’s World of Native Advertising – Webinar by Zemanta & Scripted

Join Zemanta’s CEO, Todd Sawicki, and Scripted Co-founder and Head of Partnerships, Ryan Buckley as they team up to discuss how to scale your content creation and amplification efforts.

Banner ads no longer deliver the results they once did, and now brands are looking for new methods to drive theirbusinesses forward.

With consumers connected online more than ever, brands have to publish original, high quality content to spread the word about their products and services. Brands also need a cost effective and efficient way to amplify the content they create, as well as measure the success.

In this one hour webinar, you’ll learn:

  • How to use content ads to engage customers
  • Types of content that perform best with native advertising
  • How marketers can create and manage content ads at scales
  • And much, much more!

Register today with the Scripted team to save your seat for this webinar.

We look forward to seeing you!

About the Speakers:

todd-sawicki-head

Todd Sawicki, Zemanta CEO

Based in Seattle, Todd is Zemanta’s CEO. He’s a long time exec and founder for digital media startups who has a thing for the world of online advertising and publishing. Todd loves to play ice hockey and really hopes Seattle gets a proper hockey team.

ryan-buckley-headshot

Ryan Buckley, Scripted.com Co-Founder & Head of Partnerships

Ryan is Scripted’s Co-founder and Head of Partnerships. Ryan holds an MBA from the MIT Sloan School of Management and an MPP from the Harvard Kennedy School of Government. Still and always a Cal Bear, Ryan graduated from UC Berkeley with degrees in Economics and Environmental Sciences. He likes to dabble in PHP, Python, Rails, Quickbooks, and whatever else he needs to know to help run Scripted.

August 01 2014

14:54

Five AKSW Papers at SEMANTiCS 2014

Hello Community!
We are very pleased to announce that five of our papers were accepted for presentation at SEMANTiCS 2014.  The papers cover architectures for Big Data Search Engines, Linked Data Visualisations, Machine Learning and Dataset Descriptions. In more detail, we will present the following papers:
  • A distributed search framework for full-text, geo-spatial and semantic search (Andreas Both, Axel-Cyrille Ngonga Ngomo, Ricardo Usbeck, Christiane Lemke, Denis Lukovnikov and Maximilian Speicher)
  • LD Viewer – Linked Data Presentation Framework (Denis Lukovnikov, Claus Stadler and Jens Lehmann)
  • A Comparison of Supervised Learning Classifiers for Link Discovery (Tommaso Soru and Axel-Cyrille Ngonga Ngomo)
  • Towards an Open Question Answering Architecture (Edgard Marx, Ricardo Usbeck, Axel-Cyrille Ngonga Ngomo, Konrad Höffner, Jens Lehmann and Sören Auer)
  • DataID: Towards Semantically Rich Metadata For Complex Datasets (Martin Brümmer, Ciro Baron, Ivan Ermilov, Markus Freudenberg and Sebastian Hellmann)

Additionally, there are also interessting workshops. For example, Link Discovery of the Web of Data (Organized by GeoKnow & LinkingLOD) organized by Axel Ngonga-Ngomo.

Visit us from on the 4th and 5th September in Leipzig, Germany and enjoy the talks. More information on these publications at http://aksw.org/Publications.
Cheers,
Ricardo on behalf of AKSW

July 30 2014

17:07

The start of a publishing revolution

This January, sovrn took a stand for publishers. We started the sovrn pubhub as a community gathering place for our publishers and we’re working on a top-secret project that will change publishing as we know it — ultimately giving the power of insight, engagement, and analytics (previously only available to advertisers and marketers) to all sovrn publishers.

What does that mean for you? It means no more managing one-off brand relationships. It means no more guessing if your content is really engaging your audience. It means no more spending hours trying to find and connect with people and tools that can help you grow. We have nearly 20,000 publishers and 1 million sites in our network and we work with over 70 demand partners. So we have a pretty good idea of what works — and we want to share it with you.

For instance, you might’ve heard something about this “mobile revolution”. It’s so big that some sovrn publishers are reporting that their mobile traffic is eclipsing their desktop traffic. (Take a look at Pinch of Yum’s monthly site income report.) So, what are you, as a publisher, supposed to know and do about this mobile thing? Well, a few things for you to consider are:

  1. Make sure your website is mobile responsive. That means that your readers automatically get served the graphics and content in the size and format that makes sense for the device that they’re using. If you’re a WordPress publisher, simply choose a mobile-responsive theme.
  2. Install Zemanta’s Editorial Assistant WordPress plugin or Browser Extension. This gives your content that lift that you’re looking for by including trend-worthy content for extra share-ability and some SEO link-love.
  3. Install sovrn’s mobile ad manager WordPress plugin. Mobile-optimized ads fill better and monetize better than those teeny little display ads that no one can see on their phones or tablets. (Don’t remove the plain ol’ display ads, just add the mobile ones in there too.) Mobile-specific ads are money in your pocket.

Want to join the publishing revolution? Visit the sovrn pubhub regularly for tips, tricks and tools to help you grow — and if you have a website, join the sovrn publishing network today!

sovrn logo

About sovrn

sovrn is an advocate of and partner to almost 20,000 publishers across the independent web, representing more than a million sites, who use our tools, services and analytics to grow their audiences, engage their readers, and monetize their sites.

July 24 2014

09:55

AKSW Colloquium “Knowledge Extraction and Presentation” on Monday, July 28, 3.00 p.m. in Room P702

Knowledge Extraction and Presentation

On Monday, July 28,  in room P702 at 3.00 p.m., Edgard Marx proposes a question answering system. He has a computer science background (BSc. and MSc. in Computer Science/PUC-Rio) and is a member of AKSW (Agile Knowledge Engineering and Semantic Web). Edgard has been engaging in Semantic Web technology research since  2010 and is mainly working on evangelization and developing of conversion and mapping tools.

Abstract

The use of Semantic Web technologies led to an increasing number of structured data published on the Web.
Despite the advances on question answering systems retrieving and presenting the desired information from RDF structured sources is still substantial challenging.
In this talk we will present our proposal and working draft to address this challenges.

About the AKSW Colloquium

This event is part of a series of events about Semantic Web technology. Please see http://wiki.aksw.org/Colloquium for further information about previous and future events. As always, Bachelor and Master students are able to get points for attendance and there is complimentary coffee and cake after the session.

July 21 2014

07:58

DBpedia Spotlight V0.7 released

DBpedia Spotlight is an entity linking tool for connecting free text to DBpedia through the recognition and disambiguation of entities and concepts from the DBpedia KB.

We are happy to announce Version 0.7 of DBpedia Spotlight, which is also the first official release of the probabilistic/statistical implementation.

More information about as well as updated evaluation results for DBpedia Spotlight V0.7 are found in this paper:

Joachim Daiber, Max Jakob, Chris Hokamp, Pablo N. Mendes:  Improving Efficiency and Accuracy in Multilingual Entity Extraction ISEM2013.  

The changes to the statistical implementation include:

  • smaller and faster models through quantization of counts, optimization of search and some pruning
  • better handling of case
  • various fixes in Spotlight and PigNLProc
  • models can now be created without requiring a Hadoop and Pig installation
  • UIMA support by @mvnural
  • support for confidence value

See the release notes at [1] and the updated demos at [4].

Models for Spotlight 0.7 can be found here [2].

Additionally, we now provide the raw Wikipedia counts, which we hope will prove useful for research and development of new models [3].

A big thank you to all developers who made contributions to this version (with special thanks to Faveeo and Idio). Huge thanks to Jo for his leadership and continued support to the community.

Cheers,
Pablo Mendes,

on behalf of Joachim Daiber and the DBpedia Spotlight developer community.

[1] - https://github.com/dbpedia-spotlight/dbpedia-spotlight/releases/tag/release-0.7

[2] - http://spotlight.sztaki.hu/downloads/

[3] - http://spotlight.sztaki.hu/downloads/raw

[4] - http://dbpedia-spotlight.github.io/demo/

(This message is an adaptation of Joachim Daiber’s message to the DBpedia Spotlight list. Edited to suit this broader community and give credit to him.)

July 20 2014

21:42

Senior Python web developer wanted for world leading open source data portal

Working in the fast-growing area of open data, we build open source tools to drive transparency, accountability and re-use. Our flagship product CKAN runs the official national data portals from the UK to Brazil, US to Australia and many others. We also build data tools and OpenSpending browsers.

We’re looking for someone passionate about the technical challenges of building software that is used as the infrastructure for open data around the world, so come join our growing team to shape the future of the open data ecosystem!

Key Skills

  • Python, JavaScript, HTML, CSS
  • Python web frameworks (we use Pylons and Flask)
  • PostgreSQL and SQLAlchemy
  • Self motivated, self-starter, able to manage your own time

Extra bonus points for:

  • Open source projects/contributions
  • Front end skills, particularly in data-vis
  • You’ve written an app using open data before
  • Experience working in a distributed team

How to apply

Email jobs@okfn.org, with the subject line “Python Developer – CKAN/Services”. Please include:

  1. Your CV
  2. A link to your GitHub (or similar) profile
  3. A cover letter

More about the Job

You will be working as part of a small, dynamic team in a modern, open-source development environment. This role is full-time and we are very happy with remote-working.

We generally work remotely (with strong contingents in London and Berlin), using asynchronous communication (email, IRC, GitHub) but with standups, developer meetings and demos most days (Skype, Google Hangout) and real-world gatherings more than twice a year including at the Open Knowledge Festival. We also try and ensure our developers can attend at least one open source conference a year.

At each level of our software stack we use best-in-class open source software, including Python, Nose, Travis CI and Coveralls, Sphinx and Read the Docs, Flask, Jinja2, Solr, PostgreSQL and SQLAlchemy, JavaScript and jQuery, Bootstrap, Git and Github, and Transifex.

We iterate quickly, and publish working, open-source code early and often.

All of our code is on github – it is open to public scrutiny, and we encourage contributions from third-party developers. This means that we have to write exceptionally clear, readable, well-tested code with excellent documentation.

All code contributions, whether from internal or external developers, are made with GitHub pull requests and we do code reviews in the open on GitHub.

We are engaged with a large and active community of users, developers and translators of our open source software, via our mailing lists, GitHub issues and pull requests, public developer meetings, Stack Overflow and Transifex. We support users in getting started with our software, encourage and mentor new developers, and take on feedback and suggestions for the next releases.

About the Open Knowledge Foundation

The Open Knowledge Foundation (OKF) is an internationally recognized non-profit working to open knowledge and see it used to empower and improve the lives of citizens around the world. We build tools, provide advice and develop communities in the area of open knowledge: data, content and information which can be freely shared and used. We believe that by creating an open knowledge commons we can make a significant contribution to improving governance, research and the economy. The last two years have seen rapid growth in our activities, increasing our annual revenue to £2m and our team to over 35 across four continents. We are a virtual organisation with the whole team working remotely, although we have informal clusters in London, Cambridge and Berlin.

The OKF is an international leader in its field and has extensive experience in building open source tools and communities around open material. The Foundation’s software development work includes some of the most innovative and widely acclaimed projects in the area. For example, its CKAN project is the world’s leading open source data portal platform – used by data.gov, data.gov.uk, the European Commission’s open data portal, and numerous national, regional and local portals from Austria to Brazil. The award winning OpenSpending project enables users to explore over 13 million government spending transactions from around the world. It has an active global network which includes Working Groups and Local Groups in dozens of countries – including groups, ambassadors and partners in 21 of Europe’s 27 Member States.

We’re changing the world by promoting a global shift towards more open ways of working in government, arts, sciences and much more.

July 17 2014

15:22
SEEK – Leading Aussie Job Search Website Powered by CloudView

July 15 2014

08:57

From Taxonomies over Ontologies to Knowledge Graphs

With the rise of linked data and the semantic web, concepts and terms like ‘ontology’, ‘vocabulary’, ‘thesaurus’ or ‘taxonomy’ are being picked up frequently by information managers, search engine specialists or data engineers to describe ‘knowledge models’ in general. In many cases the terms are used without any specific meaning which brings a lot of people to the basic question:

What are the differences between a taxonomy, a thesaurus, an ontology and a knowledge graph?

This article should bring light into this discussion by guiding you through an example which starts off from a taxonomy, introduces an ontology and finally exposes a knowledge graph (linked data graph) to be used as the basis for semantic applications.

1. Taxonomies and thesauri

Taxonomies and thesauri are closely related species of controlled vocabularies to describe relations between concepts and their labels including synonyms, most often in various languages. Such structures can be used as a basis for domain-specific entity extraction or text categorization services. Here is an example of a taxonomy created with PoolParty Thesaurus Server which is about the Apollo programme:

Apollo programme taxonomyThe nodes of a taxonomy represent various types of ‘things’ (so called ‘resources’): The topmost level (orange) is the root node of the taxonomy, purple nodes are so called ‘concept schemes’ followed by ‘top concepts’ (dark green) and ordinary ‘concepts’ (light green). In 2009 W3C introduced the Simple Knowledge Organization System (SKOS) as a standard for the creation and publication of taxonomies and thesauri. The SKOS ontology comprises only a few classes and properties. The most important types of resources are: Concept, ConceptScheme and Collection. Hierarchical relations between concepts are ‘broader’ and its inverse ‘narrower’. Thesauri most often cover also non-hierarchical relations between concepts like the symmetric property ‘related’. Every concept has at least on ‘preferred label’ and can have numerous synonyms (‘alternative labels’). Whereas a taxonomy could be envisaged as a tree, thesauri most often have polyhierarchies: a concept can be the child-node of more than one node. A thesaurus should be envisaged rather as a network (graph) of nodes than a simple tree by including polyhierarchical and also non-hierarchical relations between concepts.

2. Ontologies

Ontologies are perceived as being complex in contrast to the rather simple taxonomies and thesauri. Limitations of taxonomies and SKOS-based vocabularies in general become obvious as soon as one tries to describe a specific relation between two concepts: ‘Neil Armstrong’ is not only unspecifically ‘related’ to ‘Apollo 11′, he was ‘commander of’ this certain Apollo mission. Therefore we have to extend the SKOS ontology by two classes (‘Astronaut’ and ‘Mission’) and the property ‘commander of’ which is the inverse of ‘commanded by’.

Apollo ontology relationsThe SKOS concept with the preferred label ‘Buzz Aldrin’ has to be classified as an ‘Astronaut’ in order to be described by specific relations and attributes like ‘is lunar module pilot of’ or ‘birthDate’. The introduction of additional ontologies in order to expand expressivity of SKOS-based vocabularies is following the ‘pay-as-you-go’ strategy of the linked data community. The PoolParty knowledge modelling approach suggests to start first with SKOS to further extend this simple knowledge model by other knowledge graphs, ontologies and annotated documents and legacy data. This paradigm could be memorized by a rule named ‘Start SKOS, grow big’.

3. Knowledge Graphs

Knowledge graphs are all around (e.g. DBpedia, Freebase, etc.). Based on W3C’s Semantic Web Standards such graphs can be used to further enrich your SKOS knowledge models. In combination with an ontology, specific knowledge about a certain resource can be obtained with a simple SPARQL query. As an example, the fact that Neil Armstrong was born on August 5th, 1930 can be retrieved from DBpedia. Watch this YouTube video which demonstrates how ‘linked data harvesting’ works with PoolParty.

Knowledge graphs could be envisaged as a network of all kind things which are relevant to a specific domain or to an organization. They are not limited to abstract concepts and relations but can also contain instances of things like documents and datasets.

Why should I transform my content and data into a large knowledge graph?

The answer is simple: to being able to make complex queries over the entirety of all kind of information. By breaking up the data silos there is a high probability that query results become more valid.

With PoolParty Semantic Integrator, content and documents from SharePoint, Confluence, Drupal etc. can be tranformed automatically to integrate them into enterprise knowledge graphs.

Taxonomies, thesauri, ontologies, linked data graphs including enterprise content and legacy data – all kind of information could become part of an enterprise knowledge graph which can be stored in a linked data warehouse. Based on technologies like Virtuoso, such data warehouses have the ability to serve as a complex question answering system with excellent performance and scalability.

4. Conclusion

In the early days of the semantic web, we’ve constantly discussed whether taxonomies, ontologies or linked data graphs will be part of the solution. Again and again discussions like ‘Did the current data-driven world kill ontologies?‘ are being lead. My proposal is: try to combine all of those. Embrace every method which makes meaningful information out of data. Stop to denounce communities which don’t follow the one or the other aspect of the semantic web (e.g. reasoning or SKOS). Let’s put the pieces together – together!

 

July 14 2014

14:58

[CfP] Semantic Web Journal: Special Issue on Question Answering over Linked Data

Dear all,
The Semantic Web Journal is launching a special issue on Question Answering over Linked Data, soliciting original papers that
* address the challenges involved in question answering over linked data,
* present resources and tools to support question answering over linked data, or
* describe question answering systems and applications.
Submission deadline is November 30th, 2014. For more detailed information please visit:
With kind regards,
Axel Ngonga and Christina Unger
14:57

New Version of FOX

Dear all,
We are very pleased to announce a new version of FOX [1]. Several improvements have been carried out:
(1) We have fixed minor issues in the code. In addition, we have updated several libraries.
(2) As a result, the FOX output parameters have changed minimally. An exact specification of the parameters with examples is available at the demo page. [2]
(3) Moreover, we now make bindings available for Java[3] and Python[4] to use FOX’s web service within your application.
Enjoy and cheers,
The FOX team
09:57

Two Accepted Papers in the Health Sector

Two new publications by Sonja Zillner et al. have gotten accepted on international conferences. See http://big-project.eu/publications for all project publications.
 
User Needs and Requirements Analysis for Big Data Healthcare Applications (MIE 2014)
The realization of big data applications that allow improving the quality and efficiency of healthcare care delivery is challenging. In order to take advantage of the promising opportunities of big data technologies, a clear understanding of user needs and requirements of the various stakeholders of healthcare, such as patients, clinicians and physicians, healthcare provider, payors, pharmaceutical industry, medical product suppliers and government, is needed.
Our study is based on internet, literature and market study research as well as on semi-structured interviews with major stakeholder groups of healthcare delivery settings. The analysis shows that big data technologies could be used to align the opposing user needs of improved quality with improved efficiency of care. However, this requires the integrated view of various heterogeneous data sources, legal frameworks for data sharing and incentives that foster collaboration.
Towards a Technology Roadmap for Big Data Applications in the Healthcare Domain (IEEE IRI HI)
Big Data technologies can be used to improve the quality and efficiency of healthcare delivery. The highest impact of Big Data applications is expected when data from various healthcare areas, such as clinical, administrative, financial, or outcome data, can be integrated. However, as of today, the seamless access to the various healthcare data pools is only possible in a very constrained and limited manner. For enabling the seamless access several technical requirements, such as data digitalization, semantic annotation, data sharing, data privacy and security as well as data quality need to be addressed. In this paper, we introduce a detailed analysis of these technical requirements and show how the results of our analysis lead towards a technical roadmap for Big Data in the healthcare domain.

Categories:
06:50

Crowdsourced “Open” Innovation an Imperative for Professional Services Firms

The Financial Times’s most recent Special Report on The Connected Business inspired me to think about the imperative for law, accounting, and professional services firms (PSOs) to adopt techniques of crowdsourced open innovation methods for enhancing services delivery to their clients. Crowdsourcing in the context of innovation essentially refers to submitting problems to an open or restricted community and asking for suggestions and perhaps even joint development. Winners might even get a reward as part of a contest. In the context of PSOs, the community might be the members of a corporate legal department and partners, associates and paralegals in a law firm that traditionally has provided services to them. In an era where corporate legal departments are moving more work in-house and consolidating outsourced legal work to a smaller number of law firms, open innovation methods initiated by a law firm can help a corporate legal department refine its choice of law firm service providers more accurately. Here’s how it might work.

The law firm partner or associate responsible for a corporate legal account might identify the top challenges that the corporate legal department faces in working with outside law firms in an environment of ever more challenging regulation but dwindling budgets for managing it. The law firm might propose holding an innovation tournament (a crowdsourced innovation method) involving the law firm team and the corporate legal team to come up with ideas to address those challenges. Jointly, they would generate and vote on the best ideas. An excellent source of inspiration for all individuals taking part in the tournament and subsequent development efforts might be ideas from the Reinvent Law Laboratory.

The FT’s most recent Special Report on The Connected Business reminds us of techniques to rapidly experiment with solutions: (1) generate the best ideas from the innovation tournament; (2) design solutions; (3) rapidly prototype to test to decide to move ahead or change the design; and (4) move to production. The law firm and corporate legal department might even jointly fund a solution, working jointly with an outside vendor to help create the solution. Essentially these are the techniques identified in the very interesting book by Eric Ries, The Lean Startup. Crowdsourced open innovation initiatives involving law firm and corporate legal team members working for a joint cause can leverage the hybrid techno-legal talents in the emerging generation of lawyers, who grew up with Facebook, Google, iPods, iPads, iTunes and Amazon.

Does anyone have experience with open innovation initiatives involving PSOs and their clients? It would be great to hear from you about your experiences.

 

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.