Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

April 02 2014

11:37

Joseph A Busch: Case Study on an American Physical Society’s Taxonomy

image_jb

Joseph A Busch

Taxonomy Strategies has been working with the American Physical Society (APS) to develop a new facetted classification scheme. The proposed scheme includes several discrete sets of categories called facets whose values can be combined to express concepts such as existing Physics and Astronomy Classification Scheme (PACS) codes, as well as new concepts that have not yet emerged, or have been difficult to express with the existing PACS.

PACS codes formed a single-hierarchy classification scheme, designed to assign the “one best” category that an item will be classified under. Classification schemes come from the need to physically locate objects in one dimension, for example in a library where a book will be shelved in one and only one location, among an ordered set of other books. Traditional journal tables of contents similarly place each article in a given issue in a specific location among an ordered set of other articles, certainly a necessary constraint with paper journals and still useful online as a comfortable and familiar context for readers.

However, the real world of concepts is multi-dimensional. In collapsing to one dimension, a classification scheme makes essentially arbitrary choices that have the effect of placing some related items close together while leaving other related items in very distant bins. It also has the effect of repeating the terms associated with the last dimension in many different contexts, leading to an appearance of significant redundancy and complexity in locating terms.

A faceted taxonomy attempts to identify each stand-alone concept through the term or terms commonly associated with it, and have it mean the same thing whenever used. Hierarchy in a taxonomy is useful to group related terms together; however the intention is not to attempt to identify an item such as an article or book by a single concept, but rather to assign multiple concepts to represent the meaning. In that way, related items can be closely associated along multiple dimensions corresponding to each assigned concept. Where previously a single PACS code was used to indicate the research area, now two, three, or more of the new concepts may be needed (although often a single new concept will be sufficient). This requires a different mindset and approach in applying the new taxonomy to the way APS has been accustomed to working with PACS; however it also enables significant new capabilities for publishing and working with all types of content including articles, papers and websites.

To build and maintain the faceted taxonomy, APS has acquired the PoolParty taxonomy management tool. PoolParty will enable APS editorial staff to create, retrieve, update and delete taxonomy term records. The Tool will support the various thesaurus, knowledge organization system and ontology standards for concepts, relationships, alternate terms etc. It will also provide methods for:

  • Associating taxonomy terms with content items, and storing that association in a content index record.
  • Automated indexing to suggest taxonomy terms that should be associated with content items, and text mining to suggest terms to potentially be added to the taxonomy.
  • Integrating taxonomy term look-up, browse and navigation in a selection user interface that, for example, authors and the general public could use.
  • Implementing a feedback user interface allowing authors and the general public to suggest terms, record the source of the suggestion, and inform the user on the disposition of their suggestion.

Taxonomy Strategies (www.taxonomystrategies.com) is an information management consultancy that specializes in applying taxonomies, metadata, automatic classification, and other information retrieval technologies to the needs of business and other organizations.

March 31 2014

11:04

Why SKOS should be a focal point of your linked data strategy

skos_hand-small

The Simple Knowledge Organization System (SKOS) has become one of the ‘sweet spots’ in the linked data ecosystem in recent years. Especially when semantic web technologies are being adapted for the requirements of enterprises or public administration, SKOS has played a most central role to create knowledge graphs.

In this webinar, key people from the Semantic Web Company will describe why controlled vocabularies based on SKOS play a central role in a linked data strategy, and how SKOS can be enriched by ontologies and linked data to further improve semantic information management.

SKOS unfolds its potential at the intersection of three disciplines and their methods:

  • library sciences: taxonomy and thesaurus management
  • information sciences: knowledge engineering and ontology management
  • computational linguistics: text mining and entity extraction

Linked Data based IT-architectures cover all three aspects and provide means for agile data, information, and knowledge management.

In this webinar, you will learn about the following questions and topics:

  • How SKOS builds the foundation of enterprise knowledge graphs to be enriched by additional vocabularies and ontologies?
  • How can knowledge graphs be used build the backbone of metadata services in organisations?
  • How text mining can be used to create high-quality taxonomies and thesauri?
  • How can knowledge graphs be used for enterprise information integration?

Based on PoolParty Semantic Suite, you will see several live demos of end-user applications based on linked data and of PoolParty’s latest release which provides outstanding facilities for professional linked data management, including taxonomy, thesaurus and ontology management.

Register here: https://www4.gotomeeting.com/register/404918583

 

March 26 2014

15:19

Seymour: Maybe I Was Wrong About Legal Wearables

Maybe I was wrong about wearables because I needed to go beyond my comfort zone to see what’s around the bend. I too easily settled for limits. Seymour is the project name for one of the ideas that took shape during the Innovation Tournament and while it’s a technical challenge, it may not be entirely without merit and here’s why.

Business Intake made Easy 

I once attended an Intellectual Property audit for a niche software company to support their Intellectual Asset Management. We had an extensive paper questionnaire and preparation meeting to ensure our visit would be fruitful. The goal was to get as much of the questions answered without making the client feel like it’s an inquisition. When we arrived, it was a bit chaotic and nobody managed to get all the answers. We left with boxes of documents to help us finalize the form.

Modern day logic would dictate we would need some kind of database for the intake. Even better if it was a mobile app with just a checklist and we could divide the load across the team during our visit. It would help if we already had information pre-populated and we just needed to fill in the blanks. Looks good on ‘paper,’ right?

Well I believe in the last decade, our process thinking has resulted in convoluted systems. We used the wishful add-on: “it should be easy to use and intuitive” like a sprinkle of angel dust to make the core product usable. Assuming the core product, the database, is what it is all about. No, it isn’t because nobody can use an empty database or worse – outdated data. So the time of laborious data entries should be in our past.

Seymour, See, Save, Share

I suspect wearables will play a major role in this space. Wearables will be the fastest way to grow any database simply because data entry will be more convenient. Forbes reported this as the first useful Glassware and seeing their video, you might agree:

“Sullivan Solar Power …developed a Google Glass app that gives its field technicians “volumes” of electrical system data in a hands’ free, or close to it, manner—which I would imagine to be a welcome delivery mode for someone wrestling with heavy equipment on a rooftop.”

Being fed real-time contextual information in situations where it’s slightly awkward to break out the laptop and do desktop legal research seems extremely powerful. Only consuming information might not make it ‘killer’ for me but if you can combine it with creation it will be close. A quote from this Glass wearing president and creative director:

“…The thing I use it for the most is taking notes. I tap it and say, “Take a note,” and then a microphone shows up and it will accurately dictate everything I say for about 30 seconds. And then when I stop talking it sends it to Evernote. At the same time, if someone else is using Evernote, they can send the note to me and it will appear in my screen.”

First Look: Evernote for Google Glass

This maybe farfetched, but the possibilities of having a checklist as Glassware and just ‘nodding’ off the list would be quite cool. Better yet, just tapping your wrist will be even cooler:

Tick off checklists for groceries with the Pebble, which syncs to Evernote, for a hands-free shopping experience. Evernote Reminders are supported, so you’ll always have your notifications and to-dos close to you.

Evernote on the Go: Introducing Evernote for Pebble

If we just infuse the right legal context into these workflows, we can even make legal research fun again. Shopping for groceries is not that different from shopping for Intellectual Property, it can only be made more pleasant by the tools we use.

Going off the grid

There was one little caveat with wearables, actually any internet connected device: its needs an internet connection. Well maybe not. Let me introduce you to “Wireless Mesh Networking.” This enables device to device communication in a free-form, non-internet dependent way. And that’s almost perfect for having wearables talk to our phone – or each other. It’s one of the best kept secrets in the latest iOS 7 and what Google is betting on to extent wearables and even home automation.

Last year I just had a name and a notion. Now it’s slowly making sense and Seymour is my reminder to keep going beyond the bend.

 

March 24 2014

07:06

2015 FDA Budget Request Supports 5-Year Strategic Plan

In April 2011, the Food and Drug Administration (FDA) published a five-year strategic plan, entitled “Strategic Priorities 2011 – 2015.” In that plan, the FDA set forth five modernization priorities for the agency. The plan also committed to using these priorities to improve agency infrastructure, modernize the regulatory processes, strengthen its workforce, and, ultimately, do a better job promoting and protecting the health of Americans. The five modernization priorities are:

  • Advance Regulatory Science and Innovation
  • Strengthen the Safety and Integrity of the Global Supply Chain
  • Strengthen Compliance and Enforcement Activities to Support Public Health
  • Expand Efforts to Meet the Needs of Special Populations
  • Advance Medical Countermeasures and Emergency Preparedness

Since 2011, the FDA budget requests have focused on these five strategic priorities. The President’s Fiscal Year (FY) 2015 Budget Request also tries to build FDA capacity in these five areas.

Budget Overview

The FY 2015 FDA Budget Request (Justification of Estimates for Appropriations Committees) totals $4.74 billion, which is $358 million above the FY 2014 enacted level and is comprised of $2.58 billion in budget authority and $2.16 billion in user fees. The $358 million increase consists of $23 million in budget authority and $335 million in user fees. The user fee funding growth flows from new programs and increased user fee collection authority for FDA’s existing programs.

Budget Increases

The increased spending in the FY 2015 Budget Request focuses on (1) medical product safety, (2) strengthening oversight of the pharmacy compounding industry, (3) supporting food safety, and (4) implementation of the Food Safety and Modernization Act (FSMA) (P.L. 111-353). Here is the breakdown:

  • Medical product safety. The budget provides a program level of $2.6 billion for medical product safety, which is $61 million above the FY 2014 enacted level, to continue core medical product safety activities across FDA programs.
  • Pharmacy compounding. Out of the $61 million in medical product safety spending, FDA will invest $25 million in budget authority to enhance pharmacy compounding oversight activities. This request also includes $4.6 million for proposed International Courier user fees.
  • Food safety. The budget provides a program level of $1.48 billion for food safety, which is $263 million above the FY 2014 enacted level. This request also includes $255 million in proposed new user fees.
  • FSMA implementation. Within the $1.48 billion proposed for food safety, FDA will invest $24 million to further implement the FSMA.

The spending for further FSMA implementation includes: (1) finalizing mandated rulemakings on preventive controls for human and animal food, produce safety, and foreign supplier verification; (2) developing technical support for ongoing FSMA guidance development; (3) increased training and certification of federal, state, local, tribal, territorial, and international partners conducting food safety inspections; (4) supporting state capacity building, FDA-state joint work planning and data sharing, and collaborative agreements, such as the Manufactured Food Regulatory Program Standards; (5) increased data gathering and analytical capacity to support risk-based priority setting and resource allocation; and (6) phasing out animal production uses of medically important antimicrobial drugs and bringing the remaining legitimate animal health uses under veterinary supervision.

New User Fees

The FY 2015 Budget Request proposes new user fees for food imports, food facility registration and inspection, food contact substance notification, cosmetics, and international couriers.

  • Food safety. To support implementation of FSMA, FDA is proposing $169 million in new user fees for Food Import and $60 million for Food Facility Registration and Inspection. These fees will be used to improve FDA’s import process and modernize FDA’s food facility inspection system.
  • Food contact substance notification. FDA is proposing a new user fee of $5.1 million to ensure that the Food Contact Substance Notification (FCN) program operates more predictably by providing a stable, long-term source of funding to supplement budget authority appropriations. Section 409(h)(5) of the Food Drug & Cosmetic Act (FDC Act) (21 U.S.C. sec. 348(h)(5)) specifies that the FCN program can operate only if adequately funded. The proposed user fee will provide greater predictability of program funding and operations.
  • Cosmetics. FDA is proposing a new user fee of $19.5 million to support FDA cosmetic safety responsibilities. The FDC Act does not currently authorize FDA to collect user fees to support its Cosmetics Program. The proposed user fees will improve FDA’s capacity to promote greater safety and understanding of cosmetic products.
  • International courier. FDA is proposing a new International Courier user fee to support activities associated with increased surveillance of FDA-regulated commodities at express courier hubs. Approximately 20 percent, or $1.2 million of this $4.6 million proposed fee will support imported food safety. The user fee will help FDA keep pace with growing volume of imports that enter through international couriers and the increased cost of import operations.

Current User Fees

Included in the funding request for medical product and food safety, FDA is requesting a $75.4 million increase for current user fees. The current user fees support the review and surveillance of human and animal drugs, medical and mammography devices, food and feed, color additives, export certification, and tobacco products.

Medical Countermeasures

The FY 2015 Budget Request asks for $24.5 million to continue medical countermeasures (MCMs) across FDA programs. MCMs, such as drugs, vaccines, and diagnostic tests, are used to protect the United States from chemical, biological, radiological, nuclear, and emerging infectious disease threats. The FDA plays a critical role in ensuring that these MCMs are safe, effective, and secure. This 2015 request is $48,000 below the FY 2014 enacted level due to decreased rent costs.

Infrastructure, Rent, and White Oak Consolidation

Included in the funding request for medical product and food safety, and medical countermeasures, FDA is requesting an increase of $5.8 million for infrastructure. The FDA is also requesting a $20.6 million increase for Government Services Administration rental payments and other rent and rent-related costs, such as guard services and security systems. Finally, the proposed budget calls for $47.1 million ($14.8 million below the FY 2014 enacted level) for the ongoing and expanded operational and logistical functions for 9,000 employees on the White Oak Campus.

March 21 2014

08:45

Just Intelligent or Something More?

In this blog, many posts discuss how new technologies can be exploited in order to implement new systems and services to our customers. Maybe we have gone too far in looking ahead, however researchers and scholars in various fields are focusing to ensure that the human-machine interaction may take place through increasingly intelligent systems. After the Industrial one we are now in the middle of new Revolution.

Systems Intelligence is a multidisciplinary matter

For many years, researchers from human factors, engineering, computer science, psychology, work sciences, design, and architecture – as a multidisciplinary challenge – are discussing ideas and theories, and implementing prototypes of interactive systems that more and more act and take decisions like humans do.

Systems can be smart or intelligent

The term intelligent immediately reminds us of the Artificial Intelligence concept, denoting computer systems that use inference methods and knowledge representation to arrive to conclusions. Intelligent systems solve problems in the same way human brains do, mainly using principles from probability theory and formal logic.

Systems instead are defined smart if they are able to perform as wanted and if they know what we intend. They have to adhere to the Principle of Least Astonishment: “the program should always respond to the user in the way that astonishes him least.” One service that operates this way to me is Google Now, when, without having requested and having configured anything, it prompts on my phone news about public transportation situation in my route back home, at around the time I usually leave the office. This makes me say “wow!” In other words smart systems are those with a flavor of magic.

Systems can make wise by Connected Intelligence

Speaking of wisdom is still hard when related to information systems. Anyway the term wisdom of crowds can be associated to the usage of information services like the social platform: information constructed by diverse opinions, independent and decentralized members, and proper methods of aggregation, do better than the information constructed by any single individual.

The social platform can be considered a sort of collective mind or connected intelligence and even a source for human wisdom. It is not the system itself that is intelligent or wise, but the proper processing and linking of information provided by the members of a social group together with its selection, transformation, aggregation, and presentation.

Being social and taking advantage of social

The social web, as a set of tools and technologies that link people over the internet while posting news, pictures, videos, thoughts and comments, has also profoundly changed the way in which organizations, both public and private, engage with their stakeholders. Just look at how organizations are increasingly developing sophisticated social web strategies to capitalize on their ability to directly engage with a wider public.

On the other hand, the social web is a rich source of information from which companies can also understand the feelings and attitude of people towards their brand and products through the semantic analysis of big data, that is out of the social connected intelligence.

I think we still have to exploit all the available tools and information, to provide our customers with new services that make them really amazed.

Yes, we are in the middle of a revolution, or maybe even still at the beginning. Do you agree?

March 20 2014

20:01

In Hoc Signo Vinces (part 10 of n): TPC-H Q9, Q17, Q20 - Predicate Games

TPC-H is a hash join game. The rules do allow indices, but maintaining these takes time, and indices will quickly result in non-local access patterns. Indices also take space. Besides, somebody must know what indices to create, which is not obvious. Thus, it is best if a BI data warehouse works without.

Once you go to hash join, one side of the join will be materialized, which takes space, which ipso facto is bad. So, the predicate games are about moving conditions so that the hash table made for the hash join will be as small as possible. Only items that may in fact be retrieved should be put in the hash table. If you know that the query deals with shipments of green parts, putting lineitems of parts that are not green in a hash table makes no sense since only green ones are being looked for.

So, let's consider Q9. The query is:

SELECT                 nation,
                       o_year,
        SUM(amount) AS sum_profit
 FROM  ( SELECT
                                                                          n_name AS nation,
                                               EXTRACT ( YEAR FROM o_orderdate ) AS o_year,
                 l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity AS amount
           FROM
                 part,
                 supplier,
                 lineitem,
                 partsupp,
                 orders,
                 nation
          WHERE  s_suppkey = l_suppkey
            AND  ps_suppkey = l_suppkey
            AND  ps_partkey = l_partkey
            AND  p_partkey = l_partkey
            AND  o_orderkey = l_orderkey
            AND  s_nationkey = n_nationkey
            AND  p_name like '%green%'
       ) AS profit
GROUP BY  nation,
          o_year
ORDER BY  nation,
          o_year DESC
;

The intent is to calculate profit from the sale of a type of part, broken down by year and supplier nation. All orders, lineitems, partsupps, and suppliers involving the parts of interest are visited. This is one of the longest running of the queries. The query is restricted by part only, and the condition selects 1/17 of all parts.

The execution plan is below. First the plan builds hash tables of all nations and suppliers. We expect to do frequent lookups, thus making a hash is faster than using the index. Partsupp is the 3rd largest table in the database. This has a primary key of ps_partkey, ps_suppkey, referenced by the compound foreign key l_partkey, l_suppkey in lineitem. This could be accessed by index, but we expect to hit each partsupp row multiple times, hence hash is better. We further note that only partsupp rows where the part satisfies the condition will contribute to the result. Thus we import the join with part into the hash build. The ps_partkey is not directly joined to p_partkey, but rather the system must understand that this follows from l_partkey = ps_partkey and l_partkey = p_partkey. In this way, the hash table is 1/17th of the size it would otherwise be, which is a crucial gain.

Looking further into the plan, we note a scan of lineitem followed by a hash join with part. Restricting the build of the partsupp hash would have the same effect, hence part is here used twice while it occurs only once in the query. This is deliberate, since the selective hash join with part restricts lineitem faster than the more complex hash join with a 2 part key (l_partkey, l_suppkey). Both joins perform the identical restriction, but doing the part first is faster since this becomes a single-key, invisible hash join, merged into the lineitem scan, done before even accessing the l_suppkey and other columns.

{ 
time   3.9e-06% fanout         1 input         1 rows
time   4.7e-05% fanout         1 input         1 rows
{ hash filler
time   3.6e-05% fanout        25 input         1 rows
NATION        25 rows(.N_NATIONKEY, nation)
 
time   8.8e-06% fanout         0 input        25 rows
Sort hf 35 (.N_NATIONKEY) -> (nation)
 
}
time      0.16% fanout         1 input         1 rows
{ hash filler
time     0.011% fanout     1e+06 input         1 rows
SUPPLIER     1e+06 rows(.S_SUPPKEY, .S_NATIONKEY)
 
time      0.03% fanout         0 input     1e+06 rows
Sort hf 49 (.S_SUPPKEY) -> (.S_NATIONKEY)
 
}
time      0.57% fanout         1 input         1 rows
{ hash filler
Subquery 58 
{ 
time       1.6% fanout 1.17076e+06 input         1 rows
PART   1.2e+06 rows(t1.P_PARTKEY)
 P_NAME LIKE  <c %green%> LIKE  <c >
time       1.1% fanout         4 input 1.17076e+06 rows
PARTSUPP       3.9 rows(t4.PS_SUPPKEY, t4.PS_PARTKEY, t4.PS_SUPPLYCOST)
 inlined  PS_PARTKEY = t1.P_PARTKEY
 
After code:
      0: t4.PS_SUPPKEY :=  := artm t4.PS_SUPPKEY
      4: t4.PS_PARTKEY :=  := artm t4.PS_PARTKEY
      8: t1.P_PARTKEY :=  := artm t1.P_PARTKEY
      12: t4.PS_SUPPLYCOST :=  := artm t4.PS_SUPPLYCOST
      16: BReturn 0
time      0.33% fanout         0 input 4.68305e+06 rows
Sort hf 82 (t4.PS_SUPPKEY, t4.PS_PARTKEY) -> (t1.P_PARTKEY, t4.PS_SUPPLYCOST)
 
}
}
time      0.18% fanout         1 input         1 rows
{ hash filler
time       1.6% fanout 1.17076e+06 input         1 rows
PART   1.2e+06 rows(.P_PARTKEY)
 P_NAME LIKE  <c %green%> LIKE  <c >
time     0.017% fanout         0 input 1.17076e+06 rows
Sort hf 101 (.P_PARTKEY)
}
time   5.1e-06% fanout         1 input         1 rows
{ fork
time   4.1e-06% fanout         1 input         1 rows
{ fork
time        59% fanout 3.51125e+07 input         1 rows
LINEITEM     6e+08 rows(.L_PARTKEY, .L_ORDERKEY, .L_SUPPKEY, .L_EXTENDEDPRICE, .L_DISCOUNT, .L_QUANTITY)
 
hash partition+bloom by 108 (tmp)hash join merged always card     0.058 -> ()
hash partition+bloom by 56 (tmp)hash join merged always card         1 -> (.S_NATIONKEY)
time      0.18% fanout         1 input 3.51125e+07 rows
 
Precode:
      0: temp := artm  1  - .L_DISCOUNT
      4: temp := artm .L_EXTENDEDPRICE * temp
      8: BReturn 0
Hash source 101 merged into ts      0.058 rows(.L_PARTKEY) -> ()
time        17% fanout         1 input 3.51125e+07 rows
Hash source 82       0.057 rows(.L_SUPPKEY, .L_PARTKEY) -> (  <none> , .PS_SUPPLYCOST)
time       6.2% fanout         1 input 3.51125e+07 rows
 
Precode:
      0: temp := artm .PS_SUPPLYCOST * .L_QUANTITY
      4: temp := artm temp - temp
      8: BReturn 0
ORDERS unq         1 rows (.O_ORDERDATE)
 inlined  O_ORDERKEY = k_.L_ORDERKEY
time    0.0055% fanout         1 input 3.51125e+07 rows
Hash source 49 merged into ts          1 rows(k_.L_SUPPKEY) -> (.S_NATIONKEY)
time       3.5% fanout         1 input 3.51125e+07 rows
Hash source 35           1 rows(k_.S_NATIONKEY) -> (nation)
time       8.8% fanout         0 input 3.51125e+07 rows
 
Precode:
      0: o_year := Call year (.O_ORDERDATE)
      5: BReturn 0
Sort (nation, o_year) -> (temp)
 
}
time   4.7e-05% fanout       175 input         1 rows
group by read node  
(nation, o_year, sum_profit)
time   0.00028% fanout         0 input       175 rows
Sort (nation, o_year) -> (sum_profit)
 
}
time   2.2e-05% fanout       175 input         1 rows
Key from temp (nation, o_year, sum_profit)
 
time   1.6e-06% fanout         0 input       175 rows
Select (nation, o_year, sum_profit)
}


 6114 msec 1855% cpu, 3.62624e+07 rnd 6.44384e+08 seq   99.6068% same seg  0.357328% same pg 

6.1s is a good score for this query. When executing the same in 5 parallel invocations, the fastest ends in 13.7s and the slowest in 27.6s. For five concurrent executions, the peak transient memory utilization is 4.7 GB for the hash tables, which is very reasonable.

*           *           *           *           *

Let us next consider Q17.

SELECT
        SUM(l_extendedprice) / 7.0 AS avg_yearly
FROM
        lineitem,
        part
WHERE
        p_partkey = l_partkey
   AND  p_brand = 'Brand#23'
   AND  p_container = 'MED BOX'
   AND  l_quantity 
           < (
                   SELECT
                            2e-1 * AVG(l_quantity)
                   FROM
                            lineitem
                   WHERE
                            l_partkey = p_partkey
                )

Deceptively simple? This calculates the total value of small orders (below 1/5 of average quantity for the part) for all parts of a given brand with a specific container.

If there is an index on l_partkey, the plan is easy enough: Take the parts, look up the average quantity for each, then recheck lineitem and add up the small lineitems. This takes about 1s. But we do not want indices for this workload.

If we made a hash from l_partkey to l_quantity for all lineitems, we could run out of space, and this would take so long the race would be automatically lost on this point alone. The trick is to import the restriction on l_partkey into the hash build. This gives us a plan that does a scan of lineitem twice, doing a very selective hash join (few parts). There is a lookup for the average for each lineitem with the part. The average is calculated potentially several times.

The below plan is workable but better is possible: We notice that the very selective join need be done just once; it is cheaper to remember the result than to do it twice, and the result is not large. The other trick is that the correlated subquery can be rewritten as

SELECT
        ... 
  FROM
        lineitem, 
        part, 
        ( SELECT 
                                           l_partkey, 
                 0.2 * AVG (l_quantity) AS qty 
            FROM 
                 lineitem, 
                 part 
            ...
        ) f 
 WHERE
        l_partkey = f.l_partkey 
 ...

In this form, one can put the entire derived table f on the build side of a hash join. In this way, the average is never done more than once per part.

{ 
time   7.9e-06% fanout         1 input         1 rows
time    0.0031% fanout         1 input         1 rows
{ hash filler
time      0.27% fanout     20031 input         1 rows
PART     2e+04 rows(.P_PARTKEY)
 P_BRAND =  <c Brand#23> ,  P_CONTAINER =  <c MED BOX>
time   0.00047% fanout         0 input     20031 rows
Sort hf 34 (.P_PARTKEY)
}
time       0.1% fanout         1 input         1 rows
{ hash filler
Subquery 40 
{ 
time        46% fanout    600982 input         1 rows
LINEITEM     6e+08 rows(t4.L_PARTKEY, t4.L_QUANTITY)
 
hash partition+bloom by 38 (tmp)hash join merged always card     0.001 -> ()
time    0.0042% fanout         1 input    600982 rows
Hash source 34 merged into ts not partitionable     0.001 rows(t4.L_PARTKEY) -> ()
 
After code:
      0: t4.L_PARTKEY :=  := artm t4.L_PARTKEY
      4: t4.L_QUANTITY :=  := artm t4.L_QUANTITY
      8: BReturn 0
time     0.059% fanout         0 input    600982 rows
Sort hf 62 (t4.L_PARTKEY) -> (t4.L_QUANTITY)
 
}
}
time   6.8e-05% fanout         1 input         1 rows
{ fork
time        46% fanout    600982 input         1 rows
LINEITEM     6e+08 rows(.L_PARTKEY, .L_QUANTITY, .L_EXTENDEDPRICE)
 
hash partition+bloom by 38 (tmp)hash join merged always card   0.00052 -> ()
time   0.00021% fanout         1 input    600982 rows
Hash source 34 merged into ts    0.00052 rows(.L_PARTKEY) -> ()
 
Precode:
      0: .P_PARTKEY :=  := artm .L_PARTKEY
      4: BReturn 0
END Node
After test:
      0: { 
time     0.038% fanout         1 input    600982 rows
time      0.17% fanout         1 input    600982 rows
{ fork
time       6.8% fanout         0 input    600982 rows
Hash source 62  not partitionable      0.03 rows(k_.P_PARTKEY) -> (.L_QUANTITY)
 
After code:
      0:  sum sum.L_QUANTITYset no set_ctr
      5:  sum count 1 set no set_ctr
      10: BReturn 0
}
 
After code:
      0: temp := artm sum / count
      4: temp := artm  0.2  * temp
      8: aggregate :=  := artm temp
      12: BReturn 0
time     0.042% fanout         0 input    600982 rows
Subquery Select(aggregate)
}
 
      8: if (.L_QUANTITY < scalar) then 12 else 13 unkn 13
      12: BReturn 1
      13: BReturn 0
 
After code:
      0:  sum sum.L_EXTENDEDPRICE
      5: BReturn 0
}
 
After code:
      0: avg_yearly := artm sum /  7 
      4: BReturn 0
time   4.6e-06% fanout         0 input         1 rows
Select (avg_yearly)
}


 2695 msec 1996% cpu,         3 rnd 1.18242e+09 seq         0% same seg         0% same pg 

2.7s is tolerable, but if this drags down the overall score by too much, we know that a 2+x improvement is readily available. Playing the rest of the tricks would result in the hash plan almost catching up with the 1s execution time of the index-based plan.

*           *           *           *           *

Q20 is not very long-running, but it is maybe the hardest to optimize of the lot. But as usual, failure to recognize its most salient traps will automatically lose the race, so pay attention.

SELECT TOP 100
           s_name,
           s_address
     FROM
           supplier,
           nation
    WHERE
           s_suppkey IN 
             ( SELECT  
                       ps_suppkey
                 FROM  
                       partsupp
                WHERE  
                       ps_partkey IN 
                         ( SELECT  
                                   p_partkey
                             FROM  
                                   part
                            WHERE  
                                   p_name LIKE 'forest%'
                         )
                  AND  ps_availqty > 
                         ( SELECT  
                                   0.5 * SUM(l_quantity)
                             FROM  
                                   lineitem
                            WHERE  
                                   l_partkey = ps_partkey
                              AND  l_suppkey = ps_suppkey
                              AND  l_shipdate >= CAST ('1994-01-01' AS DATE)
                              AND  l_shipdate < DATEADD ('year', 1, CAST ('1994-01-01' AS DATE))
                         )
             )
      AND  s_nationkey = n_nationkey
      AND  n_name = 'CANADA'
 ORDER BY  s_name

This identifies suppliers that have parts in stock in excess of half a year's shipments of said part.

The use of IN to denote a join is the first catch. The second is joining to lineitem by hash without building an overly large hash table. We know that IN becomes EXISTS which in turn can become a join as follows:

SELECT 
        l_suppkey 
FROM
        lineitem 
WHERE
        l_partkey IN  
          ( SELECT  
                    p_partkey 
              FROM  
                    part 
             WHERE  
                    p_name LIKE 'forest%'
       )
;

-- is --

SELECT  
        l_suppkey 
  FROM  
        lineitem 
 WHERE  EXISTS  
          ( SELECT  
                    p_partkey 
              FROM  
                    part 
             WHERE  
                    p_partkey = l_partkey 
               AND  p_name LIKE 'forest%')
;

-- is --

SELECT  
        l_suppkey 
  FROM  
        lineitem, 
        ( SELECT  
                    DISTINCT p_partkey 
            FROM  
                    part 
           WHERE  
                    p_name LIKE 'forest%') f 
 WHERE  
        l_partkey = f.p_partkey
;

But since p_partkey is unique, the DISTINCT drops off and we have.

SELECT  
        l_suppkey 
  FROM  
        lineitem, 
        part 
 WHERE  
        p_name LIKE 'forest% 
   AND  l_partkey = f.p_partkey
;

You see, the innermost IN with the ps_partkey goes through all these changes, and just becomes a join. The outermost IN stays as a distinct derived table, since ps_suppkey is not unique, and the meaning of IN is not to return a given supplier more than once.

The derived table is flattened and the DISTINCT is done partitioned; hence the stage node in front of the distinct. A DISTINCT can be multithreaded, if each thread gets a specific subset of all the keys. The stage node is an exchange of tuples between several threads. Each thread then does a TOP k sort. The TOP k trick we saw in Q18 is used, but does not contribute much here.

{ 
time   8.2e-06% fanout         1 input         1 rows
time   0.00017% fanout         1 input         1 rows
{ hash filler
time   6.1e-05% fanout         1 input         1 rows
NATION         1 rows(.N_NATIONKEY)
 N_NAME =  <c CANADA>
time   1.2e-05% fanout         0 input         1 rows
Sort hf 34 (.N_NATIONKEY)
}
time     0.073% fanout         1 input         1 rows
{ hash filler
time       4.1% fanout    240672 input         1 rows
PART   2.4e+05 rows(t74.P_PARTKEY)
 P_NAME LIKE  <c forest%> LIKE  <c >
time     0.011% fanout         0 input    240672 rows
Sort hf 47 (t74.P_PARTKEY)
}
time      0.69% fanout         1 input         1 rows
{ hash filler
Subquery 56 
{ 
time        42% fanout 1.09657e+06 input         1 rows
LINEITEM   9.1e+07 rows(t76.L_PARTKEY, t76.L_SUPPKEY, t76.L_QUANTITY)
 L_SHIPDATE >= <c 1994-01-01> < <c 1995-01-01>
hash partition+bloom by 54 (tmp)hash join merged always card     0.012 -> ()
time     0.022% fanout         1 input 1.09657e+06 rows
Hash source 47 merged into ts not partitionable     0.012 rows(t76.L_PARTKEY) -> ()
 
After code:
      0: t76.L_PARTKEY :=  := artm t76.L_PARTKEY
      4: t76.L_SUPPKEY :=  := artm t76.L_SUPPKEY
      8: t76.L_QUANTITY :=  := artm t76.L_QUANTITY
      12: BReturn 0
time      0.22% fanout         0 input 1.09657e+06 rows
Sort hf 80 (t76.L_PARTKEY, t76.L_SUPPKEY) -> (t76.L_QUANTITY)
 
}
}
time   2.1e-05% fanout         1 input         1 rows
time   3.2e-05% fanout         1 input         1 rows
{ fork
time       5.3% fanout    240672 input         1 rows
PART   2.4e+05 rows(t6.P_PARTKEY)
 P_NAME LIKE  <c forest%> LIKE  <c >
time       1.9% fanout         4 input    240672 rows
PARTSUPP       1.2 rows(t4.PS_AVAILQTY, t4.PS_PARTKEY, t4.PS_SUPPKEY)
 inlined  PS_PARTKEY = t6.P_PARTKEY
time        16% fanout  0.680447 input    962688 rows
END Node
After test:
      0: { 
time      0.08% fanout         1 input    962688 rows
time       9.4% fanout         1 input    962688 rows
{ fork
time       3.6% fanout         0 input    962688 rows
Hash source 80       0.013 rows(k_t4.PS_PARTKEY, k_t4.PS_SUPPKEY) -> (t8.L_QUANTITY)
 
After code:
      0:  sum sumt8.L_QUANTITYset no set_ctr
      5: BReturn 0
}
 
After code:
      0: temp := artm  0.5  * sum
      4: aggregate :=  := artm temp
      8: BReturn 0
time      0.85% fanout         0 input    962688 rows
Subquery Select(aggregate)
}
 
      8: if (t4.PS_AVAILQTY > scalar) then 12 else 13 unkn 13
      12: BReturn 1
      13: BReturn 0
time         1% fanout         1 input    655058 rows
Stage 2
time     0.071% fanout         1 input    655058 rows
Distinct (q_t4.PS_SUPPKEY)
 
After code:
      0: PS_SUPPKEY :=  := artm t4.PS_SUPPKEY
      4: BReturn 0
time     0.016% fanout         1 input    655058 rows
Subquery Select(PS_SUPPKEY)
time       3.2% fanout 0.0112845 input    655058 rows
SUPPLIER unq     0.075 rows (.S_NAME, .S_NATIONKEY, .S_ADDRESS)
 inlined  S_SUPPKEY = PS_SUPPKEY
hash partition+bloom by 38 (tmp)hash join merged always card      0.04 -> ()
top k on S_NAME
time    0.0012% fanout         1 input      7392 rows
Hash source 34 merged into ts       0.04 rows(.S_NATIONKEY) -> ()
time     0.074% fanout         0 input      7392 rows
Sort (.S_NAME) -> (.S_ADDRESS)
 
}
time   0.00013% fanout       100 input         1 rows
top order by read (.S_NAME, .S_ADDRESS)
time     5e-06% fanout         0 input       100 rows
Select (.S_NAME, .S_ADDRESS)
}


 1777 msec 1355% cpu,    894483 rnd 6.39422e+08 seq   79.1214% same seg   19.3093% same pg 

1.8s is sufficient, and in the ballpark with VectorWise. Some further gain is possible, as the lineitem hash table can also be restricted by supplier; after all, only 1/25 of all suppliers are in the end considered. Further simplifications are possible. Another 20% of time could be saved. The tricks are however quite complex and specific, and there are easier gains to be had -- for example, in reusing intermediates in Q17 and Q15.

The next installment will discuss late projection and some miscellaneous tricks not mentioned so far. After this, we are ready to take an initial look at the performance of the system as a whole.

To be continued...

In Hoc Signo Vinces (TPC-H) Series

March 18 2014

15:20

Introducing the CKAN Association

We are pleased to announce the CKAN “Association”. The Association will manage and oversee the CKAN project going forward, supporting the growth of CKAN, its community and stakeholders. The Association reflects more than a year of discussion and consultation with key stakeholders and the wider community.

Key aspects of the Association are:

  • A Steering Group and Advisory Group which oversee the project and represent stakeholders.
  • Specific teams to look after particular areas such as a “Technical Team” to oversee technical development and a “Content and Outreach team” to oversee materials (including project website) and drive community engagement
  • Membership to allow stakeholders to contribute to the longer-term sustainability of the project – more below

The Association has its formal institutional home at the Open Knowledge Foundation but is autonomous and has its own independent governance, in the form the Steering Group which is drawn from major CKAN stakeholders. The Open Knowledge Foundation, who are the original creators of CKAN, continue to contribute to CKAN at all levels but the Association allows others – from government users to suppliers of CKAN services – to have a formal role in the development of CKAN project going forward.

Membership

The CKAN Association will have members. Membership is a way for individuals, companies and organisations to support the CKAN Project and be recognised for doing so. By becoming a member you are helping to ensure the long-term sustainability of CKAN.

Member organizations are expected to contribute resources – either through contributing money or providing in-kind resources such as staff time. Members receive recognition for their contribution through display on the website, participation in events etc.

You can find more information about membership here »

Frequently Asked Questions

Why Create the Association?

Over the last few years CKAN has seen rapid growth in terms of technology, deployments, and the community. It is now the basis of dozens of major sites around the world, including national data portals in the UK, US, Canada, Brazil, Australia, Germany, Austria and Norway. There also has been substantial growth in the developer and vendor community deploying, customising and working with CKAN.

We believe that, as with many open-source projects when they achieve a certain size, the time has come to bring some more structure to the community of CKAN developers and users. By doing so we aim to provide a solid foundation for the future growth of the project, and to empower more explicitly its growing array of stakeholders.

Will the CKAN Association be a Separate Legal Entity?

No, at least not initially. The association will retains its legal home at the Open Knowledge Foundation operating as a self-governed and autonomous project. If a strong need for a separate legal entity arises this is something that the Steering Committee will consider in due course.

How will the CKAN Association relate to the Open Knowledge Foundation’s technical consulting around CKAN?

These 2 activities will be strictly separated. The CKAN Services team at the Open Knowledge Foundation will no doubt participate in the CKAN Association as stakeholders similar to other organizations and groups but will have no special rights or privileges.

What “Assets” will the Association have responsibility for?

The Association will have responsibility for items such as:

  • The primary CKAN codebase
  • The CKAN project roadmap including overseeing and steering technical development of CKAN
  • Oversee and drive user and community engagement
  • The ckan.org website and any related media assets
  • Managing any project finances and resources (e.g. from membership fees)

Will the CKAN Association have dedicated staff?

We imagine that the CKAN Association may appoint dedicated staff on an as needed basis and where there are resources to do so (also note that Members may contribute in kind resources in the form of staff time). However, at least initially, the CKAN Association will not have dedicated staff but will have in-kind support time provided by the Open Knowledge Foundation and other key stakeholders.

14:20

Linked Geospatial Data 2014 Workshop, Part 4: GeoKnow, London, Brussels, The Message

Last Friday (2014-03-14) I gave a talk about GeoKnow at the EC Copernicus Big Data workshop. This was a trial run for more streamlined messaging. I have, aside the practice of geekcraft, occupied myself with questions of communication these last weeks.

The clear take-home from London and Brussels alike is that these events have full days and 4 or more talks an hour. It is not quite TV commercial spots yet but it is going in this direction.

If you say something complex, little will get across unless the audience already knows what you will be saying.

I had a set of slides from Jens Lehmann, the GeoKnow project coordinator, for whom I was standing in. Now these are a fine rendition of the description of work. What is wrong with partners, work packages, objectives, etc? Nothing, except everybody has them.

I recall the old story about the journalist and the Zen master: The Zen master repeatedly advises the reporter to cut the story in half. We get the same from PR professionals, "If it is short, they have at least thought about what should go in there," said one recently, talking of pitches and messages. The other advice was to use pictures. And to have a personal dimension to it.

Enter "Ms. Globe" and "Mr. Cube". Frans Knibbe of Geodan gave the Linked Geospatial Data 2014 workshop's most memorable talk entitled "Linked Data and Geoinformatics - a love story" (pdf) about the excitement and the pitfalls of the burgeoning courtship of Ms. Globe (geoinformatics) and Mr. Cube (semantic technology). They get to talking, later Ms. Globe thinks to herself... "Desiloisazation, explicit semantics, integrated metadata..." Mr. Cube, young upstart now approaching a more experienced and sophisticated lady, dreams of finally making an entry into adult society, "critical mass, global scope, relevant applications..." There is a vibration in the air.

So, with Frans Knibbe's gracious permission I borrowed the storyline and some of the pictures.

We ought to make a series of cartoons about the couple. There will be twists and turns in the story to come.

Mr. Cube is not Ms. Globe's first lover, though; there is also rich and worldly Mr. Table. How will Mr. Cube prove himself? The eternal question... Well, not by moping around, not by wise-cracking about semantics, no. By boldly setting out upon a journey to fetch the Golden Fleece from beyond the crashing rocks. "Column store, vectored execution, scale out, data clustering, adaptive schema..." he affirms, with growing confidence.

This is where the story stands, right now. Virtuoso run circles around PostGIS doing aggregations and lookups on geometries in a map-scrolling scenario (GeoKnow's GeoBenchLab). Virtuoso SPARQL outperforms PostGIS SQL against planet-scale OpenStreetMap; Virtuoso SQL goes 5-10x faster still.

Mr Cube is fast on the draw, but still some corners can be smoothed out.

Later in GeoKnow, there will be still more speed but also near parity between SQL and SPARQL via taking advantage of data regularity in guiding physical storage. If it is big, it is bound to have repeating structure.

The love story grows more real by the day. To be consummated still within GeoKnow.

Talking of databases has the great advantage that this has been a performance game from the start. There are few people who need convincing about the desirability of performance, as this also makes for lower cost and more flexibility on the application side.

But this is not all there is to it.

In Brussels, the public was about E-science (Earth observation). In science, it is understood that qualitative aspects can be even more crucial. I told the story about an E-science-oriented workshop I attended in America years ago. The practitioners, from high energy physics to life sciences to climate, had invariably come across the need for self-description of data and for schema-last. This was essentially never provided by RDF, except for some life science cases. Rather, we had one-off schemes, ranging from key-value pairs to putting the table name in a column of the same table to preserve the origin across data export.

Explicit semantics and integrated metadata are important, Ms. Globe knows, but she cannot sacrifice operational capacity for this. So it is more than a DBMS or even data model choice -- there must be a solid tool chain for data integration and visualization. GeoKnow provides many tools in this space.

Some of these, such as the LIMES entity matching framework (pdf) are probably close to the best there is. For other parts, the SQL-based products with hundreds of person years invested in user interaction are simply unbeatable.

In these cases, the world can continue to talk SQL. If the regular part of the data is in fact tables already, so much the better. You connect to Virtuoso via SQL, just like to PostGIS or Oracle Spatial, and talk SQL MM. The triples, in the sense of flexible annotation and integrated metadata, stay there; you just do not see them if you do not want them.

There are possibilities all right. In the coming months I will showcase some of the progress, starting with a detailed look at the OpenStreetMap experiments we have made in GeoKnow.

Linked Geospatial Data 2014 Workshop posts:

14:20

Linked Geospatial Data 2014 Workshop, Part 3: The Stellar Reach of OKFN

The Open Knowledge Foundation (OKFN) held a London Open Data Meetup in the evening of the first day of the Linked Geospatial Data 2014 workshop. The event was, as they themselves put it, at the amazing open concept office of OKFN at the Center for Creative Collaboration in Central London. What could sound cooler? True, OKFN threw a good party, with ever engaging and charismatic founder Rufus Pollock presiding. Phil Archer noted, only half in jest, that OKFN was so influential, visible, had the ear of government and public alike, etc., that it put W3C to shame.

Now, OKFN is a party in the LOD2 FP7 project, so I have over the years met people from there on and off. In LOD2, OKFN is praised to the skies for its visibility and influence and outreach and sometimes, in passing, critiqued for not publishing enough RDF, let alone five star linked data.

As it happens, CSV rules, and even the W3C will, it appears, undertake to standardize a CSV-to-RDF mapping. As far as I am concerned, as long as there is no alignment of identifiers or vocabulary, whether a thing is CSV or exactly equivalent RDF, there is no great difference, except that CSV is smaller and loads into Excel.

For OKFN, which has a mission of opening data, insisting on any particular format would just hinder the cause.

What do we learn from this? OKFN is praised not only for government relations but also for developer friendliness. Lobbying for open data is something I can understand, but how do you do developer relations? This is not like talking to customers, where the customer wants to do something and it is usually possible to give some kind of advice or recommendation on how they can use our technology for the purpose.

Are JSON and Mongo DB the key? A well renowned database guy once said that to be with the times, JSON is your data model, Hadoop your file system, Mongo DB your database, and JavaScript your language, and failing this, you are an old fart, a legacy suit, well, some uncool fossil.

The key is not limited to JSON. More generally, it is zero time to some result and no learning curve. Some people will sacrifice almost anything for this, such as the possibility of doing arbitrary joins. People will even write code, even lots of it, if it only happens to be in their framework of choice.

Phil again deplored the early fiasco of RDF messaging. "Triples are not so difficult. It is not true that RDF has a very steep learning curve." I would have to agree. The earlier gaffes of the RDF/XML syntax and the infamous semantic web layer cake diagram now lie buried and unlamented; let them be.

Generating user experience from data or schema is an old mirage that has never really worked out. The imagined gain from eliminating application writing has however continued to fascinate IT minds and attempts in this direction have never really ceased. The lesson of history seems to be that coding is not to be eliminated, but that it should have fast turnaround time and immediately visible results.

And since this is the age of data, databases should follow this lead. Schema-last is a good point, maybe adding JSON alongside XML as an object type in RDF might not be so bad. There are already XML functions, so why not the analog for JSON? Just don't mention XML to the JSON folks...

How does this relate to OKFN? Well, in the first instance this is the cultural impression I received from the meetup, but in a broader sense these factors are critical to realizing the full potential of OKFN's successes so far. OKFN is a data opening advocacy group; it is not a domain-specific think tank or special interest group. The data owners and their consultants will do analytics and even data integration if they see enough benefit in this, all in the established ways. However, the widespread opening of data does create possibilities that did not exist before. Actual benefits depend in great part on constant lowering of access barriers, and on a commitment by publishers to keep the data up to date, so that developers can build more than just a one-off mashup.

True, there are government users of open data, since there is a productivity gain in already having the neighboring department's data opened to a point; one does no longer have to go through red tape to gain access to it.

For an application ecosystem to keep growing on the base of tens of thousands of very heterogeneous datasets coming into the open, continuing to lower barriers is key. This is a very different task from making faster and faster databases or of optimizing a particular business process, and it demands different thinking.

Linked Geospatial Data 2014 Workshop posts:

14:20

Linked Geospatial Data 2014 Workshop, Part 2: Is SPARQL Slow?

I had a conversation with Andy Seaborne of Epimorphics, initial founder of the Jena RDF Framework tool chain and editor of many W3C recommendations, among which the two SPARQLs. We exchanged some news; I told Andy about our progress in cutting the RDF-to-SQL performance penalty and doing more and better SQL tricks. Andy asked me if there were use cases doing analytics over RDF, not in the business intelligence sense, but in the sense of machine learning or discovery of structure. There is, in effect, such work, notably in data set summarization and description. A part of this has to do with learning the schema, like one would if wanting to put triples into tables when appropriate. CWI in LOD2 has worked in this direction, as has DERI (Giovanni Tummarello's team), in the context of giving hints to SPARQL query writers. I would also mention Chris Bizer et al., at University of Mannheim, with their data integration work, which is all about similarity detection in a schema-less world, e.g., the 150M HTML tables in the Common Crawl, briefly mentioned in the previous blog. Jens Lehmann from University of Leipzig has also done work in learning a schema from the data, this time in OWL.

Andy was later on a panel where Phil Archer asked him whether SPARQL was slow by nature or whether this was a matter of bad implementations. Andy answered approximately as follows: "If you allow for arbitrary ad hoc structure, you will always pay something for this. However, if you tell the engine what your data is like, it is no different from executing SQL." This is essentially the gist of our conversation. Most likely we will make this happen via adaptive schema for the regular part and exceptions as quads.

Later I talked with Phil about the "SPARQL is slow" meme. The fact is that Virtuoso SPARQL will outperform or match PostGIS SQL for Geospatial lookups against the OpenStreetMap dataset. Virtuoso SQL will win by a factor of 5 to 10. Still, the SPARQL is slow meme is not entirely without a basis in fact. I would say that the really blatant cases that give SPARQL a bad name are query optimization problems. With 50 triple patterns in a query there are 50-factorial ways of getting a bad plan. This is where the catastrophic failures of 100+ times worse than SQL come from. The regular penalty of doing triples vs tables is somewhere between 2.5 (Star Schema Benchmark) and 10 (lookups with many literals), quite acceptable for many applications. Some really bad cases can occur with regular expressions on URI strings or literals, but then, if this is the core of the application, it should use a different data model or an n-gram index.

The solutions, including more dependable query plan choice, will flow from adaptive schema which essentially reduces RDF back into relational, however without forcing schema first and with accommodation for exceptions in the data.

Phil noted here that there already exist many (so far, proprietary) ways of describing the shape of a graph. He said there would be a W3C activity for converging these. If so, a vocabulary that can express relationships, the types of related entities, their cardinalities, etc., comes close to a SQL schema and its statistics. Such a thing can be the output of data analysis, or the input to a query optimizer or storage engine, for using a schema where one in fact exists. Like this, there is no reason why things would be less predictable than with SQL. The idea of a re-convergence of data models is definitely in the air; this is in no sense limited to us.

Linked Geospatial Data 2014 Workshop posts:

14:20

Linked Geospatial Data 2014 Workshop, Part 1: Web Services or SPARQL Modeling?

The W3C (World Wide Web Consortium) and OGC (Open Geospatial Consortium) organized the Linked Geospatial Data 2014 workshop in London this week. The GeoKnow project was represented by Claus Stadler of Universität Leipzig, and Hugh Williams and myself (Orri Erling) from OpenLink Software. The Open Knowledge Foundation (OKFN) also held an Open Data Meetup in the evening of the first day of the workshop.

Reporting on each talk and the many highly diverse topics addressed is beyond the scope of this article; for this you can go to the program and the slides that will be online. Instead, I will talk about questions that to me seemed to be in the air, and about some conversations I had with the relevant people.

The trend in events like this is towards shorter and shorter talks and more and more interaction. In this workshop, talks were given in series of three talks with all questions at the end, with all the presenters on stage. This is not a bad idea since we get a panel-like effect where many presenters can address the same question. If the subject matter allows, a panel is my preferred format.

Web services or SPARQL? Is GeoSPARQL good? Is it about Linked Data or about ontologies?

Geospatial data tends to be exposed via web services, e.g., WFS (Web Feature Service). This allows item retrieval on a lookup basis and some predefined filtering, transformation, and content negotiation. Capabilities vary; OGC now has WFS 2.0, and there are open source implementations that do a fair job of providing the functionality.

Of course, a real query language is much more expressive, but a service API is more scalable, as people say. What they mean is that an API is more predictable. For pretty much any complex data task, a query language is near-infinitely more efficient than going back-and-forth, often on a wide area network, via an API. So, as Andreas Harth put it: for data publishers, make an API; an open SPARQL endpoint is too "brave," [Andreas' word, with the meaning of foolhardy]. When you analyze, he continued, then you load it into a endpoint, but you use your own. Any quality of service terms must be formulated with respect to a fixed workload, this is not meaningful with ad hoc queries in an expressive language. Things like anytime semantics (return whatever is found within a time limit) are only good for a first interactive look, not for applications.

Should the application go to the data or the reverse? Some data is big and moving it is not self-evident. A culture of datasets being hosted on a cloud may be forming. Of course some linked data like DBpedia has for a long time been available as Amazon images. Recently, SindiceTech has made a similar packaging of Freebase. The data of interest here is larger and its target audience is more specific, on the e-science side.

How should geometries be modeled? I have met the GeoSPARQL and the SQL MM on which it is based with a sense of relief, as these are reasonable things that can be efficiently implemented. There are proposals where points have URIs, and linestrings are ordered sets of points, and collections are actual trees with RDF subjects as nodes. As a standard, such a thing is beyond horrible, as it hits all the RDF penalties and overheads full force, and promises easily 10x worse space consumption and 100x worse run times compared to the sweetly reasonable GeoSPARQL. One presenter said that cases of actually hanging attributes off points of complex geometries had been heard of but were, in his words, anecdotal. He posed a question to the audience about use cases where points in fact needed separately addressable identities. Several cases did emerge, involving, for example, different measurement certainties for different points on on a trajectory trace obtained by radar. Applications that need data of this sort will perforce be very domain specific. OpenStreetMap (OSM) itself is a bit like this, but there the points that have individual identity also have predominantly non-geometry attributes and stand for actually-distinct entities. OSM being a practical project, these are then again collapsed into linestrings for cases where this is more efficient. The OGC data types themselves have up to 4 dimensions, of which the 4th could be used as an identifier of a point in the event this really were needed. If so, this would likely be empty for most points and would compress away if the data representation were done right.

For data publishing, Andreas proposed to give OGC geometries URIs, i.e., the borders of a country can be more or less precisely modeled, and the large polygon may have different versions and provenances. This is reasonable enough, as long as the geometries are big. For applications, one will then collapse the 1:n between entity and its geometry into a 1:1. In the end, when you make an application, even an RDF one, you do not just throw all the data in a bucket and write queries against that. Some alignment and transformation is generally involved.

Linked Geospatial Data 2014 Workshop posts:

March 17 2014

16:41

In Hoc Signo Vinces (part 9 of n): TPC-H Q18, Ordered Aggregation, and Top K

We will here return to polishing the cutting edge, the high geekcraft of database. We will look at more of the wonders of TPC-H and cover two more tricks. The experts can skip the preliminaries and go to the query profiles; for the others, there is some explanation first.

From the TPC-H specification:

    SELECT  TOP 100
                     c_name,
                     c_custkey,
                     o_orderkey,
                     o_orderdate,
                     o_totalprice,
                     SUM ( l_quantity )
     FROM  customer,
           orders,
           lineitem
    WHERE  o_orderkey 
             IN 
               (
                  SELECT  l_orderkey
                    FROM  lineitem
                GROUP BY  l_orderkey 
                            HAVING
                              SUM ( l_quantity ) > 312
               )
      AND  c_custkey = o_custkey
      AND  o_orderkey = l_orderkey
 GROUP BY  c_name,
           c_custkey,
           o_orderkey,
           o_orderdate,
           o_totalprice
 ORDER BY  o_totalprice DESC, 
           o_orderdate 

The intent of the query is to return order and customer information for cases where an order involves a large quantity of items, with highest-value orders first.

We note that the only restriction in the query is the one on the SUM of l_quantity in the IN subquery. Everything else is a full scan or a JOIN on a foreign key.

Now, the first query optimization rule of thumb could be summarized as start from the small. Small here means something that is restricted; it does not mean small table. Smallest is the one from which the highest percentage is dropped via a condition that does not depend on other tables.

The next rule of thumb is to try starting from the large, if the large has a restricting join; for example, scan all the lineitems and hash join to parts that are green and of a given brand. In this case, the idea is to make a hash table from the small side and sequentially scan the large side, dropping everything that does not match something in the hash table.

The only restriction here is on orders via a join on lineitem. So, the IN subquery can be flattened, so as to read like --

SELECT ... 
  FROM  (   SELECT  l_orderkey, 
                    SUM ( l_quantity ) 
              FROM  lineitem 
          GROUP BY  l_orderkey 
                      HAVING
                        SUM ( l_quantity ) > 312
        ) f, 
          orders, 
          customer, 
          lineitem 
 WHERE  f.l_orderkey = o_orderkey ....

The above (left to right) is the best JOIN order for this type of plan. We start from the restriction, and for all the rest the JOIN is foreign key to primary key, sometimes n:1 (orders to customer), sometimes 1:n (orders to lineitem). A 1:n is usually best by index; an n:1 can be better by hash if there are enough tuples on the n side to make it worthwhile to build the hash table.

We note that the first GROUP BY makes a very large number of groups, e.g., 150M at 100 Gtriple scale. We also note that if lineitem is ordered so that the lineitems of a single order are together, the GROUP BY is ordered. In other words, once you have seen a specific value of l_orderkey change to the next, you will not see the old value again. In this way, the groups do not have to be remembered for all time. The GROUP BY produces a stream of results as the scan of lineitem proceeds.

Considering vectored execution, the GROUP BY does remember a bunch of groups, up to a vector size worth, so that output from the GROUP BY is done in large enough batches, not a tuple at a time.

Considering parallelization, the scan of lineitem must be split in such a way that all lineitems with the same l_orderkey get processed by the same thread. If this is the case, all threads will produce an independent stream of results that is guaranteed to need no merge with the output of another thread.

So, we can try this:

{ 
time     6e-06% fanout         1 input         1 rows
time       4.5% fanout         1 input         1 rows
{ hash filler

-- Make a hash table from c_custkey to c_name

time      0.99% fanout   1.5e+07 input         1 rows
CUSTOMER   1.5e+07 rows(.C_CUSTKEY, .C_NAME)
 
time      0.81% fanout         0 input   1.5e+07 rows
Sort hf 35 (.C_CUSTKEY) -> (.C_NAME)
 
}
time   2.2e-05% fanout         1 input         1 rows
time   1.6e-05% fanout         1 input         1 rows
{ fork
time   5.2e-06% fanout         1 input         1 rows
{ fork

-- Scan lineitem

time        10% fanout 6.00038e+08 input         1 rows
LINEITEM     6e+08 rows(t5.L_ORDERKEY, t5.L_QUANTITY)

-- Ordered GROUP BY (streaming with duplicates)

time        73% fanout 1.17743e-05 input 6.00038e+08 rows
Sort streaming with duplicates (t5.L_ORDERKEY) -> (t5.L_QUANTITY)

-- The ordered aggregation above emits a batch of results every so often, having accumulated 20K or so groups (DISTINCT l_orderkey's)

-- The operator below reads the batch and sends it onward, the GROUP BY hash table for the next batch.

time        10% fanout   21231.4 input      7065 rows
group by read node 
(t5.L_ORDERKEY, aggregate)
END Node
After test:
      0: if (aggregate >  312 ) then 4 else 5 unkn 5
      4: BReturn 1
      5: BReturn 0
 
After code:
      0: L_ORDERKEY :=  := artm t5.L_ORDERKEY
      4: BReturn 0

-- This marks the end of the flattened IN subquery. 1063 out of 150M groups survive the test on the SUM of l_quantity.

-- The main difficulty of Q18 is guessing that this condition is this selective.

time    0.0013% fanout         1 input      1063 rows
Subquery Select(L_ORDERKEY)
time     0.058% fanout         1 input      1063 rows
ORDERS unq      0.97 rows (.O_CUSTKEY, .O_ORDERKEY, .O_ORDERDATE, .O_TOTALPRICE)
 inlined  O_ORDERKEY = L_ORDERKEY
hash partition+bloom by 42 (tmp)hash join merged always card      0.99 -> (.C_NAME)
time    0.0029% fanout         1 input      1063 rows
Hash source 35 merged into ts       0.99 rows(.O_CUSTKEY) -> (.C_NAME)
 
After code:
      0: .C_CUSTKEY :=  := artm .O_CUSTKEY
      4: BReturn 0
time     0.018% fanout         7 input      1063 rows
LINEITEM       4.3 rows(.L_QUANTITY)
 inlined  L_ORDERKEY = .O_ORDERKEY
time     0.011% fanout         0 input      7441 rows
Sort (.C_CUSTKEY, .O_ORDERKEY) -> (.L_QUANTITY, .O_TOTALPRICE, .O_ORDERDATE, .C_NAME)
 
}
time   0.00026% fanout      1063 input         1 rows
group by read node  
(.C_CUSTKEY, .O_ORDERKEY, aggregate, .O_TOTALPRICE, .O_ORDERDATE, .C_NAME)
time   0.00061% fanout         0 input      1063 rows
Sort (.O_TOTALPRICE, .O_ORDERDATE) -> (.C_NAME, .C_CUSTKEY, .O_ORDERKEY, aggregate)
 
}
time   1.7e-05% fanout       100 input         1 rows
top order by read (.C_NAME, .C_CUSTKEY, .O_ORDERKEY, .O_ORDERDATE, .O_TOTALPRICE, aggregate)
time   1.2e-06% fanout         0 input       100 rows
Select (.C_NAME, .C_CUSTKEY, .O_ORDERKEY, .O_ORDERDATE, .O_TOTALPRICE, aggregate)
}


 6351 msec 1470% cpu,      2151 rnd 6.14898e+08 seq  0.185874% same seg   1.57993% same pg 

What is wrong with this? The result is not bad, in the ballpark with VectorWise published results (4.9s on a slightly faster box), but better is possible. We note that there is a hash join from orders to customer. Only 1K customers of 15M get hit. The whole hash table of 15M entries is built in vain. Let's cheat and declare the join to be by index. Cheats like this are not allowed in an official run but here we are just looking. So we change the mention of the customer table in the FROM clause from FROM ... customer, ... to FROM ... customer TABLE OPTION (loop), ...

{ 
time   1.4e-06% fanout         1 input         1 rows
time     9e-07% fanout         1 input         1 rows

-- Here was the hash build in the previous plan; now we start direct with the scan of lineitem.

time   2.2e-06% fanout         1 input         1 rows
{ fork
time   2.3e-06% fanout         1 input         1 rows
{ fork
time        11% fanout 6.00038e+08 input         1 rows
LINEITEM     6e+08 rows(t5.L_ORDERKEY, t5.L_QUANTITY)
 
time        78% fanout 1.17743e-05 input 6.00038e+08 rows
Sort streaming with duplicates (t5.L_ORDERKEY) -> (t5.L_QUANTITY)
 
time        11% fanout   21231.4 input      7065 rows
group by read node  
(t5.L_ORDERKEY, aggregate)
END Node
After test:
      0: if (aggregate >  312 ) then 4 else 5 unkn 5
      4: BReturn 1
      5: BReturn 0
 
After code:
      0: L_ORDERKEY :=  := artm t5.L_ORDERKEY
      4: BReturn 0
time    0.0014% fanout         1 input      1063 rows
Subquery Select(L_ORDERKEY)
time     0.051% fanout         1 input      1063 rows
ORDERS unq      0.97 rows (.O_CUSTKEY, .O_ORDERKEY, .O_ORDERDATE, .O_TOTALPRICE)
 inlined  O_ORDERKEY = L_ORDERKEY

-- We note that getting the 1063 customers by index takes no time, and there is no hash table to build

time     0.023% fanout         1 input      1063 rows
CUSTOMER unq      0.99 rows (.C_CUSTKEY, .C_NAME)
 inlined  C_CUSTKEY = .O_CUSTKEY
time     0.021% fanout         7 input      1063 rows
LINEITEM       4.3 rows(.L_QUANTITY)
 inlined  L_ORDERKEY = k_.O_ORDERKEY

-- The rest is identical to the previous plan, cut for brevity

 3852 msec 2311% cpu,      3213 rnd 5.99907e+08 seq  0.124456% same seg   1.08899% same pg 
Compilation: 1 msec 0 reads         0% read 0 messages         0% clw

We save over 2s of real time. But the problem is how to know that very few customers will be hit. One could make a calculation that l_quantity is between 1 and 50, and that an order has an average of 4 lineitems with a maximum of 7. For the SUM to be over 312, only orders with 7 lineitems are eligible, and even so the l_quantities must all be high. Assuming flat distributions, which here happens to be the case, one could estimate that the condition selects very few orders. The problem is that real data with this kind of regularity is sight unseen, so such a trick, while allowed, would just work for benchmarks.

*           *           *           *           *

As it happens, there is a better way. We also note that the query selects the TOP 100 orders with the highest o_totalprice. This is a very common pattern; there is almost always a TOP k clause in analytics queries unless they GROUP BY something that is known to be of low cardinality, like nation or year.

If the ordering falls on a grouping column, as soon as there are enough groups generated to fill a TOP 100, one can take the lowest o_totalprice as a limit and add this into the query as an extra restriction. Every time the TOP 100 changes, the condition becomes more selective, as the 100th highest o_totalprice increases.

Sometimes the ordering falls on the aggregation result, which is not known until the aggregation is finished. However, in lookup-style queries, it is common to take the latest-so-many events or just the TOP k items by some metric. In these cases, pushing the TOP k restriction down into the selection always works.

So, we try this:

{ 
time     4e-06% fanout         1 input         1 rows
time   6.1e-06% fanout         1 input         1 rows
{ fork

-- The plan begins with orders, as we now expect a selection on o_totalprice

-- We see that out of 150M orders, a little over 10M survive the o_totalprice selection, which gets more restrictive as the query proceeds.

time        33% fanout 1.00628e+07 input         1 rows
ORDERS   4.3e+04 rows(.O_TOTALPRICE, .O_ORDERKEY, .O_CUSTKEY, .O_ORDERDATE)
 
top k on O_TOTALPRICE
time        32% fanout 3.50797e-05 input 1.00628e+07 rows
END Node
After test:
      0: if ({ 

-- The IN subquery is here kept as a subquery, not flattened.

time      0.42% fanout         1 input 1.00628e+07 rows
time        11% fanout   4.00136 input 1.00628e+07 rows
LINEITEM         4 rows(.L_ORDERKEY, .L_QUANTITY)
 inlined  L_ORDERKEY = k_.O_ORDERKEY
time        21% fanout 2.55806e-05 input 4.02649e+07 rows
Sort streaming with duplicates (set_ctr, .L_ORDERKEY) -> (.L_QUANTITY)
 
time       2.4% fanout   9769.72 input      1030 rows
group by read node  
(gb_set_no, .L_ORDERKEY, aggregate)
END Node
After test:
      0: if (aggregate >  312 ) then 4 else 5 unkn 5
      4: BReturn 1
      5: BReturn 0
time   0.00047% fanout         0 input       353 rows
Subquery Select(  )
}
) then 4 else 5 unkn 5
      4: BReturn 1
      5: BReturn 0

-- Here we see that fewer customers are accessed than in the non-TOP k plans, since there is an extra cut on o_totalprice that takes effect earlier

time     0.013% fanout         1 input       353 rows
CUSTOMER unq         1 rows (.C_CUSTKEY, .C_NAME)
 inlined  C_CUSTKEY = k_.O_CUSTKEY
time    0.0079% fanout         7 input       353 rows
LINEITEM         4 rows(.L_QUANTITY)
 inlined  L_ORDERKEY = k_.O_ORDERKEY
time    0.0063% fanout 0.0477539 input      2471 rows
Sort streaming with duplicates (.C_CUSTKEY, .O_ORDERKEY) -> (.L_QUANTITY, .O_TOTALPRICE, .O_ORDERDATE, .C_NAME)
 
time    0.0088% fanout   2.99153 input       118 rows
group by read node  
(.C_CUSTKEY, .O_ORDERKEY, aggregate, .O_TOTALPRICE, .O_ORDERDATE, .C_NAME)
time    0.0063% fanout         0 input       353 rows
Sort (.O_TOTALPRICE, .O_ORDERDATE) -> (.C_NAME, .C_CUSTKEY, .O_ORDERKEY, aggregate)
 
}
time   8.5e-05% fanout       100 input         1 rows
top order by read (.C_NAME, .C_CUSTKEY, .O_ORDERKEY, .O_ORDERDATE, .O_TOTALPRICE, aggregate)
time   2.7e-06% fanout         0 input       100 rows
Select (.C_NAME, .C_CUSTKEY, .O_ORDERKEY, .O_ORDERDATE, .O_TOTALPRICE, aggregate)
}


 949 msec 2179% cpu, 1.00486e+07 rnd 4.71013e+07 seq   99.9267% same seg 0.0318055% same pg 

Here we see that the time is about 4x better than with the cheat version. We note that about 10M of 1.5e8 orders get considered. After going through the first 10% or so of orders, there is a TOP 100, and a condition on o_totalprice that will drop most orders can be introduced.

If we set the condition on the SUM of quantity so that no orders match, there is no TOP k at any point, and we get a time of 6.8s, which is a little worse than the initial time with the flattened IN. But since the TOP k trick does not allocate memory, it is relatively safe even in cases where it does not help.

We can argue that the TOP k pushdown trick is more robust than guessing the selectivity of a SUM of l_quantity. Further, it applies to a broad range of lookup queries, while the SUM trick applies to only TPC-H Q18, or close enough. Thus, the TOP k trick is safer and more generic.

We are approaching the end of the TPC-H blog series, with still two families of tricks to consider, namely, moving predicates between subqueries, and late projection. After this we will look at results and the overall picture.

To be continued...

In Hoc Signo Vinces (TPC-H) Series

09:10

When building a house…

Maybe you read in your childhood the famous French Asterix comic book series. I especially like Edifis (or as he is called in other languages Numérobis), an Egyptian architect in the “Asterix and Cleopatra” volume. He is the best architect Egypt can provide, but his buildings are really horrible. They seem to collapse every minute and obviously there is no proper basis or even a sufficient plan used during construction. Sometimes I think that many people in our industry still work like Numérobis.

Numérobis’ buildings use a bottom up approach. Based on what is available, construction starts and subsequently one layer of bricks is put on top of the other. Although isolated parts somehow look nice, the overall result in the end is a disaster. So how should we build a temple instead?

I strongly believe that when we think about something new, we should not start with a temple as a building, but as an idea – as something that serves a specific purpose. When we know that purpose, then we can start thinking about how best to achieve this purpose. Which might – or might not – lead to building a temple.

Another important aspect is that we need to accept basic conditions and work on these accordingly. Stable ones like gravity, fluent ones like fashion trends and completely new ones like the usage of marble instead of sand stone (and the consequences that come with it!).

Our main challenge when using “marble stone” from now on for our information products is to accept that the business we are active in is currently in a phase of disruption and the ways we used to build our temples simply do not work anymore.

Consequences are manifold, starting with completely new ways of information access, toward needing a different information strategy and architecture in the background, up to serving new business models in a new application setting. But the “glue” to be able to bring all these aspects together is making collaboration proactively a cornerstone of all our efforts. With that I mean collaboration across departments, business units, countries, divisions and even industries. This could even lead to “sacrifice” existing assets within the company, when others are better on that topic and are willing to cooperate. There should be no boundaries in our thinking when looking for the most promising options. I will be looking for new collaboration opportunities this Wednesday in Athens at the European Data Forum.

Numérobis finally succeeded in creating (even within the given timeframe) an impressive building – but only with the help of his friends from the other side of world!

March 14 2014

08:30

Why I Remain Passionate about Innovation Tournaments

Since posting about innovation tournaments last year, I’ve had the privilege of leading and participating in several tournaments. In fact, here at Wolters Kluwer we are holding monthly innovation tournaments in multiple locations around the world organized around a common monthly theme as part of our GPO (Wolters Kluwer’s Global Platform Organization) Presents webinar series. January’s theme was the topic of innovation itself. February’s theme was wearable computing devices and emerging technologies. March’s theme will be UX (User Experience). So far we are holding monthly tournaments in our offices in New York, Alphen aan den Rijn (Netherlands) and Chicago. After March, more tournaments will be held corresponding to monthly themes as diverse as search, analytics, and hybrid content-software tools. And here is why I remain passionate about innovation tournaments as a tool to stimulate innovation.

Individuals from multiple business units of Wolters Kluwer congregate in a single room to share expertise – health, tax & accounting, legal and regulatory – many people who otherwise do not regularly interact discover that the need to integrate content in a contextually relevant way runs across all market segments.

The tournament participants are multidisciplinary. In a single room for a tournament, you will find product managers, software architects, project managers, business analysts and many other roles interacting to unlock value for Wolters Kluwer’s professional customers.

We use an innovation platform to store ideas. Each participant is invited to create a description of his or her idea accompanied by any supporting presentations and photographs. A local moderator follows up with participants to clear up ambiguities. A global moderator follows up by selecting the winners of the three monthly local tournaments and creating a separate innovation campaign for a vote. The ultimate winner moves to the proof-of-concept stage.

I’m passionate about leading and participating in this process because it exposes value insights and thinking from diverse skill sets and knowledge sets from across all of Wolters Kluwer. The tournaments generate energy; they are fun; and, most important, they are immensely valuable to unlocking innovation at Wolters Kluwer.

I look forward to posting more at the end of 2014.

Do you have any experiences with innovation tournaments to share?

00:37

Five AKSW Papers at ESWC 2014

Hello World!
We are very pleased to announce that five of our papers were accepted for presentation at ESWC 2014. These papers range from natural-language processing to the acquisition of temporal data. In more detail, we will present the following papers:
  • NLP Data Cleansing Based on Linguistic Ontology Constraints (Dimitris Kontokostas, Martin Brummer, Sebastian Hellmann, Jens Lehmann and Lazaros Ioannidis)
  • Unsupervised Link Discovery Through Knowledge Base Repair (Axel-Cyrille Ngonga Ngomo, Mohamed Sherif and Klaus Lyko)
  • conTEXT – Lightweight Text Analytics using Linked Data (Ali Khalili, Sören Auer and Axel-Cyrille Ngonga Ngomo)
  • HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation (Muhammad Saleem and Axel-Cyrille Ngonga Ngomo)
  • Hybrid Acquisition of Temporal Scopes for RDF Data (Anisa Rula, Matteo Palmonari, Axel-Cyrille Ngonga Ngomo, Daniel Gerber, Jens Lehmann and Lorenz Bühmann)
Come over to ESWC and enjoy the talks. More information on these publications at http://aksw.org/Publications.
Cheers,
Axel on behalf of AKSW

March 13 2014

19:16

Improvements to the CKAN Roadmap Process

We have switched to a Github issue tracker for tracking and managing the ideas in the CKAN Roadmap (see also this pretty version).

As detailed previously we have a public process for managing the roadmap for CKAN.

This was being managed in Trello but is moving to a Github issue tracker.

We have also updated the Roadmap page to give more information about the Roadmap and how it works:

The Roadmap provides stakeholders in CKAN, including users, developers and vendors, with the ability to shape and understand the future technical path for CKAN. Specifically, the Roadmap provides for:

  • Suggesting new ideas and features for CKAN
  • Prioritizing these into a schedule for future work especially on “core” CKAN

We emphasize that ideas don’t just have to be about improvements to the core CKAN software – for example, the idea of creating a new phone app client for CKAN would be perfect thing to submit.

Lastly, we should emphasize that, of course, just because an item is in the ideas tracker does not mean it will get worked on. If you want a certain feature implemented then the best way to ensure that happens is to sponsor its development – get in touch for more information

Check out the Roadmap

New ideas and current roadmap itself are managed via an issue tracker on github.

Idea issue tracker along with instructions on now to make new suggestions »

Prettier column-based view of the ideas and roadmap »

10:43

Startup crawl 2014

The 2nd Startup Crawl is taking place next Friday, on the 21st of March, in Ljubljana. Zemanta  proudly participates in it and is thus opening three time slots, each for up to 30 pre-registered attendees. We strongly encourage you to do so ASAP — over here!

So… what’s a Startup Crawl?

  • same as a pub crawl, except you don’t have to pay for the booze,
  • you finally get to meet all those folks you wanted to and see if they really have a ping-pong table (and show ‘em who’s the boss!),
  • ask them awkward questions, receive discounts & freebies (and eat all their cookies),
  • spend a day out of office getting lost navigating through the sunny streets of Ljubljana (take your date!),
  • a really, really good opportunity to talk about ideas, problems along the way and gaining contacts for the future. It’s an ideal ice-breaker!

zemanta_lobby

I remember a couple of months ago, when the first Startup Crawl took place, it was a sunny autumn day in Ljubljana, most of my coworkers left the office earlier, leaving me all alone with loads of work behind my desk. And it was a Friday! What should one do?

Well, it wasn’t such a hard decision after all, to be honest. The work wasn’t imminent and I always wondered what that office called D-Labs upstairs was all about. 20 seconds later I was welcomed by drink (!) and although the office was pretty much empty (most of them were also crawling around the city), I got to learn what were their latest projects, where’s their hidden stash of cookies (the secretary’s got them locked!) and what’s their WiFi password (•••••••••). All were equally interesting facts!

Next up Flaviar and DietPoint in Trnovo! Or should I say an afternoon aperitivo and free pizza? Lumu was third on my list, since their place was really close (it’s Ljubljana after all) and I heard they were supposedly notoriously good for their ping-pong skills, which I have successfully proven as a false rumor. But they are really nice guys, give them your tobacco and they’ll open their (business-oriented-)hearts.

Work is always much easier if you stir up your daily routine and go out for a crawl, talk with strangers and just relax a bit. When I finally got home that workload from earlier was done before I could pour myself another drink and I’m sure this next Friday is going to be the same. So you really should register now, mark you calendar as AVAILABLE that afternoon and see what’s actually going on around in Ljubljana. You won’t regret it, after all — we are already stacking up our supplies (interpret at will).

Again, slots are only available for those who register, so don’t hesitate to do so until it’s too late! Zemanta opened up three different time slots (app. from 14:00-20:00) for those who do-not-want / partially-want / yes-do-want escape from work early! Below is our registration form and the list of all open doors you can enter next Friday. Be there or be square!

Eventbrite Registration form

List of open Startup doors

 

08:31

AKSW Colloquium “Current semantic web initiatives in the Netherlands” on Friday, March 14, Room P901

Current semantic web initiatives in the Netherlands: Heritage & Location, PiLOD 2.0

On Friday, March 14, at 10.00 a.m. in room P901, visiting researchers Tine van Nierop and Rein van ‘t Veer from the E&L will discuss, amongst several other semantic web initiatives in the Netherlands, two different projects: Heritage & Location (www.erfgoedenlocatie.nl) and PiLOD 2.0 (www.pilod.nl). Heritage & Location assembles linked geospatial data from all possible heritage institutions and discloses it in geotemporal semantic applications. PiLOD (Platform implementation Linked Open Data) is a sector-independent initiative for any institution or individual wanting to explore the possibilities of the semantic web.

About the AKSW Colloquium

This event is part of a series of events about Semantic Web technology. Please see http://wiki.aksw.org/Colloquium for further information about previous and future events. As always, Bachelor and Master students are able to get points for attendance and there is complimentary coffee and cake after the session.

March 11 2014

11:42

Improve how the CKAN Community Works – Your Suggestions Wanted

We want to collect ideas about how to improve how the CKAN Community works — whether that’s a change to the CKAN.org website, creating a new help forum or making it easier to locate relevant documentation.

A key first step is to identify and prioritize the needs of the Community.

That’s where we want your help! We’ve created a shared editable document where people can contribute their thoughts and idea on what is needed and how to provide it:

Shared Document for Our Ideas – Take a Look and Contribute Now »

Please jump in and add your suggestions and thoughts on what you most want (and how that could be provided).

Note: we are not looking for suggestions on how to improve the CKAN the software (if you have ideas there please see the Roadmap page)

09:59

AKSW Colloquium “Towards a Computer Algebra Semantic Social Network” on Monday, March 17

Towards a Computer Algebra Semantic Social Network

profile picture of Prof. Dr. Hans-Gert GräbeOn Monday, March 17th, 2014 at 1.30 – 4:00 p.m. in Room P702 (Paulinum), Prof. Dr. Hans-Gert Gräbe will present and discuss a bootstrap process for a Semantic Social Network within a special scientific community based on enhancements of xodx

For the moment we operate a SPARQL endpoint and extract valuable information to the WordPress based website of the German CA-Fachgruppe.




Additional Information

About the AKSW Colloquium

This event is part of a series of events about Semantic Web technology. Please see http://wiki.aksw.org/Colloquium for further information about previous and future events. As always, Bachelor and Master students are able to get points for attendance and there is complimentary coffee and cake after the session.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.