Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

September 26 2014

21:02

In Hoc Signo Vinces (part 18 of n): Cluster Dynamics

This article is about how scale-out differs from single-server. This shows large effects of parameters whose very existence most would not anticipate, and some low level metrics for assessing these. The moral of the story is that this is the stuff which makes the difference between merely surviving scale-out and winning with it. The developer and DBA would not normally know about this; thus these things fall into the category of adaptive self-configuration expected from the DBMS. But since this series is about what makes performance, I will discuss the dynamics such as they are and how to play these.

We take the prototypical cross partition join in Q13: Make a hash table of all customers, partitioned by c_custkey. This is independently done with full parallelism in each partition. Scan the orders, get the customer (in a different partition), and flag the customers that had at least one order. Then, to get the customers with no orders, return the customers that were not flagged in the previous pass.

The single-server time in part 12 was 7.8 and 6.0 with a single user. We consider the better of the times. The difference is due to allocating memory on the first go; on the second go the memory is already in reserve.

With default settings, we get 4595 ms (microseconds), with per node resource utilization at:


Cluster 4 nodes, 4 s. 112405 m/s 742602 KB/s  2749% cpu 0%  read 4% clw threads 1r 0w 0i buffers 8577766 287874 d 0 w 0 pfs
cl 1: 27867 m/s 185654 KB/s  733% cpu 0%  read 4% clw threads 1r 0w 0i buffers 2144242 71757 d 0 w 0 pfs
cl 2: 28149 m/s 185372 KB/s  672% cpu 0%  read 0% clw threads 0r 0w 0i buffers 2144640 71903 d 0 w 0 pfs
cl 3: 28220 m/s 185621 KB/s  675% cpu 0%  read 0% clw threads 0r 0w 0i buffers 2144454 71962 d 0 w 0 pfs
cl 4: 28150 m/s 185837 KB/s  667% cpu 0%  read 0% clw threads 0r 0w 0i buffers 2144430 72252 d 0 w 0 pfs

 

The top line is the summary; the lines below are per-process. The m/s is messages-per-second; KB/s is interconnect traffic per second; clw % is idle time spent waiting for a reply from another process. The cluster is set up with 4 processes across 2 machines, each with 2 NUMA nodes. Each process has affinity to the NUMA node, so local memory only. The time is reasonable in light of the overall CPU of 2700%. The maximum would be 4800% with all threads of all cores busy all the time.

The catch here is that we do not have a steady half-platform utilization all the time, but full platform peaks followed by synchronization barriers with very low utilization. So, we set the batch size differently:


cl_exec ('__dbf_set (''cl_dfg_batch_bytes'', 50000000)');

 

This means that we set, on each process, the cl_dfg_batch_bytes to 50M from a default of 10M. The effect is that each scan of orders, one thread per slice, 48 slices total, will produce 50MB worth of o_custkeys to be sent to the other partition for getting the customer. After each 50M, the thread stops and will produce the next batch when all are done and a global continue message is sent by the coordinator.

The time is now 3173 ms with:


Cluster 4 nodes, 3 s. 158220 m/s 1054944 KB/s  3676% cpu 0%  read 1% clw threads 1r 0w 0i buffers 8577766 287874 d 0 w 0 pfs
cl 1: 39594 m/s 263962 KB/s  947% cpu 0%  read 1% clw threads 1r 0w 0i buffers 2144242 71757 d 0 w 0 pfs
cl 2: 39531 m/s 263476 KB/s  894% cpu 0%  read 0% clw threads 0r 0w 0i buffers 2144640 71903 d 0 w 0 pfs
cl 3: 39523 m/s 263684 KB/s  933% cpu 0%  read 0% clw threads 0r 0w 0i buffers 2144454 71962 d 0 w 0 pfs
cl 4: 39535 m/s 263586 KB/s  900% cpu 0%  read 0% clw threads 0r 0w 0i buffers 2144430 72252 d 0 w 0 pfs

 

The platform utilization is better as we see. The throughput is nearly double that of the single-server, which is pretty good for a communication-heavy query.

This was done with a vector size of 10K. In other words, each partition gets 10K o_custkeys and splits these 48 ways to go to every recipient. 1/4 are in the same process, 1/4 in a different process on the same machine, and 2/4 on a different machine. The recipient gets messages with an average of 208 o_custkey values, puts them back together in batches of 10K, and passes these to the hash join with customer.

We try different vector sizes, such as 100K:


cl_exec ('__dbf_set (''dc_batch_sz'', 100000)');

 

There are two metrics of interest here: The write block time, and the scheduling overhead. The write block time is microseconds, which increases whenever a thread must wait before it can write to a connection. The scheduling overhead is cumulative clocks spent by threads while waiting for a critical section that deals with dispatching messages to consumer threads. Long messages make blocking; short messages make frequent scheduling decisions.


SELECT cl_sys_stat ('local_cll_clk', clr=>1), 
       cl_sys_stat ('write_block_usec', clr=>1)
;

 

cl_sys_stat gets the counters from all processes and returns the sum. clr=>1 means that the counter is cleared after read.

We do Q13 with vector sizes of 10, 100, and 1000K.

Vector size msec mtx wblock 10K 3297 10,829,910,329 0 100K 3150 1,663,238,367 59,132 1000K 3876 414,631,129 4,578,003

So, 100K seems to strike the best balance between scheduling and blocking on write.

The times are measured after several samples with each setting. The times stabilize after a few runs, as the appropriate size memory blocks are in reserve. Calling mmap to allocate these on the first run with each size has a very high penalty, e.g., 60s for the first run with 1M vector size. We note that blocking on write is really bad even though 1/3 of the time there is no network and 2/3 of the time there is a fast network (QDR IB) with no other load. Further, the affinities are set so that the thread responsible for incoming messages is always on core. Result variability on consecutive runs is under 5%, which is similar to single-server behavior.

It would seem that a mutex, as bad as it is, is still better than a distributed cause for going off core (blocking on write). The latency for continuing a thread thus blocked is of course higher than the latency for continuing one that is waiting for a mutex.

We note that a cluster with more machines can take a longer vector size because a vector spreads out to more recipients. The key seems to be to set the message size so that blocking on write is not common. This is a possible adaptive execution feature. We have seen no particular benefit from SDP (Sockets Direct Protocol) and its zero copy. This is a TCP replacement that comes with the InfiniBand drivers.

We will next look at replication/partitioning tradeoffs for hash joins. Then we can look at full runs.

To be continued...

In Hoc Signo Vinces (TPC-H) Series

21:02

In Hoc Signo Vinces (part 17 of n): 100G and 300G Runs on Dual Xeon E5 2650v2

This is an update presenting sample results on a newer platform for a single-server configuration. This is to verify that performance scales with the addition of cores and clock speed. Further, we note that the jump from 100G to 300G changes very little about the score. 3x larger takes approximately 3x longer, as long as things are in memory.

The platform is one node of the CWI cluster which was also used for the 500Gt RDF experiments reported on this blog. The specification is dual Xeon E5 2650v2 (8 core, 16 thread, 2.6 GHz) with 256 GB RAM. The disk setup is a RAID-0 of three 2 TB rotating disks.

For the 100G, we go from 240 to 395, which is about 1.64x. The new platform has 16 vs 12 cores and a clock of 2.6 as opposed to 2.3. This makes a multiplier of 1.5. The rest of the acceleration is probably attributable to faster memory clock. Anyway, the point of more speed from larger platform is made.

The top level scores per run are as follows; the numerical quantities summaries are appended.

100G

Run Power Throughput Composite Run 1 391,000.1 401,029.4 395,983.0 Run 2 388,746.2 404,189.3 396,392.6

300G

Run Power Throughput Composite Run 1 61,988.7 384,883.7 154,461.6 Run 2 423,431.8 387,248.6 404,936.3 Run 3 417,672.0 389,719.5 403,453.7

The interested may reproduce the results using the feature/analytics branch of the v7fasttrack git repository on GitHub as described in Part 13.

For the 300G runs, we note a much longer load time; see below, as this is seriously IO bound.

The first power test at 300G is a non-starter, even though this comes right after bulk load. Still, the data is not in working set and getting it from disk is simply an automatic disqualification, unless maybe one had 300 separate disks. This happens in TPC benchmarks, but not very often in the field. Looking at the first power run, the first queries take the longest, but by the time the power run starts, the working set is there. By an artifact of the metric (use of geometric mean for the power test), long queries are penalized less there than in the throughput run.

So, we run 3 executions instead of the prescribed 2, to have 2 executions from warm state.

To do 300G well in 256 GB of RAM, one needs either to use several SSDs, or to increase compression and keep all in memory, so no secondary storage at all. In order to keep all in memory, one could have stream-compression on string columns. Stream-compressing strings (e.g., o_comment, l_comment) does not pay if one is already in memory, but if stream-compressing strings eliminates going to secondary storage, then the win is sure.

As before, all caveats apply; the results are unaudited and for information only. Therefore we do not use the official metric name.

100G Run 1

Virt-H Executive Summary

Report Date September 15, 2014 Database Scale Factor 100 Start of Database Load 09/15/2014 07:04:08 End of Database Load 09/15/2014 07:15:58 Database Load Time 0:11:50 Query Streams for
Throughput Test 5 Virt-H Power 391,000.1 Virt-H Throughput 401,029.4 Virt-H Composite
Query-per-Hour Metric
(Qph@100GB) 395,983.0 Measurement Interval in
Throughput Test (Ts) 98.846000 seconds

Duration of stream execution

Start Date/Time End Date/Time Duration Stream 0 09/15/2014 13:13:01 09/15/2014 13:13:28 0:00:27 Stream 1 09/15/2014 13:13:29 09/15/2014 13:15:06 0:01:37 Stream 2 09/15/2014 13:13:29 09/15/2014 13:15:07 0:01:38 Stream 3 09/15/2014 13:13:29 09/15/2014 13:15:07 0:01:38 Stream 4 09/15/2014 13:13:29 09/15/2014 13:15:04 0:01:35 Stream 5 09/15/2014 13:13:29 09/15/2014 13:15:08 0:01:39 Refresh 0 09/15/2014 13:13:01 09/15/2014 13:13:03 0:00:02 09/15/2014 13:13:28 09/15/2014 13:13:29 0:00:01 Refresh 1 09/15/2014 13:14:10 09/15/2014 13:14:16 0:00:06 Refresh 2 09/15/2014 13:13:29 09/15/2014 13:13:42 0:00:13 Refresh 3 09/15/2014 13:13:42 09/15/2014 13:13:53 0:00:11 Refresh 4 09/15/2014 13:13:53 09/15/2014 13:14:02 0:00:09 Refresh 5 09/15/2014 13:14:02 09/15/2014 13:14:10 0:00:08

Numerical Quantities Summary Timing Intervals in Seconds:

Query Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Stream 0 1.442477 0.304513 0.720263 0.351285 0.979414 0.479455 0.865992 0.875236 Stream 1 3.938133 0.920533 3.738724 2.769707 3.209728 1.339146 2.759384 3.626868 Stream 2 4.104738 0.952245 4.719658 0.865586 2.139267 0.850909 2.044402 2.600373 Stream 3 3.692119 1.024876 3.430172 1.579846 4.097845 1.859468 2.312921 6.238070 Stream 4 5.419537 0.531571 2.116176 1.256836 4.787617 2.117995 3.517466 3.982180 Stream 5 5.167029 0.746720 3.157557 1.255182 3.004802 2.131963 3.648316 2.835751 Min Qi 3.692119 0.531571 2.116176 0.865586 2.139267 0.850909 2.044402 2.600373 Max Qi 5.419537 1.024876 4.719658 2.769707 4.787617 2.131963 3.648316 6.238070 Avg Qi 4.464311 0.835189 3.432457 1.545431 3.447852 1.659896 2.856498 3.856648 Query Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Stream 0 2.606044 1.117063 1.847930 0.618534 4.327600 1.110908 0.995289 0.975910 Stream 1 7.463593 4.686463 4.549733 4.168129 15.759178 5.247666 4.495030 4.075198 Stream 2 9.398552 5.170904 3.934405 1.880683 19.968787 3.767992 6.965337 3.849845 Stream 3 7.581069 4.109905 4.301159 2.123634 17.683200 5.383603 4.376887 2.854777 Stream 4 9.927887 6.913209 3.351489 2.802724 16.985827 3.925148 4.691474 4.080586 Stream 5 7.035080 3.921425 6.844778 2.899238 14.839509 4.986742 6.629664 4.089547 Min Qi 7.035080 3.921425 3.351489 1.880683 14.839509 3.767992 4.376887 2.854777 Max Qi 9.927887 6.913209 6.844778 4.168129 19.968787 5.383603 6.965337 4.089547 Avg Qi 8.281236 4.960381 4.596313 2.774882 17.047300 4.662230 5.431678 3.789991 Query Q17 Q18 Q19 Q20 Q21 Q22 RF1 RF2 Stream 0 1.215956 0.745257 0.699801 1.281834 1.291110 0.518425 1.827192 1.014431 Stream 1 5.779854 2.383264 2.396793 6.130511 5.002700 1.968425 4.172437 2.427047 Stream 2 7.828176 1.833416 3.175649 4.785709 5.385834 1.403290 6.383005 6.366525 Stream 3 5.880139 1.797383 3.258024 5.601364 6.373216 1.977848 5.235542 6.385010 Stream 4 3.989621 1.252891 2.478303 4.678629 3.212176 2.740586 5.037995 3.911379 Stream 5 5.030440 2.010988 4.188428 6.221990 5.418788 2.187718 3.589915 3.517380 Min Qi 3.989621 1.252891 2.396793 4.678629 3.212176 1.403290 3.589915 2.427047 Max Qi 7.828176 2.383264 4.188428 6.221990 6.373216 2.740586 6.383005 6.385010 Avg Qi 5.701646 1.855588 3.099439 5.483641 5.078543 2.055573 4.883779 4.521468

100G Run 2

Virt-H Executive Summary

Report Date September 15, 2014 Database Scale Factor 100 Total Data Storage/Database Size 87,312M Start of Database Load 09/15/2014 07:04:08 End of Database Load 09/15/2014 07:15:58 Database Load Time 0:11:50 Query Streams for
Throughput Test 5 Virt-H Power 388,746.2 Virt-H Throughput 404,189.3 Virt-H Composite
Query-per-Hour Metric
(Qph@100GB) 396,392.6 Measurement Interval in
Throughput Test (Ts) 98.074000 seconds

Duration of stream execution:

Start Date/Time End Date/Time Duration Stream 0 09/15/2014 13:15:11 09/15/2014 13:15:38 0:00:27 Stream 1 09/15/2014 13:15:39 09/15/2014 13:17:13 0:01:34 Stream 2 09/15/2014 13:15:39 09/15/2014 13:17:16 0:01:37 Stream 3 09/15/2014 13:15:39 09/15/2014 13:17:15 0:01:36 Stream 4 09/15/2014 13:15:39 09/15/2014 13:17:17 0:01:38 Stream 5 09/15/2014 13:15:39 09/15/2014 13:17:15 0:01:36 Refresh 0 09/15/2014 13:15:11 09/15/2014 13:15:12 0:00:01 09/15/2014 13:15:38 09/15/2014 13:15:39 0:00:01 Refresh 1 09/15/2014 13:16:13 09/15/2014 13:16:20 0:00:07 Refresh 2 09/15/2014 13:15:39 09/15/2014 13:15:47 0:00:08 Refresh 3 09/15/2014 13:15:47 09/15/2014 13:15:56 0:00:09 Refresh 4 09/15/2014 13:15:56 09/15/2014 13:16:03 0:00:07 Refresh 5 09/15/2014 13:16:03 09/15/2014 13:16:12 0:00:09

Numerical Quantities Summary Timing Intervals in Seconds:

Query Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Stream 0 1.467681 0.277665 0.766102 0.365185 0.941206 0.549381 0.938998 0.803514 Stream 1 3.883169 1.488521 3.366920 1.627478 3.632321 2.065565 2.911138 2.444544 Stream 2 3.294589 1.138066 3.260775 1.899615 5.367725 1.820374 3.655119 2.186642 Stream 3 3.797641 0.995877 3.239690 2.483035 2.737690 1.505998 4.058083 4.268644 Stream 4 4.099187 0.402685 4.704959 1.469825 5.367910 2.783018 2.706164 2.551061 Stream 5 3.651273 1.598314 2.051899 1.283754 4.711897 1.519763 2.851300 2.484093 Min Qi 3.294589 0.402685 2.051899 1.283754 2.737690 1.505998 2.706164 2.186642 Max Qi 4.099187 1.598314 4.704959 2.483035 5.367910 2.783018 4.058083 4.268644 Avg Qi 3.745172 1.124693 3.324849 1.752741 4.363509 1.938944 3.236361 2.786997 Query Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Stream 0 2.734812 1.115539 1.679910 0.633239 4.391739 1.130082 1.137284 0.919646 Stream 1 9.271071 5.664855 3.377869 2.148228 16.046021 2.935643 4.897009 2.891040 Stream 2 10.272523 4.578427 4.086788 2.312762 16.295728 2.714776 6.393897 2.414951 Stream 3 7.095213 4.544636 4.073433 2.710320 18.789088 3.903873 5.471600 2.994184 Stream 4 7.567924 3.691088 3.951049 2.207944 18.189014 4.985841 6.568935 3.965322 Stream 5 8.173577 4.959777 4.736593 3.507469 17.106990 5.405699 7.357104 3.125788 Min Qi 7.095213 3.691088 3.377869 2.148228 16.046021 2.714776 4.897009 2.414951 Max Qi 10.272523 5.664855 4.736593 3.507469 18.789088 5.405699 7.357104 3.965322 Avg Qi 8.476062 4.687757 4.045146 2.577345 17.285368 3.989166 6.137709 3.078257 Query Q17 Q18 Q19 Q20 Q21 Q22 RF1 RF2 Stream 0 1.206347 0.792013 0.699476 1.349182 1.505387 0.543947 1.549135 0.824344 Stream 1 5.135036 1.873195 4.978155 5.988226 4.705365 1.211049 4.175947 3.579242 Stream 2 7.656125 2.229819 2.805272 6.629781 4.138014 1.423334 5.165700 3.197300 Stream 3 6.385983 2.086301 3.450305 3.292353 5.503905 2.302992 4.860041 3.865383 Stream 4 6.514967 2.876895 3.481100 1.629007 5.715903 2.121692 3.681208 3.347289 Stream 5 4.100205 2.400816 2.142291 4.710677 5.765320 1.616445 6.095817 3.007436 Min Qi 4.100205 1.873195 2.142291 1.629007 4.138014 1.211049 3.681208 3.007436 Max Qi 7.656125 2.876895 4.978155 6.629781 5.765320 2.302992 6.095817 3.865383 Avg Qi 5.958463 2.293405 3.371425 4.450009 5.165701 1.735102 4.795743 3.399330

300G Run 1

Virt-H Executive Summary

Report Date September 25, 2014 Database Scale Factor 300 Start of Database Load 09/25/2014 16:38:20 End of Database Load 09/25/2014 18:32:06 Database Load Time 1:53:46 Query Streams for
Throughput Test 6 Virt-H Power 61,988.7 Virt-H Throughput 384,883.7 Virt-H Composite
Query-per-Hour Metric
(Qph@300GB) 154,461.6 Measurement Interval in
Throughput Test (Ts) 370.498000 seconds

Duration of stream execution:

Start Date/Time End Date/Time Duration Stream 0 09/25/2014 19:00:29 09/25/2014 19:22:25 0:21:56 Stream 1 09/25/2014 19:22:27 09/25/2014 19:28:23 0:05:56 Stream 2 09/25/2014 19:22:27 09/25/2014 19:28:23 0:05:56 Stream 3 09/25/2014 19:22:27 09/25/2014 19:28:26 0:05:59 Stream 4 09/25/2014 19:22:27 09/25/2014 19:28:13 0:05:46 Stream 5 09/25/2014 19:22:27 09/25/2014 19:28:38 0:06:11 Stream 6 09/25/2014 19:22:27 09/25/2014 19:28:38 0:06:11 Refresh 0 09/25/2014 19:00:29 09/25/2014 19:03:56 0:03:27 09/25/2014 19:22:25 09/25/2014 19:22:27 0:00:02 Refresh 1 09/25/2014 19:25:22 09/25/2014 19:25:58 0:00:36 Refresh 2 09/25/2014 19:22:27 09/25/2014 19:23:11 0:00:44 Refresh 3 09/25/2014 19:23:10 09/25/2014 19:23:40 0:00:30 Refresh 4 09/25/2014 19:23:40 09/25/2014 19:24:21 0:00:41 Refresh 5 09/25/2014 19:24:21 09/25/2014 19:24:58 0:00:37 Refresh 6 09/25/2014 19:24:59 09/25/2014 19:25:22 0:00:23

Numerical Quantities Summary Timing Intervals in Seconds:

Query Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Stream 0 183.735463 95.826361 79.826802 87.603164 47.099641 1.301704 2.606488 52.667426 Stream 1 9.400003 1.983777 15.839250 3.001843 15.593335 6.067716 8.870516 11.679706 Stream 2 12.634711 3.472203 13.683075 8.057952 16.500741 5.403771 11.181661 12.393932 Stream 3 10.807287 3.793587 15.844244 3.214977 15.960600 7.099744 10.424530 21.001623 Stream 4 11.900829 3.741707 14.219904 5.616907 16.487144 14.229782 11.100193 8.769539 Stream 5 13.933423 2.916529 19.453452 5.258843 16.706269 7.948711 8.982104 17.566729 Stream 6 17.084445 0.738683 11.503079 8.324812 23.483917 20.101834 9.207737 10.311292 Min Qi 9.400003 0.738683 11.503079 3.001843 15.593335 5.403771 8.870516 8.769539 Max Qi 17.084445 3.793587 19.453452 8.324812 23.483917 20.101834 11.181661 21.001623 Avg Qi 12.626783 2.774414 15.090501 5.579222 17.455334 10.141926 9.961123 13.620470 Query Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Stream 0 41.997798 2.727870 21.651730 25.704209 293.103984 3.171437 2.886688 5.298823 Stream 1 29.662265 22.788618 12.979253 7.121358 62.774323 22.132581 22.616793 21.625334 Stream 2 28.041750 22.481172 19.262140 5.790272 58.105179 16.809177 32.813330 12.692499 Stream 3 32.534297 15.460256 12.038047 7.012926 59.413740 18.540284 25.968635 16.716208 Stream 4 28.759993 15.123651 21.734471 6.920480 63.119744 12.848884 21.372432 11.662102 Stream 5 18.315308 21.781800 26.141212 8.230858 60.985590 22.369824 27.098660 25.283066 Stream 6 31.455961 27.078707 12.954580 11.081669 72.483462 12.376376 22.129120 11.439147 Min Qi 18.315308 15.123651 12.038047 5.790272 58.105179 12.376376 21.372432 11.439147 Max Qi 32.534297 27.078707 26.141212 11.081669 72.483462 22.369824 32.813330 25.283066 Avg Qi 28.128262 20.785701 17.518284 7.692927 62.813673 17.512854 25.333162 16.569726 Query Q17 Q18 Q19 Q20 Q21 Q22 RF1 RF2 Stream 0 7.793403 81.545934 41.648484 4.638731 25.003179 0.536267 206.980380 2.501589 Stream 1 27.058060 3.894254 8.664394 25.315007 11.921265 3.561859 22.936601 13.235777 Stream 2 25.718500 6.140657 8.856586 14.761290 11.870351 7.728217 13.882613 29.328859 Stream 3 15.896774 8.631035 15.742406 20.621604 13.370582 5.536313 14.677463 14.772753 Stream 4 22.458327 5.319241 11.973431 22.344017 11.534642 2.402683 24.214115 16.236299 Stream 5 13.407745 5.413278 8.800650 18.055743 17.528827 4.173171 15.927165 21.636801 Stream 6 8.069721 5.531066 13.233927 21.321389 7.622026 12.064182 11.457848 12.342336 Min Qi 8.069721 3.894254 8.664394 14.761290 7.622026 2.402683 11.457848 12.342336 Max Qi 27.058060 8.631035 15.742406 25.315007 17.528827 12.064182 24.214115 29.328859 Avg Qi 18.768188 5.821588 11.211899 20.403175 12.307949 5.911071 17.182634 17.925471

300G run 2

Virt-H Executive Summary

Report Date September 25, 2014 Database Scale Factor 300 Start of Database Load 09/25/2014 16:38:20 End of Database Load 09/25/2014 18:32:06 Database Load Time 1:53:46 Query Streams for
Throughput Test 6 Virt-H Power 423,431.8 Virt-H Throughput 387,248.6 Virt-H Composite
Query-per-Hour Metric
(Qph@300GB) 404,936.3 Measurement Interval in
Throughput Test (Ts) 368.236000 seconds

Duration of stream execution:

Start Date/Time End Date/Time Duration Stream 0 09/25/2014 19:28:42 09/25/2014 19:29:58 0:01:16 Stream 1 09/25/2014 19:30:00 09/25/2014 19:36:04 0:06:04 Stream 2 09/25/2014 19:30:00 09/25/2014 19:36:00 0:06:00 Stream 3 09/25/2014 19:30:00 09/25/2014 19:36:06 0:06:06 Stream 4 09/25/2014 19:30:00 09/25/2014 19:36:07 0:06:07 Stream 5 09/25/2014 19:30:00 09/25/2014 19:35:53 0:05:53 Stream 6 09/25/2014 19:30:00 09/25/2014 19:36:08 0:06:08 Refresh 0 09/25/2014 19:28:41 09/25/2014 19:28:46 0:00:05 09/25/2014 19:29:58 09/25/2014 19:30:00 0:00:02 Refresh 1 09/25/2014 19:32:23 09/25/2014 19:32:55 0:00:32 Refresh 2 09/25/2014 19:30:00 09/25/2014 19:30:31 0:00:31 Refresh 3 09/25/2014 19:30:31 09/25/2014 19:31:00 0:00:29 Refresh 4 09/25/2014 19:31:01 09/25/2014 19:31:23 0:00:22 Refresh 5 09/25/2014 19:31:23 09/25/2014 19:31:54 0:00:31 Refresh 6 09/25/2014 19:31:55 09/25/2014 19:32:23 0:00:28

Numerical Quantities Summary Timing Intervals in Seconds:

Query Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Stream 0 4.197427 1.011516 2.535959 0.858781 2.857279 1.293530 2.682266 2.260502 Stream 1 15.467757 3.517499 13.820864 4.157259 13.141556 10.902710 16.899687 8.986535 Stream 2 15.639991 6.026485 13.521624 3.918031 17.336458 1.975310 9.718194 15.165247 Stream 3 14.891929 4.481383 15.322621 5.272911 15.266543 6.771253 13.430646 20.171084 Stream 4 14.560526 2.464157 11.567112 5.526629 20.531540 5.225971 16.288606 17.209475 Stream 5 10.390577 3.549165 9.598328 8.783847 17.351211 6.308214 12.606512 13.035716 Stream 6 16.275922 4.086475 14.109963 4.385887 10.174709 6.703266 8.936217 16.798526 Min Qi 10.390577 2.464157 9.598328 3.918031 10.174709 1.975310 8.936217 8.986535 Max Qi 16.275922 6.026485 15.322621 8.783847 20.531540 10.902710 16.899687 20.171084 Avg Qi 14.537784 4.020861 12.990085 5.340761 15.633670 6.314454 12.979977 15.227764 Query Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Stream 0 8.300092 2.598145 5.168418 1.619399 11.958836 3.191672 3.097822 2.497410 Stream 1 26.412829 17.354745 12.942454 8.169447 58.600101 15.227942 32.985324 13.914978 Stream 2 34.523245 17.635531 15.193748 8.435375 62.442800 16.276300 26.533303 12.414575 Stream 3 25.334301 18.595422 11.663933 10.029387 63.664992 20.378320 24.760768 15.710589 Stream 4 36.971957 15.645673 14.672851 13.196301 58.214728 17.375053 26.581101 11.624989 Stream 5 30.891797 12.993365 14.089049 10.515091 65.232712 20.807026 26.920526 11.362095 Stream 6 38.143281 21.106772 15.152299 18.845766 66.240343 12.295624 22.510610 18.081103 Min Qi 25.334301 12.993365 11.663933 8.169447 58.214728 12.295624 22.510610 11.362095 Max Qi 38.143281 21.106772 15.193748 18.845766 66.240343 20.807026 32.985324 18.081103 Avg Qi 32.046235 17.221918 13.952389 11.531894 62.399279 17.060044 26.715272 13.851388 Query Q17 Q18 Q19 Q20 Q21 Q22 RF1 RF2 Stream 0 4.016212 1.603004 1.836489 3.542383 3.901876 0.515102 4.759612 2.358873 Stream 1 22.162387 10.067834 15.772705 22.091355 12.974776 8.354196 19.342171 12.771250 Stream 2 25.647926 4.263008 11.590737 19.179326 17.899770 4.137031 15.720245 14.719776 Stream 3 14.511279 7.484608 20.735250 13.041037 17.139046 6.014141 16.234122 13.454647 Stream 4 19.297494 10.110707 10.907458 19.649066 15.206251 3.423503 11.268082 11.852223 Stream 5 17.445165 5.582309 15.266324 19.788382 14.245770 2.810949 16.601461 14.019717 Stream 6 25.115339 6.896503 11.661563 21.900028 5.520025 3.093050 15.436258 13.353446 Min Qi 14.511279 4.263008 10.907458 13.041037 5.520025 2.810949 11.268082 11.852223 Max Qi 25.647926 10.110707 20.735250 22.091355 17.899770 8.354196 19.342171 14.719776 Avg Qi 20.696598 7.400828 14.322339 19.274866 13.830940 4.638812 15.767057 13.361843

300G run 3:

Virt-H Executive Summary

Report Date September 25, 2014 Database Scale Factor 300 Total Data Storage/Database Size 258,888M Start of Database Load 09/25/2014 16:38:20 End of Database Load 09/25/2014 18:32:06 Database Load Time 1:53:46 Query Streams for
Throughput Test 6 Virt-H Power 417,672.0 Virt-H Throughput 389,719.5 Virt-H Composite
Query-per-Hour Metric
(Qph@300GB) 403,453.7 Measurement Interval in
Throughput Test (Ts) 365.902000 seconds

Duration of stream execution:

Start Date/Time End Date/Time Duration Stream 0 09/25/2014 19:36:11 09/25/2014 19:37:29 0:01:18 Stream 1 09/25/2014 19:37:32 09/25/2014 19:43:13 0:05:41 Stream 2 09/25/2014 19:37:32 09/25/2014 19:43:31 0:05:59 Stream 3 09/25/2014 19:37:32 09/25/2014 19:43:37 0:06:05 Stream 4 09/25/2014 19:37:32 09/25/2014 19:43:33 0:06:01 Stream 5 09/25/2014 19:37:32 09/25/2014 19:43:32 0:06:00 Stream 6 09/25/2014 19:37:32 09/25/2014 19:43:37 0:06:05 Refresh 0 09/25/2014 19:36:12 09/25/2014 19:36:16 0:00:04 09/25/2014 19:37:29 09/25/2014 19:37:31 0:00:02 Refresh 1 09/25/2014 19:40:02 09/25/2014 19:40:33 0:00:31 Refresh 2 09/25/2014 19:37:31 09/25/2014 19:38:01 0:00:30 Refresh 3 09/25/2014 19:38:01 09/25/2014 19:38:30 0:00:29 Refresh 4 09/25/2014 19:38:30 09/25/2014 19:38:58 0:00:28 Refresh 5 09/25/2014 19:38:58 09/25/2014 19:39:27 0:00:29 Refresh 6 09/25/2014 19:39:27 09/25/2014 19:40:01 0:00:34

Numerical Quantities Summary Timing Intervals in Seconds:

Query Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Stream 0 4.305006 1.083442 2.502758 0.845763 2.840824 1.346166 2.659511 2.233550 Stream 1 11.513360 3.732513 14.530428 3.819517 14.821291 7.561547 10.435082 8.984230 Stream 2 13.486433 3.373689 9.620363 3.914320 16.857542 5.837487 10.695443 17.901191 Stream 3 11.015942 1.780220 4.830412 9.073543 15.587709 9.661989 12.374931 15.262485 Stream 4 13.600461 0.820899 12.254226 7.799415 19.860761 13.145017 14.404345 11.807583 Stream 5 13.358000 3.885118 11.099935 4.845043 18.286721 6.424272 9.735255 15.041608 Stream 6 13.588873 3.789631 13.503399 5.130389 13.104065 3.517076 14.929079 19.831639 Min Qi 11.015942 0.820899 4.830412 3.819517 13.104065 3.517076 9.735255 8.984230 Max Qi 13.600461 3.885118 14.530428 9.073543 19.860761 13.145017 14.929079 19.831639 Avg Qi 12.760511 2.897012 10.973127 5.763705 16.419681 7.691231 12.095689 14.804789 Query Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Stream 0 8.553183 3.215484 4.652364 1.620089 11.936052 2.916132 3.219969 2.374276 Stream 1 29.441108 20.348266 9.994556 14.965432 60.537168 13.302875 30.159402 10.277570 Stream 2 41.799347 18.197400 16.773638 6.510347 67.461446 20.362328 0.109929 9.908769 Stream 3 24.306937 20.555376 17.140758 16.715188 61.724168 22.469230 27.967206 13.434167 Stream 4 34.820796 11.795664 18.015120 7.176057 63.134711 11.427374 23.959842 16.759246 Stream 5 23.139366 12.655317 13.152401 7.258740 64.273225 22.854106 28.803059 12.832364 Stream 6 27.955059 24.633526 11.046285 5.995041 74.965966 15.636579 22.803890 13.221303 Min Qi 23.139366 11.795664 9.994556 5.995041 60.537168 11.427374 0.109929 9.908769 Max Qi 41.799347 24.633526 18.015120 16.715188 74.965966 22.854106 30.159402 16.759246 Avg Qi 30.243769 18.030925 14.353793 9.770134 65.349447 17.675415 22.300555 12.738903 Query Q17 Q18 Q19 Q20 Q21 Q22 RF1 RF2 Stream 0 4.298092 1.702071 1.894548 4.118591 3.922889 0.491145 4.519734 2.347913 Stream 1 16.432222 6.908918 17.749058 18.756674 11.148628 5.464975 18.300673 12.972871 Stream 2 20.588544 4.387662 14.527229 23.844364 15.500462 15.543458 13.666574 15.240662 Stream 3 14.008049 6.222633 12.833421 22.811602 16.013232 9.449069 16.486111 12.974515 Stream 4 16.964699 8.106044 11.207675 22.483826 17.354675 4.641183 14.583941 13.679087 Stream 5 25.243144 7.359437 16.986615 19.855391 17.183725 5.750937 14.759597 13.052316 Stream 6 12.986721 10.160993 17.496662 19.267026 17.300224 4.955930 19.267721 15.421241 Min Qi 12.986721 4.387662 11.207675 18.756674 11.148628 4.641183 13.666574 12.972871 Max Qi 25.243144 10.160993 17.749058 23.844364 17.354675 15.543458 19.267721 15.421241 Avg Qi 17.703896 7.190948 15.133443 21.169814 15.750158 7.634259 16.177436 13.890115

To be continued...

In Hoc Signo Vinces (TPC-H) Series

September 25 2014

17:02

Community Meeting: Friday 26 September 2014

The next CKAN Association Community meeting is planned for this Friday 26th September at 4pm London / 11am EDT.

One major proposed topic is the Roadmap but please suggest additional topics in the meeting doc or here on list.

To help us manage the meeting and get an idea of numbers if you intend to come please add yourself to the participants list in the meeting doc.

Details

  • When: Friday 26 Sepember 2014 – 4pm London (BST) / 11am US East Coast (EDT) / 5pm European (CET)
  • Where: online (likely via WebEx or similar – link to come)
  • Text chat: http://webchat.freenode.net/?channels=ckan
  • Chair: Jeanne Holm (GSA / Data.Gov)
  • Meeting Doc: Google Doc here – please add your name if you plan to come!

Draft Agenda

September 24 2014

17:05

In Hoc Signo Vinces (part 16 of n): Introduction to Scale-Out

So far, we have analyzed TPC-H in a single-server, memory-only setting. We will now move to larger data and cluster implementations. In principle, TPC-H parallelizes well, so we should expect near-linear scalability; i.e., twice the gear runs twice as fast, or close enough.

In practice, things are not quite so simple. Larger data, particularly a different data-to-memory ratio, and the fact of having no shared memory, all play a role. There is also a network, so partitioned operations, which also existed in the single-server case, now have to send messages across machines, not across threads. For data loading and refreshes, there is generally no shared file system, so data distribution and parallelism have to be considered.

As an initial pass, we look at 100G and 1000G scales on the same test system as before. This is two machines, each with dual Xeon E5-2630, 192 GB RAM, 2 x 512 GB SSD, and QDR InfiniBand. We will also try other platforms, but if nothing else is said, this is the test system.

As of this writing, there is a working implementation, but it is not guaranteed to be optimal as yet. We will adjust it as we go through the workload. One outcome of the experiment will be a precise determination of the data-volume-to-RAM ratio that still gives good performance.

A priori, we know of the following things that complicate life with clusters:

  • Distributed memory — The working set must be in memory for a run to have a competitive score. A cluster can have a lot of memory, and the data is such that it partitions very evenly, so this appears at first not a problem. The difficulty comes with query memory: If each machine has 1/16th of the total RAM and a hash table would be 1/64th of the working set, on a single-server it is no problem just building the hash table. On a scale-out system, the hash table would be 1/4 of the working set if replicated on each node, which will not fit, especially if there are many such hash tables at the same time. Two main approaches exist: The hash table can be partitioned, but this will force the probe to go cross-partition, which takes time. The other possibility is to build the hash table many times, each time with a fraction of the data, and to run the probe side many times. Since hash tables often have Bloom filters, it is sometimes possible to replicate the Bloom filter and partition the hash table. One has also heard of hash tables that go to secondary storage, but should this happen, the race is already lost; so, we do not go there.

    We must evaluate different combinations of these techniques and have a cost model that accurately predicts the performance of each variant. Adding to realism is always safe but halfway difficult to do.

  • NUMA — Most servers are NUMA (non-uniform memory architecture), where each CPU socket has its own local memory. For single-server cases, we use all the memory for the process. Some implementations have special logic for memory affinity between threads. With scale-out there is the choice of having a server process per-NUMA-node or per-physical-machine. If per-NUMA-node, we are guaranteed only local memory accesses. This is a tradeoff to be evaluated.

  • Network and Scheduling — Execution on a cluster is always vectored, for the simple reason that sending single-tuple messages is unfeasible in terms of performance. With an otherwise vectored architecture, the message batching required on a cluster comes naturally. However, the larger the cluster, the more partitions there are, which rapidly gets into shorter messages. Increasing the vector size is possible and messages become longer, but indefinite increase in vector size has drawbacks for cache locality and takes memory. To run well, each thread must stay on core. There are two ways of being taken off core ahead of time: Blocking for a mutex, and blocking for network. Lots of short messages run into scheduling overhead, since the recipient must decide what to do with each, which is not really possible without some sort of critical section. This is more efficient if messages are longer, as the decision time does not depend on message length. Longer messages are however liable to block on write at the sender side. So one pays in either case. This is another tradeoff to be balanced.

  • Flow control — A query is a pipeline of producers and consumers. Sometimes the consumer is in a different partition. The producer must not get indefinitely ahead of the consumer because this would run out of memory, but it must stay sufficiently ahead so as not to stop the consumer. In practice, there are synchronization barriers to check even progress. These will decrease platform utilization, because two threads never finish at exactly the same time. The price of not having these is having no cap on transient memory consumption.

  • Un-homogenous performance — Identical machines do not always perform identically. This is seen especially with disk, where wear on SSDs can affect write speed, and where uncontrollable hazards of data placement will get uneven read speeds on rotating media. Purely memory-bound performance is quite close, though. Un-anticipatable and uncontrollable hazards of scheduling cause different times of arrival of network messages, which introduces variation in run time on consecutive runs. Single-servers have some such variation from threading, but the effects are larger with a network.

The logical side of query optimization stays the same. Pushing down predicates is always good, and all the logical tricks with moving conditions between subqueries stay the same.

Schema design stays much the same, but there is the extra question of partitioning keys. In this implementation, there are only indices on identifiers, not on dates, for example. So, for a primary key to foreign key join, if there is an index on the foreign key, the index should be partitioned the same way as the primary key. So, joining from orders to lineitem on orderkey will be co-located. Joining from customer to orders by index will be colocated for the c_custkey = o_custkey part (assuming an index on o_custkey) and cross-partition for getting the customer row on c_custkey, supposing that the query needs some property of the customer other than c_custkey or c_orderkey.

A secondary question is the partition granularity. For good compression, nearby values should be consecutive, so here we leave the low 12 bits out of the partitioning. This has effect on bulk load and refreshes, for example, so that a batch of 10,000 lineitems, ordered on l_orderkey will go to only 2 or 3 distinct destinations, thus getting longer messages and longer insert batches, which is more efficient.

This is a quick overview of the wisdom so far. In subsequent installments, we will take a quantitative look at the tradeoffs and consider actual queries. As a conclusion, we will show a full run on a couple of different platforms, and likely provide Amazon machine images for the interested to see for themselves. Virtuoso Cluster is not open source, but the cloud will provide easy access.

To be continued...

In Hoc Signo Vinces (TPC-H) Series

September 23 2014

15:37

Zemanta Partners with Getty Images

Today, we are thrilled to announce a partnership with Getty Images, the global leader in digital media, to offer Zemanta users the ability to embed high quality images directly from our Editorial Assistant. Whether it’s improving SEO or helping authors tell a better story, images play a critical role in the content creation process. The use of high quality images can help your blog stand above the rest. Now with Zemanta Editorial Assistant, bloggers can draw from a library of more than 50 million free images to compliment their content. Let’s take a look at how to embed Getty Images using Editorial Assistant. getty Just like our existing image recommendations, images from the Getty Images library are available via the Media Gallery.

  1. As you begin writing, Editorial Assistant will make image recommendations based on the topic of your post.  Draw on Getty Images’ latest news, sports, celebrity, music and fashion coverage; immense digital photo archive; and rich conceptual images to illustrate your unique passions, ideas and interests. This partnership opens one of the largest, deepest and most comprehensive image collections in the world for easy sharing.
  2. Hover over the images for larger view, and click on an image once you are ready to embed it in an article. The image itself will be inserted to the position of your pointer in the text editor, but it’s also worth noting that the image is one-size-fits-all, which should fit all WordPress’s design layouts. It really is that easy.
  3. Also — all images already have their credits with the source URL hardcoded at the very bottom of the image, so that you won’t have to worry about adding it later.
  4. Once you’ve selected an image, make sure to include some in-text links and related articles recommended by Editorial Assistant to make your post even richer.

Getty Images embeds are available via both our Editorial Assistant WordPress plugin and Browser Extension. If you haven’t tried Editorial Assistant just yet, or are looking to start blogging again, you can download the plugin or browser extension here. If you have any question or concerns getting setup, we’re always happy to help. Send us an email at support@zemanta.com, or find us on Twitter @Zemanta.

09:17

Big Data Applications in the Healthcare Domain Presentation at IEEE IRI 2014

From 13th to 15th August 2014, the 15th IEEE International Conference on Information Reuse and Integration took place in San Francisco, USA. Sabrina Neururer had the chance to present the accepted paper entitled “Towards a Technology Roadmap for Big Data Applications in the Healthcare Domain” to a broad public on the first day of the conference. Very good feedback was received that confirmed the results of the presented study and highlighted the huge impact of Big Data applications on the healthcare domain. Technical challenges, such as semantic annotation, data sharing, data quality, privacy and security, and open R&D questions were discussed. Also non-technical challenges for Big Data Applications, such as user acceptance, were debated intensively.

Categories:

September 19 2014

09:26

BIG Project Final Event Workshop co-located with ISC BIG DATA 2014

Big Data Project Final Event Workshop

In this workshop we will present the results of the BIG project including analysis of foundational Big Data research technologies, technology and strategy roadmaps to enable business to understand the potential of Big Data technologies across different sectors, and the necessary collaboration and dissemination infrastructure to link technology suppliers, integrators and leading user organizations.

Agenda: includes speakers from Atos, AGT International, OKF and SIEMENS

PPP approved: the Big Project is pleased to have achieved its goal of creating a Big Data Public Private Partnership (PPP). The PPP will be signed by Neelie Kroes on October 13th and will allow industry to influence Europe's funding policy. Learn more about it in Session 2 of our workshop

The workshop is highly interactive, if you wish to give a position statement, please contact info@big-project.eu (please include [finalevent] in the subject

Registration is still open (and free) at our workshop website

Date: September 30, 2014

Time: 9 a.m.–5:30 p.m.

Venue: Heidelberg Marriott Hotel, Heidelberg, Germany

Co-located with: ISC BigData 2014

The workshop is free to attend; lunch and refreshments will be provided by the BIG Project. In case of questions regarding the registration for the event email to info@big-project.eu (please include [finalevent] in the subject of your email for faster response).

06:43

How Turn Dry Statistics into Discussions on Twitter

Four years ago we were standing in the offices of DG INSFO, the European Commission ministry of information and networks currently called DG Connect, for a first meeting on what is today the Digital Agenda Scoreboard.

DG INFSO invited us to discuss the future of the digital agenda scoreboard. Their objective was: create together with us a Linked Open Data platform for publishing statistics on how digital Europe is.

We did the classical data extraction towards Linked Data turning the CSV representation of the data into RDF according to the Data Cube Vocabulary. This vocabulary was then brand new and - as far as we know - we were one of the first to build an application with it. A lot of effort went into the creation of the use case scenario's, like:

  • Visitors can select the data dimensions of their interest.
  • The system creates dynamically graphs with all the background information attached.

Already at the first meeting we pointed out the new scoreboards potential for triggering vivid discussions on the data. This potential was shown right before the launch of the first version. During the preview at DG Connect, we showed a graph, and while explaining the selecting a vivid discussion on the correctness of the data started between the business owners. Mission accomplished, I thought.

Now 4 years later, my twitter feed shows on a regular basis tweets pointing to the scoreboard. Boring statistics in a book have been turned in a lively, public place where European citizens discuss on the how digital their country is.

Looking back, I am proud to be part of this story. It shows that breaking down the walls around data can kick start discussions in a community. Also the Digital Agenda Scoreboard was the start of the development of cool extensions to deal with publishing statistical data as Linked Open Data within the LOD2 project.



September 15 2014

17:38

Partner with Zemanta

At Zemanta, the tools and services we build are designed to help content creators create and then reach their total potential audience. Independent bloggers, publishers, and Fortune 500 brands work with Zemanta to create and amplify high quality content. To support our customers’ goals we rely on a diverse partner ecosystem — from Advertising Networks to WordPress Hosts.

If you run a network, platform, or software service that is seeking to further monetize or add a valuable service, let’s chat! Take a look at the opportunities we have available:

  • Advertising Networks. Zemanta is the first dedicated Demand Side Platform (DSP) for Content Ads. Partner with Zemanta to monetize your network by accessing our inventory of high quality content ads programmatically. Our unique Promoted Content API enables Ad Networks full control over their content ads. Promoted Content from Zemanta can be found on networks such as Outbrain, Yahoo!, nRelate, and many many others.
  • WordPress Hosts. Give your WordPress users the power of our Editorial Assistant WordPress plugin by default to create richer posts more efficiently and share in the revenues we create with our unique Editorial Network ad product.
  • Content Creation, Curation, and Marketing Platforms. Partner with Zemanta to monetize your “authors” as an audience with Editorial Assistant or enable your customers to easily amplify their original content via our Content Ad DSP.

Do you run a platform or site that would benefit from a partnership with Zemanta? Visit our new partner page or email us at partners@zemanta.com to learn more. Even if one of the opportunities above doesn’t quite fit with your projects we’d still love to hear from you.

September 10 2014

09:25
The LOD2 project results presented in “Linked Open Data — Creating Knowledge Out of Interlinked Data”

September 09 2014

10:36

Big Data Healthcare Applications Presentation at MIE 2014

At September 1st at the 25th European Medical Informatics Conference (MIE 2014), Dr. Sonja Zillner presented "User Needs and Requirements Analysis for Big Data Healthcare Applications"  which was received very well . As the realization of the promising opportunities of big data technolgies for healthcare relies on the integrated view on heterogeneoaus health data sources in high quality and the availablitlity of legal frameworks for secure data sharing,  an intensive discussion of how to address the mentioned health data management challenges was triggered.

Categories:
08:58

DBpedia Version 2014 released

Hi all,

we are happy to announce the release of DBpedia 2014.

The most important improvements of the new release compared to DBpedia 3.9 are:

1. the new release is based on updated Wikipedia dumps dating from April / May 2014 (the 3.9 release was based on dumps from March / April 2013), leading to an overall increase of the number of things described in the English edition from 4.26 to 4.58 million things.

2. the DBpedia ontology is enlarged and the number of infobox to ontology mappings has risen, leading to richer and cleaner data.

The English version of the DBpedia knowledge base currently describes 4.58 million things, out of which 4.22 million are classified in a consistent ontology (http://wiki.dbpedia.org/Ontology2014), including 1,445,000 persons, 735,000 places (including 478,000 populated places), 411,000 creative works (including 123,000 music albums, 87,000 films and 19,000 video games), 241,000 organizations (including 58,000 companies and 49,000 educational institutions), 251,000 species and 6,000 diseases.

We provide localized versions of DBpedia in 125 languages. All these versions together describe 38.3 million things, out of which 23.8 million are localized descriptions of things that also exist in the English version of DBpedia. The full DBpedia data set features 38 million labels and abstracts in 125 different languages, 25.2 million links to images and 29.8 million links to external web pages; 80.9 million links to Wikipedia categories, and 41.2 million links to YAGO categories. DBpedia is connected with other Linked Datasets by around 50 million RDF links.

Altogether the DBpedia 2014 release consists of 3 billion pieces of information (RDF triples) out of which 580 million were extracted from the English edition of Wikipedia, 2.46 billion were extracted from other language editions.

Detailed statistics about the DBpedia data sets in 28 popular languages are provided at Dataset Statistics page ( http://wiki.dbpedia.org/Datasets2014/DatasetStatistics ).

The main changes between DBpedia 3.9 and 2014 are described below. For additional, more detailed information please refer to the DBpedia Change Log ( http://wiki.dbpedia.org/Changelog ).

  1. Enlarged Ontology

The DBpedia community added new classes and properties to the DBpedia ontology via the mappings wiki. The DBpedia 2014 ontology encompasses

  • 685  classes (DBpedia 3.9: 529)
  • 1,079 object properties (DBpedia 3.9: 927)
  • 1,600 datatype properties (DBpedia 3.9: 1,290)
  • 116 specialized datatype properties (DBpedia 3.9: 116)
  • 47 owl:equivalentClass and 35 owl:equivalentProperty mappings to http://schema.org

2. Additional Infobox to Ontology Mappings

The editors community of the mappings wiki also defined many new mappings from Wikipedia templates to DBpedia classes. For the DBpedia 2014 extraction, we used 4,339 mappings (DBpedia 3.9: 3,177 mappings), which are distributed as follows over the languages covered in the release.

  • English: 586 mappings
  • Dutch: 469 mappings
  • Serbian: 450 mappings
  • Polish: 383 mappings
  • German: 295 mappings
  • Greek: 281 mappings
  • French: 221 mappings
  • Portuguese: 211 mappings
  • Slovenian: 170 mappings
  • Korean: 148 mappings
  • Spanish: 137 mappings
  • Italian: 125 mappings
  • Belarusian: 125 mappings
  • Hungarian: 111 mappings
  • Turkish: 91 mappings
  • Japanese: 81 mappings
  • Czech: 66 mappings
  • Bulgarian: 61 mappings
  • Indonesian: 59 mappings
  • Catalan: 52 mappings
  • Arabic: 52 mappings
  • Russian: 48 mappings
  • Basque: 37 mappings
  • Croatian: 36 mappings
  • Irish: 17 mappings
  • Wiki-Commons: 12 mappings
  • Welsh: 7 mappings
  • Bengali: 6 mappings
  • Slovak: 2 Mappings

3. Extended Type System to cover Articles without Infobox

  Until the DBpedia 3.8 release, a concept was only assigned a type (like person or place) if the corresponding Wikipedia article contains an infobox indicating this type. Starting from the 3.9 release, we provide type statements for articles without infobox that are inferred based on the link structure within the DBpedia knowledge base using the algorithm described in Paulheim/Bizer 2014 ( http://www.heikopaulheim.com/documents/ijswis_2014.pdf ). For the new release, an improved version of the algorithm was run to produce type information for 400,000 things that were formerly not typed. A similar algorithm (presented in the same paper) was used to identify and remove potentially wrong statements from the knowledge base.

  4. New and updated RDF Links into External Data Sources

  We updated the following RDF link sets pointing at other Linked Data sources: Freebase, Wikidata, Geonames and GADM. For an overview about all data sets that are interlinked from DBpedia please refer to http://wiki.dbpedia.org/Interlinking .

Accessing the DBpedia 2014 Release 

  You can download the new DBpedia datasets from http://wiki.dbpedia.org/Downloads .

  As usual, the new dataset is also available as Linked Data and via the DBpedia SPARQL endpoint at http://dbpedia.org/sparql .

Credits

  Lots of thanks to

  1. Daniel Fleischhacker (University of Mannheim) and Volha Bryl (University of Mannheim) for improving the DBpedia extraction framework, for extracting the DBpedia 2014 data sets for all 125 languages, for generating the updated RDF links to external data sets, and for generating the statistics about the new release.
  2. All editors that contributed to the DBpedia ontology mappings via the Mappings Wiki.
  3.  The whole DBpedia Internationalization Committee for pushing the DBpedia internationalization forward.
  4. Dimitris Kontokostas (University of Leipzig) for improving the DBpedia extraction framework and loading the new release onto the DBpedia download server in Leipzig.
  5. Heiko Paulheim (University of Mannheim) for re-running his algorithm to generate additional type statements for formerly untyped resources and identify and removed wrong statements.
  6. Petar Ristoski (University of Mannheim) for generating the updated links pointing at the GADM database of Global Administrative Areas. Petar will also generate an updated release of DBpedia as Tables soon.
  7. Aldo Gangemi (LIPN University, France & ISTC-CNR, Italy) for providing the links from DOLCE to DBpedia ontology.
  8.  Kingsley Idehen, Patrick van Kleef, and Mitko Iliev (all OpenLink Software) for loading the new data set into the Virtuoso instance that serves the Linked Data view and SPARQL endpoint.
  9.  OpenLink Software (http://www.openlinksw.com/) altogether for providing the server infrastructure for DBpedia.
  10. Michael Moore (University of Waterloo, as an intern at the University of Mannheim) for implementing the anchor text extractor and and contribution to the statistics scripts.
  11. Ali Ismayilov (University of Bonn) for implementing Wikidata extraction, on which the interlanguage link generation was based.
  12. Gaurav Vaidya (University of Colorado Boulder) for implementing and running Wikimedia Commons extraction.
  13. Andrea Di Menna, Jona Christopher Sahnwaldt, Julien Cojan, Julien Plu, Nilesh Chakraborty and others who contributed improvements to the DBpedia extraction framework via the source code repository on GitHub.
  14.  All GSoC mentors and students for working directly or indirectly on this release: https://github.com/dbpedia/extraction-framework/graphs/contributors

 The work on the DBpedia 2014 release was financially supported by the European Commission through the project LOD2 - Creating Knowledge out of Linked Data (http://lod2.eu/).

More information about DBpedia is found at http://dbpedia.org/About as well as in the new overview article about the project available at   http://wiki.dbpedia.org/Publications .

Have fun with the new DBpedia 2014 release!

Cheers,

Daniel Fleischhacker, Volha Bryl, and Christian Bizer

 

 

September 08 2014

20:11

SEMANTiCS 2014 (part 3 of 3): Conversations

I was asked for an oracular statement about the future of relational database (RDBMS) at the conference. The answer, without doubt or hesitation, is that this is forever. But this does not mean that the RDBMS world would be immutable, quite the opposite.

The specializations converge. The RDBMS becomes more adaptable and less schema-first. Of course the RDBMS also take new data models beside the relational. RDF and other property graph models, for instance.

The schema-last-ness is now well in evidence. For example, PostgreSQL has an hstore column type which is a list of key-value pairs. Vertica has a feature called flex tables where a column can be added on a row-by-row basis.

Specialized indexing for text and geometries is a well established practice. However, dedicated IR systems, often Lucene derivatives, can offer more transparency in the IR domain for things like vector-space-models and hit-scoring. There is specialized faceted search support which is quite good. I do not know of an RDBMS that would do the exact same trick as Lucene for facets, but, of course, in the forever expanding scope of RDB, this is added easily enough.

JSON is all the rage in the web developer world. Phil Archer even said in his keynote, as a parody of the web developer: " I will never touch that crap of RDF or the semantic web; this is a pipe dream of reality ignoring academics and I will not have it. I will only use JSON-LD."

XML and JSON are much the same thing. While most databases have had XML support for over a decade, there is a crop of specialized JSON systems like MongoDB. PostgreSQL also has a JSON datatype. Unsurprisingly, MarkLogic too has JSON, as this is pretty much the same thing as their core competence of XML.

Virtuoso, too, naturally has a JSON parser, and mapping this to the native XML data type is a non-issue. This should probably be done.

Stefano Bertolo of the EC, also LOD2 project officer, used the word Cambrian explosion when talking about the proliferation of new database approaches in recent years.

Hadoop is a big factor in some environments. Actian Vector (née VectorWise), for example, can use this as its file system. HDFS is singularly cumbersome for this but still not impossible and riding the Hadoop bandwagon makes this adaptation likely worthwhile.

Graphs are popular in database research. We have a good deal of exposure to this via LDBC. Going back to an API for database access, as is often done in graph database, can have its point, especially as a reaction to the opaque and sometimes hard to predict query optimization of declarative languages. This just keeps getting more complex, so a counter-reaction is understandable. APIs are good if crossed infrequently and bad otherwise. So, graph database APIs will develop vectoring, is my prediction and even recommendation in LDBC deliverables.

So, there are diverse responses to the same evolutionary pressures. These are of initial necessity one-off special-purpose systems, since the time to solution is manageable. Doing these things inside an RDBMS usually takes longer. The geek also likes to start from scratch. Well, not always, as there have been some cases of grafting some entirely non-MySQL-like functionality, e.g. Infobright and Kickfire, onto MySQL.

From the Virtuoso angle, adding new data and control structures has been done many times. There is no reason why this cannot continue. The next instances will consist of some graph processing (BSP, or Bulk Synchronous Processing) in the query languages. Another recent example is an interface for pluggable specialized content indices. One can make chemical structure indices, use alternate full text indices, etc., with this.

Most of this diversification has to do with physical design. The common logical side is a demand for more flexibility in schema and sometimes in scaling, e.g., various forms of elasticity in growing scale-out clusters, especially with the big web players.

The diversification is a fact, but the results tend to migrate into the RDBMS given enough time.

On the other hand, when a new species like the RDF store emerges, with products that do this and no other thing and are numerous enough to form a market, the RDBMS functionality seeps in. Bigdata has a sort of multicolumn table feature, if I am not mistaken. We just heard about the wish for strict schema, views, and triggers. By all means.

From the Virtuoso angle, with structure awareness, the difference of SQL and RDF gradually fades, and any advance can be exploited to equal effect on either side.

Right now, I would say we have convergence when all the experimental streams feel many of the same necessities.

Of course you cannot have a semantic tech conference without the matter of the public SPARQL end point coming up. The answer is very simple: If you have operational need for SPARQL accessible data, you must have your own infrastructure. No public end points. Public end points are for lookups and discovery; sort of a dataset demo. If operational data is in all other instances the responsibility of the one running the operation, why should it be otherwise here? Outsourcing is of course possible, either for platform (cloud) or software (SaaS). To outsource something with a service level, the service level must be specifiable. A service level cannot be specified in terms of throughput with arbitrary queries but in terms of well defined transactions; hence the services world runs via APIs, as in the case of Open PHACTS. For arbitrary queries (i.e., analytics on demand), with the huge variation in performance dependent on query plans and configuration of schema, the best is to try these things with platform on demand in a cloud. Like this, there can be a clear understanding of performance, which cannot be had with an entirely uncontrolled concurrent utilization. For systems in constant operation, having one's own equipment is cheaper, but still might be impossible to procure due to governance.

Having clarified this, the incentives for operators also become clearer. A public end point is a free evaluation; a SaaS deal or product sale is the commercial offering.

Anyway, common datasets like DBpedia are available preconfigured on AWS with a Virtuoso server. For larger data, there is a point to making ready-to-run cluster configurations available for evaluation, now that AWS has suitable equipment (e.g., dual E5 2670 with 240 GB RAM and SSD for USD 2.8 an hour). According to Amazon, up to five of these are available at a time without special request. We will try this during the fall and make the images available.

SEMANTiCS 2014 Series

19:22

SEMANTiCS 2014 (part 2 of 3): RDF Data Shapes

The first keynote of Semantics 2014 was by Phil Archer of the W3C, entitled "10 Years of Achievement." After my talk, in the questions, Phil brought up the matter of the upcoming W3C work group charter on RDF Data Shapes. We had discussed this already at the reception the night before and I will here give some ideas about this.

After the talk, my answer was that naturally the existence of something that expressed the same sort of thing as SQL DDL, with W3C backing, can only be a good thing and will give the structure awareness work by OpenLink in Virtuoso and probably others a more official seal of approval. Quite importantly, this will be a facilitator of interoperability and will raise this from a product specific optimization trick to a respectable, generally-approved piece of functionality.

This is the general gist of the matter and can hardly be otherwise. But underneath is a whole world of details, which we discussed at the reception.

Phil noted that there was controversy around whether a lightweight OWL-style representation or SPIN should function as the basis for data shapes.

Phil stated in the keynote that the W3C considered the RDF series of standards as good and complete, but would still have working groups for filling in gaps as these came up. This is what I had understood from my previous talks with him at the Linking Geospatial Data workshop in London earlier this year.

So, against this backdrop, as well as what I had discussed with Ralph Hodgson of Top Quadrant at a previous LDBC TUC meeting in Amsterdam, SPIN seems to me a good fit.

Now, it turns out that we are talking about two different use cases. Phil said that the RDF Data Shapes use case was about making explicit what applications required of data. For example, all products should have a unit price, and this should have one value that is a number.

The SPIN proposition on the other hand, as Ralph himself put it in the LDBC meeting, is providing to the linked data space functionality that roughly corresponds to SQL views. Well, this is one major point, but SPIN involves more than this.

So, is it DDL or views? These are quite different. I proposed to Phil that there was in fact little point in fighting over this; best to just have two profiles.

To be quite exact, even SQL DDL equivalence is tricky, since enforcing this requires a DBMS; consider, for instance, foreign key and check constraints. At the reception, Phil stressed that SPIN was certainly good but since it could not be conceived without a SPARQL implementation, it was too heavy to use as a filter for an application that, for example, just processed a stream of triples.

The point, as I see it, is that there is a wish to have data shape enforcement, at least to a level, in a form that can apply to a stream without random access capability or general purpose query language. This can make sense for some big data style applications, like an ETL-stage pre-cooking of data before the application. Applications mostly run against a DBMS, but in some cases, this could be a specialized map-reduce or graph analytics job also, so no low cost random access.

My own take is that views are quite necessary, especially for complex query; this is why Virtuoso has the SPARQL macro extension. This will do, by query expansion, a large part of what general purpose inference will do, except for complex recursive cases. Simple recursive cases come down to transitivity and still fit the profile. SPIN is a more generic thing, but has a large intersection with SPARQL macro functionality.

My other take is that structure awareness needs a way of talking about structure. This is a use case that is clearly distinct from views.

A favorite example of mine is the business rule that a good customer is one that has ordered more than 5 times in the last year, for a total of more than so much, and has no returns or complaints. This can be stated as a macro or SPIN rule with some aggregates and existences. This cannot be stated in any of the OWL profiles. When presented with this, Phil said that this was not the use case. Fair enough. I would not want to describe what amounts to SQL DDL in these terms either.

A related topic that has come up in other conversations is the equivalent of the trigger. One use case of this is enforcement of business rules and complex access rights for updates. So, we see that the whole RDBMS repertoire is getting recreated.

Now, talking from the viewpoint of the structure-aware RDF store, or the triple-stream application for that matter, I will outline some of what data shapes should do. The triggers and views matter is left out, here.

The commonality of bulk-load, ETL, and stream processing, is that they should not rely on arbitrary database access. This would slow them down. Still, they must check the following sorts of things:

  • Data types
  • Presence of some required attributes
  • Cardinality — e.g., a person has no more than one date of birth
  • Ranges — e.g., a product's price is a positive number; gender is male/female; etc.
  • Limited referential integrity — e.g., a product has one product type, and this is a subject of the RDF type product type.
  • Limited intra-subject checks — e.g.. delivery date is greater-than-or-equal-to ship date.

All these checks depend on previous triples about the subject; for example, these checks may be conditional on the subject having a certain RDF type. In a data model with a join per attribute, some joining cannot be excluded. Checking conditions that can be resolved one triple at a time is probably not enough, at least not for the structure-aware RDF store case.

But, to avoid arbitrary joins which would require a DBMS, we have to introduce a processing window. The triples in the window must be cross-checkable within the window. With RDF set semantics, some reference data may be replicated among processing windows (e.g., files) with no ill effect.

A version of foreign key declarations is useful. To fit within a processing window, complete enforcement may not be possible but the declaration should still be possible, a little like in SQL where one can turn off checking.

In SQL, it is conventional to name columns by prefixing them with an abbreviation of the table name. All the TPC schemas are like that, for example. Generally in coding, it is good to prefix names with data type or subsystem abbreviation. In RDF, this is not the practice. For reuse of vocabularies, where a property may occur in anything, the namespace or other prefix denotes where the property comes from, not where it occurs.

So, in TPC-H, l_partkey and ps_partkey are both foreign keys that refer to part, plus that l_partkey is also a part of a composite foreign key to partsupp. By RDF practices, these would be called rdfh:hasPart. So, depending on which subject type we have, rdfh:hasPart is 30:1 or 4:1. (distinct subjects:distinct objects) Due to this usage, the property's features are not dependent only on the property, but on the property plus the subject/object where it occurs.

In the relational model, when there is a parent and a child item (one to many), the child item usually has a composite key prefixed with the parent's key, with a distinguishing column appended, e.g., l_orderkey, l_linenumber. In RDF, this is rdfh:hasOrder as a property of the lineitem subject. In SQL, there is no single part lineitem subject at all, but in RDF, one must be made since everything must be referenceable with a single value. This does not have to matter very much, as long as it is possible to declare that lineitems will be primarily accessed via their order. It is either this or a scan of all lineitems. Sometimes a group of lineitems are accessed by the composite foreign key of l_partkey, l_suppkey. There could be a composite index on these. Furthermore, for each l_partkey, l_suppkey in lineitem there exists a partsupp. In an RDF translation, the rdfh:hasPart and rdfh:hasSupplier, when they occur in a lineitem subject, specify exactly one subject of type partsupp. When they occur in a partsupp subject, they are unique as a pair. Again, because names are not explicit as to where they occur and what role they play, the referential properties do not depend only on the name, but on the name plus included data shape. Declaring and checking all this is conventional in the mainstream and actually useful for query optimization also.

Take the other example of a social network where the foaf:knows edge is qualified by a date when this edge was created. This may be by reification, or more usually by an "entitized" relationship where the foaf:knows is made into a subject with the persons who know each other and the date of acquaintance as properties. In a SQL schema, this is a key person1, person2 -> date. In RDF, there are two join steps to go from person1 to person2; in SQL, 1. This is eliminated by saying that the foaf:knows entity is usually referenced by the person1 Object or person2 Object, not the Subject identifier of the foaf:knows.

This allows making the physical storage by O, S, G -> O2, O3, …. A secondary index with S, G, O still allows access by the mandatory subject identifier. In SQL, a structure like this is called a clustered table. In other words, the row is arranged contiguous with a key that is not necessarily the primary key.

So, identifying a clustering key in RDF can be important.

Identifying whether there are value-based accesses on a given Object without making the Object a clustering key is also important. This is equivalent to creating a secondary index in SQL. In the tradition of homogenous access by anything, such indexing may be on by default, except if the property is explicitly declared of low cardinality. For example, an index on gender makes no sense. The same is most often true of rdfs:type. Some properties may have many distinct values (e.g., price), but are still not good for indexing, as this makes for the extreme difference in load time between SQL and the all-indexing RDF.

Identifying whether a column will be frequently updated is another useful thing. This will turn off indexing and use an easy-to-update physical representation. Plus, properties which are frequently updated are best put physically together. This may, for example, guide the choice between row-wise and column-wise representation. A customer's account balance and orders year-to-date would be an example of such properties.

Some short string valued properties may be frequently returned or used as sorting keys. This requires accessing the literal via an ID in the dictionary table. Non-string literals, numbers, dates, etc., are always inlined (at least in most implementations), but strings are a special question. Bigdata and early versions of Virtuoso would inline short ones; later versions of Virtuoso would not. So specifying, per property/class combination, a length limit for an inlined string is very high gain and trivial to do. The BSBM explore score at large scales can get a factor of 2 gain just from inlining one label. BSBM is out of its league here, but this is still really true and yields benefits across the board. The simpler the application, the greater the win.

If there are foreign keys, then data should be loaded with the referenced entities first. This makes dimensional clustering possible at load time. If the foreign key is frequently used for accessing the referencing item (for example, if customers are often accessed by country), then loading customers so that customers of the same country end up next to each other can result in great gains. The same applies to a time dimension, which in SQL is often done as a dimension table, but rarely so in linked data. Anyhow, if date is a frequent selection criterion, physically putting items in certain date ranges together can give great gains.

The trick here is not necessarily to index on date, but rather to use zone maps (aka min/max index). If nearby values are together, then just storing a min-max value for thousands of consecutive column values is very compact and fast to check, provided that the rows have nearby values. Actian Vector's (VectorWise) prowess in TPC-H is in part from smart use of date order in this style.

To recap, the data shapes desiderata from the viewpoint of guiding physical storage is as follows:

(I will use "data shape" to mean "characteristic set," or "set of Subjects subject to the same set of constraints." A Subject belonging to a data shape may be determined either by its rdfs:type or by the fact of it having, within the processing window, all or some of a set of properties.)

  • All normal range, domain, cardinality, optionality, etc. Specifically, declaring something as single valued (as with SQL's UNIQUE constraint) and mandatory (as with SQL's NOT NULL constraint) is good.
  • Primary access path — The Properties whose Objects are dominant access criteria is important
  • No-index — Declare that no index will be made on the Object of a Property within a data shape.
  • Inlined string — String values of up to so many characters in this data shape are inlined
  • Clustering key — The Subject identifiers will be picked to be correlated with the Object of this Property in this data shape. This can be qualified by a number of buckets (e.g., if dates are from 2000 to 2020, then this interval may be 100 buckets), with an exception bucket for out of range values.
  • No full text index — A string value will not need to be full text indexed in this Property even if full text indexing is generally on.
  • Full text index desired — This means that if the value of the property is a string, then the row must be locatable via this string. The string may or may not be inlined, but an index will exist on the literal ID of the string, e.g., POSG.
  • Co-location — This is akin to clustering but specifies, for a high cardinality Object, that the Subject identifier should be picked to fall in the same partition as the Object. The Object is typically a parent of the Subject being loaded; for example, the containing assembly of a sub-assembly. Traversing the assembly created in this way will be local on a scale-out system. This can also apply to geometries or text values: If primary access is by text or geo index, then the metadata represented as triples should be in the same partition as the entry in the full text/geo index.
  • Update group — A set of properties that will often change together. Implies no index and some form of co-location, plus update-friendly physical representation. Many update groups may exist, in which case they may or may not be collocated.
  • Composite foreign/primary key. A data shape can have a multicolumn foreign key, e.g., l_partkey, l_suppkey in lineitem with the matching primary key of ps_partkey, ps_suppkey in partsupp. This can be used for checking and for query optimization: Looking at l_partkey and l_suppkey as independent properties, the guess would be that there hardly ever exists a partsupp, whereas one does always exist. The XML standards stack also has a notion of a composite key for random access on multiple attributes.

These things have the semantic of "hint for physical storage" and may all be ignored without effect on semantics, at least if the data is constraint-compliant to start with.

These things will have some degree of reference implementation through the evolution of Virtuoso structure awareness, though not necessarily immediately. These are, to the semanticist, surely dirty low-level disgraceful un-abstractions, some of the very abominations the early semanticists abhorred or were blissfully ignorant of when they first raised their revolutionary standard.

Still, these are well-established principles of the broader science of database. SQL does not standardize some of these, nor does it have much need to, as the use of these features is system-specific. The support varies widely and the performance impacts are diverse. However, since RDF excels as a reference model and as a data interchange format, giving these indications as hints to back-end systems cannot hurt, and can make a difference of night and day in load and query time.

As Phil Archer said, the idea of RDF Data Shapes is for an application to say that "it will barf if it gets data that is not like this." An extension is for the data to say what the intended usage pattern is so that the system may optimize for this.

All these things may be learned from static analysis and workload traces. The danger of this is over-fitting a particular profile. This enters a gray area in benchmarking. For big data, if RDF is to be used as the logical model and the race is about highest absolute performance, never mind what the physical model ends up being, all this and more is necessary. And if one is stretching the envelope for scale, the race is always about highest absolute performance. For this reason, these things will figure at the leading edge with or without standardization. I would say that the build-up of experience in the RDBMS world is sufficient for these things to be included as hints in a profile of data shapes. The compliance cost will be nil if these are ignored, so for the W3C, these will not make the implementation effort for compliance with an eventual data shapes recommendation prohibitive.

The use case is primarily the data warehouse to go. If many departments or organizations publish data for eventual use by their peers, users within the organization may compose different combinations of extractions for different purposes. Exhaustive indexing of everything by default makes the process slow and needlessly expensive, as we have seen. Much of such exploration is bounded by load time. Federated approaches for analytics are just not good, even though they may work for infrequent lookups. If datasets are a commodity to be plugged in and out, the load and query investment must be minimized without the user/DBA having to run workload analysis and manual schema optimization. Therefore, bundling guidelines such as these with data shapes in a dataset manifest can do no harm and can in cases provide 10-50x gains in load speeds and 2-4x in space consumption, not to mention unbounded gains in query time, as good and bad plans easily differ by 10-100x, especially in analytics.

So, here is the pitch:

  • Dramatic gains in ad hoc user experience
  • Minimal effort by data publishers, as much of the physical guidelines can be made from workload trace and dataset; the point is that the ad hoc user does not have to do this.
  • Great optimization potential for system vendors; low cost for initial compliance
  • Better understanding of the science of performance by the semantic community

To be continued...

SEMANTiCS 2014 Series

17:17

SEMANTiCS 2014 (part 1 of 3): Keynote

I was invited to give a keynote at SEMANTiCS 2014 in Leipzig, Germany last Thursday. I will here recap some of the main points, and comment on some of the ensuing controversy. The talk was initially titled Virtuoso, the Prometheus of RDF. Well, mythical Prometheus did perform a service but ended up paying for it. Still, the mythical reference is sometimes used when talking of major breakthroughs and big-gain ambitions. In the first slide, I changed it to Linked Data at Dawn, which is less product specific and more a reflection on the state of the linked data enterprise at large.

The first part of the talk was under the heading of the promise and the practice. The promise we know well and find no fault with: Schema-last-ness, persistent unique identifiers, self-describing data, some but not too much inference. The applications usually involve some form of integration and often have a mix of strictly structured content with semi-structured or textual content.

These values are by now uncontroversial and embraced by many; however, most instances of this embracing do not occur in the context of RDF as such. For example, the big online systems on the web: all have some schema-last (key-value) functionality. Applications involving long-term data retention have diverse means of having persistent IDs and self description, from UUIDs to having the table name in a column so that one can tell where a CSV dump came from.

The practice involves competing with diverse alternative technologies: SQL, key-value, information retrieval (often Lucene-derived). In some instances, graph databases occur as alternatives: Young semanticist, do or die.

In this race, linked data is often the prettiest and most flexible, but gets a hit on different aspects of performance and scalability. This is a database gig, and database is a performance game; make no mistake.

After these preliminaries we come to the "RDF tax," or the more or less intrinsic overheads of describing all as triples. The word "triple" is used by habit. In fact, we nearly always talk about quads, i.e., subject-predicate-object-graph (SPOG). The next slide is provocatively titled the Bane of the Triple, and is about why having all as triples is, on the surface, much like relational, except it makes life hard, where tables make it at least manageable, if still not altogether trivial.

The very first statement on the tax slide reads "90% of bad performance comes from non-optimal query plans." If one does triples in the customary way (i.e., a table of quads plus dictionary tables to map URIs and literal strings to internal IDs), one incurs certain fixed costs.

These costs are deemed acceptable by users who deploy linked data. If these costs were not acceptable, the proof of concept would have already disqualified linked data.

The support cases that come my way are nearly always about things taking too much time. Much less frequently, are these about something unambiguously not working. Database has well defined semantics, so whether something works or not is clear cut.

So, support cases are overwhelmingly about query optimization. The problems fall in two categories:

  • The plan is good in the end, but it takes much longer to make the plan than to execute it.
  • The plan either does the wrong things or does things in the wrong order, but produces a correct result.

Getting no plan at all or getting a clearly wrong result is much less frequent.

If the RDF overheads incurred with a good query plan were show stoppers, the show would have already stopped.

So, let's look at this in more detail; then we will talk about the fixed overheads.

The join selectivity of triple patterns is correlated. Some properties occur together all the time; some occur rarely; some not at all. Some property values can be correlated, i.e., order number and order date. Capturing these by sampling in a multicolumn table is easy; capturing this in triples would require doing the join in the cost model, which is not done since it would further extend compilation times. When everything is a join, selectivity estimation errors build up fast. When everything is a join, the space of possible graph query plans explodes as opposed to tables; thus, while the full plan space can be covered with 7 tables, it cannot be covered with 18 triple patterns. This is not factorial (number of permutations). For different join types (index/hash) and the different compositions of the hash build side, this is much worse, in some nameless outer space fringe of non-polynomiality.

TPC-H can be run with success because the cost model hits the right plan every time. The primary reason for this is the fact that the schema and queries unambiguously suggest the structure, even without foreign key declarations. The other reason is that with a handful of tables, all plans can be reviewed, and the cost model reliably tells how many rows will result from each sequence of operations.

Try this with triples; you will know what I mean.

Now, some people have suggested purely rule-based models of SPARQL query compilation. These are arguably faster to run and more predictable. But the thing that must be done, yet will not be done with these, is the right trade-off between index and hash. This is the crux of the matter, and without this, one can forget about anything but lookups. The choice depends on reliable estimation of cardinality (number of rows, number of distinct keys) on either side of the join. Quantity, not pattern matching.

Well, many linked data applications are lookups. The graph database API world is sometimes attractive because it gives manual control. Map reduce in the analytical space is sometimes attractive for the same reason.

On the other hand, query languages also give manual control, but then this depends on system specific hints and cheats. People are often black and white: Either all declarative or all imperative. We stand for declarative, but still allow physical control of plan, like most DBMS.

To round off, I will give a concrete example:

{  ?thing  rdfs:label    ?lbl         . 
   ?thing  dc:title      ?title       . 
   ?lbl    bif:contains  "gizmo"      . 
   ?title  bif:contains  "widget"     . 
   ?thing  a             xx:Document  . 
   ?thing  dc:date       ?dt          . 
   FILTER  ( ?dt  > "2014-01-01"^^xsd:date ) 
}

There are two full text conditions, one date, and one class, all on the same subject. How do you do this? Most selective text first, then get the data and check, then check the second full text given the literal and the condition, then check the class? Wrong. If widgets and gizmos are both frequent and most documents new, this is very bad because using a text index to check for a specific ID having a specific string is not easily vectorable. So, the right plan is: Take the more selective text expression, then check the date and class for the results, put the ?things in a hash table. Then do the less selective text condition, and drop the ones that are not in the hash table. Easily 10x better. Simple? In the end yes, but you do not know this unless you know the quantities.

This gives the general flavor of the problem. Doing this with TPC-H in RDF is way harder, but you catch my drift.

Each individual instance is do-able. Having closer and closer alignment between reality and prediction will improve the situation indefinitely, but since the space is as good as infinite there cannot be a guarantee of optimality except for toy cases.

The Gordian Knot shall not be defeated with pincers but by the sword.

We will come to this in a bit.

Now, let us talk of the fixed overheads. The embarrassments are in the query optimization domain; the daily grind, relative cost, and provisioning are in this one.

The overheads come from:

  • Indexing everything
  • Having literals and URI strings via dictionary
  • Having a join for every attribute

These all fall under the category of having little to no physical design room.

In the indexing everything department, we load 100 GB TPC-H in 15 minutes in SQL with ordering only on primary keys and almost no other indexing. The equivalent with triples is around 12 hours. This data can be found on this blog (TPC-H series and Meeting the Challenges of Linked Data in the Enterprise). This is on the order of confusing a screwdriver with a hammer. If the nail is not too big, the wood not too hard, and you hit it just right — the nail might still go in. The RDF bulk load is close to the fastest possible given the general constraints of what it does. The same logic is used for the record-breaking 15 minutes of TPC-H bulk load, so the code is good. But indexing everything is just silly.

The second, namely the dictionary of URIs and literals, is a dual edge. I talked to Bryan Thompson of SYSTAP (Bigdata RDF store) in D.C. at the ICDE there. He said that they do short strings inline and long ones via dictionary. I said we used to do the same but stopped in the interest of better compression. What is best depends on workload and working-set-to-memory ratio. But if you must make the choice once and for all, or at least as a database-wide global setting, you are between a rock and a hard place. Physical vs. logical design, again.

The other aspect of this is the applications that do regexps on URI strings or literals. Doing this is like driving a Formula 1 race in reverse gear. Use a text index. Always. This is why most implementations have one even though SPARQL itself makes no provisions for this. If you really need regexps, and on supposedly opaque URIs at that, tokenize them and put them in a text index as a text literal. Or if an inverted-file-word index is really not what you need, use a trigram one. So far, nobody has wanted one hard enough for us to offer this, even though this is easy enough. But special indices for special data types (e.g., chemical structure) are sometimes wanted, and we have a generic solution for all this, to be introduced shortly on this blog. Again, physical design.

I deliberately name the self-join-per-attribute point last, even though this is often the first and only intrinsic overhead that is named. True, if the physical model is triples, each attribute is a join against the triple table. Vectored execution and right use of hash-join help, though. The Star Schema Benchmark SQL to SPARQL gap is only 2.5x, as documented last year on this blog. This makes SPARQL win by 100+x against MySQL and lose by only 0.8x against column store pioneer MonetDB. Let it be said that this is so far the best case and that the gap is wider in pretty much all other cases. This gap is well and truly due to the self-join matter, even after the self-joins are done vectored, local, ordered; in one word, right. The literal and URI translation matter plays no role here. The needless indexing hurts at load but has no effect at query time, since none of the bloat participates in the running. Again, physical design.

Triples are done right, so?

In the summer of 2013, after the Star Schema results, it became clear that maybe further gains could be had and query optimization made smoother and more predictable, but that these would be paths of certain progress but with diminishing returns per effort. No, not the pincers; give me the sword. So, between fall 2013 and spring 2014, aside from doing diverse maintenance, I did the TPC-H series. This is the proficiency run for big league databases; the America's Cup, not a regatta on the semantic lake.

Even if the audience is principally Linked Data, the baseline must be that of the senior science of SQL.

It stands to reason and has been demonstrated by extensive experimentation at CWI that RDF data, by and large, has structure. This structure will carry linked data through the last mile to being a real runner against the alternative technologies (SQL, IR, key value) mentioned earlier.

The operative principles have been mentioned earlier and are set forth on the slides. In forthcoming articles I will display some results.

One important proposal for structure awareness was by Thomas Neumann in an RDF3X paper introducing characteristic sets. There, the application was creation of more predictable cost estimates. Neumann correctly saw this as possibly the greatest barrier to predictable RDF performance. Peter Boncz and I discussed the use of this for physical optimization once when driving back to Amsterdam from a LOD2 review in Luxembourg. Pham Minh Duc of CWI did much of the schema discovery research, documented in the now published LOD2 book (Linked Open Data -- Creating Knowledge Out of Interlinked Data). The initial Virtuoso implementation had to wait for the TPC-H and general squeezing of the quads model to be near complete. It will likely turn out that the greatest gain of all with structure awareness will be bringing optimization predictability to SQL levels. This will open the whole bag of tricks known to data warehousing to safe deployment for linked data. Of course, much of this has to do with exploiting physical layout; hence it also needs the physical model to be adapted. Many of these techniques have high negative impact if used in the wrong place; hence the cost model must guess right. But they work in SQL and, as per Thomas Neumann's initial vision, there is no reason why these would not do so in a schema-less model if adapted in a smart enough manner.

All this gives rise to some sociological or psychological observations. Jens Lehmann asked me why now, why not earlier; after all, over the years many people have suggested property tables and other structured representations. This is now because there is no further breakthroughs within an undifferentiated physical model.

For completeness, we must here mention other approaches to alternative, if still undifferentiated, physical models. A number of research papers mention memory-only, pointer-based (i.e., no index, no hash-join) implementations of triples or quads. Some of these are on graph processing frameworks, some stand-alone. Yarc Data is a commercial implementation that falls in this category. These may have higher top speeds than column stores, even after all vectoring and related optimizations. However the space utilization is perforce larger than with optimum column compression and this plus the requirement of 100% in memory makes these more expensive to scale. The linked data proposition is usually about integration, and this implies initially large data even if not all ends up being used.

The graph analytics, pointer-based item will be specially good for a per-application extraction, as suggested by Oracle in their paper at GRADES 13. No doubt this will come under discussion at LDBC, where Oracle Labs is now a participant.

But back to physical model. What we have in mind is relational column store — multicolumn-ordered column-wise compressed tables — a bit like Vertica and Virtuoso in SQL mode for the regular parts and quads for the rest. What is big is regular, since a big thing perforce comes from something that happens a lot, like click streams, commercial transactions, instrument readings. For the 8-lane-motorway of regular data, you get the F1 racer with the hardcore best in column store tech. When the autobahn ends and turns into the mountain trail, the engine morphs into a dirt bike.

This is complex enough, and until all the easy gains have been extracted from quads, there is little incentive. Plus this has the prerequisite of quads done right, plus the need for top of the line relational capability for not falling on your face once the speedway begins.

Steve Buxton of MarkLogic gave a talk right before mine. Coming from a document-centric world, it stands to reason that MarkLogic would have a whole continuum of different mixes between SPARQL and document oriented queries. Steve correctly observed that some users found this great; others found this a near blasphemy, an unholy heterodoxy of confusing distinct principles.

This is our experience as well, since usage of XML fragments in SPARQL with XPath and such things in Virtuoso is possible but very seldom practiced. This is not the same as MarkLogic, though, as MarkLogic is about triples-in-documents, and the Virtuoso take is more like documents-in-triples. Not to mention that use of SQL and stored procedures in Virtuoso is rare among the SPARQL users.

The whole thing about the absence of physical design in RDF is a related, but broader instance of such purism.

In my talk, I had a slide titled The Cycle of Adventure, generally philosophizing on the dynamics of innovation. All progress begins with an irritation with the status quo; to mention a few examples: the No-SQL rebellion; the rejection of parallel SQL database in favor of key-value and map-reduce; the admission that central schema authority at web scale is impossible; the anti-ACID stance when having wide-area geographies to deal with. The stage of radicalism tends to discard the baby with the bathwater. But when the purists have their own enclave, free of the noxious corruption of the rejected world, they find that life is hard and defects of human character persist, even when all subscribe to the same religion. Of course, here we may have further splinter groups. After this, the dogma adapts to reality: the truly valuable insights of the original rebellion gain in appreciation, and the extremism becomes more moderate. Finally there is integration with mainstream, which becomes enriched by new content.

By the time the term Linked Data came to broad use, the RDF enterprise had its break-away colonies that started to shed some of the initial zeal. By now, we have the last phase of reconciliation in its early stages.

This process is in principle complete when linked data is no longer a radical bet, but a technology to be routinely applied to data when the nature of the data fits the profile. The structure awareness and other technology discussed here will mostly eliminate the differential in deployment cost.

The spreading perception of an expertise gap in this domain will even-out the cost in terms of personnel. The flexibility gains that were the initial drive for the movement will be more widely enjoyed when these factors fuel broader adoption.

To help this along, we have LDBC, the Linked Data Benchmark Council, with the agenda of creating industry consensus on measuring progress across the linked data and graph DB frontiers. I duly invited MarkLogic to join.

There were many other interesting conversations at the conference, I will later comment on these.

To be continued...

SEMANTiCS 2014 Series

11:34

SEMANTiCS – the emergence of a European Marketplace for the Semantic Web

SEMANTiCS conference celebrated its 10th anniversary this September in Leipzig. And this year’s venue has been capable of opening a new age for the Semantic Web in Europe - a marketplace for the next generation of semantic technologies was born.

semantics-2014-leipzig

As Phil Archer stated in his key note, the Semantic Web is now mature, and academia and industry can be proud of the achievements so far. And exactly that fact gave the thread for the conference: Real world use cases demonstrated by industry representatives, new and already running applied projects presented by the leading consortia in the field and a vivid academia showing the next ideas and developments in the field. So this years SEMANTiCS conference brought together the European Community in Semantic Web Technology – both from academia and industry.

  • Papers and Presentations: 45 (50% of them industry talks)
  • Posters: 10 (out of 22)
  • A marketplace with 11 permanent booths
  • Presented Vocabularies at the 1st Vocabulary Carnival: 24
  • Attendance: 225
  • Geographic Coverage: 21 countries

This year’s SEMANTiCS was co-located and connected with a couple of other related events, like the German ISKO, the Multilingual Linked Open Data for Enterprises (MLODE 2014) and the 2nd DBpedia Community Meeting 2014. This wisely connected gatherings brought people together and allowed transdisciplinary exchange.

Recapitulatory speaking: This SEMANTiCS has opened up new sights on Semantic Technologies, when it comes to

  • industry use
  • problem solving capacity
  • next generation development
  • knowledge about top companies, institutes and people in the sector

08:30

Announcing the CKAN Association Steering Group

We are delighted to announce that we have finalized the initial membership of the CKAN Association Steering Group. At present it consists of 4 organizations and their representatives as follows:

  • Antonio Acuña (*), Head of Data.gov.uk, UK Cabinet Office
  • Jeanne Holm, Evangelist, Data.gov, U.S. General Services Administration (Data.gov)
  • Pat McDermott and Ashley Casovan, Open Government Secretariat, Treasury Board of Canada (data.gc.ca)
  • Rufus Pollock (+), President (Open Knowledge)

(*) indicates Chair of the Steering Group
(+) indicates Secretary of the Steering Group

Steering Board Members will serve an initial term of 2 years and serve “ex officio” representing their organizations. We will continue to review Steering Group Membership and to consider potential new members as we go forward.

More Information About the Steering Group

The Steering Group is a key part of the CKAN Association. It is made up of key stakeholders who have committed to oversee and steer the CKAN Association going forward. The initial selection of the steering committee was coordinated by Open Knowledge.

Full details of the Steering Group can be found on the Steering Group page.

September 07 2014

20:10

PoolParty Server Release 4.5

PoolParty Server 4.5 as a major release of the PoolParty suite is available now. Learn more about the new features and improvements:

  • Automatic Batch Linking between Taxonomies
    For linked projects a batch linking mechanism is available. The mechanism allows to calculate possible links between whole projects or subtrees of projects.

  • Create SKOS-XL labels with PoolParty
    PoolParty has been extended to provide the ability to create SKOS-XL labels for concepts and add custom relations and attributes for SKOS-XL labels.

  • Improved Free Terms suggestion in Corpus management
    A quality measure has been integrated into corpus management to show the quality of suggested terms.

  • Extended MS Excel import
    Excel import now covers all SKOS properties and includes also all custom schema properties.

Find all new features, improvements and bug fixes in our release notes.

September 01 2014

10:49

Webinar: Taxonomy management & content management – well integrated!

Banner-Webinar_October-2014

 

 

With the arrival of semantic web standards and linked data technologies, new options for smarter content management and semantic search have become available. Taxonomies and metadata management shall play a central role in your content management system: By combining text mining algorithms with taxonomies and knowledge graphs from the web a more accurate annotation and categorization of documents and more complex queries over text-oriented repositories like SharePoint, Drupal, or Confluence are now possible.

Nevertheless, the predominant opinion that taxonomy management is a tedious process currently impedes a widespread implementation of professional metadata strategies.

In this webinar, key people from the Semantic Web Company will describe how content management and collaboration systems like SharePoint, Drupal or Confluence can benefit from professional taxonomy management. We will also discuss why taxonomy management is not necessarily a tedious process when well integrated into content management workflows.

In this webinar, you will learn about the following questions and topics:

… how standards like SKOS build the foundation of enterprise taxonomies to be linked with and enriched by additional vocabularies and taxonomies
… how knowledge graphs can be used to build the backbone of metadata services in organizations
… how automatic text mining can be used to create high-quality taxonomies and thesauri
… how automatic tagging can be integrated into existing content workflows of your SharePoint, Drupal or Confluence system
… how search-driven applications will become feasible when accurate metadata becomes available

Based on PoolParty Semantic Platform, you will see several live demos of end-user applications based on taxonomies and linked data.

We will showcase PoolParty’s latest release which provides outstanding facilities for professional linked data management, including entity extraction, PowerTagging, corpus analysis, taxonomy, thesaurus and ontology management.

Register for free: Webinar “Taxonomy management & content management – well integrated!”, October 8th, 2014

August 29 2014

16:49

LOD2 Finale (part 3 of n): The 500 Giga-triple Runs

In the evening of day 8, we have kernel settings in the cluster changed to allow more mmaps. At this point, we notice that the dataset is missing the implied types of products; i.e., the most specific type is given but its superclasses are not directly associated with the product. We have always run this with this unique inference materialized, which is also how the data generator makes the data, with the right switch. But the switch was not used. So a further 10 Gt (Giga-triples) are added, by running a SQL script to make the superclasses explicit.

At this point, we run BSBM explore for the first time. To what degree does the 37.5 Gt predict the 500 Gt behavior? First, there is an overflow that causes a query plan cost to come out negative if the default graph is specified. This is a bona fide software bug you don't get unless a sample is quite large. Also, we note that starting the databases takes a few minutes due to disk. Further, the first query takes a long time to compile, again because of sampling the database for overall statistics.

The statistics are therefore gathered by running a few queries, and then saved. Subsequent runs will reload the stats from the file system, saving some minutes of start time. There is a function for this, stat_import and stat_export. These are used for a similar purpose by some users.

On day 10, Wednesday August 20, we have some results of BSBM explore.

Then, we get into BSBM updates. The BSBM generator makes an update dataset, but it cannot be made large enough. The BSBM test driver suite is by now hated and feared in equal measure. Is it bad in and of itself? Depends. It was certainly not made for large data. Anyway, no fix will be attempted this time. Instead, a couple of SQL procedures are made to drive a random update workload. These can run long enough to get a steady state with warm cache, which is what any OLTP measurement needs.

On day 12, some updates are measured, with a one hour ramp-up to steady-state, but these are not quite the right mix, since these are products only and the mix needs to contain offers and reviews also. The first steady-state rate was 109 Kt/s, a full 50x less than the bulk load, but then this was very badly bound by latency. So, the updates are adjusted to have more variety. The final measurement was on day 17. Now the steady-state rate is 2563 Kt/s, which is better but still quite bound by network. By adding diversity to the dataset, we get slammed by a sharp rise in warm-up time (now 2 hours to be at 230 Kt/s), at which point we launch the explore mix to be timed during update. Time is short and we do not want to find out exactly how long it takes to get the plateau in insert rate. As it happens, the explore mix is hardly slowed down by the updates, but the updates get hit worse, so that the rate goes to about 1/3 of what it was, then comes back up when the explore is finished. Finally, half an hour after this, there is a steady state of 263 Kt/s update rate.

Of course, the main object of the festivities is still the business intelligence (BI) mix. This is our (specifically, Orri's) own invention from years back, subsequently formulated in SPARQL by FU Berlin (Andreas Schultz). Well, it is already something to do big joins with 150 Gt, all on index and vectored random access, as was done in January 2013, the last time results were published on the CWI cluster. You may remember that there was an aborted attempt in January 2014. So now, with the LOD2 end date under two weeks away, we will take the BI racer out for a spin with 500 Gt. This is now a very different proposition from Jan 2013, as we have by now done the whole TPC-H work documented on this blog. This serves to show, inter alia, that we can run with the best in the much bigger and harder mainstream database sports. The full benefits of this will be realized for the semantic data public still this year, so this is more than personal vanity.

So we will see. The BI mix is not exactly TPC-H, but what is good for one is good for the other. Checking that the plans are good on the 37 Gt scale model is done around day 12. On day 13, we try this on the larger cluster. You never know — pushing the envelope, even when you know what you are doing and have written the whole thing, is still a dive in the fog. Claiming otherwise would be a lie lacking credibility. The iceberg which first emerges is overflow and partition skew. Well, there can be a lot of messages if all messages go via the same path. So we make the data structure different and retry and now die from out of memory. On the scale model, this looks like a little imbalance you don't bother to notice; at 13x scale, this kills. So, as is the case with most database problems, the query plan is bad. Instead of using a PSOG index, it uses a POSG index, and there is a constant for O. Partitioning is on either S or O, whichever is first. Not hard to fix, but still needs a cost-model adjustment to penalize low-cardinality partition columns. This is something you don't get with TPC-H, where there are hardly any indices. Once this is fixed there are other problems, such as Q5, which we ended up leaving out. The scale model is good; the large one does not produce a plan, because some search-space corner is visited that is not visited in the scale model, due to different ratios of things in the cost model. Could be a couple of days to track; this is complex stuff. So we dropped it. It is not a big part of the metric, and its omission is immaterial to the broader claim of handling 500 Gt in all safety and comfort. The moral is: never get stuck; only do what is predictable, insofar as anything in this shadowy frontier is such.

So, on days 15 and 16, the BI mix that is reported was run. The multiuser score was negatively impacted by memory skew, so some swapping on one of the nodes, but the run finished in about 2 hours anyway. The peak of transient memory consumption is another thing that you cannot forecast with exact precision. There is no model for that; the query streams are in random order, and you just have to try. And it is a few hours per iteration, so you don't want to be stuck doing that either. A rerun would get a higher multiuser BI score; maybe one will be made but not before all the rest is wrapped up.

Now we are talking 2 hours, versus 9 hours with the 150 Gt set back in January 2013. So 3.3x the data, 4.5x less time, 1.5x the gear. This comes out at one order of magnitude. With a better score from better memory balance and some other fixes, a 15x improvement for BSBM BI is in the cards.

The final explore runs were made on day 18, while writing the report to be published at the LOD2 deliverables repository. The report contains in depth discussion on the query plans and diverse database tricks and their effectiveness.

The overall moral of this trip into these uncharted spaces is this: Expect things to break. You have to be the designer and author of the system to take it past its limits. You will cut it or you won't, and nobody can do anything about it, not with the best intentions, nor even with the best expertise, which both were present. This is true of the last minute daredevil stuff like this; if you have a year full time instead of the last 20 days of a project, all is quite different, and these things are more leisurely. This might then become a committee affair, though, which has different problems. In the end, the Virtuoso DBMS has never thrown anything at us we could not handle. The uncertainty in trips of this sort is with the hardware platform, of which we had to replace 2 units to get on the way, and with how fast you can locate and fix a software problem. So you pick the quickest ones and leave the uncertain aside. There is another category of rare events like network failures that in theory cannot happen. Yet they do. So, to program a cluster, you have to have some recovery things for these. We saw a couple of these along the way. Duplication of these can take days, and whether this correlates with specific links or is a bona fide software thing is time consuming to prove, and getting into this is a sure way to lose the race. These seem to be load peaks outside of steady-state; steady-state is in fact very steady once it is there. Except at the start, network glitches were not a big factor in these experiments. The bulk of these went away after replacing a machine. After this we twice witnessed something that cannot exist but knew better than to get stuck with that. Neither incident happened again. This is days of running at a cross sectional 1 GB/s of traffic. These are the truly unpredictable, and, in a crash course like this, can sink the whole gig no matter how good you are.

Thanks are due to CWI and especially Peter Boncz for providing the race track as well as advice and support.

In the next installments of this series, we will look at how schema and characteristic sets will deliver the promise of RDF without its cost. All the experiments so far were done with a quads table, as always before. So we could say that the present level is close to the limit of the achievable within this physical model. The future lies beyond the misconception of triples/quads as primary physical model.

To be continued...

LOD2 Finale Series

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.