Skip to content

Addressing SPARQL EXISTS errata #156

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
afs opened this issue Sep 20, 2024 · 25 comments
Open

Addressing SPARQL EXISTS errata #156

afs opened this issue Sep 20, 2024 · 25 comments
Labels
ErratumRaised Errata management: proposed erratum

Comments

@afs
Copy link
Contributor

afs commented Sep 20, 2024

Recap

TPAC 2023 presentation

Issues: sparql-query/issues for EXISTS

After TPAC 2023, an email was sent to interested parties.

Proposals

1:: Improved substitution
SEP-0007/Substitution

2:: SemiJoin/Antijoin
https://w3c.github.io/sparql-exists/docs/sparql-exists.html#proposal-a

Proposal 1

Proposal 1 is based on errata for the "Substitution" operation.

Full details including relationship to errata: SEP-0007/Substitution.

Proposal 2

Proposal 2 is SemiJoin/AntiJoin.

SPARQL already has MINUS which is an antijoin with a special condition for the case of disjoint domains (a decision of the SPARQL 1.1 working group).

A way forward.

A compromise way forward:

  1. Replace "Substitute" with the errata-derived fix SEP-0007/Substitution

  2. Plan for adding LATERAL, SEMIJOIN and ANTIJOIN (both pure forms) to the SPARQL language. This may have to be additional features added in the "new features" phase due to timing.

@pchampin
Copy link
Contributor

This was discussed during the rdf-star meeting on 26 September 2024.

View the transcript

Addressing SPARQL EXISTS errata 4

ora: Are there people fine with the current syntax?

ora: In any case, chairs will discuss this, let's move on

AndyS: [about SPARQL EXISTS] There are two proposals

AndyS: 1. substitution based on various existing errata

AndyS: 2. an other one based on ANTIJOIN. We already have MINUS. Except the behavior with disjoin domain. But outside of it it's ANTIJOIN

AndyS: On an other note, there are other things that might go to SPARQL like LATERAL that can be based on substitution. And pure form of anti join and semi join

AndyS: It's a possibility to move these additions (LATERAL, anti join...) to sparql dev

pchampin: we would add more subtly differences between operators like FILTER NOT EXISTS vs MINUS

pchampin: Your point of having multiple ways might create problems

ora: SPARQL spec spends a bit of time presenting this difference

AndyS: It was quite contentious in SPARQL 1.1

<pchampin> I'm more than happy to let the editors decide on that

AndyS: I am not aware of any outgoing opinion, I think it ends up to a choice on which way to go

tl: is it related to triple terms in any way of is it a SPARQL errata

AndyS: it has nothing to do with triple terms

tl: what is the criteria of SPARQL errata to discuss now?

tl: it's a central issue, is that the argument?

pfps: There are a bunch of problems with SPARQL, the ones with EXIST are the biggies

pfps: They end up splitting the SPARQL implementation space

pfps: The decision that has to be made is to move SPARQL EXIST toward a more database-like implementation and keep it more consistent with the existing

AndyS: The current implementation is present in SQL with correlated subqueries

pfps: if you use the semi/anti join interepretation of EXISTS you change SPARQL more than the other option

pfps: In the end people who will see and understand the differences are very few

ora: I would like to know preferences

AndyS: My preference is for substitution and applying errata (option 1)

pfps: I don't have much of a horse in this race

pfps: Idealy I would love to get more SPARQL developers on board

ora: we could talk outside of the group

ktk: I reached out to stardog but not got an input

gtw: I am not sure much value to reach out to more developers. sparql-dev has been opened for a long time

<pchampin> Tpt: I have a signicant preference for option 1; option 2 is basically equivalent to MINUS

pfps: One way to check the issue would be to pull some tests

<pfps> which PR?

<gkellogg> w3c/rdf-tests#42

<gb> Issue 42 tests to document current definition of EXISTS in SPARQL (by pfps) [SPARQL]

<gkellogg> w3c/rdf-tests#43

<gb> CLOSED Pull Request 43 Add tests to document current definition of EXISTS (by pfps)

ora: Whatever solutions we pick, someone will ask why we pick it

AndyS: picking sustitution breaks the least queries

ora: That seems to me a as good reason as any, let's make a decision

tl: I would like to ask james about it

ora: Let's vote on it next Thursday

ora: Let's do it


@klinovp
Copy link

klinovp commented Sep 27, 2024

Here's my 2c as a query engine tech lead at Stardog: I prefer fixing the substitution semantics, making it a part of the spec, and keeping EXISTS as a form of correlated semi-join based on substitution. Two main reasons:

  • fixing substitution is important beyond EXISTS. It's naturally used, for example, for parameterised SPARQL queries (not standardised in 1.1). It has uses in SHACL (SPARQL constraints). It's important that its semantics is clear and understandable. However, it tends to be much, much trickier than most SPARQL users tend to think and there're multiple, non-equivalent ways how it can be defined.
  • if [NOT] EXISTS becomes an uncorrelated semi[anti]-join with the standard bottom-up evaluation semantics, I don't see how we can still support cases like
SELECT * {
  ?s :p ?o
  FILTER NOT EXISTS { [] :q ?t . FILTER (?t > ?o)  }
}

I have seen lots of queries like this over the years where the bottom-up eval would produce zero results (I know this because Stardog, like many relational query engines, has a de-correlation optimiser and that has to carefully analyse for this sort of cases).

Eventually, I would really like to see LATERAL as a part of a future SPARQL standard and that also will require substitution (Stardog implements it using the SERVICE syntax). I think it's important that the next release of the spec actually uses the term "correlated" in the text when it articulates the differences between EXISTS and MINUS, so that introduction of general correlated subqueries through LATERAL then looks like a natural next step in the process.

@VladimirAlexiev
Copy link

@domel sent us this "questionnaire" by email. I'll paste it below, because I find it a useful roadmap to this topic.

The RDF-star Working Group is currently addressing issues related to updating the semantics of SPARQL EXISTS, specifically regarding:

  • Certain uses of EXISTS being undefined during evaluation,
  • Substitution occurring where definitions apply only to variables,
  • Blank nodes being substituted into BGPs and acting as variables,
  • Substitution potentially flipping MINUS into its disjoint-domain case,
  • Substitution impacting disconnected variables.

Would you be able to provide your insights on these matters

@afs
Copy link
Contributor Author

afs commented Sep 27, 2024

@domel sent us this "questionnaire" by email. I'll paste it below, because I find it a useful roadmap to this topic.

The RDF-star Working Group is currently addressing issues related to updating the semantics of SPARQL EXISTS, specifically regarding:

  • Certain uses of EXISTS being undefined during evaluation,
  • Substitution occurring where definitions apply only to variables,
  • Blank nodes being substituted into BGPs and acting as variables,
  • Substitution potentially flipping MINUS into its disjoint-domain case,
  • Substitution impacting disconnected variables.

Would you be able to provide your insights on these matters

This section of SEP-0007 expands on these points:
https://github.com/w3c/sparql-dev/blob/main/SEP/SEP-0007/sep-0007.md#identified-issues

@hannahbast
Copy link

Dear all, Hannah (@hannahbast) and Johannes (@joka921) here from the University of Freiburg and developers of QLever. Fascinating and important question. There is a lot to untangle here, so first

TLDR: In border cases regarding the semantics of a complex query, we recommend to follow the inner logic of the query language and not what a user thinks the query should do. Following that, we recommend to define EXISTS via a standard semijoin. That being said, the SPARQL standard currently misses some important features, but that is a different topic and should not be conflated with the topic under discussion here.

Here is the long version:

  1. There are two principle ways to determine what the semantics of a certain query should be. One is based on the internal logic of the query language. The other is based on what users think a certain query does or should do. The two should typically coincide, otherwise something would be very wrong with the design of the query language. But the definition of anything non-trivial is bound to have some unintuitive border bases.

  2. The definition of the semantics of EXISTS has several such border cases, notably when the right side uses filters on variables that only occur on the left side. It is aggravated by the fact that most users don't really think about EXISTS as the functional form it is, but about its usage in combination with a filter, as in FILTER NOT EXISTS, and then wondering why on earth SPARQL has both MINUS and FILTER NOT EXISTS, which are very similar but not equal. But that is really just a side-effect of EXISTS. So better think about EXISTS in isolation, like in this simple example query: three people with their name and whether they have a birth date in Wikidata or not.

  3. Most SPARQL features translate to some form of join operation when processing the query. For example, two graph patterns with joint variables can be processed by a standard join operation: the result is all pairs of solutions from the left and from the right, where the joint variables have the same value. One graph pattern MINUS another can be processed by an antijoin: the result is only those solutions from the left, which don't match one from the right. Similarly, OPTIONAL translates to a left outer join.

  4. All these join operations have an important property: the left and right side can be evaluated independently, and the computational complexity is linear in the size of the input plus the size of the output (modulo some details related to UNDEF values omitted here).

  5. Following that logic, it would make a lot of sense to define EXISTS via a semijoin: That is, each solution on the left evaluates to true if and only if it matches a solution on the right. This includes the special case, where there are no variables in common: then all solutions on the right match, just like the would in a normal join, and each solution on the left evaluates to true (and this also explains the difference between FILTER NOT EXISTS and MINUS).

  6. Unfortunately, the standard didn't define EXISTS that way, but chose to define it via substitution. Substitution is conceptually different from a join-like operation because it is a quadratic algorithm: for each solution on the left, perform the corresponding substitution on the right and evaluate it. It is true that one can reduce this to ordinary joins in many cases (a practice called de-correlation or unnesting, here is a good paper about it). But the current definition does go against the inner logic of all the other SPARQL constructs, as sketched above. And you can already tell that something is fishy by looking at the standard, where "substitute" is defined solely for the purpose of defining EXISTS and used nowhere else: https://www.w3.org/TR/sparql11-query/#sparqlAlgebraEval . The many issues with EXISTS are a direct consequence of this special treatment.

  7. All that being said, there are important features missing from SPARQL. One notable example is queries like the highest peak of each country, where you have a GROUP BY (for each country, all the peaks), an aggregation (find the maximum height per country), but then want the value from another column corresponding to the result of the aggregation (the peak which has the maximum height). SPARQL currently forces you to write a very unnatural query for this, where parts of the query have to be repeated in a subquery. Another example are parametrized queries like all streets in region X (click on "Map view" to see the result on a map), where you want to define one or more parameters at the top, which are then used in the rest of the query. This works for simple queries, but with current SPARQL breaks as soon as you have subqueries or nested group graph patterns using the query parameters, which is confusing for many users.

  8. These missing features are related to the "substitute" semantics and to nested queries and should definitely be addressed and we do have recommendations for that. But we strongly advise against conflating this with the current discussion, which is about EXISTS and not about all kinds of missing features. When implementing new features that depart from the join logic inherent to the current version of the SPARQL standard, we would advise to use new keywords (LATERAL being an example) in order to make the departure from the standard logic explicit.

@pchampin
Copy link
Contributor

pchampin commented Oct 3, 2024

This was discussed in today's meeting

@gkellogg gkellogg removed the discuss-f2f Proposed for discussion during the next face-to-face meeting label Oct 24, 2024
@gkellogg gkellogg added the ErratumRaised Errata management: proposed erratum label Mar 27, 2025
@ktk
Copy link

ktk commented Apr 3, 2025

We need to discuss a Task Force for this

@w3cbot
Copy link

w3cbot commented Apr 10, 2025

This was discussed during the #rdf-star meeting on 10 April 2025.

View the transcript

Task Force for SPARQL EXISTS 3

ora: Idea is to prepare for EXISTS - not a priority for the whole WG at this time.
… does the group agrees?
… who will chair?

<AndyS> s/agress/agree/

ora: james - would you chair this?

james: insufficient experience of W3C processes

ora: it involves scheduling and making the TF moves forward

gkellogg: Agree to TF and maybe other items in the WG for subsets of the participants
… creates a place for discussions. I would participate.

james: sub-group decide chair?

Andys: I can schedule a first meeting

<TallTed> TF(s) will let focused conversation(s) take place in parallel with main group without consuming main group time. I don't think I will have the time to do much if any more than participate (which I will *try* to do).

ktk: can external people participate?

<TallTed> TF participants must be WG members, whether as W3CMember reps or as IEs

james: certainly agree for external participants

pchampin: chairs can invite (and to WG meetings)
… need to check if a member org can specific a person outside of their org.

<TallTed> IP issues can quickly become quite hairy.

tallted: depends on their input (IP issues need care)
… if they are contributing significantly, should do the IP side.

ora: adjourned


@afs
Copy link
Contributor Author

afs commented Apr 11, 2025

schedule a first meeting

WG members - please let me know (by direct email) of your interest and availability.

@ktk ktk removed the needs discussion Proposed for discussion in an upcoming meeting label Apr 17, 2025
@pfps
Copy link
Contributor

pfps commented Apr 23, 2025

There appears to be two different ways of fixing EXISTS and they produce different answers.

One method adds the current bindings to the FILTER expression, roughly as if a VALUES expression was prepended to it, but evaluates the resulting expression normally. This means the result of any subqueries is the same for all the bindings so they need only be evaluated once.

The other method is much more involved, and is described in https://github.com/w3c/sparql-dev/blob/main/SEP/SEP-0007/sep-0007.md In this method the current bindings are pushed into subqueries with the result that they may need to be evaluated multiple times. (It may be possible to optimize the evalation so that the subqueries do not need to be completely re-evaluated for each binding.)

A query that shows a difference between the two ways is:

PREFIX ex: <ex:>
SELECT ?v WHERE {
BIND ( ex:a AS ?v )
FILTER EXISTS { SELECT ?v WHERE { FILTER (?v = ex:a) } }
}

Under the first method this query produces no answers because no binding is available for ?v in the embedded query. Under the second method an answer is produced because the current bindings are pushed all the way down in the expression.

Currently MilleniumDB and Blazegraph produce an empty answer set for this query.
QLever and www.sparql.org produce an answer set with one set of bindings.

@TallTed
Copy link
Member

TallTed commented Apr 23, 2025

For what it's worth, current Virtuoso (version 08.03.3333 (b7f38078cb)) also produces a single row result set. ( See query, and result.)

@afs
Copy link
Contributor Author

afs commented Apr 30, 2025

@pfps -- What does the SPARQL 1.1 say?

"the pattern formed by replacing every occurrence of a variable v in pattern by μ(v) for each v in dom(μ)"

Aside from the projection of ?v (separate issue that you have already identified,

What about:

PREFIX ex: <ex:>
SELECT ?v WHERE {
BIND ( ex:a AS ?v )
FILTER EXISTS { SELECT * WHERE { FILTER (?v = ex:a) } }
}
PREFIX ex: <ex:>
SELECT ?v WHERE {
BIND ( ex:a AS ?v )
FILTER EXISTS { FILTER (?v = ex:a) }
}

@afs
Copy link
Contributor Author

afs commented Apr 30, 2025

PR 177 has one way in which substitute might be fixed.

The principle is that during execution of the row being filtered, variables have their value constrained to the binding in the row.

The PR describes this by a process of rewriting the algebra on a per-row basis. It is not a required implementation approach.

It is not the only way to communicate the effect. An alternative possibility is to introduce a new algebra operator which evaluates to the row being filtered. This would make for a rewrite a static (algebra build time) step.

@afs
Copy link
Contributor Author

afs commented May 1, 2025

I've setup a repo for tests and notes: https://github.com/afs/SPARQL-exists

In it, there is the example query, and variations, together with NOT EXISTS versions and NOT EXISTS replaced by MINUS.

https://github.com/afs/SPARQL-exists/tree/main/tests/exists-filter

@lisp
Copy link

lisp commented May 2, 2025

there is now a pull request to explore minimal changes to the recommendation to support the interpretation, that it should be implemented analogous to nested loop bgp mechanisms.

the change clarifies that patterns are not to be instantiated, which eliminates the errors otherwise associated the blank nodes.
we have followed this interpretation from our first implementation over a decade ago.
the implementation produces results which differed with those proposed during the sparql-dev working group discussions only due to differences in variable scoping rules.

@pfps
Copy link
Contributor

pfps commented May 2, 2025

It would be useful to describe how this proposal would change common uses of EXISTS. For example what happens with

ex:a ex:b ex:c .
ex:k ex:d ex:f .
SELECT ?x WHERE {
  ex:a ex:b ?x .
  FILTER NOT EXISTS { ?x ex:d ex:f . }
}

@afs
Copy link
Contributor Author

afs commented May 2, 2025

One method adds the current bindings to the FILTER expression, roughly as if a VALUES expression was prepended to it, but evaluates the resulting expression normally.

Prepend inside or outside the {...}?

The second method can be seen as prepending inside the {...}.

In the syntax to algebra algorithm of SPARQL, there is also always an empty BGP at the start of a group graph pattern. The "simplification" step in the spec replaces "join(Z, X)" and "join(X, Z)" with X for clarity.

The values insertion process includes a VALUES block of a single row (actually Table(μ)), at the start of all group graph patterns.

There is a binding for ?v. In

{ FILTER (?v = ex:a) }

there is no join, and Z is still there.

So it becomes (algebra)

(filter (?v = ex:a)
    (join Z Table(μ))
)

Filter (algebra) always filers something.

Under the first method this query produces no answers because no binding is available for ?v in the embedded query.

Maybe because it is outside the {...}?

{ VALUES(μ) FILTER (?v = ex:a) } would work, but VALUES(μ) { FILTER (?v = ex:a) } is not prepending immediately before the FILTER,

@afs
Copy link
Contributor Author

afs commented May 3, 2025

Added blank node tests (issue 3).

@hannahbast
Copy link

Dear all, I just reread my comment from 03.10.2024 and still feel the same way. In case this got forgotten, here are the main points again; see the comment for examples and all the details:

  1. When in doubt about the semantics of a query, follow the inner logic of the query language and not what users think the result should be
  2. According to the inner logic of the query language, the natural definition of EXISTS is via a standard semijoin and not via substitution
  3. I understand the desire and need for substitution semantics in SPARQL, but there should be separate keywords for this

@pfps
Copy link
Contributor

pfps commented May 3, 2025

@pfps -- What does the SPARQL 1.1 say?

"the pattern formed by replacing every occurrence of a variable v in pattern by μ(v) for each v in dom(μ)"

Aside from the projection of ?v (separate issue that you have already identified,

What about:

PREFIX ex: <ex:>
SELECT ?v WHERE {
BIND ( ex:a AS ?v )
FILTER EXISTS { SELECT * WHERE { FILTER (?v = ex:a) } }
}

My view is that embedded queries should be treated inside EXISTS just as they are outside exists.

PREFIX ex: <ex:>
SELECT ?v WHERE {
BIND ( ex:a AS ?v )
FILTER EXISTS { FILTER (?v = ex:a) }
}

But that in-scope bindings are available at the top level of the EXISTS pattern.

So these two have different results, just like

PREFIX ex: <ex:>
SELECT ?v WHERE {
 BIND (ex:a AS ?v )
 FILTER (?v = ex:a)
}

and

PREFIX ex: <ex:>
SELECT ?v WHERE {
 BIND (ex:a AS ?v )
 { SELECT * WHERE { FILTER (?v = ex:a) } } 
}

do.

@kasei
Copy link
Contributor

kasei commented May 3, 2025

@hannahbast

2. According to the inner logic of the query language, the natural definition of EXISTS is via a standard semijoin and not via substitution
3. I understand the desire and need for substitution semantics in SPARQL, but there should be separate keywords for this

My recollection of the 1.1 WG is that (NOT-)EXISTS was mostly motivated by the desire for negation support. In that context, I think (NOT) EXISTS is the "separate keyword" for substitution semantics, while MINUS provided the join-based negation semantics. (And we didn't get a semi-join counterpart to the negative MINUS).

@afs
Copy link
Contributor Author

afs commented May 4, 2025

Dear all, I just reread my comment from 03.10.2024 and still feel the same way. In case this got forgotten,

@hannahbast - The comment has not been forgotten.

Johannes presented a summary in the meeting on Friday (May 2nd)

https://www.w3.org/2025/05/02-rdf-star-minutes.html#10000

@afs
Copy link
Contributor Author

afs commented May 14, 2025

The topic for the SPARQL TF meeting this week is agree on the 5 issues, with the understanding that new issues may be found. The expectation for a new issue is that it needs clearly stating what the problem is, with evidence, such as query, data and results.

WG members - the meetings should now be appearing on your W3C calendar and also in the working group upcoming meetings.

@lisp
Copy link

lisp commented May 16, 2025

attached here is a script to run the tests from https://github.com/afs/SPARQL-exists/tree/main/tests against a repository in dydra.

the data is present in distinct graphs in a single dataset.

run_tests.txt

@afs
Copy link
Contributor Author

afs commented May 19, 2025

I've reorganised the tests area with a set of tests per issue as suggested at the last meeting. (There is overlap because the issues are not independent.)

Also - I've added a description of using an algebra function to PR #177. The function evaluates to the current solution mapping (mentioned in comment.

The spec description now does imply changing the algebra for query execution once algebra expression is initially constructed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ErratumRaised Errata management: proposed erratum
Projects
None yet
Development

No branches or pull requests