Validating Material Composition Data Using SHACL With SPARQL-Based Constraints

This is the second in a series of posts about using SHACL to validate material composition data for semiconductor products (microchips). This results from a recent project we undertook for Nexperia. In the first post we looked at the basic data model for material composition and how basic SHACL vocabulary can be used to describe the constraints. In this post we will look at how SPARQL-based constraints can be used to implement more complex rules based on a SPARQL SELECT query.

As a working example, we will look at how we can write a rule to validate the CAS (Chemical Abstracts Service) Registry Number® (CAS RN®) of a substance. The registry contains information on more than 130 million organic and inorganic substances.

Each CAS RN identifier:

  • Is a unique numeric identifier
  • Designates only one substance
  • Has no chemical significance
  • Is a link to a wealth of information about a specific chemical substance

An example is 9003-35-4 which is the identifier for the ‘Phenol, polymer with formaldehyde’ substance.

Phenol, polymer with formaldehyde

A CAS RN includes up to 10 digits which are separated into 3 groups by hyphens. The first part of the number, starting from the left, has 2 to 7 digits; the second part has 2 digits. The final part consists of a single check digit.

In the first post, we saw already how the syntax of the CAS RN can be checked using the regex "^[0-9]{2,7}-[0-9]{2}-[0-9]$" to match the pattern.

However, the CAS RN also provides a way to do check digit verification to detect mistyped numbers, which would be useful to incorporate into our validation rules.

The CAS RN may be written in a general form as:

  Nᵢ......N₄N₃ - N₂N₁ - R

In which R represents the check digit and N represents a fundamental sequential number. The check digit is derived from the following formula:

(iNᵢ + ... + 4N₄ + 3N₃ + 2N₂ + 1N₁) mod 10 = R

For example, for ‘Phenol, polymer with formaldehyde’ RN 9003-35-4, the validity is checked as follows:

  CAS RN: 9003-35-4
sequence: 6543 21

N₆ = 9; N₅ = 0; N₄ = 0; N₃ = 3; N₂ = 3; N₁ = 5

  ((6 x 9) + (5 x 0) + (4 x 0) + (3 x 3) + (2 x 3) + (1 x 5)) mod 10
= (54 + 0 + 0 + 9 + 6 + 5) mod 10
= 74 mod 10
= 4

Valid!

Obviously there is no way to do this with the SHACL Core language. With a little thought, we can implement this validity check in SPARQL as follows:

select ?casNum ?checksum ?test
where {
  # remove the hyphens
  bind(replace(?casNum, "-", "") as ?casNum_)

  # get the length of the RN
  bind(strlen(?casNum_) as ?len)

  # get the checksum value R
  bind(xsd:integer(substr(?casNum_, ?len-0, 1)) as ?0)   # R
  bind(xsd:integer(substr(?casNum_, ?len-1, 1))*1 as ?1) # 1N₁
  bind(xsd:integer(substr(?casNum_, ?len-2, 1))*2 as ?2) # 2N₂
  bind(xsd:integer(substr(?casNum_, ?len-3, 1))*3 as ?3) # 3N₃
  bind(xsd:integer(substr(?casNum_, ?len-4, 1))*4 as ?4) # 4N₄
  bind(xsd:integer(substr(?casNum_, ?len-5, 1))*5 as ?5) # 5N₅
  bind(xsd:integer(substr(?casNum_, ?len-6, 1))*6 as ?6) # 6N₆
  bind(xsd:integer(substr(?casNum_, ?len-7, 1))*7 as ?7) # 7N₇
  bind(xsd:integer(substr(?casNum_, ?len-8, 1))*8 as ?8) # 8N₈
  bind(xsd:integer(substr(?casNum_, ?len-9, 1))*9 as ?9) # 9N₉
  bind(
    coalesce(
      # if RN length = 10, then sum positions 1N₁ thru 9N₉, else
      if(?len=10, ?1+?2+?3+?4+?5+?6+?7+?8+?9, 1/0),
      # if RN length = 9, then sum positions 1N₁ thru 8N₈, else
      if(?len=9, ?1+?2+?3+?4+?5+?6+?7+?8, 1/0),
      # if RN length = 8, then sum positions 1N₁ thru 7N₇, else
      if(?len=8, ?1+?2+?3+?4+?5+?6+?7, 1/0),
      # if RN length = 7, then sum positions 1N₁ thru 6N₆, else
      if(?len=7, ?1+?2+?3+?4+?5+?6, 1/0),
      # if RN length = 6, then sum positions 1N₁ thru 5N₅, else
      if(?len=6, ?1+?2+?3+?4+?5, 1/0),
      # if RN length = 5, then sum positions 1N₁ thru 4N₄
      if(?len=5, ?1+?2+?3+?4, 1/0)
    ) as ?sum
  )

  # divide the sum by 10
  bind(?sum/10 as ?sum_10)

  # calculate the remainder and multiply by 10 to give the checksum
  bind(10*(?sum_10 - floor(?sum_10))  as ?checksum)

  # test if checksum = R
  bind(?checksum = ?0 as ?test)
}

We can then use VALUES clause to pass some (counter)examples as bindings for ?casNum into the query:

values ?casNum {
  "9003-35-4"
  "1333-86-4"
  "138265-88-0"
  "60676-86-0"
  "60676-86-1"
  "1344-28-1"
  "603-35-0"
  "60-35-0"
}

Which yields the results:

+-------------+----------+-------+
|   casNum    | checksum | test  |
+-------------+----------+-------+
| 9003-35-4   |        4 | true  |
| 1333-86-4   |        4 | true  |
| 138265-88-0 |        0 | true  |
| 60676-86-0  |        0 | true  |
| 60676-86-1  |        0 | false |
| 1344-28-1   |        1 | true  |
| 603-35-0    |        0 | true  |
| 60-35-0     |        5 | false |
+-------------+----------+-------+

Now that we have validated the query logic, the constraint can be incorporated into the property shape for our plm:casNumber property by using sh:sparql:

:casNumberShape a sh:PropertyShape ;
  sh:path plm:casNumber ;
  sh:maxCount 1 ;
  sh:datatype xsd:string ;
  sh:pattern "^[0-9]{2,7}-[0-9]{2}-[0-9]$" ; # match pattern "nnnnnNN-NN-N"
  sh:sparql [
    a sh:SPARQLConstraint ;
    sh:message "Checksum of CAS Registry Number must be valid." ;
    sh:prefixes [
      sh:declare [
        sh:prefix "plm" ;
        sh:namespace "http://example.com/def/plm/"^^xsd:anyURI
      ]
    ] , [
      sh:declare [
        sh:prefix "xsd" ;
        sh:namespace "http://www.w3.org/2001/XMLSchema#"^^xsd:anyURI
      ]
    ] ;
    sh:select """
      select $this (?casNum as ?value)
      where {
        $this $PATH ?casNum                                         # match the plm:casNumber predicate
        bind(replace(?casNum, "-", "") as ?casNum_)                 # remove the hyphens
        bind(strlen(?casNum_) as ?len)                              # get the length of the RN
        bind(xsd:integer(substr(?casNum_,?len-0,1)) as ?0)          # get the checksum value R
        bind(xsd:integer(substr(?casNum_,?len-1,1))*1 as ?1)        # 1N₁
        bind(xsd:integer(substr(?casNum_,?len-2,1))*2 as ?2)        # 2N₂
        bind(xsd:integer(substr(?casNum_,?len-3,1))*3 as ?3)        # 3N₃
        bind(xsd:integer(substr(?casNum_,?len-4,1))*4 as ?4)        # 4N₄
        bind(xsd:integer(substr(?casNum_,?len-5,1))*5 as ?5)        # 5N₅
        bind(xsd:integer(substr(?casNum_,?len-6,1))*6 as ?6)        # 6N₆
        bind(xsd:integer(substr(?casNum_,?len-7,1))*7 as ?7)        # 7N₇
        bind(xsd:integer(substr(?casNum_,?len-8,1))*8 as ?8)        # 8N₈
        bind(xsd:integer(substr(?casNum_,?len-9,1))*9 as ?9)        # 9N₉
        bind(
          coalesce(
            if(?len=10,?1+?2+?3+?4+?5+?6+?7+?8+?9,1/0),             # if RN length = 10, then sum positions 1N₁ thru 9N₉, else
            if(?len=9,?1+?2+?3+?4+?5+?6+?7+?8,1/0),                 # if RN length = 9, then sum positions 1N₁ thru 8N₈, else
            if(?len=8,?1+?2+?3+?4+?5+?6+?7,1/0),                    # if RN length = 8, then sum positions 1N₁ thru 7N₇, else
            if(?len=7,?1+?2+?3+?4+?5+?6,1/0),                       # if RN length = 7, then sum positions 1N₁ thru 6N₆, else
            if(?len=6,?1+?2+?3+?4+?5,1/0),                          # if RN length = 6, then sum positions 1N₁ thru 5N₅, else
            if(?len=5,?1+?2+?3+?4,1/0)                              # if RN length = 5, then sum positions 1N₁ thru 4N₄
          ) as ?sum
        )
        bind(?sum/10 as ?sum_10)                                    # divide the sum by 10
        bind(10*(?sum_10 - floor(?sum_10))  as ?checksum)           # calculate the remainder and multiply by 10 to give the checksum
        filter(?checksum != ?0)                                     # test if checksum != R
      }
      """
  ] .

A few things to note:

  • Any prefixes that will be used in the SPARQL query must be defined using the sh:prefixes property, in this case plm: and xsd:
  • The $PATH variable in the SPARQL query is substituted at runtime by the sh:path used by the shape, in this case plm:casNumber
  • The SPARQL query must be written such that it gives results for things that do not match the constraint, in this case the FILTER clause matches when the calculated checksum is not equal to the value of R in the CAS RN

The extended shape file is available here.

Now if we use this extended property shape to validate our data, we now see these additional validation results (some details omitted for brevity):

[ a       <http://www.w3.org/ns/shacl#ValidationResult> ;
  <http://www.w3.org/ns/shacl#focusNode>
          <http://example.com/132285000223> ;
  <http://www.w3.org/ns/shacl#resultMessage>
          "Checksum of CAS Registry Number must be valid." ;
  <http://www.w3.org/ns/shacl#resultPath>
          plm:casNumber ;
  <http://www.w3.org/ns/shacl#resultSeverity>
          <http://www.w3.org/ns/shacl#Violation> ;
  <http://www.w3.org/ns/shacl#sourceConstraint>
          _:b1 ;
  <http://www.w3.org/ns/shacl#sourceConstraintComponent>
          <http://www.w3.org/ns/shacl#SPARQLConstraintComponent> ;
  <http://www.w3.org/ns/shacl#sourceShape>
          <http://example.com/ns#casNumberShape> ;
  <http://www.w3.org/ns/shacl#value>
          "1333-8-4"
]

and

[ a       <http://www.w3.org/ns/shacl#ValidationResult> ;
  <http://www.w3.org/ns/shacl#focusNode>
          <http://example.com/132285000108> ;
  <http://www.w3.org/ns/shacl#resultMessage>
          "Checksum of CAS Registry Number must be valid." ;
  <http://www.w3.org/ns/shacl#resultPath>
          plm:casNumber ;
  <http://www.w3.org/ns/shacl#resultSeverity>
          <http://www.w3.org/ns/shacl#Violation> ;
  <http://www.w3.org/ns/shacl#sourceConstraint>
          _:b1 ;
  <http://www.w3.org/ns/shacl#sourceConstraintComponent>
          <http://www.w3.org/ns/shacl#SPARQLConstraintComponent> ;
  <http://www.w3.org/ns/shacl#sourceShape>
          <http://example.com/ns#casNumberShape> ;
  <http://www.w3.org/ns/shacl#value>
          "7441-22-4"
]

The first violation is also picked up by the existing regex pattern match. The second violation matches the regex pattern, but is still invalid as it still fails the newly added check digit verification constraint.

This demonstrates how SPARQL-based constraints can be used to capture more complex rules that are not possible to describe with SHACL Core language. Having the full range of SPARQL expressiveness available gives an almost endless range of possibilities. These constraints can be checked using any SHACL processor that implements SHACL-SPARQL.

Note that this check will still not guarantee that the CAS RN actually exists in the CAS registry. In order to do that we would need to somehow reconcile the CAS RN against the CAS registry, or some other authority like Wikidata (e.g. Carbon Black is Q764245).

This is beyond the scope of SHACL and our project, but would open the door to integrate data published by those authorities into a consuming application.

In the next post in the series, we will continue explore the use of SPARQL constraints for other validation rules involving aggregation.