Yimu Yin Jiji Zhang
Abstract.We give a category-theoretic treatment of causal models that formalizes the syntax for causal reasoning over a directed acyclic graph(DAG)by associating a free Markov category with the DAG in a canonical way.This framework enables us to define and study important concepts in causal reasoning from an abstract and“purely causal”point of view,such as causal independence/separation,causal conditionals,and decomposition of intervention effects.Our results regarding these concepts abstract away from the details of the commonly adopted causal models such as(recursive)structural equation models or causal Bayesian networks.They are therefore more widely applicable and in a way conceptually clearer.Our results are also intimately related to Judea Pearl’s celebrated do-calculus,and yield a syntactic version of a core part of the calculus that is inherited in all causal models.In particular,it induces a simpler and specialized version of Pearl’s do-calculus in the context of causal Bayesian networks,which we show is as strong as the full version.
Causal models based on directed acyclic graphs(DAGs),such as recursive structural equation models ([3,4]) and causal Bayesian networks ([13,11]),have been vigorously studied and widely applied as powerful tools for causal reasoning.However,from a logical point of view,the syntax underlying such causal models is usually left implicit or even obscure in the literature.This lacuna is fixed in recent category-theoretic work on the subject ([1,6]),where the distinction between syntax and semantics is made clear in the style of F.W.Lawvere’s functorial semantics([7]).Specifically,the syntax is provided by a monoidal category of a certain kind induced by a DAG,and a distinguished class of morphisms therein can be viewed assyntacticcausal effects,which may then be interpreted in various ways.Causal Bayesian networks,for example,interpret a causal effect of this kind with a stochastic matrix that represents probability distributions over the outcome-variables resulting from interventions on the treatment-variables.In this light,a causal Bayesian network is a functor—a structure-preserving mapping—from the syntax category to a category whose morphisms are stochastic matrices.As another example (and a degenerate case of causal Bayesian networks),deterministic structural equation models interpret a causal effect of this kind as a function that represents how the values of the outcome-variables depend on those of the treatment-variables.Thus,a(recursive)deterministic structural equation model may be viewed as a functor from the syntax category to a category whose morphisms are functions.
In this paper we build on this category-theoretic framework and study some important concepts in DAG-based causal reasoning from a syntactic and more abstract perspective.In particular,we work with the categories defined in[1],calledcausal theories,with an extra constraint to make them Markov categories in the sense of[2].We study the morphisms in that category that correspond to what we call syntactic causal effects,using the graphical language of string diagrams as vehicles for our arguments.One of our main results concerns the decomposition or disintegration of causal effect morphisms,or in the terminology of[2],the existence of a conditional for a causal effect morphism.Roughly,this refers to the property that the causal effect ofxonyandzcan be decomposed into the causal effect ofxonyfollowed by that ofxandyonz.We derive a condition that is sufficient and necessary for decomposability in a causal theory.Interestingly,the condition is precisely the condition in a specialized version of the second rule of Judea Pearl’s do-calculus([11]).This agreement is of course not a coincidence and has,we submit,several instructive implications.The other rules of the do-calculus have more straightforward counterparts in terms of causal effect morphisms,and the upshot is a generic do-calculus at the syntactic level.
This generic calculus,we argue,captures the“causal core”of reasoning about interventions,and is automatically inherited in all causal models,including but not limited to the popular probabilistic ones.In particular,it induces a simpler and specialized version of Pearl’s do-calculus in the context of causal Bayesian networks.Importantly,we show that given the probability calculus,the simpler and specialized version entails the full version of the do-calculus,corroborating our contention that the generic do-calculus corresponds to the purely causal component of the well-known probabilistic do-calculus.
The rest of the paper is organized as follows.In Section 2,we introduce the basics of category theory and the intuitive language of string diagrams,leading to the notion of a Markov category.In Section 3,we define causal theories as an abstraction of causal Bayesian networks and as free Markov categories,and highlight a class of morphisms constructed in [1],which we call “causal effect morphisms”.Section 4 presents some results about causal effect morphisms,which yield a more abstract and syntactic counterpart to Pearl’s do-calculus.We show in Section 5 that the syntactic do-calculus entails a simpler and specialized version of probabilistic do-calculus,and that,despite its simplicity,the specialized version is actually as strong as the full version.
For the sake of space and readability,we will only describe the notions of category theory that are essential for understanding this paper,and introduce the axioms for a Markov category using the language of string diagrams.For readers interested in learning more about category theory and string diagrams,we recommend the canonical treatises[8]and[12],among other excellent textbooks and surveys.
A category C consists of two types of entities:objects A,B,C,...andarrows f,g,h,...,subject to the following three rules:
· For each arrowfthere are given two objects dom(f) and cod(f),called thedomainand thecodomainoff.We usually writef:to indicate thatA=dom(f)andB=cod(f).
· Given two arrowsf:andg:,that is,cod(f)=dom(g),there is a third arrowg ?f:,called thecompositionoffandg.
· For each objectAthere is an arrow 1A:,called theidentityorunitarrow ofA.
In addition,the obviousunitalityandassociativitylaws hold for compositions:for allf:,g:,andh:,
An arrow in category theory is also called amorphismor amap.Here is a more formal definition:
Definition 2.1.Let C be a quadruple(C0,C1,dom,cod),where C0is referred to as a class ofobjects,C1is referred to as a class ofmorphisms,and dom :C1C0,cod:C1C0are functions.A morphismfin C1is usually written asf:with dom(f)=Aand cod(f)=B.For each pair of objectsA,Bin C0,the class of all morphismsfwith dom(f)=Aand cod(f)=Bis denoted by homC(A,B).
Let C2={(f,g)∈C1×C1|cod(f)=dom(g)}.We say that C is acategoryif it also comes with a morphism 1A:for everyA ∈C0,called theidentity morphismofA,and a function?:C2C1,calledcomposition,subject to the associativity and unitality laws given above.
Often we just writex ∈C and let the context determine whetherxis an object or a morphism.
A paradigmatic example of a category is the category Set,containing sets as objects and functions as morphisms.In this category,the composition of morphisms is just the composition of functions and for each objectA,the identity morphism is just the identity function.
It is helpful to think of a morphism as an abstract function,or a box with input wires and output wires,as in the graphical language of string diagrams.The four rudiments of a category are depicted in such a graphical language as follows:
Note that string diagrams are parsed in the lower-left to upper-right order.
Remark2.2.A string diagram is a topological graph in which every edge is labelled with an object and every vertex with a morphism.([12]) Object labels such asA,Bare usually omitted except when they are needed for clarity or emphasis.A labelled vertex is also called anode,and is often drawn as a box such asffor readability.Just as in the usual symbolic formalism,a morphismfmay be represented by many string diagrams.
Categories may serve as objects in a“higher”category,and the morphisms between categories are known as functors:
Definition 2.3.Let C,D be categories andFa pair of mappingsF0:C0D0,F1:C1D1.ThenFis afunctor,written asF:CD,if the following three conditions,corresponding to the three conditions for a category,are satisfied:
·Fpreserves domains and codomains,that is,F1(f:)is a morphismF0(A)0(B)for all morphismsf ∈C1.
·Fpreserves compositions,that is,F1(g ?f)=F1(g)?F1(f)for all compositionsg ?f ∈C1.
·Fpreserves identities,that is,for all objectsA ∈C0.
Compositions of functors may be defined using composition of mappings.Then it is routine to check that categories and functors form a“higher”category.
We now introduce more structures needed for our purpose.Start with the(strict)monoidal structure:
Definition 2.4.Astrict monoidal categoryis a category C equipped with a functor,called themonoidal product,and a distinguished objectI,called themonoidal unit(of the monoidal product),that satisfy associativity and unitality:
Many commonly encountered monoidal categories are actually not strict because equation(2.2)holds only“up to isomorphism”.For example,the category Set has an obvious monoidal product,which is just the Cartesian product(of sets and of functions).The monoidal unit is any singleton set,but unitality is a matter of isomorphism rather than strict identity.The formal definition of a (possibly non-strict) monoidal category is rather more complex and requires the notion of a natural transformation,which we omit to keep things simple.The monoidal categories we will focus on in this paper are strict.1By S.Mac Lane’s coherence theorem([8,Theorem XI 3.1]),every monoidal category is monoidally equivalent to a strict monoidal category.So there is a sense in which we can treat monoidal categories as if they are all strict(even though they are not).See[12]for more on how this is justified.
Since the monoidal product is a functor,it applies to both objects and morphisms in the category.Thus the graphical syntax in(2.1)is extended for monoidal categories as follows:
Notation2.5.We denote byAnthe monoidal product ofncopies of an objectAitself?this includes the empty productA0=I.When an object of the formA1?...?Akis introduced in the discussion,the indices are in general meant to indicate the ordering in which the monoidal product is taken.
For the present purpose,the monoidal structure is especially useful because it can be used to express causal processes or mechanisms that run in parallel,as is visualized in(2.3),whereas composition is used to express those that run in sequence.
Asymmetricmonoidal category is a monoidal category with natural isomorphismsσABAthat satisfy certain coherence conditions(the details do not matter for the present purpose).Graphically,a symmetry isomorphism is depicted as a crossing:
Finally,we can define a Markov category,following the lead of[2].
Definition 2.6.AMarkov categoryis a symmetric monoidal category such that for each objectAin C there are distinguished morphismsδA:,called theduplicateonA,and?A:I,called thediscardonA,that satisfy the following:
Notation2.7.Recall that our convention is to draw a string diagram in the lower-left to upper-right direction.So,above,the duplicateδAAis depicted as an upward forkand the discard?A:Ian upward dead-end·.
Thus a Markov category is endowed with both a symmetric monoidal structure and additional duplicate morphisms and discard morphisms that satisfy (2.5)-(2.7).For our purpose,duplicate morphisms are needed mainly to express the same input entering several causal processes,and discard morphisms are needed to express ignoring or marginalizing over some outcomes of a causal process.The equations in(2.5)are axioms for the so-calledcocommutative monoidal comonoid,and the equations in(2.6)express the condition that duplicates and discards respect the monoidal product.All these axioms are fairly intuitive.The equation in(2.7)says roughly that discarding the output of a morphism is identical to discarding the input in the first place.This condition is not explicitly imposed in[1]but is actually needed for a main result therein(more on this later).
For categories with additional structures,the notion of a functor can be strengthened accordingly.For example,a(strong)Markov functorbetween two Markov categories is a functor that preserves the relevant structures(see[2,Definition 10.14]for details.)
In this section we describe the central object of study in this paper,a syntax category for DAG-based causal models defined in [1],called a causal theory.The framework is an abstraction of causal Bayesian networks,so we first review the latter in 3.1,and then introduce causal theories as free Markov categories in 3.2.In 3.3 we highlight a class of morphisms constructed in[1],which we refer to as causal effect morphisms.Our main results are concerned with these morphisms.
Adirected graphis a quadrupleG=(V,A,s,t),whereV,Aare sets ands:,t:are functions.Elements inVare calledverticesofGand those inAaredirected edgesorarrowsofG.Fora ∈A,s(a)is thesourceofaandt(a)thetargetofa.Gisfiniteif bothVandAare.It issimpleif,for alla,a′ ∈A,we havea=a′whenevers(a)=s(a′)andt(a)=t(a′).We consider only finite simple graphs in this paper.A sequence of distinct arrowsa1,...,an ∈A(G)is adirected path,starting froms(a1) and ending att(an),ifs(ai+1)=t(ai) for 1≤i <n.It is acycleif,in addition,s(a1)=t(an).Gisacyclicif it contains no cycle.Forx,y ∈V(G),xis called aparentofyandyachildofxif for somea ∈A(G),s(a)=xandt(a)=y.
ABayesian network(BN) over a set of (categorical) random variablesVconsists of a triple (G,P,υ),whereGis a directed acyclic graph (DAG),Pis a joint probability law ofV,andυ:V(G)→Vis a bijection between the vertices ofGand the random variables.Following common practice,we will leave the bijectionυimplicit and simply identifyV(G)withV,and callGa DAG overV.The defining condition of a BN is thatGandPsatisfy a Markov condition,which requires thatPcan be factorized according toGas follows:
where paG(X)denotes the set of parents ofXinG.WhenGis sufficiently sparse,the factorization enables efficient computations of various probabilities entailed by the joint probability law,which makes the BN useful for probabilistic reasoning.([9])
The DAG in a BN usually lends itself to a causal interpretation,as a representation of the qualitative causal structure.With this causal reading,the BN framework can be extended to handle reasoning about effects of interventions.([13,11]) Specifically,acausal Bayesian network(CBN) overVdoes not represent just one joint probability law,but a number ofinterventionalprobability distributions.The standard setup is that for every subsetT?Vand every possible value configurationtforT,there is a probability distribution resulting from an (exogenous) intervention that forcesTto take valuet.Such an interventional distribution,denoted asP(V|do(T=t)) using Pearl’s ([11]) do-operator,is assumed to be equal to a truncated factorization:
As a special case,whenT=?,we obtain the factorization(3.1)of thepre-interventiondistribution.Equation(3.2)can be viewed as the defining axiom for the CBN,sometimes referred to as theintervention principle.([17])
Note two key ideas in this formulation of a CBN:(1)for eachX ∈V,P(X|paG(X))encodes amodularcausal process or mechanism(when paG(X)=?,P(X)is taken to encode anexogenousmechanism forX),and the whole causal system is composed of these causal modules? (2) an intervention breaks the modules for its target variables but does not affect the other modules(hence the truncated factorization).Put this way,P(X|paG(X))is a particular,probabilistic model of the causal module?the causal theory,as a syntax category,will express the causal module more abstractly,to which we now turn.
We now follow[1]to define the causal theory induced by a DAGG=(V,A,s,t),a category denoted as Cau(G).The objects are given by words overV.AwordoverVis a finite sequence of elements ofV,and this also includes the empty word?.LetW(V)be the set of words overV.ObviouslyW(V)is closed underconcatenation:ifv,w ∈W(V)thenvw ∈W(V).So concatenation provides a monoidal product onW(V),with the empty word?as the unit.
Terminology3.1.For convenience,elements ofW(V) are also referred to asvariablesand those of length 1,that is,the vertices inV,are calledatomic variables.To ease the notation,we will henceforth denote all variables with lower case letters.Concatenation of two variablesv,wis also written asv ?w.
An atomic variablevis apath ancestorof an atomic variablewif there is a directed path inGfromvtow,and is anancestorofwif it is a path ancestor ofwor is identical withw.
If no atomic variable occurs more than once in a variablevthenvissingular?in particular,?is singular.A singular variable ismaximalif each atomic variable occurs exactly once in it.
Letvwhere eachviis atomic.LetforS ?{1,...,n}? setv?=?.Writew ?v,orw ∈vifwis atomic,and say thatwis asub-variableofvifwis equal to somevS.Letw=vSandw′=vS′.Writev/w=v{1,...,n}?S,w ∩w′=vS∩S′,w?w′=vS?S′,and so on.We say thatw,w′aredisjointif no atomic variable occurs in both of them.Note that being disjoint is not the same asw ∩w′=?,unlessvis singular.
The morphisms in Cau(G)are generated from two distinct classes of generators(basic morphisms),in addition to the identity morphisms:
· The first class consists of duplicate and discard morphisms for each atomic variablev
As mentioned previously,duplicate morphisms are needed to express the same variable entering multiple causal processes,and discard morphisms are needed to express ignoring or marginalizing over some outcomes of a causal process.
· The second class is the heart of the matter and consists of acausal mechanismfor each atomic variablev
where pa(v)is a chosen singular variable that contains all the parents ofv,and is more accurately denoted by paG(v) if necessary.If pa(v)=?then this is just,which shall be called aexogenouscausal mechanism.
The causal theory Cau(G) is thefreeMarkov category generated from these two classes of morphisms(and the identity morphisms),by taking all compositions and products as depicted in(2.1)and(2.3),subject only to the constraints in axioms(2.5)-(2.7).
Note that??=δ?=1?in any Markov category.Also writeκ?=1?.
A free category is a category generated from certain generators by well-defined operations in a “no junk no noise” manner:“no junk” in the sense that only those morphisms that can be so generated are in the category,and“no noise”in the sense that no relations between morphisms hold unless they are required by the axioms.For precise technical definitions and graphical constructions of free monoidal categories,see[12].A graphical construction of free Markov categories takes a little more work,which can be found in [15].For the present purposes,we need not enter the rather technical details of the constructions,and we will simply use some lemmas from[15]in some of our proofs.Remark3.2.It may seem that the construction of Cau(G)depends on the choice of the singular variables pa(v) for the causal mechanismsκv.But this is not so:two distinct choices of pa(v) (and hence ofκv) only differ by a permutation of atomic variables in pa(v)and the resulting free Markov categories are isomorphic.
Following[1](see also[6]),we take Cau(G)as an categorical embodiment of the syntax for causal reasoning withG.It can then be interpreted in any Markov category via strong Markov functors,yielding different kinds of causal models.For example,a CBN based onGis a model of Cau(G)in the Markov category FinStoch,the category containing stochastic matrices as morphisms([1,2,6]),whereas a deterministic structural equation model based onGis a model of Cau(G)in the Markov category Set.We may also explore less studied causal models,such as possibilistic ones,which are models of Cau(G) in the Markov category Rel,the morphisms in which are relations between sets.([1])
Recall the interventional probability distributionsP(V|do(T))in the context of CBN,which is often referred to as the causal effect ofTonV.([14]) We now construct a class of morphisms in a causal theory that is a syntactic counterpart to such causal effects.
Notation3.3.In any Markov category such as Cau(G),a morphismis called amultiplieron the monoidal productA=?i Aiif it is generated from the duplicates,discards,symmetries,and identities on the factorsAi? soBmust be a monoidal product of copies of the factorsAi.IfAi /=Ajfori /=jthen the multiplier is unique,which is denoted byιA→B.This is due to(2.5)or,more intuitively,coherence of the graphical language for Markov categories(see[15]).For instance,ifA=A1?A2?A3andB=thenιA→Bmay be depicted aswhere how the duplicates in the trident are arranged,how the edges at the nodes are ordered,how the copies of the same object in the codomain are ordered,and so on,can all be left unspecified.
Henceforth we work in Cau(G).
Terminology3.4.By the construction in [15],a morphism in Cau(G) is an equivalence class of string diagrams up to surgeries.Therefore,by a string diagram of a morphism,we mean any diagram in the equivalence class in question.
Notation3.5.Although,for our purpose,there is no need to distinguish betweenwvandvwin Cau(G),for a technical reason (symmetries in free Markov categories cannot be identities),we cannot work with the quotient ofW(V)with respect to the relationwv=vwon words.This is also the reason why the maneuver in Remark 3.2 is needed.
To remedy this,we first choose a total ordering onVand denote the corresponding maximal singular variable by=1?...?n.All singular variables we shall speak of are sub-variables of.Ifv,ware singular variables thenv ∪wis abbreviated asvworwv,which denotes the unique sub-variable ofthat contains exactly the atomic variables inv,w.
The results below depend on the chosen ordering only because taking monoidal products of atomic variables does.
Definition 3.6.For singular variablestandv,letGt→vbe the subgraph ofGthat consists of all the vertices int ∪vand all the directed paths that end invbut do nottravel toward t,that is,do not pass through or end int(starting intis allowed).Note that,for everyi ∈V(Gt→v),ifitthen its parents inGare all inGt→vas well and ifi ∈tthen it has no parents inGt→v.
Construct a string diagram as follows.For eachi ∈V(Gt→v),letbe the monoidal product of as many copies ofias the number of children ofiinGt→v.Let Γibe a string diagram of
note the extra copy ofiin the codomain ofAccording to Notation 3.3,there is no need to choose orderings for the codomains of the multipliers employed here.Forj ∈V(Gt→v),letojbe the number of output wires of Γjandpjthat of all the input wires labelled byjof all the other Γi.Observe thatoj=pj+1 ifj ∈vandoj=pjin all other cases.So we may connect the corresponding wires and fuse thesecomponentsΓiinto a single string diagram,denoted by Γ[v‖t],whose input wires are labelled bytand the output wires byv.The string diagram thus obtained may not be unique up to isomorphisms,but the morphism it represents is,due to coherence of the graphical language for Markov categories.This morphism is referred to as thecausal effectoftonvand is denoted by[v‖t]:,or simply[v]whent=?,which is also called theexogenous effectonv.
This class of morphisms was called“causal conditionals”in[1,§4].We prefer to call them“causal effects”here because of their eponymous counterparts in probabilistic causal modeling mentioned earlier,but also because we will study a notion of a conditional in the next section,and[v‖t]may not be a conditional in that sense.
Example 3.7.For any atomic variablev,if pa(v)=?thenκvis simply depicted asThe simplest causal effects are the ones of the form[v‖pa(v)],which is of course just the causal mechanismκv.Below are some simple examples of the causal effect[z‖x]in Cau(G)for four different graphs with three vertices:
Note that in the second and third examples,xis not an ancestor ofzin the causal graphG,and the causal effect [z‖x] is accordingly “disconnected”:the morphism factors through the monoidal unit?,which marks the lack of influence ofxonz.
With string diagrams,we can use topological notions to aid reasoning.Here are some notions that will be used in some of the arguments below:
Definition 4.1.Let Γ be a string diagram in a symmetric monoidal category C.Denote the source of an edgeein Γ bye(0)and the target bye(1).Adirected path a?bin Γ is a sequencep=(a=e0,e1,...,ek,b=ek+1) of edges in Γ such that,for eachi,ei(1)=ei+1(0)andeiis not labeled byI?we also regard the source and target of eacheias belonging to the directed path,and writep(0)=a(0)andp(1)=b(1).Apathis just a concatenation of finitely many directed paths.In particular,asplitter path(respectively,acollider path)is a concatenation of two directed paths joined at the starting nodes(respectively,at the ending nodes).
Two(not necessarily distinct)edges areconnectedin Γ if there is a path between them.More generally,two setsA,Bof edges are connected in Γ if somea ∈Ais connected with someb ∈B,of particular interest is the caseA=dom(Γ) andB=cod(Γ).
In some proofs below,we shall need the theory on surgeries on string diagrams developed in [15].For Markov categories,there are four types of surgeries,corresponding to the four diagrams in(2.5)and(2.7),which shall be accordingly referred to as coassociativity surgery,counitality surgery,cocommutativity surgery,and discard surgery,respectively.Only the following bit from that theory is needed here.
Recall Remark 2.2 and Terminology 3.4.Let Γ be a string diagram of a morphism in Cau(G).A nodexof Γ isdecorativeif either it is a dead-end(discard)or every maximal directed pathpin Γ withp(0)=xruns into a dead-end or,in case thatxis a duplicate,this is so for all such paths through one of the prongs.Denote the set of decorative nodes of Γ by ΔΓand its complement by ?ΔΓ.Denote byPΓthe set of the directed paths that end in cod(Γ)and bySΓthe set of splitter paths between edges in cod(Γ).
Lemma 4.2.Suppose thatΓ,Υare string diagrams of the same morphism inCau(G).Then
·There is a bijection π:compatible with the labels inΓ,Υ.
·There is a bijection:compatible with π,that is,π restricts to a bijection between the nodes inbelonging to p ∈PΓand those inbelonging to(p).
·There is a bijection:compatible with π.
We continue to work in Cau(G).Here is a useful fact from [15] that will be needed in the subsequent arguments:
Lemma 4.3.Let f::v/v′ w for some sub-variable v′ v such that f=/v′ is connected with an atomic variable in w via a directed path.
We now proceed to establish some results about causal effect morphisms in a causal theory.A central result has to do with the existence of conditionals in a Markov category,as is defined in[2].
Definition 4.4.Letf:be a morphism in a Markov category M.
· Themarginal fX|ZoffoverYis the morphism=(1X ??Y)?f:.
·fadmits aconditional over Xif there is a morphismfY|XZ:such that
The marginals of a causal effect morphism behave as expected.
Lemma 4.5.Let u,v,and w be singular variables with v∩w=?.Then the marginal of the causal effect[vw‖u]over v is the causal effect[w‖u](and that over w is[v‖u]).
Proof.By induction on the cardinality ofv,this is immediately reduced to the case wherevis an atomic variable.We examine how composing with?vchanges the component Γvand other subsequently impacted components Γiwithout changing the morphism represented (recall Definition 3.6).If Γvhas more than one output wire then,by counitality surgery,is changed toand Γvis thus changed without impacting any other Γi.If Γvhas only one output wire then,by discard surgery,it is reduced to ?where pa(v)is computed inGu→vw.In that case,for anyj ∈pa(v),we ask the same question that whether Γjhas more than one output wire or not,and proceed accordingly as before.Observe that,when there are no more surgeries to be performed,the remaining components Γj,including the modified ones,are exactly those required to construct Γ[w‖u].The lemma follows.
Note that this lemma relies on the axiom(2.7),or discard surgery,which is not imposed in[1]as we do here.To see it,consider again the third graph in Example 3.7 and the marginal(1x ??y)?[xy‖z]of the causal effect[xy‖z]overy,then we have:
where the first equality is begotten by discard surgery and the second one by counitality surgery.Without(2.7),the first equality would fail.
Related to this observation is a claim in [1,Proposition 4.2] that ifv,ware atomic variables andvis not an ancestor ofwinG,then there exists no morphismf:in Cau(G)such thatvandware connected.Again,this is not quite right without (2.7),as shown by the example in (4.2).Now that we have imposed (2.7),this claim does hold,and is immediate from Lemma 4.3:
Proposition 4.6([1]).Let v,w be atomic variables.If there exists a morphism v w in which v,w are connected then v is an ancestor of w.Conversely,if v is an ancestor of w then they are connected by a directed path inΓ[w‖v].
This fact signals that Cau(G)is“purely causal”,in that all connected morphisms in the category go from causal ancestors to descendants.As a result,merely“associational”or“evidential”relations are not expressed by any morphism in the category.(Recall the“no junk”property of a free category.)
In some Markov categories such as FinStoch mentioned earlier,every morphism of the formf:admits conditionals over both objects in the codomain([2]),but this is not the case in Cau(G).Take,for instance,the simple graphx →yand consider the exogenous effect[xy]:.If a conditional[xy]x|y:overyexisted then we would have Since the duplicateδydoes not occur on the left-hand side,by the first claim of Lemma 4.2,its displayed occurrence on the right-hand side must be decorative,but thenx,ycannot be connected by a splitter path,violating the third claim of Lemma 4.2.So the equality is not possible.On the other hand,[xy]obviously admits a conditional overx,which is just[y‖x]=κy.
This simple example actually illustrates a general fact:for pairwise disjoint singular variables in Cau(G),u,v,w,if[vw‖u]admits a conditional overv,the conditional must be[w‖uv].We will leave the proof of this fact to another occasion,since it is a little involved and not directly relevant to the intended contributions of this paper.For the present purpose,the directly relevant question is when[w‖uv]is a conditional of[w‖uv],or in other words,when the followingdecompositionordisintegrationof a causal effect holds:
Call the property expressed by(4.4)thedecomposabilityof[vw‖u]overv.We now introduce some graphical conditions relevant to characterizing decomposability and other significant concepts to be introduced presently:
Definition 4.7.Leti,jbe two distinct vertices inG.Aforward trekfromitojinGis a directed path fromitoj.Abackward trekfromitojis a directed path fromjtoi,or a disjoint union of two directed paths joined at a distinct starting vertexk(i.e.,i ←···←k →···→j).
GivenX,Y ?V(G),aproperforward(respectively,backward)trek fromXtoYis a forward(respectively,backward)trek from somei ∈Xto somej ∈Ythat does not contain any other vertex inXor inY.
We say that
·Xisforward-t-separatedfromYbyZif every proper forward trek inGfromXtoYcontains somek ∈Z?
·Xisbackward-t-separatedfromYbyZif every proper backward trek inGfromXtoYcontains somek ∈Z?
·XandYaret-separatedbyZifXis both forward-t-separated and backwardt-separated fromYbyZ.
Observe thatt-separation is a symmetric relation,but forward-t-separation and backward-t-separation are not.Also note thatt-separation is a simpler condition than the well-knownd-separation([9])?the former is concerned only with blocking treks,whereas the latter also has an explicit requirement for paths that contain colliders,where two arrows point at the same vertex(i.e.,i →k ←j).
For the rest of this section,letu,v,andwbe pairwise disjoint singular variables in Cau(G).
Theorem 4.8.[vw‖u]is decomposable over v if and only if v is backward-t-separated from w by u.
Proof.For the “if” direction,supposevis backward-t-separated fromwbyu,and we show that the equality (4.4) holds.We examine the components Γi,Γjfori ∈V(Gu→v),j ∈V(Guv→w)and show that,together withδvandδuon the righthand side of (4.4),they are exactly those needed to construct the string diagram Γ[vw‖u].There are the following cases.
·i/∈uv.Then there is a directed path fromito somei′ ∈vinGthat does not pass throughu.Ifialso occurs inGuv→wthen either there is a directed path from it to somej′ ∈winGthat does not pass throughuv,or it is contained inw,both of which are prohibited by the backward-t-separation ofvfromwbyu.So Γiis the same in Γ[vw‖u]and Γ[v‖u].
·i ∈uv.Leti′,j′be any children ofiinGu→v,Guv→w,respectively.Soi′/∈uandj′/∈uv.Ifi′/∈vthen,by the case just considered,i′does not occur inGuv→wat all? for the same reason,j′does not occur inGu→v.On the other hand,ifi′ ∈vthen it cannot be a child ofiinGuv→w.So Γiin Γ[vw‖u]is the juxtaposition of the two Γiin Γ[v‖u]and Γ[w‖uv]joined byδu.
·j/∈uv.Thenjis an ancestor of somej′ ∈winG(uv→w,and hence cannot occur inGu→v,again due to the backward-t-separation ofvfromwbyu.So Γjis the same in Γ[vw‖u]and Γ[w‖uv].
This establishes(4.4).
For the“only if”direction,suppose that(4.4)holds.Letπbe a proper backward trek froma ∈vtob ∈winG.Suppose for contradiction thatπdoes not contain any vertex inu.By the first claim of Lemma 4.2,κbcannot occur in Γ[v‖u].Thus,by the other two claims of Lemma 4.2,πwould translate into a directed pathb?ain Γ[w‖vu],which is not possible,or a splitter path betweenaandbin Γ[vw‖u]that does not pass throughvin the direction ofband hence must pass throughu,contradiction again.
Readers familiar with Pearl’s do-calculus([10])may have noticed the close affinity between backward-t-separation and the condition for Rule 2 of the do-calculus.Before we elaborate on the connection,let us introduce two more notions to fully match the do-calculus.One of them (“conditional independence”) is introduced in[2]for all Markov categories.
Definition 4.9.Let M be a Markov category.
· Letf:be a morphism in M.We say thatXisconditionally irrelevant to Y given Z over fif there is a morphismfY|Z:such that
· Letf:be a morphism in M.We say thatX,Yareconditionally independent given Z over fif
We consider a specialized version of conditional irrelevance:viscausally screened-offfromwbyuif
It is easy to show that causal screening-off is captured precisely by forward-t-separation.
Theorem 4.10.[w‖vu]=?v ?[w‖u]if and only if v is forward-t-separated from w by u in G.
Proof.For the“if”direction,sincevis forward-t-separated fromwbyu,noi ∈vcan have children inGvu→wand henceGvu→wis the union ofGu→wand the trivial graph with vertices inv.It then follows from the construction of causal effects in Definition 3.6 that(4.7)holds.
For the“only if”direction,suppose that(4.7)holds.If there is a proper forward trek fromvtowinGthat does not contain any vertex inuthen,by Lemma 4.2,it would translate into a directed path on the right-hand side of(4.7)connectingvandw,which is not possible.
Similarly,conditional independence over causal effects is captured precisely byt-separation.
Theorem 4.11.We have that v,w are conditionally independent given u over[vw‖u]if and only if they are t-separated by u in G.
Proof.For the“if”direction,note thatt-separation betweenvandwbyuentails that,on the one hand,vis backward-t-separated fromwbyuand hence,by Theorem 4.8,the equality(4.4)holds,and on the other hand,vis forward-t-separated fromwbyuand hence,by Theorem 4.10,the equality(4.7)holds.It then follows that
So,by Lemma 4.5,v,ware conditionally independent givenuover[vw‖u].
For the“only if”direction,by Lemma 4.5 again,we may assume
Letπbe a proper forward or backward trek froma ∈vtob ∈w.Ifπdoes not contain any vertex inuthen it would translate into a directed or splitter pathγbetweenaandbin Γ[vw‖u].Sincea,bu,we see thatκa,κboccur exactly once in Γ[vw‖u]and hence,by (4.9) and the first claim of Lemma 4.2,κaand henceado not occur in Γ[w‖u],whereasκband hencebdo not occur in Γ[v‖u].It follows from the other two claims of Lemma 4.2 thatγhas to pass throughu,contradiction.
A merit of such theorems about the syntax category is that the sufficiency claims in them are immediately transferred to all models.
Corollary 4.12.Let M be a Markov category and F:Cau(G)Ma strong Markov functor.Then
1.If v and w are t-separated by u in G then,inM,F(v)and F(w)are conditionally independent given F(u)over F([vw‖u]):
2.If v is backward-t-separated from w by u in G then,inM,F([w‖uv])is a conditional of F([vw‖u]):
3.If v is forward-t-separated from w by u in G then,inM,F(v)is conditionally irrelevant to F(w)given F(u)over F([w‖uv]):
On the other hand,the necessity claims do not hold in all models.For example,the decomposition property (4.11) always holds in deterministic structural equation models(as functors from Cau(G)to Set),regardless of backward-t-separation.The necessity in question is necessity for validity(“true in all models”),rather than truth in particular models.
Corollary 4.12 is particularly interesting because its three clauses correspond to the three rules in Pearl’s do-calculus,respectively.Suppose M=FinStoch,andFsends each causal mechanismκvto a positive stochastic matrix,so that it gives rise to a causal Bayesian network(CBN)model([1]),in which the pre-intervention joint probability distribution is positive(which is assumed by Pearl’s do-calculus).LetXdenote the set of random variables represented by the objectu,Ybyw,andZbyv.Then equations(4.10)-(4.12)can be reformulated as follows.2
Equation(4.10)is rendered as:
Equation(4.11)is rendered as:
Equation(4.12)is rendered as:
2In FinStoch,we can simply use nonzero natural numbers as the objects,and the morphisms are stochastic matrices:a morphismis am×nstochastic matrix.A discard morphism?n:1 is the 1×nstochastic matrix in which each entry is 1,and a duplicate morphismδn:2is then2×nstochastic matrix in which=1(and other entries are zero).The composition of morphisms is given by matrix multiplication,and the monoidal product is given by the Kronecker product of matrices.([1,2]) As shown in [1],F([vG]),the image of the exogenous effect onvGin FinStoch,is a stochastic matrix(in fact,a column vector)encoding a joint probability distribution over the set of random variables(V)represented byvGthat satisfies the factorization in(3.1).His argument can be generalized to show thatF([v‖u]) is a stochastic matrix encoding the distributions over the random variables represented byvgiven that those represented byuare intervened to take various values,according to the intervention principle (3.2).Note that equations (5.1)-(5.3) are understood as holding for all values ofX,Y,Z,and so express equality between specific entries in the relevant matrices.
Note in addition that by the probability calculus and the assumed positivity,(5.1)is equivalent to
and(5.2)is equivalent to
because by the chain rule of the probability calculus,P(Y,Z|do(X))=P(Z|do(X))P(Y|do(X),Z).
The upshot is that Corollary 4.12,when applied to a CBN model based onG(with a positive or regular pre-intervention distribution),entails the following rules.Rule 1(Insertion/deletion of observations):ifYandZaret-separated byXinG,then
Rule 2(Action/observation exchange):ifZis backward-t-separated fromYbyXinG,then
Rule 3(Insertion/deletion of actions):ifZis forward-t-separated fromYbyXinG,then
Pearl’s do-calculus([10])consists of exactly three rules like these,but each rule therein is more general than the corresponding rule above and is formulated in terms of the more complicatedd-separation criterion and various modifications ofG.The extra generality in Pearl’s version is that the consequent equation in each rule has an extra set of variablesWto be conditioned upon on both sides of the equation.For example,the consequent equation in the first rule of Pearl’s do-calculus is
and similarly for the other two rules.It is a simple exercise to check that each of the rules above is exactly equivalent to the corresponding rule in Pearl’s calculus whenWis taken to be empty.
So Corollary 4.12,when applied to a CBN model,yields a specialized version of Pearl’s do-calculus.However,although each rule in the specialized version is a special case of the corresponding rule in the full version,taken together they are actually as strong as the full version.To see this,it suffices to show that we can recover the intervention principle(3.2)from the specialized rules,or to be more exact,from Rule 2 and Rule 3 above,since Rule 1 is entailed by the conjunction of Rule 2 and Rule 3(just as in the full version,see[5]).Since the full version is entailed by the intervention principle(plus the probability calculus),it is also entailed by the specialized version(plus the probability calculus)if the intervention principle is entailed by the specialized version(plus the probability calculus).
We now sketch a fairly simple argument to that effect.It helps to first consider the pre-intervention case,where we need to show that Rule 2 and Rule 3 entail that the pre-intervention probability distribution factorizes as in(3.1).This is equivalent to deriving the local Markov condition from Rule 2 and Rule 3,the condition that every variable is probabilistically independent of its non-descendants conditional on its parents.([13]) The derivation is straightforward.For any random variableV ∈Vin the CBN,by Rule 2,we have
where pa(V) and nd(V) denote the set ofV’s parents inGand the set ofV’s nondescendants inG(i.e.,those variables of whichVis not an ancestor),respectively.This is so because pa(V)is trivially backward-t-separated fromVby the empty set(for there is no proper backward trek from pa(V) toV),and so is nd(V),which contains pa(V)as a subset.Then by Rule 3,we have
simply because every forward trek toVcontains a parent ofV.It follows that for eachV,
So the factorization required by the intervention principle holds in the pre-intervention case.This argument easily generalizes to any post-intervention probability distributionP(V|do(T))?that is,we can derive in the same fashion from Rule 2 and Rule 3 that for everyV,
where pa*(V) and nd*(V) denote the set ofV’s parents and the set ofV’s nondescendants,respectively,in the subgraph ofGin which all arrows into variables inTare deleted.From this follows the factorization ofP(V|do(T))required by the intervention principle(3.2).
Therefore,the full do-calculus can in principle be derived from the specialized do-calculus together with the probability calculus.This fact reveals an equivalent formulation of the do-calculus for CBN models that is simpler than the standard formulation.More importantly,this simpler formulation reflects the“causal core”of the do-calculus,for it is an instance of the generic do-calculus given in Corollary 4.12,and the generic do-calculus is derived from results in a syntax category that is,so to speak,purely causal (because there is no morphism in that category to match noncausal or evidential relations between variables.More precisely,the causal core of the do-calculus is given by the rule for causal decomposition((4.11),rendered as(5.2)in a CBN model)and the rule for causal screening-off((4.12),rendered as(5.3)in a CBN model).These are derived without any consideration of the non-causal features of the models.The standard do-calculus for the CBN models can be seen as derived from a conjunction of these two rules on the one hand,which are purely causal,and the probability calculus on the other hand,which is non-causal.
Following the pioneering work of [1],we studied the causal effect morphisms in a causal DAG-induced free Markov category in some depth,and established sufficient and necessary graphical conditions for some conceptually important properties of such morphisms,including especially decomposition and screening-off.Our results yield a generic do-calculus that is more general and abstract than the standard do-calculus in the context of causal Bayesian networks.Not only is the generic docalculus more widely applicable,it is also conceptually illuminating in that it reveals the purely causal component of the do-calculus.When applied to causal Bayesian networks,it also results in a simpler but equivalent formulation of the probabilistic do-calculus.
Since the simpler do-calculus uses trek-separation rather than the more convoluted d-separation,it is probably easier to explain and understand.Moreover,the simpler do-calculus may also be readily extendable to other causal graphical models derived from DAG models.For example,in[16],Pearl’s do-calculus is extended to the so-called partial ancestral graphs (PAGs),which are used to represent Markov equivalence classes of DAG models.The extension is intended to capture the applicability of a do-calculus rule in all DAGs in the equivalence class represented by a PAG,but due to the complex graphical conditions in Pearl’s do-calculus,it only accommodates some but not all such cases of unanimous applicability.We suspect that an extension of the simpler do-calculus highlighted in this paper would be more straightforward and complete.