Explainable AI for Graph Data Augmentation in Machine Learning
PhD position – starting fourth quarter 2025
Location :
GREYC laboratory, CNRS UMR 6072, Université de Caen Normandie, 14000 Caen,
France
Scientific context
Pandora
This thesis is financed within the Pandora project funded by the French ANR
(National Research Agency), underway since February 2025. Pandora is
situated in the
context of explainable artificial intelligence (XAI) as applied to graph
neural networks (GNN). By focusing on the internal functioning of GNNs, the
objectives of the project are as follows :
— characterize, understand and clearly explain the internal workings of
GNNs using pattern extraction techniques ;
— uncover statistically significant patterns of neural activation, called
“activation rules,” to determine how networks encode concepts [7, 8] ;
— translate these activation rules into graph patterns interpretable by a
user ;
— use this knowledge to improve GNNs by identifying learning biases,
generating additional data, and building explanatory systems.
The thesis will be concerned with the last of those research questions. The
work carried out in this project (and by extension in the thesis) will be
partially
based on molecular data resulting from biochemical experiments from our
collaboration with the CERMN laboratory (Centre d’Études et de Recherche
sur le Médicament de Normandie), University of Caen Normandy.
Problem setting
In machine learning, we do not always have training data sets that are
sufficiently representative of the real world (for example,
chemical/biological experiments
often focus only on certain well-explored molecules or certain therapeutic
targets). How to detect that a training data set is insufficient ? Two
non-exhaustive proposals for this :
— possible parts of the data space are not represented (e.g. some node/edge
combinations cannot be found).
— the learned model is unreliable in some subspaces of the data (the
reliability of a supervised model can be studied, for example, by looking
at the importance of instances in the construction of decision boundaries).
The literature contains methods to characterize data in a model-independent
manner [5] and methods to characterize the behavior of a model based on the
components of the individual graphs considered [9, 2, 6, 3, 4, 1]. However,
there is no approach that establishes the link between data and the
performance of a specific model. Furthermore, there exist no approaches for
augmenting the data as a means for improving model performance and
reliability. The thesis is intended to address these gaps.
Objectives
This thesis has three objectives. First, we want to characterize at a
global level graph datasets in a way similar to that already used for
vectorial datasets. Second, we want to design one (or more) approaches to
use the explanations of the behavior of GNNs to identify relevant instances
of the training set used. Finally, we leverage the results of the first two
points to generate additional data instances to improve the data set and
therefore render GNNs more accurate and more robust.
Topic and overview of the work plan of the thesis
In short, the thesis deals with the use of patterns learnt from GNN to
improve GNNs by identifying learning biases, generating additional data,
and building explanatory systems. More precisely, we wish to develop new
methods to improve the learning of graph models by relying on the analysis
of the internal functioning of these models via, for example, activation
rules expressed in the latent space. This will involve analyzing decision
boundaries, characterizing the errors of the model studied in the data
space or in their latent representations in order to propose corrective
solutions. This approach can be broken down
into sub-problems :
Data characterization and bias identification. The characterization of
training data can help identify instances on which the model commits errors
but also detect whether the data are not the source of bias in learning.
One work direction is to study the complexity of activation rules and
compare them to domain knowledge. Targeted generation of additional data.
Once the model’s limitations have been identified, we want to automatically
define “corrective patches” to improve the model’s robustness. A preferred
area of work will be the generation of targeted additional data to allow
the model to better separate the data according to the class studied in the
constructed representation.
The first problem, i.e. data characterization will start from the knowledge
developed in meta-learning for vectorial data, combined with existing work
explaining GNN predictions and on activation rules.
The second problem poses relatively complex research questions since
realistic graph data with desired properties is rather hard to generate.
While a number of graph data generators exist in the literature, the
generated data have often been found to lack properties observed in
real-world data.
Preliminary work plan
- Conduct a literature review of methods for explaining the behavior of
GNN models [9, 2, 8, 7, 6, 3, 4, 1]. The aim of this study is to establish
in what sense the different methods identify certain aspects of the data
used to train the model. - Design and implement approaches to identify the instances (graphs)
involved by the explanatory descriptors/rules. It is not certain that such
approaches will be found for all of them, which will then lead to a
selection of descriptors. Highlighting the instances and subgraphs linked
to the explanatory descriptors/rules will also allow to determine how the
descriptors characterize different subsets of data. - Develop a formalism to extend concepts defined for vector data (density,
decision boundaries, value distribution) to graph data. This formalism, in
combination with the results of step 2, will allow to determine where
learning instances are missing in a training dataset and thus where it is
useful to generate synthetic data. - Exploit the information derived from the first three points, as well as
others — for instance graph patterns extracted using pattern mining methods
— to define
constraints on symbolic data generators to arrive at data with precise
properties that fill the gaps in the data sets. - Evaluate the generated data in the context of project use cases,
particularly molecular data activity prediction.
Keywords : Statistical learning, graph neural networks, explainable AI,
data mining.
Thesis period : Starting in autumn 2025
Remuneration : Approximately 2,200e gross per month.
Supervising team :
— Bruno Crémilleux (GREYC – Université de Caen Normandie).
— Marc Plantevit (LRE – EPITA)
— Albrecht Zimmermann (GREYC – Université de Caen Normandie).
Candidate profile
The candidate must be enrolled in the final year of a Master’s degree or an
engineering degree, or hold such a degree, in a field related to computer
science or applied mathematics, and have solid programming skills.
Experience in data science, deep learning, etc. would be a plus.The
candidate must be able to write scientific reports and communicate research
results at conferences in English.
To apply
Application period : from now until the position is filled.
Send the following documents (exclusively in pdf format) to
bruno.cremilleux@unicaen.fr, marc.plantevit@epita.fr et
albrecht.zimmermann@unicaen.fr :
— cover letter explaining your qualifications, experiences and motivation
for this subject ;
— curriculum vitae ;
— transcript of grades (if possible with ranking) of 3rd year of Bachelor’s
degree, 1st and 2nd year of Master’s degree or equivalent for engineering
schools ;
— if possible, names of people (teachers or other person) who can provide
information on your skills and your work ;
— a link to personal project repositories (e.g. GitHub) ;
— any other information you consider useful.
Références
[1] C. Abrate, G. Preti, and F. Bonchi. Counterfactual explanations for
graph classification
through the lenses of density. In World Conference on Explainable
Artificial Intelligence,
pages 324–348. Springer, 2023.
[2] A. Duval and F. D. Malliaros. Graphsvx : Shapley value explanations for
graph neu-
ral networks. In Machine Learning and Knowledge Discovery in Databases.
Research
Track : European Conference, ECML PKDD 2021, Bilbao, Spain, September 13–17,
2021, Proceedings, Part II 21, pages 302–318. Springer, 2021.
[3] Q. Huang, M. Yamada, Y. Tian, D. Singh, and Y. Chang. Graphlime : Local
interpre-
table model explanations for graph neural networks. IEEE Transactions on
Knowledge
and Data Engineering, 35(7) :6968–6972, 2022.
[4] A. Mastropietro, G. Pasculli, C. Feldmann, R. Rodríguez-Pérez, and J.
Bajorath. Ed-
geshaper : Bond-centric shapley value-based explanation method for graph
neural net-
works. Iscience, 25(10), 2022.
[5] M. A. Munoz, L. Villanova, D. Baatar, and K. Smith-Miles. Instance
spaces for machine
learning classification. Machine Learning, 107(1) :109–147, 2018.
[6] A. Perotti, P. Bajardi, F. Bonchi, and A. Panisson. Graphshap :
Explaining
identity-aware graph classifiers through the language of motifs. arXiv
preprint
arXiv :2202.08815, 2022.
[7] L. Veyrin-Forrer, A. Kamal, S. Duffner, M. Plantevit, and C. Robardet.
In pursuit of
the hidden features of gnn’s internal representations. Data & Knowledge
Engineering,
142 :102097, 2022.
[8] L. Veyrin-Forrer, A. Kamal, S. Duffner, M. Plantevit, and C. Robardet.
On gnn ex-
plainability with activation rules. Data Mining and Knowledge Discovery,
pages 1–35,
2022.
[9] H. Yuan, H. Yu, J. Wang, K. Li, and S. Ji. On explainability of graph
neural networks
via subgraph explorations. In M. Meila and T. Zhang, editors, Proceedings
of the 38th
International Conference on Machine Learning, volume 139 of Proceedings of
Machine
Learning Research, pages 12241–12252. PMLR, 18–24 Jul 2021.