SIR: Structured Image Representations for Explainable Robot Learning

Paul Mattes, Jan Schwab, Jens Oliver Bosch, Maximilian Xiling Li, Nils Blank, Minh-Trung Tang, Moritz Haberland, Rudolf Lioutikov

June, 2026

Abstract

Existing robot policies based on learned visual embeddings lack explicit structure and are sensitive to visual distractions. Thus, the representations that drive their behaviour are often opaque, making their decision-making process difficult to interpret. To address this, we introduce Structured Image Representations (SIR), a method that leverages Scene Graphs (SGs) as an intermediate representation for robot policy learning. Our approach first constructs a fully connected graph, using 2D or 3D image-derived features as initial node representations. Then, a module learns to sparsify this graph end-to-end, creating a minimal, task-relevant sub-graph that is passed to the action generation model. This process makes our model intrinsically explainable. Evaluations on RoboCasa show that our sparse graph policies outperform image-based baselines on average with 19.5% vs 14.81% success rate. We also demonstrate that our graph-based representations are significantly more robust to distractor objects, showing almost no performance degradation, as opposed to image representations. Most importantly, we show that the learned sparse graphs are a powerful tool for model analysis. By analysing when the model’s sub-graph deviates from human expectation, such as by including distractor nodes or omitting key objects, we successfully uncover dataset biases, including spurious correlations and positional biases.

Type

Conference paper

Publication

In Computer Vision and Pattern Recognition Conference

SIR: Structured Image Representations for Explainable Robot Learning

Abstract

Read the Paper

Paul Mattes

PhD at KIT in the IRL