SADGE: Structure and Appearance Domain Gap Estimation of Synthetic and Real Data

1Adam Mickiewicz University, 2ArtiCollect, 3KAUST, 4Kiel University
SADGE method diagram: real and synthetic image pairs are scored along an Appearance Similarity branch (pretrained vision encoder, cosine similarity) and a Geometry Consistency branch (learned matcher, correspondence-based verification); the two z-score-standardized signals are fused into a single SADGE score.

SADGE predicts the utility of a synthetic image dataset for downstream visual recognition by jointly modeling appearance similarity and geometry consistency between real and synthetic images. For each real image, comparison is performed either using an aligned real–synthetic pair or by retrieving the best synthetic match from a candidate subset. After dataset-level aggregation, the appearance and geometry scores are fused with a constrained bilinear interaction model to produce the final SADGE score.

Abstract

We propose SADGE, a quantitative similarity metric that predicts the performance of synthetic image datasets for common computer vision tasks without downstream model training. Estimating whether a synthetic dataset will lead to a model that performs well on real-world data remains a bottleneck in model development. Existing evaluation metrics (e.g., PSNR, FID, CLIP) primarily measure semantic alignment between real and synthetic images (Appearance Similarity Score). Less commonly, structural similarity between images is considered to assess the domain gap (Geometric Similarity Score). However, to the best of our knowledge there exists no studies that evaluate which similarity metric is the best downstream predictor for a given synthetic dataset. In this paper, we show over a wide variety of different synthetic datasets and downstream tasks that neither appearance nor geometry alone can reliably predict downstream performance; rather, it is their non-linear interplay that dictates synthetic data utility.

Specifically, we measure how commonly used Appearance and Geometric Similarity metrics (e.g., CLIP, PSNR, LPIPS, SSIM) computed between synthetic and real images correlate with downstream performance in object detection, semantic segmentation, and pose estimation. Across five public synthetic-to-real benchmark families and 15 dataset-level variants (79k image pairs), SADGE achieves the strongest association with downstream transfer performance under both linear and rank-based criteria, reaching Pearson r = 0.879 and Spearman ρ = 0.768 (n = 15, approximate p = 8.3×10−4). We compute for each combination of geometry-based methods (SSIM, SuperPoints, MASt3R, LoFTR) and appearance-based approaches (FID, DINOv2, DINOv3, SigLIP2, SAM3, PSNR, CLIP, LPIPS) SADGE scores across all benchmark families. The best configuration is obtained by fusing DINOv3 appearance similarity with MASt3R geometric consistency through a constrained bilinear interaction, outperforming both the strongest geometry-only baseline (LoFTR, ρ = 0.582) and the strongest appearance-only baseline (PSNR, ρ = 0.536).

BibTeX

@misc{bartkowiak2026sadgestructureappearancedomain,
      title={SADGE: Structure and Appearance Domain Gap Estimation of Synthetic and Real Data},
      author={Patryk Bartkowiak and Bartosz Kotrys and Dominik Michels and Soren Pirk and Wojtek Palubicki},
      year={2026},
      eprint={2605.22467},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.22467},
}