1HKU MMLab | 2Meituan
†Corresponding authors
Generating images conditioned on multiple visual references is critical for real-world applications such as multi-subject composition, narrative illustration, and novel view synthesis, yet current models suffer from severe performance degradation as the number of input references grows. We identify the root cause as a fundamental data bottleneck: existing datasets are dominated by single- or few-reference pairs and lack the structured, long-context supervision needed to learn dense inter-reference dependencies. To address this, we introduce MacroData, a large-scale dataset of 400K samples, each containing up to 10 reference images, systematically organized across four complementary dimensions—Customization, Illustration, Spatial reasoning, and Temporal dynamics—to provide comprehensive coverage of the multi-reference generation space. Recognizing the concurrent absence of standardized evaluation protocols, we further propose MacroBench, a benchmark of 4,000 samples that assesses generative coherence across graded task dimensions and input scales. Extensive experiments show that fine-tuning on MacroData yields substantial improvements in multi-reference generation, and ablation studies further reveal synergistic benefits of cross-task co-training and effective strategies for handling long-context complexity. The dataset and benchmark will be publicly released.
MacroData comprises 400,000 high-quality samples spanning four tasks, curated from diverse real-world sources. The dataset supports up to 10 inputs per sample with an average of 5.44 references. It is structured across four reference image count categories — 1–3, 4–5, 6–7, and 8–10 images — enabling fine-grained evaluation of models' ability to leverage increasing amounts of visual context. Each task is composed of 100k samples, split into different numbers of reference images.
Each task is independently sourced and verified. Customization focuses on generating compositions conditioned on multiple reference images, covering diverse human and object subjects with varying backgrounds. Illustration aims at producing illustrative images based on multimodal context, featuring diverse topics derived from interleaved contexts. Spatial requires predicting novel view images given multiple views, leveraging structured multi-view captures of indoor scenes, outdoor environments, and isolated objects. Temporal involves forecasting future frames based on historical sequence, with data extracted from video sequences with dense temporal correspondence. Rigorous LLM-based quality filtering ensures that each retained sample satisfies task-specific consistency and instruction-following criteria, resulting in a benchmark with high signal-to-noise ratio suitable for both training and evaluation.
MacroBench evaluates models across 16 sub-categories defined by the cross-product of four tasks and four image-count buckets (1-3, 4-5, 6-7, 8-10), yielding a comprehensive assessment of multi-reference generation capabilities. Our evaluation reveals several key findings. First, performance consistently degrades as the number of input reference images increases, highlighting that handling large visual contexts remains an open challenge for all current architectures. Second, models fine-tuned on MacroData show substantial gains over their base counterparts across all tasks and image-count groups, with our fine-tuned Bagel achieving a 5.71 average score, ranking third overall behind only the closed-source Nano Banana Pro and GPT-Image-1.5, and substantially improving over the base Bagel (3.03). Notably, it approaches Nano Banana Pro in Customization and surpasses it in Spatial tasks. Under identical architectures, MacroData also outperforms alternative fine-tuning datasets including Echo4o, MICo, and OpenSubject, validating its effectiveness.
Furthermore, while increasing input images from 1–5 to 6–10 degrades performance generally, models trained on MacroData exhibit improved robustness. For instance, our fine-tuned Qwen mitigates the severe drops in Customization and Illustration observed in the base model. MacroData also provides stable gains in challenging Spatial tasks where base models typically score below 1.0. To further validate the generalization capabilities of our dataset, we evaluate performance on the OmniContext benchmark, which targets 1–3 image Customization tasks. MacroData achieves strong OmniContext performance despite targeting the broader multi-reference setting, notably surpassing Echo4o—a dataset purpose-built for OmniContext that also distills from closed-source models—validating the quality of our data collection pipeline.
We examine how the distribution of training samples across input image counts affects model performance on progressive tasks—where task difficulty increases with input count (e.g., Customization)—and non-progressive tasks (e.g., Temporal). We compare four sampling ratios (1:1:1:1, 2:2:3:3, 1:2:3:4, and 1:3:7:9) applied to input groups of 1–3, 4–5, 6–7, and 8–10 images. We observe that upweighting large-input samples substantially boosts high-input performance on progressive tasks without hurting low-input performance, while non-progressive tasks show no such sensitivity. Based on these findings, MacroData adopts a 2:2:3:3 ratio for Customization and 1:1:1:1 for all other tasks.
We study how dataset size (1K, 5K, 10K, 20K samples) affects Customization performance, evaluated on the Customization subsets of both MacroBench and OmniContext. Performance scales consistently with data volume, with the sharpest gains occurring between 1K and 10K. Returns diminish from 10K to 20K, suggesting approaching saturation, though larger datasets continue to stabilize training convergence. We therefore scale each task to 100K samples in MacroData.
To analyze the trade-off between multi-reference and standard T2I generation, we evaluate models trained with varying T2I data ratios (0%, 10%, 20%, 40%) on GenEval and a representative MacroBench subset. While T2I co-training significantly enhances GenEval performance, increasing the ratio beyond 10% yields negligible marginal gains. Consequently, we adopt a 10% T2I data ratio for models trained on MacroData to optimize training efficiency.