MACRO: Advancing Multi-Reference Image Generation with Structured Long-Context Data

Abstract

Generating images conditioned on multiple visual references is critical for real-world applications such as multi-subject composition, narrative illustration, and novel view synthesis, yet current models suffer from severe performance degradation as the number of input references grows. We identify the root cause as a fundamental data bottleneck: existing datasets are dominated by single- or few-reference pairs and lack the structured, long-context supervision needed to learn dense inter-reference dependencies. To address this, we introduce MacroData, a large-scale dataset of 400K samples, each containing up to 10 reference images, systematically organized across four complementary dimensions—Customization, Illustration, Spatial reasoning, and Temporal dynamics—to provide comprehensive coverage of the multi-reference generation space. Recognizing the concurrent absence of standardized evaluation protocols, we further propose MacroBench, a benchmark of 4,000 samples that assesses generative coherence across graded task dimensions and input scales. Extensive experiments show that fine-tuning on MacroData yields substantial improvements in multi-reference generation, and ablation studies further reveal synergistic benefits of cross-task co-training and effective strategies for handling long-context complexity. The dataset and benchmark will be publicly released.

Dataset Overview

Data Samples

Browse Data Gallery →

Customization Generating compositions conditioned on multiple reference images.

Illustration Producing illustrative images based on multimodal context.

Spatial Predicting novel view images given multiple views, specifically including outside-in objects and inside-out scenes.

Temporal Forecasting future frames based on historical sequence.

Data Statistics

MacroData comprises 400,000 high-quality samples spanning four tasks, curated from diverse real-world sources. The dataset supports up to 10 inputs per sample with an average of 5.44 references. It is structured across four reference image count categories — 1–3, 4–5, 6–7, and 8–10 images — enabling fine-grained evaluation of models' ability to leverage increasing amounts of visual context. Each task is composed of 100k samples, split into different numbers of reference images.

Each task is independently sourced and verified. Customization focuses on generating compositions conditioned on multiple reference images, covering diverse human and object subjects with varying backgrounds. Illustration aims at producing illustrative images based on multimodal context, featuring diverse topics derived from interleaved contexts. Spatial requires predicting novel view images given multiple views, leveraging structured multi-view captures of indoor scenes, outdoor environments, and isolated objects. Temporal involves forecasting future frames based on historical sequence, with data extracted from video sequences with dense temporal correspondence. Rigorous LLM-based quality filtering ensures that each retained sample satisfies task-specific consistency and instruction-following criteria, resulting in a benchmark with high signal-to-noise ratio suitable for both training and evaluation.

Data Construction Pipelines

Customization Data Construction We aggregate source metadata from OpenSubject for human, MVImgNet for object, DL3DV for scene, Vibrant Clothes Rental Dataset for cloth and WikiArt for style. Each undergoes tailored preprocessing: we uniformly sample video keyframes for scenes, filter out clothing images containing faces via VLMs to prevent identity leakage, and categorize styles using tags to guarantee diversity. Finally, aesthetic scoring is applied to select high-quality images. We sample metadata across these categories to build diverse composition sets, using LLMs to evaluate and iteratively resample any logically or spatially unreasonable combinations. The valid sets are then synthesized into target images. To ensure data fidelity, we perform a VLM-based bidirectional assessment, filtering out samples where input images are not faithfully reflected in the output or where text prompts are semantically inconsistent with the generated images.

Illustration Data Construction We leverage large-scale interleaved image-text sequences from web crawls of OmniCorpus-CC-210M as raw source material. Since such data is inherently noisy, we employ VLMs to identify "anchor images" that exhibit high semantic relevance to both the accompanying text and preceding images. These anchors are designated as generation targets, with the preceding context serving as input conditions. Since the preceding contexts are often noisy and textually verbose, we employ strong VLMs to regenerate each sequence. The VLM re-evaluates semantic relevance, discards images inconsistent with the narrative flow, and synthesizes a concise, coherent textual context. Each reorganized sample is then assigned a quality score, and low-scoring samples are filtered out to finalize the Illustration dataset.

Spatial Data Construction We construct this subtask from multi-view 3D object renderings data from G-buffer Objaverse dataset. The outside-in setup focuses on capturing a central object from surrounding external viewpoints. To bridge the domain gap with real imagery, we filter out low-quality samples based on color saturation and brightness in HSV space. From a set of canonical views, we designate one as the target and randomly sample diverse input views from the remainder, strictly ensuring visual overlap for physical plausibility. For inside-out scenes, which capture the surrounding environment by looking outward from a central internal point, we start from panoramic images of DIT360, Pano360, and Polyhaven, filtering out non-standard formats and categorizing the rest into indoor and outdoor scenes using VLMs. We define canonical viewing directions from the panorama's center, designate a target view, and sample input views from the remaining directions, ensuring adequate spatial overlap with the target.

Temporal Data Construction To mitigate redundancy in raw video content from OmniCorpus-YT, we apply shot boundary detection to segment videos into semantically distinct clips and extract the central frame of each clip as a representative keyframe, effectively compressing temporal information while preserving key visual transitions. We group keyframes into coherent sequences within scene boundaries identified via DINOv2 visual feature similarity thresholds, ensuring smooth transitions. A VLM is then applied to generate a descriptive summary and a quality score for each sequence, and low-scoring samples are discarded. The final frame of each valid sequence is designated as the generation target, completing the Temporal dataset.

Experimental Results

Qualitative Comparisons

Browse Case Gallery →

Customization A standard multi-image customization case, effectively integrating features from multiple images to produce coherent, contextually relevant outputs.

Customization A more complex customization case with 10 input images, demonstrating strong performance in handling extensive visual contexts.

Illustration Generating descriptive images based on both textual content and visual information.

Spatial Predicting novel viewpoints of an object from complex multi-view inputs.

Temporal Forecasting future frames while maintaining strict visual consistency and temporal coherence.

Benchmark Analysis

MacroBench evaluates models across 16 sub-categories defined by the cross-product of four tasks and four image-count buckets (1-3, 4-5, 6-7, 8-10), yielding a comprehensive assessment of multi-reference generation capabilities. Our evaluation reveals several key findings. First, performance consistently degrades as the number of input reference images increases, highlighting that handling large visual contexts remains an open challenge for all current architectures. Second, models fine-tuned on MacroData show substantial gains over their base counterparts across all tasks and image-count groups, with our fine-tuned Bagel achieving a 5.71 average score, ranking third overall behind only the closed-source Nano Banana Pro and GPT-Image-1.5, and substantially improving over the base Bagel (3.03). Notably, it approaches Nano Banana Pro in Customization and surpasses it in Spatial tasks. Under identical architectures, MacroData also outperforms alternative fine-tuning datasets including Echo4o, MICo, and OpenSubject, validating its effectiveness.

Furthermore, while increasing input images from 1–5 to 6–10 degrades performance generally, models trained on MacroData exhibit improved robustness. For instance, our fine-tuned Qwen mitigates the severe drops in Customization and Illustration observed in the base model. MacroData also provides stable gains in challenging Spatial tasks where base models typically score below 1.0. To further validate the generalization capabilities of our dataset, we evaluate performance on the OmniContext benchmark, which targets 1–3 image Customization tasks. MacroData achieves strong OmniContext performance despite targeting the broader multi-reference setting, notably surpassing Echo4o—a dataset purpose-built for OmniContext that also distills from closed-source models—validating the quality of our data collection pipeline.

Ablation Studies

We examine how the distribution of training samples across input image counts affects model performance on progressive tasks—where task difficulty increases with input count (e.g., Customization)—and non-progressive tasks (e.g., Temporal). We compare four sampling ratios (1:1:1:1, 2:2:3:3, 1:2:3:4, and 1:3:7:9) applied to input groups of 1–3, 4–5, 6–7, and 8–10 images. We observe that upweighting large-input samples substantially boosts high-input performance on progressive tasks without hurting low-input performance, while non-progressive tasks show no such sensitivity. Based on these findings, MacroData adopts a 2:2:3:3 ratio for Customization and 1:1:1:1 for all other tasks.

We study how dataset size (1K, 5K, 10K, 20K samples) affects Customization performance, evaluated on the Customization subsets of both MacroBench and OmniContext. Performance scales consistently with data volume, with the sharpest gains occurring between 1K and 10K. Returns diminish from 10K to 20K, suggesting approaching saturation, though larger datasets continue to stabilize training convergence. We therefore scale each task to 100K samples in MacroData.

To analyze the trade-off between multi-reference and standard T2I generation, we evaluate models trained with varying T2I data ratios (0%, 10%, 20%, 40%) on GenEval and a representative MacroBench subset. While T2I co-training significantly enhances GenEval performance, increasing the ratio beyond 10% yields negligible marginal gains. Consequently, we adopt a 10% T2I data ratio for models trained on MacroData to optimize training efficiency.