Overview
Generative image models are increasingly used in interior design — for client previews, mood boards, and end-to-end room visualisation. But there is no widely-accepted way to measure how well a given model handles interior-design-specific tasks. General benchmarks (FID, CLIP-Score, generic preference panels) do not capture what actually matters in this domain: whether a chosen design style is rendered faithfully, whether room proportions and furniture placement are realistic, and whether outputs remain consistent across multiple generations of the same prompt.
The Room AI Interior Design Benchmark addresses this gap. It is purpose-built for interior visualisation and evaluates 11 image generation models head-to-head, on the same prompts, using the same review panel.
Dataset
The benchmark uses a fixed, versioned dataset of 60 interior design prompts (Q2 2026 dataset version). Each prompt specifies a room type, a design style, and a small set of contextual constraints (e.g. “a small studio apartment kitchenette in Japandi style with limited natural light”).
Room types (10)
- living room
- bedroom
- kitchen
- bathroom
- dining room
- home office
- kids room
- hallway
- balcony
- studio
Design styles (6)
- Modern Scandinavian
- Industrial
- Bohemian
- Minimalist
- Mid-Century Modern
- Japandi
Each room type × design style cell contributes one prompt where applicable, with the remainder allocated to edge cases (small rooms, unusual layouts, niche style requests).
Models tested
All 11 models were tested on the same prompt set, with the same default generation parameters as documented by each vendor.
- Suede 2.5 (Room AI)
- Midjourney v6.1 (Midjourney)
- Flux.1 Pro (Black Forest Labs)
- DALL-E 3 (OpenAI)
- Imagen 3 (Google)
- Ochre (Room AI)
- Ideogram 2.0 (Ideogram)
- Linen 1.0 (Room AI)
- Stable Diffusion XL (Stability AI)
- Adobe Firefly 3 (Adobe)
- Leonardo AI (Leonardo)
Scoring dimensions
ELO rating
Pairwise blind preference scoring across 1,200+ A/B comparisons by interior design reviewers. Each comparison shows two un-labelled outputs from the same prompt; reviewers pick the stronger interior visualisation. ELO ratings are computed from the aggregate win-loss matrix using a standard implementation (K-factor 32, initialisation 1000).
Generation time
Wall-clock seconds from prompt submission to final image, averaged across 5 runs per prompt. Measurements use each vendor’s production endpoint at default settings; queue time is included where applicable, network round-trip is excluded.
Style fidelity (1–10)
Manual scoring of how accurately the output matches the specified design style. Reviewers check signature elements (e.g. tapered legs and walnut tones for Mid-Century Modern; limewashed timber and wabi-sabi imperfection for Japandi).
Spatial coherence (1–10)
Proportions, perspective, and furniture placement realism. Penalises common failure modes: floating furniture, impossible perspectives, doors that open into walls, mismatched ceiling-line vanishing points.
Style consistency (1–10)
Variance across 5 generations of the same prompt — lower variance scores higher. Measured on style adherence, palette stability, and material consistency. A high score means a user can re-roll the same prompt and reliably get on-brand outputs.
Edge case handling (1–10)
Small rooms, awkward layouts, niche styles, and unusual constraints. Distinguishes models that perform well on canonical prompts but fall apart on long-tail requests.
Composite
Weighted average of the five dimensions. Weights: style fidelity 0.30, spatial coherence 0.25, style consistency 0.20, edge cases 0.15, ELO-derived component 0.10.
Reviewers
Scoring conducted by a panel of 4 interior design professionals and 2 spatial-AI researchers. Reviewers received the prompt context but not the model identity for any individual output. Inter-rater agreement was ≥ 0.78 Cohen’s κ on a held-out calibration set.
Limitations
- ELO comparisons are limited to interior design prompts; results may not generalise to other generative tasks (illustration, character art, product photography, etc.).
- Generation-time measurements depend on vendor endpoint capacity at the time of testing. We re-run timing measurements weekly during the benchmark window to reduce noise.
- Style fidelity scores are inherently subjective. We mitigate this with multi-reviewer scoring and inter-rater calibration, but cannot eliminate it.
- The benchmark scores public-API model versions only. Fine-tuned or LoRA-customised variants of any listed model are out of scope.
Reproducibility
The 60-prompt dataset is available on request for academic and partner use. Reach out via the contact form below and reference the Q2 2026 dataset version. The structured benchmark data is also published as a public JSON endpoint at /data/benchmarks.json and as the LLM-readable record at /models.json.
Last run
The benchmark was last run on 2026-05-15 (Q2 2026). We re-run the benchmark each quarter and on any major model release. Subscribe to the Room AI changelog to be notified when new results are published.
See the results
View the full leaderboard of all 11 models, category winners by style and room type, and how each model handles edge cases.