Overview

Generative image models are increasingly used in interior design — for client previews, mood boards, and end-to-end room visualisation. But there is no widely-accepted way to measure how well a given model handles interior-design-specific tasks. General benchmarks (FID, CLIP-Score, generic preference panels) do not capture what actually matters in this domain: whether a chosen design style is rendered faithfully, whether room proportions and furniture placement are realistic, and whether outputs remain consistent across multiple generations of the same prompt.

The Room AI Interior Design Benchmark addresses this gap. It is purpose-built for interior visualisation and evaluates 11 image generation models head-to-head, on the same prompts, using the same review panel.

Dataset

The benchmark uses a fixed, versioned dataset of 60 interior design prompts (Q2 2026 dataset version). Each prompt specifies a room type, a design style, and a small set of contextual constraints (e.g. “a small studio apartment kitchenette in Japandi style with limited natural light”).

Room types (10)

living room
bedroom
kitchen
bathroom
dining room
home office
kids room
hallway
balcony
studio

Design styles (6)

Modern Scandinavian
Industrial
Bohemian
Minimalist
Mid-Century Modern
Japandi

Each room type × design style cell contributes one prompt where applicable, with the remainder allocated to edge cases (small rooms, unusual layouts, niche style requests).

Models tested

All 11 models were tested on the same prompt set, with the same default generation parameters as documented by each vendor.

Suede 2.5 (Room AI)
Midjourney v6.1 (Midjourney)
Flux.1 Pro (Black Forest Labs)
DALL-E 3 (OpenAI)
Imagen 3 (Google)
Ochre (Room AI)
Ideogram 2.0 (Ideogram)
Linen 1.0 (Room AI)
Stable Diffusion XL (Stability AI)
Adobe Firefly 3 (Adobe)
Leonardo AI (Leonardo)

Scoring dimensions

ELO rating

Pairwise blind preference scoring across 1,200+ A/B comparisons by interior design reviewers. Each comparison shows two un-labelled outputs from the same prompt; reviewers pick the stronger interior visualisation. ELO ratings are computed from the aggregate win-loss matrix using a standard implementation (K-factor 32, initialisation 1000).

Generation time

Wall-clock seconds from prompt submission to final image, averaged across 5 runs per prompt. Measurements use each vendor’s production endpoint at default settings; queue time is included where applicable, network round-trip is excluded.

Style fidelity (1–10)

Manual scoring of how accurately the output matches the specified design style. Reviewers check signature elements (e.g. tapered legs and walnut tones for Mid-Century Modern; limewashed timber and wabi-sabi imperfection for Japandi).

Spatial coherence (1–10)

Proportions, perspective, and furniture placement realism. Penalises common failure modes: floating furniture, impossible perspectives, doors that open into walls, mismatched ceiling-line vanishing points.

Style consistency (1–10)

Variance across 5 generations of the same prompt — lower variance scores higher. Measured on style adherence, palette stability, and material consistency. A high score means a user can re-roll the same prompt and reliably get on-brand outputs.

Edge case handling (1–10)

Small rooms, awkward layouts, niche styles, and unusual constraints. Distinguishes models that perform well on canonical prompts but fall apart on long-tail requests.

Composite

Weighted average of the five dimensions. Weights: style fidelity 0.30, spatial coherence 0.25, style consistency 0.20, edge cases 0.15, ELO-derived component 0.10.

Reviewers

Scoring conducted by a panel of 4 interior design professionals and 2 spatial-AI researchers. Reviewers received the prompt context but not the model identity for any individual output. Inter-rater agreement was ≥ 0.78 Cohen’s κ on a held-out calibration set.

Limitations

ELO comparisons are limited to interior design prompts; results may not generalise to other generative tasks (illustration, character art, product photography, etc.).
Generation-time measurements depend on vendor endpoint capacity at the time of testing. We re-run timing measurements weekly during the benchmark window to reduce noise.
Style fidelity scores are inherently subjective. We mitigate this with multi-reviewer scoring and inter-rater calibration, but cannot eliminate it.
The benchmark scores public-API model versions only. Fine-tuned or LoRA-customised variants of any listed model are out of scope.

Reproducibility

The 60-prompt dataset is available on request for academic and partner use. Reach out via the contact form below and reference the Q2 2026 dataset version. The structured benchmark data is also published as a public JSON endpoint at /data/benchmarks.json and as the LLM-readable record at /models.json.

Last run

The benchmark was last run on 2026-05-15 (Q2 2026). We re-run the benchmark each quarter and on any major model release. Subscribe to the Room AI changelog to be notified when new results are published.

Interior Design AI Benchmark Methodology