Benchmark

90 Metrics, Complete
Transparency

Every claim on this site is backed by reproducible benchmarks. We show our wins, our losses, and even our overfitting analysis.

ColorBench: deterministic, float64, 13 categories. All data, scripts, and checkpoints are open source. No cherry-picking. Full train/test split analysis included.

65 GenSpace wins

16 ties

9 OKLab wins

83 internal + 7 independent validation = 90 total metrics

MetricSpace

Color Difference Accuracy

MetricSpace predicts human color perception more accurately than any existing standard, including the industry-standard CIEDE2000 formula.

STRESS (CIE 217:2016) evaluated on COMBVD (3,813 pairs from 6 sub-datasets), MacAdam 1974 (128 pairs), and Human Feedback (3,552 judgements). Lower is better.

GenSpace

Generation Benchmark: 65-9

GenSpace is purpose-built for creating colors: gradients, palettes, gamut mapping. Head-to-head against OKLab (the current CSS standard), it wins 65 out of 90 metrics.

83 internal metrics (deterministic, float64) + 7 independent validation metrics. Opponent: OKLab with standard Euclidean deltaE. Same test harness, same precision.

GenSpace wins

OKLab wins

Ties

Category Breakdown

Click any card to expand and see every individual metric in that category.

Performance breakdown across 13 test categories. Click any card to see individual metrics.

83 internal metrics in 12 categories + 7 independent validation metrics. All deterministic, float64.

Gamut

24W3T0L

How well the space maps to real device screens

Cusp validity, boundary smoothness, clipping across sRGB/P3/Rec2020

27 metrics

Application

9W3T0L

Real-world tasks like palettes, tints, and accessibility

Palette generation, gamut mapping, WCAG contrast, animation

12 metrics

Gradient

7W1T3L

How smooth colors blend between two endpoints

CV of perceptual step size, hue drift, banding metrics

11 metrics

Independent

6W0T1L

Tests using data we never trained on

Hung-Berns, Ebner-Fairchild, Pointer's Gamut

7 metrics

Perceptual

5W0T0L

Agreement with how humans actually see color

Munsell, MacAdam, Hung-Berns hue linearity validation

5 metrics

Structural

4W2T2L

Mathematical properties that affect reliability

Hue reversals, OOG excursion, chroma amplification, LMS

8 metrics

Hue

2W0T0L

Whether hue labels match human expectation

Hue RMS vs Munsell, primary lightness range

2 metrics

Achromatic

2W0T0L

Perfect grays without color contamination

Gray ramp chroma residual under sRGB and D65

2 metrics

Advanced

2W4T0L

Edge cases and stress tests

1000-trip roundtrip, Jacobian condition, 8-bit precision

6 metrics

Special

2W0T1L

Problem areas where OKLab is known to struggle

Yellow chroma, blue-to-white midpoint, red-to-white shift

3 metrics

Banding

1W1T0L

Visible stepping artifacts in gradients

Invisible step ratio, duplicate 8-bit bucket count

2 metrics

Accessibility

1W0T1L

Usability for colorblind viewers

CVD simulation minimum step deltaE (protan/deutan)

2 metrics

Numerical

0W2T1L

Mathematical precision of conversions

Round-trip error across sRGB, P3, Rec2020 (float64)

3 metrics

Full Data

Metric Explorer

Search and filter all 83 internal benchmark metrics. Every number is reproducible.

Sortable, filterable table of all metrics. Values are from ColorBench HEAD running GenSpace v10-BH vs OKLab, both at float64 precision.

Metric	Category	OKLab	GenSpace	Winner
CVD deutan min step ΔEΔE	Accessibility	0.16	0.11	OKLab
CVD protan min step ΔEΔE	Accessibility	0.13	0.13	GenSpace
Gray ramp pure D65 CC	Achromatic	7.61e-7	1.88e-15	GenSpace
Gray ramp sRGB CC	Achromatic	5.57e-7	6.30e-13	GenSpace
1000-trip RTmax ΔE	Advanced	5.77e-13	6.97e-14	GenSpace
8-bit exact/10Kcount	Advanced	10,000	10,000	Tie
Animation frame CV%	Advanced	62.1	60.1	GenSpace
Channel mono violationscount	Advanced	0	0	Tie
Cross-gamut amplification×	Advanced	1.0×	1.0×	Tie
Jacobian condition	Advanced	6.49	6.47	Tie
Chroma preservation (no mud)	Application	0.414	0.41	Tie
Data viz min pairwise ΔEΔE	Application	14.34	14.5	GenSpace
Eased animation CV%	Application	64.1	64.5	Tie
Muddy gradients (C drop >50%)count	Application	12	12	Tie
Multi-stop gradient CV%	Application	37.7	37.3	GenSpace
Palette harmony accuracy°	Application	11.7	9.1	GenSpace
Palette L* spacing%	Application	78.9	76.5	GenSpace
Photo gamut map fidelity°	Application	0.98	0.96	GenSpace
Shade palette hue drift°	Application	8.6	6	GenSpace
Shade palette worst hue drift°	Application	20.9	20.4	GenSpace
Tint/shade hue preservation°	Application	8.8	7.9	GenSpace
WCAG midpoint contrastratio	Application	2.73	2.88	GenSpace
Duplicate 8-bit steps%	Banding	16.1	13.8	GenSpace
Invisible gradient steps%	Banding	99.7	99.8	Tie
Cusp smoothness (max jump)	Gamut	0.805	0.072	GenSpace
Gamut volume fill%	Gamut	1	1	Tie
P3 boundary bad huescount	Gamut	121	4	GenSpace
P3 boundary continuity	Gamut	0.444	0.079	GenSpace
P3 boundary mean jump	Gamut	0.02	0.003	GenSpace
P3 cliff max%	Gamut	0.2	0.2	Tie
P3 cusp mean smoothness	Gamut	0.008	0.005	GenSpace
P3 cusp smoothness	Gamut	0.778	0.039	GenSpace
P3 invalid cuspscount	Gamut	52	0	GenSpace
P3 mono violationscount	Gamut	71	0	GenSpace
P3 valid cuspscusps	Gamut	308/360	360/360	GenSpace
Rec2020 boundary bad huescount	Gamut	130	20	GenSpace
Rec2020 boundary continuity	Gamut	0.562	0.248	GenSpace
Rec2020 boundary mean jump	Gamut	0.025	0.006	GenSpace
Rec2020 cliff max%	Gamut	0.72	0.18	GenSpace
Rec2020 cusp mean smoothness	Gamut	0.007	0.006	GenSpace
Rec2020 cusp smoothness	Gamut	0.756	0.157	GenSpace
Rec2020 mono violationscount	Gamut	60	1	GenSpace
Rec2020 valid cuspscusps	Gamut	360/360	360/360	Tie
sRGB boundary bad huescount	Gamut	123	15	GenSpace
sRGB boundary continuity	Gamut	0.545	0.301	GenSpace
sRGB boundary mean jump	Gamut	0.02	0.005	GenSpace
sRGB cliff max%	Gamut	0.65	0.16	GenSpace
sRGB cusp mean smoothness	Gamut	0.009	0.005	GenSpace
sRGB invalid cuspscount	Gamut	61	0	GenSpace
sRGB mono violationscount	Gamut	88	0	GenSpace
sRGB valid cuspscusps	Gamut	299/360	360/360	GenSpace
3-color gradient CV%	Gradient	39.34	34.92	GenSpace
Banding meansteps	Gradient	1.84	1.83	Tie
Bright gradient CV (L>0.6)%	Gradient	32.18	32.76	OKLab
Cross-lightness gradient CV%	Gradient	22.08	18.03	GenSpace
Dark gradient CV (L<0.4)%	Gradient	47.28	37.24	GenSpace
Gradient CV (mean)%	Gradient	38.2	37.45	GenSpace
Gradient CV (p95)%	Gradient	136.69	138.78	OKLab
High-chroma gradient CV%	Gradient	29.63	26.92	GenSpace
Max hue drift (non-crossing)°	Gradient	112.7	77.5	GenSpace
Near-achromatic gradient CV%	Gradient	85.95	106.73	OKLab
Worst-case gradient CV%	Gradient	412.6	377.7	GenSpace
Hue RMS°	Hue	30.1	27.5	GenSpace
Primary L range	Hue	0.516	0.6	GenSpace
Round-trip P3 16.7Mmax ΔE	Numerical	1.67e-15	2.00e-15	Tie
Round-trip Rec2020 2.1Mmax ΔE	Numerical	1.55e-15	1.78e-15	Tie
Round-trip sRGB 16.7Mmax ΔE	Numerical	1.67e-15	5.64e-8	OKLab
Hue agreement w/ CIE Lab°	Perceptual	8.5	8.3	GenSpace
Hue leaf constancy°	Perceptual	73.3	59.8	GenSpace
MacAdam isotropyratio	Perceptual	1.99	1.78	GenSpace
Munsell Hue spacing%	Perceptual	18.5	11.4	GenSpace
Munsell Value uniformity%	Perceptual	2.8	0.16	GenSpace
Blue→White midpoint G/Rratio	Special	1.408	1.513	GenSpace
Red→White midpoint G-B	Special	0.062	0.063	OKLab
Yellow chroma	Special	0.211	0.333	GenSpace
Extreme chroma amplification×	Structural	5.79×	3.79×	GenSpace
Hue reversal max angle°	Structural	3	0.6	GenSpace
Hue reversals (count)count	Structural	80	66	GenSpace
Negative LMS colors%	Structural	0	0	Tie
OOG excursion pairs%	Structural	9.8	9.8	Tie
OOG max distance	Structural	0.11	0.103	GenSpace
Primary hue disc (P3)°	Structural	1.08	1.37	OKLab
Primary hue disc (sRGB)°	Structural	1.31	1.65	OKLab

Showing 83 of 83 metrics

59 GenSpace16 Ties8 OKLab

Independent Validation

Tested on Data We Never Trained On

Three independent datasets from published color science research (1980-1998). GenSpace wins 6 out of 7 metrics against OKLab on data it never saw.

Hung & Berns 1995 (hue linearity, 168 samples), Ebner & Fairchild 1998 (constant-hue surfaces, 321 samples), Pointer 1980 (real surface color gamut, 576 boundary points). None used in optimization.

Hung & Berns 1995

168 samples

Do straight lines in the color space match straight lines in human hue perception?

Hue linearity: angular deviation from constant-hue lines. 12 hues, 13 targets each, 9 observers.

Red

2.42 vs 2.84

Red-yellow

4.24 vs 3.55

Yellow

5.71 vs 4.51

Yellow-green

5.48 vs 4.69

Green

1.64 vs 1.69

Green-cyan

7.13 vs 8.77

Cyan

9.37 vs 9.94

Cyan-blue

7.01 vs 5.58

Blue

4.62 vs 3.29

Blue-magenta

4.52 vs 3.76

Magenta

3.59 vs 3.54

Magenta-red

3.78 vs 4.43

Score 6W 1T 5L

Ebner & Fairchild 1998

321 samples

When you change lightness and chroma but keep the hue name the same, does the color space agree?

Constant perceived-hue surface deviation. 15 hues. Mean and max angular deviation from ideal.

Space	Mean	Max
CIE Lab	2.95	16.0
OKLab	2.23	8.1
GenSpace	2.10	8.6

GenSpace wins mean deviation. OKLab wins max deviation.

Pointer's Gamut 1980

576 pts

How uniformly does each space represent real-world surface colors?

Real surface color boundary (16 L levels, 36 hue angles). Chroma CV, boundary smoothness, hue uniformity.

Space	C* CV	Smooth	Hue CV
CIE Lab	0.479	0.144	0.034
OKLab	0.413	0.132	0.370
GenSpace	0.404	0.125	0.262

CIE Lab wins hue uniformity because Pointer's data is defined in CIE Lab coordinates.

Independent Validation Total

Across 3 published datasets (1980-1998), none used in training

6 - 1 (7 metrics)

Honesty Check

Overfitting Analysis

We optimized MetricSpace on color difference data. Could it have just memorized the answers? We tested this honestly and show you the results.

80/20 stratified split (seed=42), multiple DOF configurations. Train-test gap exists (+1.8) but held-out test still beats all competitors.

Does the model genuinely predict color perception, or did it just memorize the training data? We tested this rigorously with held-out data the model never saw during training.

80/20 train-test split (seed=42, 3050/763 pairs). Multiple DOF configurations tested. Cross-validated estimate: STRESS 24.3.

Model	ParamsDOF	Train	Test	Gap	MacAdam	Human FB
v20b baseline No optimization (starting point)	0	27.72	27.57	-0.15	20.09	33.68
v21 (full-data) Full-data trained (published result)	72	22.14	23.91	+1.77	19.51	23.26
Phase 1 train-only Train split only, 6 DOF optimization	6	25.35	25.65	+0.30	18.85	24.76
Phase 1+2 train-only Train split only, 48 DOF optimization	48	22.78	24.59	+1.82	19.12	23.73

Key Findings

Mild overfitting exists. +1.8 STRESS gap between train and held-out test. This is real and acknowledged.

Gap is from data variance, not memorization. v21 (full-data) gap = +1.77, train-only gap = +1.82. Nearly identical.

Held-out test still beats all competitors. Held-out test STRESS (24.59) vs CIEDE2000 (29.20) on full dataset = 16% better.

MacAdam generalizes independently. MacAdam 1974 (never trained on) = 19.12, consistent with v21's 19.51.

Low DOF shows near-zero gap. 6 DOF shows +0.30 gap. Overfitting scales with parameters but stays controlled.

Published STRESS: 22.48 (full-data optimized) | Cross-validated: ~24.3 | Both are still #1 among all tested competitors.

MetricSpace

When to Use MetricSpace

MetricSpace is purpose-built for color difference prediction — not generation. Use it when you need to measure, not create.

Quality Control

Print, display, textile color matching. 23% lower STRESS than CIEDE2000 on COMBVD.

Color Matching Tolerance

Pair-dependent SL/SC weighting adapts to the specific lightness and chroma of each color pair.

A/B Testing

Human Feedback STRESS = 23.26 vs CIEDE2000's 62.54. 63% better at predicting real user preferences.

Accessibility Checking

Euclidean deltaE that's actually perceptually calibrated. OKLab STRESS = 47 — not designed for distance prediction.

Research

Transparent pipeline, fully invertible, open source. All parameters, datasets, and optimization scripts published.

Verdict

Are We the Best?

Color difference measurement: Yes.

MetricSpace v21 achieves the lowest published STRESS on COMBVD, MacAdam, and Human Feedback simultaneously. No other metric matches human perception this accurately across multiple datasets. Caveat: cross-validated estimate is ~24.3 (not the published 22.48), and CIEDE2000 wins on 3 of 6 COMBVD sub-datasets.

Generation tasks: Best-rounded, not best at everything.

GenSpace wins 65-9 vs OKLab across 90 metrics, including 6-1 on independent 3rd-party datasets OKLab was optimized on. However: OKLab is better for near-achromatic gradients (24%), CVD deutan palettes (43%), and native CSS oklch(). CIE Lab's hue angles remain the established industry reference for hue naming.

Overall: First to do both.

Helmlab is the first color space library to achieve state-of-the-art in both perceptual color difference measurement and visual generation quality simultaneously. No other space does both.

Methodology

How We Test

Deterministic

Every metric is computed at float64 precision with fixed seeds. Run the same code, get the same numbers. No stochastic variation.

Head-to-head

Same test harness, same input colors, same precision for both spaces. Winner is determined by the metric's natural direction (lower or higher is better).

No Cherry-picking

All metrics are reported, including our 9 losses. We do not add or remove metrics based on whether we win them.

Open Source

ColorBench source code, all data files, and checkpoint parameters are publicly available on GitHub for independent verification.

What We Did NOT Test

HDR color differences — no HDR psychophysical dataset available
Cross-surround conditions — all data is standard viewing conditions
Display-specific gamuts — only standard sRGB / Display P3 / Rec.2020 primaries
Computational performance — not benchmarked (GenSpace ~35 FLOPs, MetricSpace ~150 FLOPs per color)
Perceptual ranking with human observers — GenSpace metrics test geometric/mathematical properties, not direct human preference

View on GitHub

90 Metrics, Complete Transparency

Color Difference Accuracy

Generation Benchmark: 65-9

Category Breakdown

Gamut

Application

Gradient

Independent

Perceptual

Structural

Hue

Achromatic

Advanced

Special

Banding

Accessibility

Numerical

Metric Explorer

Tested on Data We Never Trained On

Hung & Berns 1995

Ebner & Fairchild 1998

Pointer's Gamut 1980

Independent Validation Total

Overfitting Analysis

Key Findings

When to Use MetricSpace

Quality Control

Color Matching Tolerance

A/B Testing

Accessibility Checking

Research

Are We the Best?

Color difference measurement: Yes.

Generation tasks: Best-rounded, not best at everything.

Overall: First to do both.

How We Test

Deterministic

Head-to-head

No Cherry-picking

Open Source

What We Did NOT Test

90 Metrics, Complete
Transparency