90 Metrics, Complete
Transparency
Every claim on this site is backed by reproducible benchmarks. We show our wins, our losses, and even our overfitting analysis.
ColorBench: deterministic, float64, 13 categories. All data, scripts, and checkpoints are open source. No cherry-picking. Full train/test split analysis included.
83 internal + 7 independent validation = 90 total metrics
Color Difference Accuracy
MetricSpace predicts human color perception more accurately than any existing standard, including the industry-standard CIEDE2000 formula.
STRESS (CIE 217:2016) evaluated on COMBVD (3,813 pairs from 6 sub-datasets), MacAdam 1974 (128 pairs), and Human Feedback (3,552 judgements). Lower is better.
Generation Benchmark: 66-9
GenSpace is purpose-built for creating colors: gradients, palettes, gamut mapping. Head-to-head against OKLab (the current CSS standard), it wins 66 out of 90 metrics.
83 internal metrics (deterministic, float64) + 7 independent validation metrics. Opponent: OKLab with standard Euclidean deltaE. Same test harness, same precision.
Category Breakdown
Click any card to expand and see every individual metric in that category.
Performance breakdown across 13 test categories. Click any card to see individual metrics.
83 internal metrics in 12 categories + 7 independent validation metrics. All deterministic, float64.
Gamut
How well the space maps to real device screens
Cusp validity, boundary smoothness, clipping across sRGB/P3/Rec2020
Gradient
How smooth colors blend between two endpoints
CV of perceptual step size, hue drift, banding metrics
Application
Real-world tasks like palettes, tints, and accessibility
Palette generation, gamut mapping, WCAG contrast, animation
Perceptual
Agreement with how humans actually see color
Munsell, MacAdam, Hung-Berns hue linearity validation
Structural
Mathematical properties that affect reliability
Hue reversals, OOG excursion, chroma amplification, LMS
Hue
Whether hue labels match human expectation
Hue RMS vs Munsell, primary lightness range
Achromatic
Perfect grays without color contamination
Gray ramp chroma residual under sRGB and D65
Advanced
Edge cases and stress tests
1000-trip roundtrip, Jacobian condition, 8-bit precision
Special
Problem areas where OKLab is known to struggle
Yellow chroma, blue-to-white midpoint, red-to-white shift
Banding
Visible stepping artifacts in gradients
Invisible step ratio, duplicate 8-bit bucket count
Accessibility
Usability for colorblind viewers
CVD simulation minimum step deltaE (protan/deutan)
Numerical
Mathematical precision of conversions
Round-trip error across sRGB, P3, Rec2020 (float64)
Metric Explorer
Search and filter all 83 internal benchmark metrics. Every number is reproducible.
Sortable, filterable table of all metrics. Values are from ColorBench HEAD running GenSpace v10-BH vs OKLab, both at float64 precision.
| Metric | Category | OKLab | GenSpace | Winner |
|---|---|---|---|---|
| CVD deutan min step ΔEΔE | Accessibility | 0.16 | 0.11 | OKLab |
| CVD protan min step ΔEΔE | Accessibility | 0.13 | 0.13 | GenSpace |
| Gray ramp pure D65 C*C* | Achromatic | 7.61e-7 | 1.88e-15 | GenSpace |
| Gray ramp sRGB C*C* | Achromatic | 5.57e-7 | 6.30e-13 | GenSpace |
| 1000-trip RTmax ΔE | Advanced | 5.77e-13 | 6.97e-14 | GenSpace |
| 8-bit exact/10Kcount | Advanced | 10,000 | 10,000 | Tie |
| Animation frame CV% | Advanced | 62.1 | 60.1 | GenSpace |
| Channel mono violationscount | Advanced | 0 | 0 | Tie |
| Cross-gamut amplification× | Advanced | 1.0× | 1.0× | Tie |
| Jacobian condition | Advanced | 6.49 | 6.47 | Tie |
| Chroma preservation (no mud) | Application | 0.414 | 0.41 | Tie |
| Data viz min pairwise ΔEΔE | Application | 14.34 | 14.5 | GenSpace |
| Eased animation CV% | Application | 64.1 | 64.5 | Tie |
| Muddy gradients (C drop >50%)count | Application | 12 | 12 | Tie |
| Multi-stop gradient CV% | Application | 37.7 | 37.3 | GenSpace |
| Palette harmony accuracy° | Application | 11.7 | 9.1 | GenSpace |
| Palette L* spacing% | Application | 78.9 | 76.5 | GenSpace |
| Photo gamut map fidelity° | Application | 0.98 | 0.96 | GenSpace |
| Shade palette hue drift° | Application | 8.6 | 6 | GenSpace |
| Shade palette worst hue drift° | Application | 20.9 | 20.4 | GenSpace |
| Tint/shade hue preservation° | Application | 8.8 | 7.9 | GenSpace |
| WCAG midpoint contrastratio | Application | 2.73 | 2.88 | GenSpace |
| Duplicate 8-bit steps% | Banding | 16.1 | 13.8 | GenSpace |
| Invisible gradient steps% | Banding | 99.7 | 99.8 | Tie |
| Cusp smoothness (max jump) | Gamut | 0.805 | 0.072 | GenSpace |
| Gamut volume fill% | Gamut | 1 | 1 | Tie |
| P3 boundary bad huescount | Gamut | 121 | 4 | GenSpace |
| P3 boundary continuity | Gamut | 0.444 | 0.079 | GenSpace |
| P3 boundary mean jump | Gamut | 0.02 | 0.003 | GenSpace |
| P3 cliff max% | Gamut | 0.16 | 0.1 | GenSpace |
| P3 cusp mean smoothness | Gamut | 0.008 | 0.005 | GenSpace |
| P3 cusp smoothness | Gamut | 0.778 | 0.039 | GenSpace |
| P3 invalid cuspscount | Gamut | 52 | 0 | GenSpace |
| P3 mono violationscount | Gamut | 71 | 0 | GenSpace |
| P3 valid cuspscusps | Gamut | 308/360 | 360/360 | GenSpace |
| Rec2020 boundary bad huescount | Gamut | 130 | 20 | GenSpace |
| Rec2020 boundary continuity | Gamut | 0.562 | 0.248 | GenSpace |
| Rec2020 boundary mean jump | Gamut | 0.025 | 0.006 | GenSpace |
| Rec2020 cliff max% | Gamut | 0.72 | 0.18 | GenSpace |
| Rec2020 cusp mean smoothness | Gamut | 0.007 | 0.006 | GenSpace |
| Rec2020 cusp smoothness | Gamut | 0.756 | 0.157 | GenSpace |
| Rec2020 mono violationscount | Gamut | 60 | 1 | GenSpace |
| Rec2020 valid cuspscusps | Gamut | 360/360 | 360/360 | Tie |
| sRGB boundary bad huescount | Gamut | 123 | 15 | GenSpace |
| sRGB boundary continuity | Gamut | 0.545 | 0.301 | GenSpace |
| sRGB boundary mean jump | Gamut | 0.02 | 0.005 | GenSpace |
| sRGB cliff max% | Gamut | 0.65 | 0.16 | GenSpace |
| sRGB cusp mean smoothness | Gamut | 0.009 | 0.005 | GenSpace |
| sRGB invalid cuspscount | Gamut | 61 | 0 | GenSpace |
| sRGB mono violationscount | Gamut | 88 | 0 | GenSpace |
| sRGB valid cuspscusps | Gamut | 299/360 | 360/360 | GenSpace |
| 3-color gradient CV% | Gradient | 39.34 | 34.92 | GenSpace |
| Banding meansteps | Gradient | 1.84 | 1.83 | Tie |
| Bright gradient CV (L>0.6)% | Gradient | 32.18 | 32.76 | OKLab |
| Cross-lightness gradient CV% | Gradient | 22.08 | 18.03 | GenSpace |
| Dark gradient CV (L<0.4)% | Gradient | 47.28 | 37.24 | GenSpace |
| Gradient CV (mean)% | Gradient | 38.2 | 37.45 | GenSpace |
| Gradient CV (p95)% | Gradient | 136.69 | 138.78 | OKLab |
| High-chroma gradient CV% | Gradient | 29.63 | 26.92 | GenSpace |
| Max hue drift (non-crossing)° | Gradient | 112.7 | 77.5 | GenSpace |
| Near-achromatic gradient CV% | Gradient | 85.95 | 106.73 | OKLab |
| Worst-case gradient CV% | Gradient | 412.6 | 377.7 | GenSpace |
| Hue RMS° | Hue | 30.1 | 27.5 | GenSpace |
| Primary L range | Hue | 0.516 | 0.6 | GenSpace |
| Round-trip P3 16.7Mmax ΔE | Numerical | 1.67e-15 | 2.00e-15 | Tie |
| Round-trip Rec2020 2.1Mmax ΔE | Numerical | 1.55e-15 | 1.78e-15 | Tie |
| Round-trip sRGB 16.7Mmax ΔE | Numerical | 1.67e-15 | 5.64e-8 | OKLab |
| Hue agreement w/ CIE Lab° | Perceptual | 8.5 | 8.3 | GenSpace |
| Hue leaf constancy° | Perceptual | 73.3 | 59.8 | GenSpace |
| MacAdam isotropyratio | Perceptual | 1.99 | 1.78 | GenSpace |
| Munsell Hue spacing% | Perceptual | 18.5 | 11.4 | GenSpace |
| Munsell Value uniformity% | Perceptual | 2.8 | 0.16 | GenSpace |
| Blue→White midpoint G/Rratio | Special | 1.408 | 1.513 | GenSpace |
| Red→White midpoint G-B | Special | 0.062 | 0.063 | OKLab |
| Yellow chroma | Special | 0.211 | 0.333 | GenSpace |
| Extreme chroma amplification× | Structural | 5.79× | 3.79× | GenSpace |
| Hue reversal max angle° | Structural | 3 | 0.6 | GenSpace |
| Hue reversals (count)count | Structural | 80 | 66 | GenSpace |
| Negative LMS colors% | Structural | 0 | 0 | Tie |
| OOG excursion pairs% | Structural | 9.8 | 9.8 | Tie |
| OOG max distance | Structural | 0.11 | 0.103 | GenSpace |
| Primary hue disc (P3)° | Structural | 1.08 | 1.37 | OKLab |
| Primary hue disc (sRGB)° | Structural | 1.31 | 1.65 | OKLab |
Tested on Data We Never Trained On
Three independent datasets from published color science research (1980-1998). GenSpace wins 6 out of 7 metrics against OKLab on data it never saw.
Hung & Berns 1995 (hue linearity, 168 samples), Ebner & Fairchild 1998 (constant-hue surfaces, 321 samples), Pointer 1980 (real surface color gamut, 576 boundary points). None used in optimization.
Hung & Berns 1995
168 samplesDo straight lines in the color space match straight lines in human hue perception?
Hue linearity: angular deviation from constant-hue lines. 12 hues, 13 targets each, 9 observers.
Ebner & Fairchild 1998
321 samplesWhen you change lightness and chroma but keep the hue name the same, does the color space agree?
Constant perceived-hue surface deviation. 15 hues. Mean and max angular deviation from ideal.
| Space | Mean | Max |
|---|---|---|
| CIE Lab | 2.95 | 16.0 |
| OKLab | 2.23 | 8.1 |
| GenSpace | 2.10 | 8.6 |
Pointer's Gamut 1980
576 ptsHow uniformly does each space represent real-world surface colors?
Real surface color boundary (16 L levels, 36 hue angles). Chroma CV, boundary smoothness, hue uniformity.
| Space | C* CV | Smooth | Hue CV |
|---|---|---|---|
| CIE Lab | 0.479 | 0.144 | 0.034 |
| OKLab | 0.413 | 0.132 | 0.370 |
| GenSpace | 0.404 | 0.125 | 0.262 |
Independent Validation Total
Across 3 published datasets (1980-1998), none used in training
Overfitting Analysis
We optimized MetricSpace on color difference data. Could it have just memorized the answers? We tested this honestly and show you the results.
80/20 stratified split (seed=42), multiple DOF configurations. Train-test gap exists (+1.8) but held-out test still beats all competitors.
Does the model genuinely predict color perception, or did it just memorize the training data? We tested this rigorously with held-out data the model never saw during training.
80/20 train-test split (seed=42, 3050/763 pairs). Multiple DOF configurations tested. Cross-validated estimate: STRESS 24.3.
| Model | ParamsDOF | Train | Test | Gap |
|---|---|---|---|---|
| v20b baseline | 0 | 27.72 | 27.57 | -0.15 |
| v21 (full-data) | 72 | 22.14 | 23.91 | +1.77 |
| Phase 1 train-only | 6 | 25.35 | 25.65 | +0.30 |
| Phase 1+2 train-only | 48 | 22.78 | 24.59 | +1.82 |
Key Findings
Published STRESS: 22.48 (full-data optimized) | Cross-validated: ~24.3 | Both are still #1 among all tested competitors.
When to Use MetricSpace
MetricSpace is purpose-built for color difference prediction — not generation. Use it when you need to measure, not create.
Quality Control
Print, display, textile color matching. 23% lower STRESS than CIEDE2000 on COMBVD.
Color Matching Tolerance
Pair-dependent SL/SC weighting adapts to the specific lightness and chroma of each color pair.
A/B Testing
Human Feedback STRESS = 23.26 vs CIEDE2000's 62.54. 63% better at predicting real user preferences.
Accessibility Checking
Euclidean deltaE that's actually perceptually calibrated. OKLab STRESS = 47 — not designed for distance prediction.
Research
Transparent pipeline, fully invertible, open source. All parameters, datasets, and optimization scripts published.
Are We the Best?
Color difference measurement: Yes.
MetricSpace v21 achieves the lowest published STRESS on COMBVD, MacAdam, and Human Feedback simultaneously. No other metric matches human perception this accurately across multiple datasets. Caveat: cross-validated estimate is ~24.3 (not the published 22.48), and CIEDE2000 wins on 3 of 6 COMBVD sub-datasets.
Generation tasks: Best-rounded, not best at everything.
GenSpace wins 66-9 vs OKLab across 90 metrics, including 6-1 on independent 3rd-party datasets OKLab was optimized on. However: OKLab is better for near-achromatic gradients (24%), CVD deutan palettes (43%), and native CSS oklch(). CIE Lab's hue angles remain the established industry reference for hue naming.
Overall: First to do both.
Helmlab is the first color space library to achieve state-of-the-art in both perceptual color difference measurement and visual generation quality simultaneously. No other space does both.
How We Test
Deterministic
Every metric is computed at float64 precision with fixed seeds. Run the same code, get the same numbers. No stochastic variation.
Head-to-head
Same test harness, same input colors, same precision for both spaces. Winner is determined by the metric's natural direction (lower or higher is better).
No Cherry-picking
All metrics are reported, including our 9 losses. We do not add or remove metrics based on whether we win them.
Open Source
ColorBench source code, all data files, and checkpoint parameters are publicly available on GitHub for independent verification.
What We Did NOT Test
- HDR color differences — no HDR psychophysical dataset available
- Cross-surround conditions — all data is standard viewing conditions
- Display-specific gamuts — only standard sRGB / Display P3 / Rec.2020 primaries
- Computational performance — not benchmarked (GenSpace ~35 FLOPs, MetricSpace ~150 FLOPs per color)
- Perceptual ranking with human observers — GenSpace metrics test geometric/mathematical properties, not direct human preference