// article

Pricing Diamonds

A gradient booster prices diamonds to $276. Then it meets a big one.

April 30, 2026 Article

Predicted vs actual diamond price for the gradient booster: a hexbin density hugging the perfect-prediction diagonal across 13,480 test stones, with an inset showing mean absolute error climbing 27-fold across price deciles while percentage error stays pinned between 6 and 7 percent

Hold out a quarter of the diamonds, train on the rest, and a gradient-boosted model prices a stone it has never seen to within $276 on average. That is the mean absolute error on 13,480 test diamonds, against prices that run from $326 to nearly $19,000. R² of 0.981. For a model that gets nothing but the 4Cs and three physical measurements, that is a good day.

Then you look at where the $276 comes from, and the average turns out to be a lie of composition. On the cheapest tenth of the stones the model misses by $36. On the priciest tenth it misses by $987. Same model, same features, a 27-fold gap in dollar error. The number that makes a diamond expensive is the same number that makes it hard to price.

The diagonal above is the whole pitch and the whole catch. Predictions hug it all the way up the range, which is what an R² of 0.981 looks like. Watch the inset: the dollar miss grows 27x from the cheapest decile to the priciest, while the model stays the same 6 to 7 percent wrong everywhere. The headline number is honest and misleading at once.

The setup

This is the ggplot2 diamonds set, 53,940 stones, pulled through the catalog (source: ggplot2 diamonds, via seaborn-data). Carat, the three quality grades (cut, color, clarity), price in dollars, and the x/y/z dimensions in millimeters. I dropped 20 rows with a zero dimension and 3 more with an impossible one: a 0.51-carat stone logged at 31.8mm tall, a 2-carat one at 58.9mm wide. The longest real side in this set runs to about 11mm, so those are transcription errors. The z=31.8 row matters beyond its absurdity. Leave it in and a log-space linear model exponentiates that single leverage point into a multi-billion-dollar prediction, which tells you something about log models I will come back to.

One split, seed 42, 75/25. Every number below is on the test set. All preprocessing (scaling the numerics, one-hot encoding the grades) lives inside an sklearn Pipeline fit on training data only, so the test fold never leaks into the standardization. I wanted an honest out-of-sample read, not a flattering in-sample one.

Three models, and why logging the target is not optional

Three contenders. Plain OLS on raw dollar price. The same OLS on log price, scored back in dollars. And a HistGradientBoostingRegressor, also on log price.

Raw-price OLS lands at $716 MAE, R² 0.924. Not terrible. But it does something disqualifying: it predicts a negative price 893 times. Almost one test stone in fifteen comes out worth less than nothing. Picture the fitted line as a straight ramp through a quantity with a hard floor at zero and a long tail above it. To reach the cheap stones at the bottom, the ramp has to dip below the floor, and below the floor means a price under zero.

Move to log price and the floor problem disappears, because exp of anything is positive. MAE drops to $410, R² climbs to 0.959. Logging the target alone, no new features, no new model, buys $305 of accuracy and kills every nonsensical prediction. The first diamonds post I did showed price is almost perfectly symmetric in log space; this is the predictive payoff of that same fact. When your target is a price, fit in logs.

The gradient booster wins outright at $276 MAE and R² 0.981. It beats log-OLS by $134 a stone, and it earns that on the curvature the linear model cannot bend to: the way price accelerates against carat, the interactions between size and clarity. Where the line is one shape forever, the trees change shape wherever the data tells them to.

Three models ranked by test-set mean absolute error as horizontal lollipops: OLS on raw price at $716 (and 893 negative price predictions), OLS on log price at $410, gradient boosting on log price at $276, the winner highlighted

The predicted-versus-actual plot is a tight cigar on the diagonal across the whole price range. Tight, but not uniform, and the non-uniformity is the whole story.

GBM predicted vs actual price, test set, as a log-scaled hexbin density with the perfect-prediction diagonal

Where it breaks

I split the test set into deciles by actual price and asked the obvious question: is the $276 spread evenly, or does it pile up somewhere? It piles up.

The cheapest decile, stones from $326 to $648, gets priced to a mean error of $36. The error climbs every single step up the ladder, $45, $60, $85, $144, and by the top decile, the stones from $9,648 to $18,787, the model is off by $987 on average. The priciest 10% of diamonds carry 35.7% of all the dollar error in the test set. One tenth of the stones, more than a third of the misery.

Mean absolute error by actual-price decile rising from $36 to $987 as colored bars, overlaid with a flat percentage-error line holding at 6 to 7 percent across all ten deciles

Switch the lens from dollars to percent and the panic drains out. Mean absolute percentage error in the bottom decile is 6.9%. In the top decile it is 7.3%. Across all ten deciles MAPE never leaves the 6.1% to 7.4% band. The model misses by roughly the same fraction everywhere. It is just that 7% of a $13,000 stone is $900, and 7% of a $540 stone is $38. The dollar error is not the model getting confused on expensive diamonds. It is the model being consistently, proportionally wrong, viewed through a scale that grows underneath it.

That distinction is the practical one. If you are pricing engagement rings under a grand, $276 of headline MAE wildly oversells your pain: you will be within forty bucks. If you are underwriting a vault of investment stones, the same model that looked surgical will be a thousand dollars out per diamond, and you should quote a percentage band, not a dollar one. The metric you report should match the stones you sell.

The residual fan

The residual-versus-carat plot says it without deciles. Up to about a carat the residuals are a tight ribbon pinned to zero. Past a carat the ribbon flares into a cone, and out past three carats it is just scatter, individual stones flung hundreds or thousands of dollars high and low.

GBM residuals against carat: a tight ribbon pinned to zero under one carat that flares into a wide cone past one carat, points colored by absolute error, with the worst 4.5-carat miss circled

There is a data reason the fan opens. The set has tens of thousands of sub-one-carat stones and only a handful above three. A tree, asked to price a 4.5-carat stone, has a few dozen comparable rows in all of training to lean on, so it pulls its guess back toward the dense middle it knows. The worst single miss in the test set is exactly that: a 4.5-carat stone that sold for $18,531, priced by the model at $11,605, low by $6,926. Big diamonds are rare, rarity is what a tree-based model interpolates worst, and that same rarity is why those stones cost so much in the first place. The model is least sure precisely where the money is.

Two caveats I will keep honest. This is a listing snapshot, roughly 2017-era, prices nominal: no inflation, no confirmation a stone changed hands. And $276 is one split’s verdict; a different seed would wobble it. The shape holds regardless of the seed: a good model, a clean diagonal, and a fan that opens right where the stones get expensive enough to care about.