Statistical Society of Australia - Truncated regression

19 Nov 2024 4:45 PM

Quote

Reply # 13432386 on 13220333

Chris Lloyd

John Maindonald wrote:
Chris Lloyd wrote:
Replying to John. If you plot price against say carats, you will see the truncation very clearly (at about $19,000).

Chris, the real issue is that the distribution of values of 'carat' is strongly banded, which gives the dark vertical bands in the plots to which you refer. Most of the points are concentrated in these bands, so that one does not see the dropoff in density as the price increases. One needs to use smoothScatter() or an equivalent to see a more visually meaningful picture. I am attaching the plot from
with(ggplot::diamonds,(smoothScatter(carat,price)))

Or, repeat the ggplot2 plot with a small enough sample of data that the dark bands separate out to show the separate points. Possibly also an issue is that only for higher valued diamonds is there finesse in setting the carat, setting it higher and moving the carat measures for high dollar diamonds up and away from the dark vertical bands. The second plot is from a sample of 1000 points.
samp1K <- diamonds[sample(1:nrow(diamonds),1000),]
ggplot(aes(x=carat, y=price), data=samp1K) +
    geom_point(fill=I("#F79420"), color=I("black"), shape=21) +
    scale_x_continuous(lim = c(0, quantile(diamonds$carat, 0.99)) ) +
    scale_y_continuous(lim = c(0, quantile(diamonds$price, 0.99)) ) +
    ggtitle("Diamonds: Price vs. Carat")

I do not agree that the verical bands with respect to carats are a problem. That is the DATA!!! No jeweler in history every sold a 0.99 carat diamond. I can see that you ahve not taught in a b-school! Haha. While the price penalty at whole carats is hard to see in a plot, this does not mean is cannot be well estimated. And the effect is clear in a parametric regression with dummies for price discontinuities.

19 Nov 2024 4:42 PM

Quote

Reply # 13432384 on 13214920

Chris Lloyd

Thanks for your reply John and others. I think the key problem is that this data set should not have been published without giving the background on the truncation. Unless I misunderstand, the origina, this dataset was provided by Hadley Wickham. I did contact him to get some background on the data but received no reply.

Personally, I have NEVER in my entire life failed to respond to a polite query from another academic about a research or professional issue. But than again, I do not have a kool ring in my ear either.

Unless we understand at least some details of the truncation mechanism, I do not see how we can measure success of any prediction method. Under the simplest model that any observed y-value>c is excluded from the data set, the problem of prediction is at least well defined.

22 Aug 2024 8:10 AM

Quote

Reply # 13396500 on 13214920

John Maindonald

There is a 2023 paper that compares the performance of a number of different machine learning approaches with this 'Diamonds' dataset, with random forests coming out on top.

Kigo, S.N., Omondi, E.O. & Omolo, B.O. Assessing predictive performance of supervised machine learning algorithms for a diamond pricing model. Sci Rep 13, 17315 (2023). https://doi.org/10.1038/s41598-023-44326-w

The article states that "The dataset contains information on approximately 53,000 diamonds sold by a US-based retailer between 2008 and 2018."

Supporting data and code are available from:
https://datadryad.org/stash/share/C6GT0Srv2PTHitjNyS29EPgAk29hHu_s2RR6CyzVpeM

30 Jun 2023 7:53 PM

Quote

Reply # 13222069 on 13214920

John Maindonald

A randomForest model does a pretty good job for predicting price:

MASS::boxcox(price~., data=ggplot2::diamonds) # Suggests a log transform
# although what matters is the distribution of residuals after the fit
diamonds <- ggplot2::diamonds; Y <- diamonds[,"price", drop=T]
samp5K <- sample(1:nrow(diamonds), size=5000)
library(randomForest)
(diamond5K.rf <- randomForest(x=diamonds[samp5K,-7], y=log(Y[samp5K]),
                   xtest=diamonds[-samp5K,-7], ytest=log(Y[-samp5K])))
. . .
          Mean of squared residuals: 0.01350986
                    % Var explained: 98.66
                       Test set MSE: 0.01
                    % Var explained: 98.65

Chris, What kind of modeling did or do you have in mind? I have not tried, but I suspect that it will be hard to find a GAM or more conventional regression model that does anything like as well. Incidentally, fitting single regression tree generates a tree with ~2400 leaves, and still does somewhat more poorly than the random forest model. This suggests to me that there are some quite hard to chase down interactions involved.

Last modified: 30 Jun 2023 7:56 PM | John Maindonald

30 Jun 2023 12:23 PM

Quote

Reply # 13222016 on 13214920

Duncan Lowes

Following with interest and still nervous to comment but I reckon I can see the regression line (and the truncation) on those graphs. If I could I would draw the line with a pen

Here is my estimate. Went for a polynomial rather than exponential

price = 3825.5*carat² + 2486.9*carat - 297.38 Haven't worked out confidence intervals or percentage explained variance - I just drew a line

I will read with interest for what the true relationship is. Concerned about people seeing too many of my methods

Sorry I forgot before anyone corrects me. That is log price or something like that. Or the variance is clearly log/exponential around whatever mean we have chosen. Whatever

Disclaimer - I am rusty just having fun speculating, and its strange that my function went through 5 points randomly located on a grid

1 file

Last modified: 30 Jun 2023 12:57 PM | Duncan Lowes

27 Jun 2023 12:55 PM

Quote

Reply # 13220333 on 13219818

John Maindonald

Chris Lloyd wrote:
Replying to John. If you plot price against say carats, you will see the truncation very clearly (at about $19,000).

Chris, the real issue is that the distribution of values of 'carat' is strongly banded, which gives the dark vertical bands in the plots to which you refer. Most of the points are concentrated in these bands, so that one does not see the dropoff in density as the price increases. One needs to use smoothScatter() or an equivalent to see a more visually meaningful picture. I am attaching the plot from
with(ggplot::diamonds,(smoothScatter(carat,price)))

Or, repeat the ggplot2 plot with a small enough sample of data that the dark bands separate out to show the separate points. Possibly also an issue is that only for higher valued diamonds is there finesse in setting the carat, setting it higher and moving the carat measures for high dollar diamonds up and away from the dark vertical bands. The second plot is from a sample of 1000 points.
samp1K <- diamonds[sample(1:nrow(diamonds),1000),]
ggplot(aes(x=carat, y=price), data=samp1K) +
    geom_point(fill=I("#F79420"), color=I("black"), shape=21) +
    scale_x_continuous(lim = c(0, quantile(diamonds$carat, 0.99)) ) +
    scale_y_continuous(lim = c(0, quantile(diamonds$price, 0.99)) ) +
    ggtitle("Diamonds: Price vs. Carat")

2 files

diamondsSM.pdf (31.9 KB)
diamonds1Kpts.pdf (48.62 KB)

Last modified: 27 Jun 2023 1:15 PM | John Maindonald

26 Jun 2023 3:27 PM

Quote

Reply # 13219819 on 13214920

Chris Lloyd

Thanks Gillian, I will try out that package.

26 Jun 2023 3:26 PM

Quote

Reply # 13219818 on 13214920

Chris Lloyd

Replying to John. If you plot price against say carats, you will see the truncation very clearly (at about $19,000).

23 Jun 2023 7:46 AM

Quote

Reply # 13218781 on 13214920

John Maindonald

A good recourse in this context may be quantile regression, using R's qgam package. At the very least, it will offer an insightful perspective on an analysis the fits the mean. For example, fit 25%, 50% and 75% quantiles, allowing a check that the spread of the distribution, as measured by the inter-quartile range is reasonably constant. Abilities in qgam have the advantange over those in quantreg that the smoothing parameters can be automatically chosen, under independence assumptions.

But is the price data really truncated? A density plot for the prices gives little suggestion of truncation:
> d <- density(diamonds$price)
> range(d$y[d$x>0])
[1] 8.346320e-09 3.455133e-04

There is a quite detailed look at the data at
https://www.rpubs.com/Mohit_kumar_5522/diamonds_53940

1 file

diamonds.pdf (26.37 KB)

Last modified: 24 Jun 2023 12:26 PM | John Maindonald

22 Jun 2023 8:18 AM

Quote

Reply # 13218286 on 13217710

Gillian Heller

Hi Chris,

The R package gamlss does censored and truncated regression, for any response distribution in its repertoire (gamlss.dist).

Gillian