Statistical Society of Australia - Truncated regression

30 Jun 2023 7:53 PM

Quote

Reply # 13222069 on 13214920

John Maindonald

A randomForest model does a pretty good job for predicting price:

MASS::boxcox(price~., data=ggplot2::diamonds) # Suggests a log transform
# although what matters is the distribution of residuals after the fit
diamonds <- ggplot2::diamonds; Y <- diamonds[,"price", drop=T]
samp5K <- sample(1:nrow(diamonds), size=5000)
library(randomForest)
(diamond5K.rf <- randomForest(x=diamonds[samp5K,-7], y=log(Y[samp5K]),
                   xtest=diamonds[-samp5K,-7], ytest=log(Y[-samp5K])))
. . .
          Mean of squared residuals: 0.01350986
                    % Var explained: 98.66
                       Test set MSE: 0.01
                    % Var explained: 98.65

Chris, What kind of modeling did or do you have in mind? I have not tried, but I suspect that it will be hard to find a GAM or more conventional regression model that does anything like as well. Incidentally, fitting single regression tree generates a tree with ~2400 leaves, and still does somewhat more poorly than the random forest model. This suggests to me that there are some quite hard to chase down interactions involved.

Last modified: 30 Jun 2023 7:56 PM | John Maindonald

30 Jun 2023 12:23 PM

Quote

Reply # 13222016 on 13214920

Duncan Lowes

Following with interest and still nervous to comment but I reckon I can see the regression line (and the truncation) on those graphs. If I could I would draw the line with a pen

Here is my estimate. Went for a polynomial rather than exponential

price = 3825.5*carat² + 2486.9*carat - 297.38 Haven't worked out confidence intervals or percentage explained variance - I just drew a line

I will read with interest for what the true relationship is. Concerned about people seeing too many of my methods

Sorry I forgot before anyone corrects me. That is log price or something like that. Or the variance is clearly log/exponential around whatever mean we have chosen. Whatever

Disclaimer - I am rusty just having fun speculating, and its strange that my function went through 5 points randomly located on a grid

1 file

Last modified: 30 Jun 2023 12:57 PM | Duncan Lowes

27 Jun 2023 12:55 PM

Quote

Reply # 13220333 on 13219818

John Maindonald

Chris Lloyd wrote:
Replying to John. If you plot price against say carats, you will see the truncation very clearly (at about $19,000).

Chris, the real issue is that the distribution of values of 'carat' is strongly banded, which gives the dark vertical bands in the plots to which you refer. Most of the points are concentrated in these bands, so that one does not see the dropoff in density as the price increases. One needs to use smoothScatter() or an equivalent to see a more visually meaningful picture. I am attaching the plot from
with(ggplot::diamonds,(smoothScatter(carat,price)))

Or, repeat the ggplot2 plot with a small enough sample of data that the dark bands separate out to show the separate points. Possibly also an issue is that only for higher valued diamonds is there finesse in setting the carat, setting it higher and moving the carat measures for high dollar diamonds up and away from the dark vertical bands. The second plot is from a sample of 1000 points.
samp1K <- diamonds[sample(1:nrow(diamonds),1000),]
ggplot(aes(x=carat, y=price), data=samp1K) +
    geom_point(fill=I("#F79420"), color=I("black"), shape=21) +
    scale_x_continuous(lim = c(0, quantile(diamonds$carat, 0.99)) ) +
    scale_y_continuous(lim = c(0, quantile(diamonds$price, 0.99)) ) +
    ggtitle("Diamonds: Price vs. Carat")

2 files

diamondsSM.pdf (31.9 KB)
diamonds1Kpts.pdf (48.62 KB)

Last modified: 27 Jun 2023 1:15 PM | John Maindonald

26 Jun 2023 3:27 PM

Quote

Reply # 13219819 on 13214920

Chris Lloyd

Thanks Gillian, I will try out that package.

26 Jun 2023 3:26 PM

Quote

Reply # 13219818 on 13214920

Chris Lloyd

Replying to John. If you plot price against say carats, you will see the truncation very clearly (at about $19,000).

23 Jun 2023 7:46 AM

Quote

Reply # 13218781 on 13214920

John Maindonald

A good recourse in this context may be quantile regression, using R's qgam package. At the very least, it will offer an insightful perspective on an analysis the fits the mean. For example, fit 25%, 50% and 75% quantiles, allowing a check that the spread of the distribution, as measured by the inter-quartile range is reasonably constant. Abilities in qgam have the advantange over those in quantreg that the smoothing parameters can be automatically chosen, under independence assumptions.

But is the price data really truncated? A density plot for the prices gives little suggestion of truncation:
> d <- density(diamonds$price)
> range(d$y[d$x>0])
[1] 8.346320e-09 3.455133e-04

There is a quite detailed look at the data at
https://www.rpubs.com/Mohit_kumar_5522/diamonds_53940

1 file

diamonds.pdf (26.37 KB)

Last modified: 24 Jun 2023 12:26 PM | John Maindonald

22 Jun 2023 8:18 AM

Quote

Reply # 13218286 on 13217710

Gillian Heller

Hi Chris,

The R package gamlss does censored and truncated regression, for any response distribution in its repertoire (gamlss.dist).

Gillian

21 Jun 2023 6:55 AM

Quote

Reply # 13217710 on 13214920

Duncan Hedderley

Hi Chris

Is this censored regression? In which case R packages censReg (https://cran.r-project.org/web/packages/censReg/vignettes/censReg.pdf) or AER (https://stats.stackexchange.com/questions/149091/censored-regression-in-r) look like they will handle it.

Wish I knew the story about the upper limit on quoted price.

According to https://stats.oarc.ucla.edu/r/dae/tobit-models/ the speedometers in American cars in the 1980s weren't allowed to read more than 85 mph, so if you wanted to predict maximum speed from engine size and weight, and could only use the speedo' to measure maximum speed, the data would be truncated.

Duncan

20 Jun 2023 7:51 PM

Quote

Reply # 13217310 on 13214920

Duncan Lowes

Hi Chris

No response yet. This can be a lonely forum

Sorry I have no knowledge of the dataset and my memories of truncated methods are very rusty

I am looking at it now out of curiosity but may not be able to help

I even feel unqualified to even comment and run the risk of saying something stupid even if I do. I thought most people ignored stuff like that. The further you get from the (possibly transformed) means of your data, maybe truncation doesn't matter much most of the time :) Can you not see from the truncated data what may be going on. Sorry

Hope somebody has helped

regards Duncan

Last modified: 20 Jun 2023 8:06 PM | Duncan Lowes

14 Jun 2023 10:02 AM

Quote

Message # 13214920

Chris Lloyd

I have constructed an assignment for my Business Analytics course using the Diamonds data set in the ggplot2 package. Unfortunately, I have just realised that the y-variable, price, is truncated for some reason at about k$18.9 see HERE for a plot. There is nothing in the documentation of this data set that explains its source or the reason for truncation.

I guess I have two questions.

Is anybody familiar with this data set and the true story for why it is truncated? (I can easily write my own story).
Is there an standard linear model package that will estimate a regression where the Y-variable has a hard known truncation (let’s assume the observed y is conditional on Y<c). I am kind of embarrassed that I never encountered a truncated regression before