Menu
Log in


Truncated regression

<< First  < Prev   1   2   Next >  Last >> 
  • 19 Nov 2024 4:45 PM
    Reply # 13432386 on 13220333
    John Maindonald wrote:
    Chris Lloyd wrote:

    Replying to John. If you plot price against say carats, you will see the truncation very clearly (at about $19,000).

    Chris, the real issue is that the distribution of values of 'carat' is strongly banded, which gives the dark vertical bands in the plots to which you refer.  Most of the points are concentrated in these bands, so that one does not see the dropoff in density as the price increases.  One needs to use smoothScatter() or an equivalent to see a more visually meaningful picture.  I am attaching the plot from
    with(ggplot::diamonds,(smoothScatter(carat,price)))

    Or, repeat the ggplot2 plot with a small enough sample of data that the dark bands separate out to show the separate points.  Possibly also an issue is that only for higher valued diamonds is there finesse in setting the carat, setting it higher and moving the carat measures for high dollar diamonds up and away from the dark vertical bands. The second plot is from a sample of 1000 points.
    samp1K <- diamonds[sample(1:nrow(diamonds),1000),]
    ggplot(aes(x=carat, y=price), data=samp1K) +
        geom_point(fill=I("#F79420"), color=I("black"), shape=21) +
        scale_x_continuous(lim = c(0, quantile(diamonds$carat, 0.99)) ) +
        scale_y_continuous(lim = c(0, quantile(diamonds$price, 0.99)) ) +
        ggtitle("Diamonds: Price vs. Carat")


    I do not agree that the verical bands with respect to carats are a problem. That is the DATA!!! No jeweler in history every sold a 0.99 carat diamond. I can see that you ahve not taught in a b-school! Haha.  While the price penalty at whole carats is hard to see in a plot, this does not mean is cannot be well estimated. And the effect is clear in a parametric regression with dummies for price discontinuities.

  • 19 Nov 2024 4:42 PM
    Reply # 13432384 on 13214920

    Thanks for your reply John and others. I think the key problem is that this data set should not have been published without giving the background on the truncation. Unless I misunderstand, the origina, this dataset was provided by Hadley Wickham. I did contact him to get some background on the data but received no reply. 

    Personally, I have NEVER in my entire life failed to respond to a polite query from another academic about a research or professional issue. But than again, I do not have a kool ring in my ear either.

    Unless we understand at least some details of the truncation mechanism, I do not see how we can measure success of any prediction method. Under the simplest model that any observed y-value>c is excluded from the data set, the problem of prediction is at least well defined.

  • 22 Aug 2024 8:10 AM
    Reply # 13396500 on 13214920

    There is a 2023 paper that compares the performance of a number of different machine learning approaches with this 'Diamonds' dataset, with random forests coming out on top.

    Kigo, S.N., Omondi, E.O. & Omolo, B.O. Assessing predictive performance of supervised machine learning algorithms for a diamond pricing model. Sci Rep 13, 17315 (2023). https://doi.org/10.1038/s41598-023-44326-w

    The article states that "The dataset contains information on approximately 53,000 diamonds sold by a US-based retailer between 2008 and 2018."

    Supporting data and code are available from:
    https://datadryad.org/stash/share/C6GT0Srv2PTHitjNyS29EPgAk29hHu_s2RR6CyzVpeM

  • 30 Jun 2023 7:53 PM
    Reply # 13222069 on 13214920

    A randomForest model does a pretty good job for predicting price:

    MASS::boxcox(price~., data=ggplot2::diamonds)  # Suggests a log transform
     # although what matters is the distribution of residuals after the fit
    diamonds <- ggplot2::diamonds; Y <- diamonds[,"price", drop=T]
    samp5K <- sample(1:nrow(diamonds), size=5000)
    library(randomForest)
    (diamond5K.rf <- randomForest(x=diamonds[samp5K,-7], y=log(Y[samp5K]),
                       xtest=diamonds[-samp5K,-7], ytest=log(Y[-samp5K])))
    . . .
              Mean of squared residuals: 0.01350986
                        % Var explained: 98.66
                           Test set MSE: 0.01
                        % Var explained: 98.65

    Chris, What kind of modeling did or do you have in mind?  I have not tried, but I suspect that it will be hard to find a GAM or more conventional regression model that does anything like as well.  Incidentally, fitting single regression tree generates a tree with ~2400 leaves, and still does somewhat more poorly than the random forest model.  This suggests to me that there are some quite hard to chase down interactions involved.

    Last modified: 30 Jun 2023 7:56 PM | John Maindonald
  • 30 Jun 2023 12:23 PM
    Reply # 13222016 on 13214920

    Following with interest and still nervous to comment but I reckon I can see the regression line (and the truncation) on those graphs. If I could I would draw the line with a pen

    Here is my estimate. Went for a polynomial rather than exponential

    price = 3825.5*carat2 + 2486.9*carat - 297.38 Haven't worked out confidence intervals or percentage explained variance - I just drew a line

    I will read with interest for what the true relationship is. Concerned about people seeing too many of my methods

    Sorry I forgot before anyone corrects me. That is log price or something like that. Or the variance is clearly log/exponential around whatever mean we have chosen. Whatever

    Disclaimer - I am rusty just having fun speculating, and its strange that my function went through 5 points randomly located on a grid



    1 file
    Last modified: 30 Jun 2023 12:57 PM | Duncan Lowes
  • 27 Jun 2023 12:55 PM
    Reply # 13220333 on 13219818
    Chris Lloyd wrote:

    Replying to John. If you plot price against say carats, you will see the truncation very clearly (at about $19,000).

    Chris, the real issue is that the distribution of values of 'carat' is strongly banded, which gives the dark vertical bands in the plots to which you refer.  Most of the points are concentrated in these bands, so that one does not see the dropoff in density as the price increases.  One needs to use smoothScatter() or an equivalent to see a more visually meaningful picture.  I am attaching the plot from
    with(ggplot::diamonds,(smoothScatter(carat,price)))

    Or, repeat the ggplot2 plot with a small enough sample of data that the dark bands separate out to show the separate points.  Possibly also an issue is that only for higher valued diamonds is there finesse in setting the carat, setting it higher and moving the carat measures for high dollar diamonds up and away from the dark vertical bands. The second plot is from a sample of 1000 points.
    samp1K <- diamonds[sample(1:nrow(diamonds),1000),]
    ggplot(aes(x=carat, y=price), data=samp1K) +
        geom_point(fill=I("#F79420"), color=I("black"), shape=21) +
        scale_x_continuous(lim = c(0, quantile(diamonds$carat, 0.99)) ) +
        scale_y_continuous(lim = c(0, quantile(diamonds$price, 0.99)) ) +
        ggtitle("Diamonds: Price vs. Carat")

    2 files
    Last modified: 27 Jun 2023 1:15 PM | John Maindonald
  • 26 Jun 2023 3:27 PM
    Reply # 13219819 on 13214920

    Thanks Gillian, I will try out that package.

  • 26 Jun 2023 3:26 PM
    Reply # 13219818 on 13214920

    Replying to John. If you plot price against say carats, you will see the truncation very clearly (at about $19,000).

  • 23 Jun 2023 7:46 AM
    Reply # 13218781 on 13214920

    A good recourse in this context may be quantile regression, using R's qgam package.  At the very least, it will offer an insightful perspective on an analysis the fits the mean.  For example, fit 25%, 50% and 75% quantiles, allowing a check that the spread of the distribution, as measured by the inter-quartile range is reasonably constant.  Abilities in qgam have the advantange over those in quantreg that the smoothing parameters can be automatically chosen, under independence assumptions.

    But is the price data really truncated?  A density plot for the prices gives little suggestion of truncation:
    > d <- density(diamonds$price)
    > range(d$y[d$x>0])
    [1] 8.346320e-09 3.455133e-04

    There is a quite detailed look at the data at
    https://www.rpubs.com/Mohit_kumar_5522/diamonds_53940

    1 file
    Last modified: 24 Jun 2023 12:26 PM | John Maindonald
  • 22 Jun 2023 8:18 AM
    Reply # 13218286 on 13217710

    Hi Chris,

    The R package gamlss does censored and truncated regression, for any response distribution in its repertoire (gamlss.dist). 

    Gillian




<< First  < Prev   1   2   Next >  Last >> 
Powered by Wild Apricot Membership Software