Menu
Log in


Truncated regression

<< First  < Prev   1   2   Next >  Last >> 
  • 22 Aug 2024 8:10 AM
    Reply # 13396500 on 13214920

    There is a 2023 paper that compares the performance of a number of different machine learning approaches with this 'Diamonds' dataset, with random forests coming out on top.

    Kigo, S.N., Omondi, E.O. & Omolo, B.O. Assessing predictive performance of supervised machine learning algorithms for a diamond pricing model. Sci Rep 13, 17315 (2023). https://doi.org/10.1038/s41598-023-44326-w

    The article states that "The dataset contains information on approximately 53,000 diamonds sold by a US-based retailer between 2008 and 2018."

    Supporting data and code are available from:
    https://datadryad.org/stash/share/C6GT0Srv2PTHitjNyS29EPgAk29hHu_s2RR6CyzVpeM

  • 30 Jun 2023 7:53 PM
    Reply # 13222069 on 13214920

    A randomForest model does a pretty good job for predicting price:

    MASS::boxcox(price~., data=ggplot2::diamonds)  # Suggests a log transform
     # although what matters is the distribution of residuals after the fit
    diamonds <- ggplot2::diamonds; Y <- diamonds[,"price", drop=T]
    samp5K <- sample(1:nrow(diamonds), size=5000)
    library(randomForest)
    (diamond5K.rf <- randomForest(x=diamonds[samp5K,-7], y=log(Y[samp5K]),
                       xtest=diamonds[-samp5K,-7], ytest=log(Y[-samp5K])))
    . . .
              Mean of squared residuals: 0.01350986
                        % Var explained: 98.66
                           Test set MSE: 0.01
                        % Var explained: 98.65

    Chris, What kind of modeling did or do you have in mind?  I have not tried, but I suspect that it will be hard to find a GAM or more conventional regression model that does anything like as well.  Incidentally, fitting single regression tree generates a tree with ~2400 leaves, and still does somewhat more poorly than the random forest model.  This suggests to me that there are some quite hard to chase down interactions involved.

    Last modified: 30 Jun 2023 7:56 PM | John Maindonald
  • 30 Jun 2023 12:23 PM
    Reply # 13222016 on 13214920

    Following with interest and still nervous to comment but I reckon I can see the regression line (and the truncation) on those graphs. If I could I would draw the line with a pen

    Here is my estimate. Went for a polynomial rather than exponential

    price = 3825.5*carat2 + 2486.9*carat - 297.38 Haven't worked out confidence intervals or percentage explained variance - I just drew a line

    I will read with interest for what the true relationship is. Concerned about people seeing too many of my methods

    Sorry I forgot before anyone corrects me. That is log price or something like that. Or the variance is clearly log/exponential around whatever mean we have chosen. Whatever

    Disclaimer - I am rusty just having fun speculating, and its strange that my function went through 5 points randomly located on a grid



    1 file
    Last modified: 30 Jun 2023 12:57 PM | Duncan Lowes
  • 27 Jun 2023 12:55 PM
    Reply # 13220333 on 13219818
    Chris Lloyd wrote:

    Replying to John. If you plot price against say carats, you will see the truncation very clearly (at about $19,000).

    Chris, the real issue is that the distribution of values of 'carat' is strongly banded, which gives the dark vertical bands in the plots to which you refer.  Most of the points are concentrated in these bands, so that one does not see the dropoff in density as the price increases.  One needs to use smoothScatter() or an equivalent to see a more visually meaningful picture.  I am attaching the plot from
    with(ggplot::diamonds,(smoothScatter(carat,price)))

    Or, repeat the ggplot2 plot with a small enough sample of data that the dark bands separate out to show the separate points.  Possibly also an issue is that only for higher valued diamonds is there finesse in setting the carat, setting it higher and moving the carat measures for high dollar diamonds up and away from the dark vertical bands. The second plot is from a sample of 1000 points.
    samp1K <- diamonds[sample(1:nrow(diamonds),1000),]
    ggplot(aes(x=carat, y=price), data=samp1K) +
        geom_point(fill=I("#F79420"), color=I("black"), shape=21) +
        scale_x_continuous(lim = c(0, quantile(diamonds$carat, 0.99)) ) +
        scale_y_continuous(lim = c(0, quantile(diamonds$price, 0.99)) ) +
        ggtitle("Diamonds: Price vs. Carat")

    2 files
    Last modified: 27 Jun 2023 1:15 PM | John Maindonald
  • 26 Jun 2023 3:27 PM
    Reply # 13219819 on 13214920

    Thanks Gillian, I will try out that package.

  • 26 Jun 2023 3:26 PM
    Reply # 13219818 on 13214920

    Replying to John. If you plot price against say carats, you will see the truncation very clearly (at about $19,000).

  • 23 Jun 2023 7:46 AM
    Reply # 13218781 on 13214920

    A good recourse in this context may be quantile regression, using R's qgam package.  At the very least, it will offer an insightful perspective on an analysis the fits the mean.  For example, fit 25%, 50% and 75% quantiles, allowing a check that the spread of the distribution, as measured by the inter-quartile range is reasonably constant.  Abilities in qgam have the advantange over those in quantreg that the smoothing parameters can be automatically chosen, under independence assumptions.

    But is the price data really truncated?  A density plot for the prices gives little suggestion of truncation:
    > d <- density(diamonds$price)
    > range(d$y[d$x>0])
    [1] 8.346320e-09 3.455133e-04

    There is a quite detailed look at the data at
    https://www.rpubs.com/Mohit_kumar_5522/diamonds_53940

    1 file
    Last modified: 24 Jun 2023 12:26 PM | John Maindonald
  • 22 Jun 2023 8:18 AM
    Reply # 13218286 on 13217710

    Hi Chris,

    The R package gamlss does censored and truncated regression, for any response distribution in its repertoire (gamlss.dist). 

    Gillian




  • 21 Jun 2023 6:55 AM
    Reply # 13217710 on 13214920

    Hi Chris

    Is this censored regression? In which case R packages censReg (https://cran.r-project.org/web/packages/censReg/vignettes/censReg.pdf) or AER (https://stats.stackexchange.com/questions/149091/censored-regression-in-r) look like they will handle it.

    Wish I knew the story about the upper limit on quoted price.

    According to https://stats.oarc.ucla.edu/r/dae/tobit-models/ the speedometers in American cars in the 1980s weren't allowed to read more than 85 mph, so  if you wanted to predict maximum speed from engine size and weight, and could only use the speedo' to measure maximum speed, the data would be truncated.


    Duncan

  • 20 Jun 2023 7:51 PM
    Reply # 13217310 on 13214920

    Hi Chris

    No response yet. This can be a lonely forum

    Sorry I have no knowledge of the dataset and my memories of truncated methods are very rusty

    I am looking at it now out of curiosity but may not be able to help

    I even feel unqualified to even comment and run the risk of saying something stupid even if I do. I thought most people ignored stuff like that. The further you get from the (possibly transformed) means of your data, maybe truncation doesn't matter much most of the time :) Can you not see from the truncated data what may be going on. Sorry

    Hope somebody has helped

    regards Duncan


    Last modified: 20 Jun 2023 8:06 PM | Duncan Lowes
<< First  < Prev   1   2   Next >  Last >> 
Powered by Wild Apricot Membership Software