In previous post I loaded, cleaned and visualized aspects of a data set retrieved from Kaggle. The data set was based on ebay and used car postings from Germany.
In this post I want to conduct additional analysis steps. Main objective will be to gain better understanding of how prices and thus the value of used cars is influced by milage and age.
Milage is measured in kilometres (since the data is from Germany), and age is measured as the years difference between 2016 and year of car registration. Only cars with a price between 100 and 100,000 EUR will be considered by this analysis. Moreover, I consider years of registration between 2006 and 2016 (most data was scraped from ebay in 2016, so this is the last year being considered by me).
# read in the data
data_df = read.csv("autos.csv",header=TRUE,sep = ",",stringsAsFactors = TRUE)
# read in dplyr and magrittr
# select relevant variables Only
data_df = data_df %>% select(name,
# clean out data entries with empty data points
for(i in 1:length(colnames(data_df))){
data_df = data_df[data_df[,i]!="",]
# clean out data entries where prices are below 100 or above 100000 EUR (in this case using dplyr)
data_df = data_df %>% filter(price>=100,price<=100000,yearOfRegistration<2016,yearOfRegistration>=2006)
# print a summary
## name seller price
## BMW_116i : 141 gewerblich: 1 Min. : 101
## Audi_A4_Avant_2.0_TDI_DPF: 107 privat :55390 1st Qu.: 5850
## BMW_118d_DPF : 97 Median : 9300
## BMW_320d_DPF_Touring : 87 Mean :11783
## Ford_Fiesta_1.25_Trend : 74 3rd Qu.:14999
## MINI_Mini_One : 71 Max. :99999
## (Other) :54814
## vehicleType yearOfRegistration gearbox powerPS
## limousine :14469 Min. :2006 : 0 Min. : 0.0
## kombi :12394 1st Qu.:2007 automatik:16259 1st Qu.: 100.0
## kleinwagen:11109 Median :2009 manuell :39132 Median : 136.0
## bus : 6357 Mean :2009 Mean : 143.2
## suv : 4088 3rd Qu.:2011 3rd Qu.: 170.0
## cabrio : 3708 Max. :2015 Max. :17011.0
## (Other) : 3266
## model kilometer fuelType monthOfRegistration
## golf : 3968 Min. : 5000 diesel :27570 Min. : 0.000
## andere : 3441 1st Qu.: 60000 benzin :27075 1st Qu.: 3.000
## 3er : 2476 Median :100000 lpg : 465 Median : 6.000
## passat : 1767 Mean :102639 cng : 165 Mean : 6.259
## a4 : 1750 3rd Qu.:150000 hybrid : 98 3rd Qu.: 9.000
## 1er : 1542 Max. :150000 elektro: 12 Max. :12.000
## (Other):40447 (Other): 6
## brand postalCode
## volkswagen :11368 Min. : 1067
## bmw : 6614 1st Qu.:33102
## audi : 6350 Median :53340
## mercedes_benz: 4902 Mean :53415
## opel : 4276 3rd Qu.:74363
## ford : 3409 Max. :99998
## (Other) :18472
Now that I have the processed data frame I produce a series of plots. The first plot compares car prices and milage (kilometer) categories. The second plot compares car age and car price. The third plot contains information on both milage and age and visualizes the combined impact on car prices.
# compare car prices vs kilometer "milage"
# boxplot for milage categories
ggplot(data_df) +
geom_boxplot(mapping = aes(x=kilometer,group=kilometer,y=price),color="blue") +
labs(title="prices vs milage (kilometres) category",
subtitle="data for 2006 - 2016, German ebay postings") +
xlab("milage category [kilometres]") +
ylab("prices [EUR]")

# boxplot for year of registration
ggplot(data_df) +
geom_boxplot(mapping = aes(x=2016-as.integer(yearOfRegistration),group=2016-as.integer(yearOfRegistration),y=price),color="red") +
labs(title="prices vs year of registratio category",
subtitle="data for 2006 - 2016, German ebay postings") +
xlab("years after registration [-]") +
ylab("prices [EUR]")

# boxplot graph that contains information on BOTH milage (kilometres) and agel
ggplot(data_df) +
geom_jitter(mapping = aes(x=2016-as.integer(yearOfRegistration),
alpha=0.08) + scale_color_viridis_c()+
labs(title="prices vs year of registratio category",
subtitle="data for 2006 - 2016, German ebay postings") +
xlab("years after registration [-]") +
ylab("prices [EUR]")

The last step of my analysis shall be to train a linear regression model. Prices are the dependent variable and car age as well as car milage (kilometres) are the independent variables. Below I implement bi-variate linear regression in R and display the distribution of redisuals. I provide a summary of model performance, based on the entire data set.
# split data in training and test set (see my post on regression analysis in R for more on this)
data_df$age = 2016-as.integer(data_df$yearOfRegistration)
# train predictor with mutliple linear regression methodology, on training set
predictor <- lm(formula = price ~ kilometer +
age, data_df)
# summarize regression outcome
## Call:
## lm(formula = price ~ kilometer + age, data = data_df)
## Residuals:
## Min 1Q Median 3Q Max
## -22293 -4558 -1412 2713 93603
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.415e+04 9.827e+01 245.70 < 2e-16 ***
## kilometer -6.632e-03 9.475e-04 -7.00 2.6e-12 ***
## age -1.718e+03 1.722e+01 -99.77 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 7790 on 55388 degrees of freedom
## Multiple R-squared: 0.2505, Adjusted R-squared: 0.2505
## F-statistic: 9257 on 2 and 55388 DF, p-value: < 2.2e-16
# distribution of residuals, considering training set

The model is really not having that impressive ratings (adjusted R square is approx. 25%) but the process of getting to this point gives you an idea of how you can build a prediction model on this data set.

