# Vehicle age and milage Kaggle analysis in R

In previous post I loaded, cleaned and visualized aspects of a data set retrieved from Kaggle. The data set was based on ebay and used car postings from Germany.

In this post I want to conduct additional analysis steps. Main objective will be to gain better understanding of how prices and thus the value of used cars is influced by milage and age.

Milage is measured in kilometres (since the data is from Germany), and age is measured as the years difference between 2016 and year of car registration. Only cars with a price between 100 and 100,000 EUR will be considered by this analysis. Moreover, I consider years of registration between 2006 and 2016 (most data was scraped from ebay in 2016, so this is the last year being considered by me).

``````# read in the data
data_df = read.csv("autos.csv",header=TRUE,sep = ",",stringsAsFactors = TRUE)

# read in dplyr and magrittr
library(dplyr)
library(magrittr)

# select relevant variables Only
data_df = data_df %>% select(name,
seller,
price,
vehicleType,
yearOfRegistration,
gearbox,
powerPS,
model,
kilometer,
fuelType,
monthOfRegistration,
brand,
postalCode)

# clean out data entries with empty data points
for(i in 1:length(colnames(data_df))){
data_df = data_df[data_df[,i]!="",]
}

# clean out data entries where prices are below 100 or above 100000 EUR (in this case using dplyr)
data_df = data_df %>% filter(price>=100,price<=100000,yearOfRegistration<2016,yearOfRegistration>=2006)

# print a summary
summary(data_df)``````
``````##                         name              seller          price
##  BMW_116i                 :  141   gewerblich:    1   Min.   :  101
##  Audi_A4_Avant_2.0_TDI_DPF:  107   privat    :55390   1st Qu.: 5850
##  BMW_118d_DPF             :   97                      Median : 9300
##  BMW_320d_DPF_Touring     :   87                      Mean   :11783
##  Ford_Fiesta_1.25_Trend   :   74                      3rd Qu.:14999
##  MINI_Mini_One            :   71                      Max.   :99999
##  (Other)                  :54814
##      vehicleType    yearOfRegistration      gearbox         powerPS
##  limousine :14469   Min.   :2006                :    0   Min.   :    0.0
##  kombi     :12394   1st Qu.:2007       automatik:16259   1st Qu.:  100.0
##  kleinwagen:11109   Median :2009       manuell  :39132   Median :  136.0
##  bus       : 6357   Mean   :2009                         Mean   :  143.2
##  suv       : 4088   3rd Qu.:2011                         3rd Qu.:  170.0
##  cabrio    : 3708   Max.   :2015                         Max.   :17011.0
##  (Other)   : 3266
##      model         kilometer         fuelType     monthOfRegistration
##  golf   : 3968   Min.   :  5000   diesel :27570   Min.   : 0.000
##  andere : 3441   1st Qu.: 60000   benzin :27075   1st Qu.: 3.000
##  3er    : 2476   Median :100000   lpg    :  465   Median : 6.000
##  passat : 1767   Mean   :102639   cng    :  165   Mean   : 6.259
##  a4     : 1750   3rd Qu.:150000   hybrid :   98   3rd Qu.: 9.000
##  1er    : 1542   Max.   :150000   elektro:   12   Max.   :12.000
##  (Other):40447                    (Other):    6
##            brand         postalCode
##  volkswagen   :11368   Min.   : 1067
##  bmw          : 6614   1st Qu.:33102
##  audi         : 6350   Median :53340
##  mercedes_benz: 4902   Mean   :53415
##  opel         : 4276   3rd Qu.:74363
##  ford         : 3409   Max.   :99998
##  (Other)      :18472``````

Now that I have the processed data frame I produce a series of plots. The first plot compares car prices and milage (kilometer) categories. The second plot compares car age and car price. The third plot contains information on both milage and age and visualizes the combined impact on car prices.

``````# compare car prices vs kilometer "milage"
library(ggplot2)
# boxplot for milage categories
ggplot(data_df) +
geom_boxplot(mapping = aes(x=kilometer,group=kilometer,y=price),color="blue") +
labs(title="prices vs milage (kilometres) category",
subtitle="data for 2006 - 2016, German ebay postings") +
xlab("milage category [kilometres]") +
ylab("prices [EUR]")``````
``````# boxplot for year of registration
ggplot(data_df) +
geom_boxplot(mapping = aes(x=2016-as.integer(yearOfRegistration),group=2016-as.integer(yearOfRegistration),y=price),color="red") +
labs(title="prices vs year of registratio category",
subtitle="data for 2006 - 2016, German ebay postings") +
xlab("years after registration [-]") +
ylab("prices [EUR]")``````
``````# boxplot graph that contains information on BOTH milage (kilometres) and agel
ggplot(data_df) +
geom_jitter(mapping = aes(x=2016-as.integer(yearOfRegistration),
group=kilometer,
y=price,
color=kilometer),
size=0.5,
alpha=0.08) + scale_color_viridis_c()+
labs(title="prices vs year of registratio category",
subtitle="data for 2006 - 2016, German ebay postings") +
xlab("years after registration [-]") +
ylab("prices [EUR]")``````

The last step of my analysis shall be to train a linear regression model. Prices are the dependent variable and car age as well as car milage (kilometres) are the independent variables. Below I implement bi-variate linear regression in R and display the distribution of redisuals. I provide a summary of model performance, based on the entire data set.

``````# split data in training and test set (see my post on regression analysis in R for more on this)
data_df\$age = 2016-as.integer(data_df\$yearOfRegistration)

# train predictor with mutliple linear regression methodology, on training set
predictor <- lm(formula = price ~ kilometer +
age, data_df)

# summarize regression outcome
summary(predictor)``````
``````##
## Call:
## lm(formula = price ~ kilometer + age, data = data_df)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -22293  -4558  -1412   2713  93603
##
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  2.415e+04  9.827e+01  245.70  < 2e-16 ***
## kilometer   -6.632e-03  9.475e-04   -7.00  2.6e-12 ***
## age         -1.718e+03  1.722e+01  -99.77  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7790 on 55388 degrees of freedom
## Multiple R-squared:  0.2505, Adjusted R-squared:  0.2505
## F-statistic:  9257 on 2 and 55388 DF,  p-value: < 2.2e-16``````
``````# distribution of residuals, considering training set
hist(predictor\$residuals)``````

The model is really not having that impressive ratings (adjusted R square is approx. 25%) but the process of getting to this point gives you an idea of how you can build a prediction model on this data set.