Kaggle second hand car price analysis in R

In a previous post I downloaded a data set from kaggle containing data that had been scraped from ebay. The data set contained ebay postings on used car. The data was from Germany.

My analysis considers cars registered at latest in 2015, since most of the data entries were crawled in 2016.

The data can be downloded from kaggle. See here: https://www.kaggle.com/orgesleka/used-cars-database/data

In this post I will clean the data set and analyze main determinants in price formation. First step is to select relevent variables only. I do so by applying the select-function from dplyr. I proceesed by filtering out entries containing empty data points. Furthermore I filter out entries below a price 100 and above 10,000,000:

# read in the data
data_df = read.csv("autos.csv",header=TRUE,sep = ",",stringsAsFactors = TRUE)

# read in dplyr and magrittr
library(dplyr)
library(magrittr)

# select relevant variables Only
data_df = data_df %>% select(name, 
                             seller,
                             price,
                             vehicleType,
                             yearOfRegistration,
                             gearbox,
                             powerPS,
                             model,
                             kilometer,
                             fuelType,
                             monthOfRegistration,
                             brand,
                             postalCode)

# clean out data entries with empty data points
for(i in 1:length(colnames(data_df))){
  data_df = data_df[data_df[,i]!="",]
}

# clean out data entries where prices are below 100 or above 10000000 EUR (in this case using dplyr)
data_df = data_df %>% filter(price>=100,price<=10000000,yearOfRegistration<2016)

# print a summary
summary(data_df)
##                   name               seller           price         
##  BMW_318i           :   331   gewerblich:     2   Min.   :     100  
##  Volkswagen_Golf_1.4:   330   privat    :149557   1st Qu.:    1499  
##  Ford_Fiesta        :   264                       Median :    3550  
##  BMW_316i           :   255                       Mean   :    6631  
##  BMW_320i           :   255                       3rd Qu.:    8200  
##  Opel_Corsa         :   244                       Max.   :10000000  
##  (Other)            :147880                                         
##      vehicleType    yearOfRegistration      gearbox          powerPS       
##  limousine :43364   Min.   :1910                :     0   Min.   :    0.0  
##  kleinwagen:34928   1st Qu.:1999       automatik: 34135   1st Qu.:   75.0  
##  kombi     :30624   Median :2003       manuell  :115424   Median :  115.0  
##  bus       :14101   Mean   :2003                          Mean   :  123.2  
##  cabrio    :10402   3rd Qu.:2008                          3rd Qu.:  150.0  
##  coupe     : 8336   Max.   :2015                          Max.   :17019.0  
##  (Other)   : 7804                                                          
##      model          kilometer         fuelType     monthOfRegistration
##  golf   : 12099   Min.   :  5000   benzin :97702   Min.   : 0.000     
##  andere : 11422   1st Qu.:100000   diesel :49102   1st Qu.: 3.000     
##  3er    :  9087   Median :150000   lpg    : 2316   Median : 6.000     
##  polo   :  5276   Mean   :125281   cng    :  265   Mean   : 6.096     
##  corsa  :  4977   3rd Qu.:150000   hybrid :  110   3rd Qu.: 9.000     
##  a4     :  4529   Max.   :150000   andere :   47   Max.   :12.000     
##  (Other):102169                    (Other):   17                      
##            brand         postalCode   
##  volkswagen   :31505   Min.   : 1067  
##  bmw          :17187   1st Qu.:31193  
##  opel         :15428   Median :50825  
##  mercedes_benz:15196   Mean   :51632  
##  audi         :14000   3rd Qu.:72459  
##  ford         :10208   Max.   :99998  
##  (Other)      :46035

Let us proceed by having a look at our data. I start by printing boxplots by milage categories, visualizing price distribution. I consider postings with prices below 100,000 EUR only.

# read in ggplot2
library(ggplot2)

# make a boxplot describing the content of the current data frame; use hist
ggplot(filter(data_df,price<100000)) + geom_boxplot(mapping=aes(x=kilometer,y=price,group=kilometer)) + 
  labs(title = "price distribution by milage category (milage measured in kilometres",
       subtitle="postings with prices below 100,000 EUR considered") +
  xlab("milage categories [kilometres]") + 
  ylab("prices [EUR]")

Next, I want to produce a scater plot of mean prices by milage (kilometre) group. Mean prices are calculated per yearOfRegistration. Also, I decided to not consider cars with a value of 100,000 EUR or above in this analysis, as well as I exclude registered ahead of year 2008:

# use to group by yearOfRegistration and to summarize data with a mean statistic
summary_df = data_df%>%filter(price<100000,yearOfRegistration>=2008,yearOfRegistration<=2015)%>%group_by(yearOfRegistration,kilometer)%>%summarize(mean=mean(price))

# force year to be character string
summary_df$yearOfRegistration = as.character(summary_df$yearOfRegistration)

# rename columns
colnames(summary_df) = c("year","kilometres","meanPrice")

# plot in ggplot2
ggplot(summary_df) + 
  geom_path(mapping=aes(x=kilometres,y=meanPrice,color=year),size=2) + 
  labs(title = "mean prices by kilometer and year of registration group",
       subtitle = "only postings with cars offered at <100,000 EUR and registered in 2008 or later considered") +
  xlab("milage category [kilometres]") + 
  ylab("mean prices [EUR]")

The mean (here as the arithmetic average) is not that a good statistical indicator since it is very sensitive to outliers.

Lets produce above plot once more, but by using the median instead of the mean as central statistic:

# use to group by yearOfRegistration and to summarize data with a mean statistic
summary_df = data_df%>%filter(price<100000,yearOfRegistration>=2008)%>%group_by(yearOfRegistration,kilometer)%>%summarize(median=median(price))

# force year to be character string
summary_df$yearOfRegistration = as.character(summary_df$yearOfRegistration)

# rename columns
colnames(summary_df) = c("year","kilometres","medianPrice")

# plot in ggplot2
ggplot(summary_df) + 
  geom_path(mapping=aes(x=kilometres,y=medianPrice,color=year),size=2) + 
  labs(title = "median prices by kilometer and year of registration group",
       subtitle = "only postings with cars offered at <100,000 EUR and registered in 2010 or later considered") +
  xlab("milage category [kilometres]") + 
  ylab("median prices [EUR]")

The chart does not seem to have changed much, i.e. median and mean must be not that far from eachother.

One last thing I want to do is to take a look at postings for cars first registered in 2015 and to consider price distribution by brand, using boxplots:

# subsetting data
subset_df = filter(data_df,yearOfRegistration==2015) %>% filter(price<=100000) %>% filter(price>=100)%>%select(kilometer,yearOfRegistration,price,brand)

# rename columns
colnames(subset_df) = c("kilometer",
                        "year",
                        "price",
                        "brand")

# fore year to be character strings
subset_df$year = as.character(subset_df$year)

# boxplot plotting by year kilometer category and brand
ggplot(subset_df) + 
  geom_boxplot(mapping=aes(x=brand,y=price,color=brand))  + 
  theme(axis.ticks.x = element_blank(),
        axis.text.x = element_blank())

These are just some quick-and-dirty examples. For a fully cleaned data set and clear findings one would have to take into consideration many influencing factors, such as e.g. whether a a car posted is damaged (i.e. has been involved in a crash or is in urgent need of a repair, etc.).

You May Also Like

Leave a Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.