In a previous post I downloaded a data set from kaggle containing data that had been scraped from ebay. The data set contained ebay postings on used car. The data was from Germany.
My analysis considers cars registered at latest in 2015, since most of the data entries were crawled in 2016.
The data can be downloded from kaggle. See here: https://www.kaggle.com/orgesleka/used-cars-database/data
In this post I will clean the data set and analyze main determinants in price formation. First step is to select relevent variables only. I do so by applying the select-function from dplyr. I proceesed by filtering out entries containing empty data points. Furthermore I filter out entries below a price 100 and above 10,000,000:
# read in the data
data_df = read.csv("autos.csv",header=TRUE,sep = ",",stringsAsFactors = TRUE)
# read in dplyr and magrittr
library(dplyr)
library(magrittr)
# select relevant variables Only
data_df = data_df %>% select(name,
seller,
price,
vehicleType,
yearOfRegistration,
gearbox,
powerPS,
model,
kilometer,
fuelType,
monthOfRegistration,
brand,
postalCode)
# clean out data entries with empty data points
for(i in 1:length(colnames(data_df))){
data_df = data_df[data_df[,i]!="",]
}
# clean out data entries where prices are below 100 or above 10000000 EUR (in this case using dplyr)
data_df = data_df %>% filter(price>=100,price<=10000000,yearOfRegistration<2016)
# print a summary
summary(data_df)
## name seller price
## BMW_318i : 331 gewerblich: 2 Min. : 100
## Volkswagen_Golf_1.4: 330 privat :149557 1st Qu.: 1499
## Ford_Fiesta : 264 Median : 3550
## BMW_316i : 255 Mean : 6631
## BMW_320i : 255 3rd Qu.: 8200
## Opel_Corsa : 244 Max. :10000000
## (Other) :147880
## vehicleType yearOfRegistration gearbox powerPS
## limousine :43364 Min. :1910 : 0 Min. : 0.0
## kleinwagen:34928 1st Qu.:1999 automatik: 34135 1st Qu.: 75.0
## kombi :30624 Median :2003 manuell :115424 Median : 115.0
## bus :14101 Mean :2003 Mean : 123.2
## cabrio :10402 3rd Qu.:2008 3rd Qu.: 150.0
## coupe : 8336 Max. :2015 Max. :17019.0
## (Other) : 7804
## model kilometer fuelType monthOfRegistration
## golf : 12099 Min. : 5000 benzin :97702 Min. : 0.000
## andere : 11422 1st Qu.:100000 diesel :49102 1st Qu.: 3.000
## 3er : 9087 Median :150000 lpg : 2316 Median : 6.000
## polo : 5276 Mean :125281 cng : 265 Mean : 6.096
## corsa : 4977 3rd Qu.:150000 hybrid : 110 3rd Qu.: 9.000
## a4 : 4529 Max. :150000 andere : 47 Max. :12.000
## (Other):102169 (Other): 17
## brand postalCode
## volkswagen :31505 Min. : 1067
## bmw :17187 1st Qu.:31193
## opel :15428 Median :50825
## mercedes_benz:15196 Mean :51632
## audi :14000 3rd Qu.:72459
## ford :10208 Max. :99998
## (Other) :46035
Let us proceed by having a look at our data. I start by printing boxplots by milage categories, visualizing price distribution. I consider postings with prices below 100,000 EUR only.
# read in ggplot2
library(ggplot2)
# make a boxplot describing the content of the current data frame; use hist
ggplot(filter(data_df,price<100000)) + geom_boxplot(mapping=aes(x=kilometer,y=price,group=kilometer)) +
labs(title = "price distribution by milage category (milage measured in kilometres",
subtitle="postings with prices below 100,000 EUR considered") +
xlab("milage categories [kilometres]") +
ylab("prices [EUR]")
Next, I want to produce a scater plot of mean prices by milage (kilometre) group. Mean prices are calculated per yearOfRegistration. Also, I decided to not consider cars with a value of 100,000 EUR or above in this analysis, as well as I exclude registered ahead of year 2008:
# use to group by yearOfRegistration and to summarize data with a mean statistic
summary_df = data_df%>%filter(price<100000,yearOfRegistration>=2008,yearOfRegistration<=2015)%>%group_by(yearOfRegistration,kilometer)%>%summarize(mean=mean(price))
# force year to be character string
summary_df$yearOfRegistration = as.character(summary_df$yearOfRegistration)
# rename columns
colnames(summary_df) = c("year","kilometres","meanPrice")
# plot in ggplot2
ggplot(summary_df) +
geom_path(mapping=aes(x=kilometres,y=meanPrice,color=year),size=2) +
labs(title = "mean prices by kilometer and year of registration group",
subtitle = "only postings with cars offered at <100,000 EUR and registered in 2008 or later considered") +
xlab("milage category [kilometres]") +
ylab("mean prices [EUR]")
The mean (here as the arithmetic average) is not that a good statistical indicator since it is very sensitive to outliers.
Lets produce above plot once more, but by using the median instead of the mean as central statistic:
# use to group by yearOfRegistration and to summarize data with a mean statistic
summary_df = data_df%>%filter(price<100000,yearOfRegistration>=2008)%>%group_by(yearOfRegistration,kilometer)%>%summarize(median=median(price))
# force year to be character string
summary_df$yearOfRegistration = as.character(summary_df$yearOfRegistration)
# rename columns
colnames(summary_df) = c("year","kilometres","medianPrice")
# plot in ggplot2
ggplot(summary_df) +
geom_path(mapping=aes(x=kilometres,y=medianPrice,color=year),size=2) +
labs(title = "median prices by kilometer and year of registration group",
subtitle = "only postings with cars offered at <100,000 EUR and registered in 2010 or later considered") +
xlab("milage category [kilometres]") +
ylab("median prices [EUR]")
The chart does not seem to have changed much, i.e. median and mean must be not that far from eachother.
One last thing I want to do is to take a look at postings for cars first registered in 2015 and to consider price distribution by brand, using boxplots:
# subsetting data
subset_df = filter(data_df,yearOfRegistration==2015) %>% filter(price<=100000) %>% filter(price>=100)%>%select(kilometer,yearOfRegistration,price,brand)
# rename columns
colnames(subset_df) = c("kilometer",
"year",
"price",
"brand")
# fore year to be character strings
subset_df$year = as.character(subset_df$year)
# boxplot plotting by year kilometer category and brand
ggplot(subset_df) +
geom_boxplot(mapping=aes(x=brand,y=price,color=brand)) +
theme(axis.ticks.x = element_blank(),
axis.text.x = element_blank())
These are just some quick-and-dirty examples. For a fully cleaned data set and clear findings one would have to take into consideration many influencing factors, such as e.g. whether a a car posted is damaged (i.e. has been involved in a crash or is in urgent need of a repair, etc.).
Data scientist focusing on simulation, optimization and modeling in R, SQL, VBA and Python
Leave a Reply