Problem Statement: The goal of this Exploratory Data Analysis is to explore the Kaggle Movies dataset and draw some interesting insights to questions like which countries produce most movies, profitability analysis, kind of movies that are most produced, most produced genres etc.
Solution Overview: After completing the data cleaning, univariate and multi-variate exploration of the data helped us answer some of the above questions. A variety of packages were used for this exercise, which are explained in the later sections. Each analysis has been explained along with its codes.
Insights: Some very interesting insights were obtained from this analysis:
Following packages were used:
library(tibble)
library(DT)
library(knitr)
library(tm)
library(ggplot2)
library(wordcloud)
library(dplyr)
library(fitdistrplus)
library(plotly)
IMDB 5000 Movie Dataset from Kaggle
This dataset contains 28 variables about 5,043 movies spanning across 100 years in 66 countries. There are 2399 unique director names, and thousands of actors/actresses.
More information about the dataset can be found on the Kaggle page here.
Data Import Code:
library(tibble)
# url <- "https://github.com/adityadiwane/Exploratory_Data_Analysis_IMDB_Movies/blob/main/movie_metadata.csv"
movie <- as_data_frame(read.csv("L:/Udemy/R Project/movie_metadata.csv", stringsAsFactors = TRUE))
class(movie)
## [1] "tbl_df" "tbl" "data.frame"
colnames(movie)
## [1] "color" "director_name"
## [3] "num_critic_for_reviews" "duration"
## [5] "director_facebook_likes" "actor_3_facebook_likes"
## [7] "actor_2_name" "actor_1_facebook_likes"
## [9] "gross" "genres"
## [11] "actor_1_name" "movie_title"
## [13] "num_voted_users" "cast_total_facebook_likes"
## [15] "actor_3_name" "facenumber_in_poster"
## [17] "plot_keywords" "movie_imdb_link"
## [19] "num_user_for_reviews" "language"
## [21] "country" "content_rating"
## [23] "budget" "title_year"
## [25] "actor_2_facebook_likes" "imdb_score"
## [27] "aspect_ratio" "movie_facebook_likes"
dim(movie)
## [1] 5043 28
The first part of data cleaning involves removal of spurious characters (Â) from a the movie title, genre and plot keyword columns. This could have come up during data scrapping from the net.
Then we remove the duplicates in the data, using then “movie_title” column. Duplicate data will skew our analysis hence needs to be removed.
Third part of data cleaning is the currency columns. The “Budget” and “Gross” (revenue) columns for a few countries were not converted to USD while compiling the data. Hence I have done this manually. This cleaning will help us compare these financial columns across movies and countries. (Conversion as of 9th Oc)
Lastly I created a profit_flag column, which is 1 if the movie is profitable (Revenue > Budget) and 0 otherwise.
movie$movie_title <- (sapply(movie$movie_title,gsub,pattern="\\Â",replacement=""))
movie$genres_2 <- (sapply(movie$genres,gsub,pattern="\\|",replacement=" "))
movie$plot_keywords_2 <- (sapply(movie$plot_keywords,gsub,pattern="\\|",replacement=" "))
movie = movie[!duplicated(movie$movie_title),]
movie <- transform(movie, budget = ifelse(country == "South Korea", budget/1143.14, budget))
movie <- transform(movie, budget = ifelse(country == "Japan", budget/105.72, budget))
movie <- transform(movie, budget = ifelse(country == "Turkey", budget/7.84, budget))
movie <- transform(movie, budget = ifelse(country == "Hungary", budget/301.39, budget))
movie <- transform(movie, budget = ifelse(country == "Thailand", budget/31.01, budget))
movie <- transform(movie, gross = ifelse(country == "South Korea", gross/1143.14, gross))
movie <- transform(movie, gross = ifelse(country == "Japan", gross/105.72, gross))
movie <- transform(movie, gross = ifelse(country == "Turkey", gross/7.84, gross))
movie <- transform(movie, gross = ifelse(country == "Hungary", gross/301.39, gross))
movie <- transform(movie, gross = ifelse(country == "Thailand", gross/31.01, gross))
movie$profit_flag <- as.factor(ifelse((movie$gross > movie$budget),1,0))
After the data cleaning, I once again check the number of rows and columns, as shown in the code below. The results show 4,917 unique movies.
dim(movie)
## [1] 4917 31
Missing Values
There are 3,700 rows which do not have any missing value.For Character values missing values are blanks, while numeric variables have missing values as NAs. But I have decided not to remove any rows with missing data yet. I will take care of this issue while doing individual analysis.
print(paste(sum(complete.cases(movie)),"Complete cases!"))
## [1] "3700 Complete cases!"
In the data, each row is a movie and the columns are the information regarding the movie.
Preview (top 50 rows) of the cleaned dataset:
library(DT)
datatable(head(movie,50))
Below is the table containing the the variable names, data types and a short description.
Variable.type <- lapply(movie,class)
Variable.desc <- c("Specifies if it was color/black & white movie",
"Name of movie director","Number of critics who reviewed",
"Duration of the movie (minutes)","Number of likes on director's FB page",
"Number of likes on 3rd actor's FB page","Name of second actor",
"Number of likes on 1st actor's FB page","Gross earning by the movie ($)",
"Genres of the movie","Name of the first actor",
"Title of the movie","Number of users voted on IMDB",
"Total facebook likes for all cast members","Name of the third actor",
"Number of the actor who featured in the movie poster",
"Keywords describing the movie plot","IMDB link of the movie",
"Number of users who gave a review","Language of the movie",
"Country the movie was produced in",
"Content rating of the movie","Budget of the movie ($)",
"Year the movie released in","Number of facebook likes for actor 2",
"IMDB score for the movie (out of 10)","Aspect ratio the movie was made in",
"Number of facebook likes","Cleaned genre column","Cleaned Plot keyword column",
"Flag indicating profitability of the movie (1-profit, 0-loss)")
Variable.name1 <- colnames(movie)
data.desc <- as_data_frame(cbind(Variable.name1,Variable.type,Variable.desc))
colnames(data.desc) <- c("Variable Name","Data Type","Variable Description")
library(knitr)
kable(data.desc)
Variable Name | Data Type | Variable Description |
---|---|---|
color | factor | Specifies if it was color/black & white movie |
director_name | factor | Name of movie director |
num_critic_for_reviews | integer | Number of critics who reviewed |
duration | integer | Duration of the movie (minutes) |
director_facebook_likes | integer | Number of likes on director’s FB page |
actor_3_facebook_likes | integer | Number of likes on 3rd actor’s FB page |
actor_2_name | factor | Name of second actor |
actor_1_facebook_likes | integer | Number of likes on 1st actor’s FB page |
gross | numeric | Gross earning by the movie ($) |
genres | factor | Genres of the movie |
actor_1_name | factor | Name of the first actor |
movie_title | character | Title of the movie |
num_voted_users | integer | Number of users voted on IMDB |
cast_total_facebook_likes | integer | Total facebook likes for all cast members |
actor_3_name | factor | Name of the third actor |
facenumber_in_poster | integer | Number of the actor who featured in the movie poster |
plot_keywords | factor | Keywords describing the movie plot |
movie_imdb_link | factor | IMDB link of the movie |
num_user_for_reviews | integer | Number of users who gave a review |
language | factor | Language of the movie |
country | factor | Country the movie was produced in |
content_rating | factor | Content rating of the movie |
budget | numeric | Budget of the movie ($) |
title_year | integer | Year the movie released in |
actor_2_facebook_likes | integer | Number of facebook likes for actor 2 |
imdb_score | numeric | IMDB score for the movie (out of 10) |
aspect_ratio | numeric | Aspect ratio the movie was made in |
movie_facebook_likes | integer | Number of facebook likes |
genres_2 | character | Cleaned genre column |
plot_keywords_2 | character | Cleaned Plot keyword column |
profit_flag | factor | Flag indicating profitability of the movie (1-profit, 0-loss) |
Each movie in my dataset had more than one genre, hence some cleaning and genre separation was required. Some pre-built functions in the “TM” package were very useful. In short, in the Analysis of Genre, the following was done: 1. Cleaning of the Genre Variable 2. Converting Genre variable to Corpus 3. Frequency Analysis: See which are the most used genres in a movie 4. Genre Associations Analysis: See which genres are closely associated and used together in a movie
Drama, Comedy and Thriller are the top movie genres, as showm in the below Word-Cloud and Bar-Chart
library(tm)
library(dplyr)
library(ggplot2)
library(wordcloud)
genre <- Corpus(VectorSource(movie$genres_2))
genre_dtm <- DocumentTermMatrix(genre)
genre_freq <- colSums(as.matrix(genre_dtm))
freq <- sort(colSums(as.matrix(genre_dtm)), decreasing=TRUE)
genre_wf <- data.frame(word=names(genre_freq), freq=genre_freq)
ggplot(genre_wf, aes(x=reorder(word,-freq), y=freq))+
geom_bar(stat="identity")+
theme(axis.text.x=element_text(angle=45, hjust=1))+
ggtitle("Movie Genre frequency graph")+
xlab("Genre")+
ylab("Frequency")
set.seed(10)
pal2 <- brewer.pal(8,"Dark2")
wordcloud(genre_wf$word,genre_wf$freq,random.order=FALSE,
rot.per=.15, colors=pal2,scale=c(4,.9),
title="WordCloud: Movie Genres")
Genre Association Analysis: Here I have analyzed which genres occur together in a movie. Since most of the movies in our database have multiple Genres, it would be interesting to understand how do these genres overlap.
I made a custom function, which finds the association for each genre and then binds all the individual results together to show them on a single graph.
word_assoc <- function(word)
{
assoc_1 <- as.data.frame(findAssocs(genre_dtm,c(word),corlimit = 0.1))
assoc_1$words <- rownames(assoc_1)
colnames(assoc_1)[1] <- c("score")
assoc_1$key <- c(word)
rownames(assoc_1) <- NULL
return(assoc_1)
}
drama_assoc <- word_assoc("drama")
comedy_assoc <- word_assoc("comedy")
thriller_assoc <- word_assoc("thriller")
action_assoc <- word_assoc("action")
romance_assoc <- word_assoc("romance")
adventure_assoc <- word_assoc("adventure")
crime_assoc <- word_assoc("crime")
assoc <- rbind(drama_assoc,comedy_assoc,thriller_assoc,action_assoc,
romance_assoc,adventure_assoc,crime_assoc)
ggplot(assoc,aes(x=words,y=score))+
geom_bar(stat="identity")+
facet_grid(~key,scales = 'free',space="free_x")+
theme(axis.text.x=element_text(angle=45, hjust=1))+
ylab("Association Score")+
xlab("Genre")+
ggtitle("Genre Associations")
The above graph shows the association strength, which can be thought of as a correlation to understand easily. The score is between 0 and 1. Lower score means they two genres in question were not used together while a higher score shows strong association.
This part of the analysis deals with exploring the country variable. The first part shows how many movies were produced in which countries across the years. I have tried to show that through a heat map. Dark blue indicates more number of movies produces and vice-versa.
Most of the countries started producing movies in the early 2000s, except a handful which had prevalant movie production going on since the mid 1900s.
country_count <- movie %>%
subset(country != "") %>%
subset(country != "New Line") %>%
na.omit() %>%
group_by(country,title_year) %>%
summarise(count=n())
colnames(country_count)[3] <- "Movie_Count"
ggplot(country_count,aes(title_year,country))+
geom_tile(aes(fill=Movie_Count),colour="white")+
scale_fill_gradient(low="light blue",high = "dark blue")+
xlab("Year of movie release")+
ylab("Country")+
ggtitle("Heat Map: Country vs Movie Release Year")+
guides(fill=FALSE)
The second part deals with exploring how does the budget varies by each country. I have assumed that the budget given in the data is the amount of money required to produce the movie. However lot of times there are over runs and other costs involved after a movie is released. The variation is definitely not linear, the top countries have significantly higher movies budgets while the bottom ones have significantly lower movie budgets. (Outliers were removed):
country_summary <- movie %>%
subset(country != "") %>%
subset(country != "New Line") %>%
group_by(country) %>%
summarise(count=n(),
avg_score=mean(imdb_score,na.rm="true"),
avg_budget = mean(budget,na.rm="true"),
avg_gross = mean(gross,na.rm="true"))
country_with_multiple_movies <- subset(country_summary,count>1)[1]
ggplot(country_summary[complete.cases(country_summary), ],
aes(x=reorder(country,-avg_budget),avg_budget/1000000))+
geom_bar(stat = "identity")+
theme(axis.text.x=element_text(angle=45, hjust=1))+
ylab("Average Movie Budget (Million $)")+
xlab("")+
ggtitle("Average Movie Budget by Country")
ggplot(subset(movie,country %in% country_with_multiple_movies$country),
aes(x=country,y=budget/1000000))+
geom_boxplot()+
theme(axis.text.x=element_text(angle=45, hjust=1))+
ylab("Average Movie Budget (Million $)")+
xlab("")+
ggtitle("Movie Budget variation by Country")+
ylim(0,100)
The third part in country analysis deals with exploring the gross revenue for the movie across the countries (Similar to the above analysis). Here the top 5 are significantly higher (more than 50 M USD) than the rest while the 6th highest is approximately 35 USD.
ggplot(country_summary[complete.cases(country_summary), ],
aes(x=reorder(country,-avg_gross),avg_gross/1000000))+
geom_bar(stat = "identity")+
theme(axis.text.x=element_text(angle=45, hjust=1))+
ylab("Average Movie Revenue (Million $)")+
xlab("")+
ggtitle("Average Movie Revenue by Country")
ggplot(subset(movie,country %in% country_with_multiple_movies$country),
aes(x=country,y=gross/1000000))+
geom_boxplot()+
theme(axis.text.x=element_text(angle=45, hjust=1))+
ylab("Average Movie Revenue (Million $)")+
xlab("")+
ggtitle("Movie Revenue variation by Country")+
ylim(0,50)
In the fifth and the final part has the analysis, I have seen which country released which languages of films. English is obviosuly the language overlapping most countries, but the below heat map helps us get a better picture. Dark blue indicates higher number of movie released.
country_language <- movie %>%
subset(country != "") %>%
subset(language != "") %>%
group_by(country,language) %>%
summarise(count=n())
colnames(country_language)[3] <- "Movie_Count"
ggplot(country_language,aes(language,country))+
geom_tile(aes(fill=log(Movie_Count)),colour="white")+
scale_fill_gradient(low="light blue",high = "dark blue")+
xlab("Language")+
ylab("Country")+
ggtitle("Heat Map: Country vs Language")+
theme(axis.text.x=element_text(angle=45, hjust=1))+
guides(fill=FALSE)
We can see below that USA produces movies in most number of languages (14), followed by UK.
movie %>%
group_by(country) %>%
summarise(num = n_distinct(language)) %>%
arrange(-num) %>%
subset(num > 3) %>%
ggplot(aes(y=num,x=reorder(country,-num)))+
geom_bar(stat = "identity")+
xlab("")+ylab("Number of Languages")+
ggtitle("Top countries by number of languages of films produced")
In here, I have tried to see which kind of movies are more successful in terms of the IMDB ratings.
We first start by looking at the basic central tendency (mean) and the variation in movie score. For this purpose I have plotted a histogram which also has the 5th and 95th percentile mark for the IMDB score.
ggplot(movie,aes(imdb_score))+
geom_histogram(bins=80)+
geom_vline(xintercept = mean(movie$imdb_score,na.rm = TRUE),colour="steel blue")+
geom_vline(xintercept = quantile(movie$imdb_score, prob = c(0.05)),colour="red",linetype = "longdash")+
geom_vline(xintercept = quantile(movie$imdb_score, prob = c(0.95)),colour="red",linetype = "longdash")+
annotate("text", label = "5th Percentile (4.3)",x = 4.2, y = 100, size = 4, colour = "red",angle=90)+
annotate("text", label = "95th Percentile (8.1)",x = 8.2, y = 100, size = 4, colour = "red",angle=90)+
annotate("text", label = "Mean (6.4)",x = 6.3, y = 100, size = 4, colour = "light blue",angle=90)+
ylab("Count of Movies")+
xlab("IMDB Score")+
ggtitle("Histogram: IMDB Score")
Since I could see a bell shaped graph, I also tried to check which distribution best represents this data distribution. I plotted a quick “Cullen and Frey Graph” which basically separates distributions on the bases of Kurtosis and Square of Skewness. The IMDB score fell somewhere between Gamma and Lognormal distributions. I did not perform further distribution fitting for this particular project.
library(fitdistrplus)
descdist(movie$imdb_score, boot = 100)
## summary statistics
## ------
## min: 1.6 max: 9.5
## median: 6.6
## mean: 6.437584
## estimated sd: 1.12774
## estimated skewness: -0.7407298
## estimated kurtosis: 3.929777
I wanted to study the relationship between IMDB score, revenue and budget of the movie. So I plotted a 3D scatter plot using the “plotly” package. This is an interactive graph and hence we can comfortably see the relationship between the three variable. Red signifies loss for the movie while blue signifies profit:
library(plotly)
plot_ly(movie, x = ~imdb_score, y = ~budget/1000000, z = ~gross/1000000,
color = ~profit_flag, colors = c('#BF382A', '#0C4B8E'),size = I(3)) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'IMDB Score'),
yaxis = list(title = 'Budget (M$)'),
zaxis = list(title = 'Revenue (M$)')),
title = "INTERACTIVE 3D Scatter plot: IMDB Score vs Revenue vs Budget",
showlegend = FALSE)
Then I wanted to see which countries do well in terms of the movie IMDB Score. I combines this analysis with the overall profitability of the movies produced in the country (total revenue - total budget). Scatter plot below shows this analysis. Size if the bubble indicates the number of movies produced in the country while colour shows protability (Green is profit, red is loss):
library(plotly)
imdb_country <- movie %>%
group_by(country) %>%
summarise(num_movie = n(),
avg_score = mean(imdb_score,na.rm=TRUE),
profit = (sum(gross,na.rm=TRUE)-sum(budget,na.rm=TRUE))/1000000) %>%
subset(country != "") %>%
subset(num_movie > 5) %>%
arrange(-num_movie)
imdb_country$profit_flag <- as.factor(ifelse(imdb_country$profit > 0 , 1,0))
imdb_country$profit_2 <- imdb_country$profit + 2009
plot_ly(imdb_country, x = ~profit_2, y = ~avg_score,
color = ~profit_flag, colors = c('red', 'green'),
size = ~num_movie,text=~country) %>%
add_markers() %>%
layout(xaxis = list(type = "log",title="Profitability",
zeroline = TRUE,showline = TRUE,showticklabels = FALSE,showgrid = TRUE),
yaxis = list(title="Average IMDB Score",
zeroline = TRUE,showline = TRUE,showticklabels = FALSE,showgrid = TRUE),
title = "Interactive Scatter: IMDB Score vs Profitability",
showlegend = FALSE)
The above analysis helped us understand the movies dataset better. Now we have answers to some inetersting questions and are in a position to do follow-up and deep dive analysis. The following is the summary of the analysis: