class: center, middle, inverse, title-slide # Reading Between the Wines ## BUS 696 ### BART | GARRETT | RICKY ### CHAPMAN UNIVERSITY ### 2019/10/24 (updated: 2019-12-17) --- background-image: url(https://camo.githubusercontent.com/06e0a35a7e3014fe8c7f009d2f6bfacf14cefa93/687474703a2f2f6761727265747468617765732e636f6d2f722f636f7665722e6a706567) background-position: center; background-repeat: no-repeat; background-size: contain; /* applied using JavaScript only if background-image is larger than slide */ class: overview # PROJECT OVERVIEW .pull-left[ - <a href="https://github.com/hawesg/BUS696-final-project"><i class="fab fa-github-alt"></i> Github Code Base</a> - Data Cleaning, Feature Generation and Exploritory Data Analysis <a href="http://bus696.garretthawes.com/deliverable-2/index.html">Deliverable</a> ] .pull-right[] --- class: inverse background-image: url(wine-heartbeat.svg) background-size: contain # TODAY'S AGENDA - DATA - Source data sets - Cleaning - Feature Generation and transformations - Exploratory Data Analysis - Preprocessing and Train/Test split - MODELING THE DATA - Stepwise Feature Selection - Linear Regression - Elastic Net - Bootstrapping - Random Forest - Logistic Regression - SUMMARY - Model comparison - Future Work --- background-image: url(https://github.com/yihui/xaringan/releases/download/v0.0.2/karl-moustache.jpg) --- # OVERVIEW OF BUSINESS PROPOSITION We are trying to accurately model the price of wine utilizing several variables connected with the winemaking process. In this dataset, the majority of the wines surveyed fall below $100, but there are notable exceptions. This sort of model has many applications that can be directly applied to numerous businesses and ventures. Below are some industries and vocations that could benefit from such a model. Examples of potential use cases are: - Restaurant Owner: Sell Wine For Larger Markup - Wine Speculator: Naturally Can Make Bigger Profits - Wine Distributor: Can Know What Wines Not To Sell - Winery Owner: Shift Focus To Higher Margin Grape Varieties --- class: inverse, middle, center background-image: url(wine-bottles.svg) background-size: contain # DATA --- # KAGGLE #### The original [data from Kaggle](https://www.kaggle.com/zynicide/wine-reviews) included 14 coulmns .pull-left[ - `#`: A numeric id from 0-130,000 - `country`: The country that the wine originated - `description`: The body of the review - `designation`: The vineyard within the winery where the grapes that made the wine are from, ie: Reserve Harvest - `points`: The number of points WineEnthusiast rated the wine on a scale of 1-100* - `price`: The cost in USD for one bottle - `province`: The province or state that the wine originated - `region_1`: The wine-growing area in a province or state (ie Napa) ] .pull-right[ - `region_2`: Sometimes there are more specific regions specified within a wine-growing area (ie Rutherford inside the Napa Valley), but this value can sometimes be blank - `taster_name`: The name of the reviewer - `taster_twitter_handle`: The twitter handle of the reviewer - `title`: The title of the wine - `variety`: The type of grapes used to make the wine (ie Pinot Noir) - `winery`: The winery that made the wine ] .footnote[<i class="fad fa-asterisk"></i> The lowest score in the data set was 80] --- class: twitter # TWITTER - We pulled data from the twitter profiles of each of the reviewers like gender, amount of tweets, number of followers, etc. - The underlying assumption is that reviewers who were prolific on social media might be more generous with their assigned points than those who weren’t. - Additionally, they may behave differently on gender lines. <!--  -->
--- class: wine-color # WINE COLOR - The original data set included the variety of the grape but not the color of the wine. - A group of colors was assigned via regular expressions ie: all varieties matching `REGEX` were classified as red. Validation was done of each of the matches to make sure there were no false positives. - The rest was done manually with the aid of a google sheet document that auto-populated with thumbnails from google image search. Afterward, this data was imported and left joined to the data set. .center[  ] --- # DATA CLEANING - Various encoding issues were fixed with the text columns, NA values were removed from price and columns were converted to character where blank implies that there was none (ie: designation). - Wines with missing countries were removed, as were wines missing a grape varietal and taster name. - Some columns were converted to charecter vectors to make them easier to work with. ```r wine_data <- wine_data %>% dplyr::mutate( country = as.character(country), variety = as.character(variety), taster_name = as.character(taster_name), color_lump = fct_lump(color, n = 2), province_lump = fct_lump(province, n = 25), country_lump = fct_lump(country, n = 25) ) %>% dplyr::filter (country != "" & variety != "" & taster_name != "") %>% drop_na(price) ``` --- class: features # FEATURE GENERATION #### Several columns were added to help get a better picture of the data. - Reviews were grouped by points into "Not Recommended", "Mediocre", "Good", "Very good", "Outstanding” and "Classic” as per the wine spectator website. - Sentiment Analysis was performed on the title of the wine - Additional title-based features included number of characters, number of words and whether the title included any accents (as a metric of whether the wine seemed exotic and foreign in the American market). - Wine designation was grouped into more broad categories ie: any wine that included a variation of the word reserve(reserva, riserva, reserve, estate reserve, etc…) were renamed reserve this was done with the aid of `grepl`, `stringr` and `stringi`. `wordcloud` was useds to identify target groups. .pull-left[ <img src="present_files/figure-html/wordcloud-1.png" width="504" /> ] .pull-right[ <img src="present_files/figure-html/sentement-1.png" width="504" /> ] --- # FEATURE GENERATION - `n` for `fct_lump` for the various factor columns according to a parameter set at the beginning of main.R or on the command line. - Parameters are `by_count`, `taster_name`, `taster_twitter`, `designation`, `country`, `variety`, `variety.red`, `variety.white`, `province`, `winery` ```r FCT_LUMPS <- list( by_count = 0, ..., winery = 15) ``` - `by_count` supersedes the other parameters in that setting it will cause `fct_lump` based on how many observations there are with any given feature. ```r if (FCT_LUMPS$by_count != 0) { fct_x <- .number_of_factor_levels_containing_num_observations FCT_LUMPS[['taster_name']] <- fct_x(wine_data$taster_name, FCT_LUMPS$by_count) FCT_LUMPS[['taster_twitter']] <- fct_x(wine_data$taster_twitter_handle, FCT_LUMPS$by_count) ... } ``` --- class: maps # EXPLORATORY DATA ANALYSIS - As you can see there are distinct patterns that emerge when looking at both price and review scores broken down by country. <img src="present_files/figure-html/maps-1.png" width="50%" /><img src="present_files/figure-html/maps-2.png" width="50%" /> --- class: marginal # EXPLORATORY DATA ANALYSIS .pull-left[ <img src="present_files/figure-html/marginal-price-1.png" width="504" /> ] .pull-right[ <img src="present_files/figure-html/marginal-price-logged-1.png" width="504" /> ] --- class: exploritory # EXPLORATORY DATA ANALYSIS .pull-left[ <img src="present_files/figure-html/price-by-country-1.png" width="504" /> ] .pull-right[ - Distribution of price is not well distributed in the original data as you can see from the previous slide, scaling the price to log base 10 with `scale_y_log10` and `coord_flip()` taking the log of price makes the data more reasonable. ```r ggplot(wine_data_clean, aes(x = country_lump, y = price, color = cat)) + * coord_flip() + scale_y_log10() + geom_jitter(alpha=1/2) + labs(color = "Rating Category", title="Price by Country", caption="Price is scaled to log 10", x="Country", y="Price") + theme(plot.title = element_text(hjust = 0.5), legend.key=element_rect(fill='#fdf6e3'), legend.box.background = element_rect(colour = "#fdf6e3") ) ``` ] --- # TRANSFORMING Y - TUKEY'S LADDER OF POWERS - Sometimes called the Bulging Rule, a way to change the shape of a skewed distribution so that it becomes normal or nearly-normal .pull-left[ ```r transformTukey(sample(prices, 5000), plotit=FALSE, returnLambda = TRUE) ``` ``` ## ## lambda W Shapiro.p.value ## 388 -0.325 0.9956 3.846e-11 ## ## if (lambda > 0){TRANS = x ^ lambda} ## if (lambda == 0){TRANS = log(x)} ## if (lambda < 0){TRANS = -1 * x ^ lambda} ``` ``` ## lambda ## -0.325 ``` $$ -1 \cdot \frac{1}{price^{0.3}}$$ ] .pull-right[ <img src="present_files/figure-html/tukey-histogram-1.png" width="504" /> ] --- # DATA PRE-PROCESSING - The preprocess function in `CARET` was used to transform independent numeric values into more standardized variables ```r wine_data_to_be_standardized <- wine_data_clean %>% select( points, title.n_words, title.sentement, title.n_chars, taster.avg_points, taster.n_reviews, taster.n_tweets, taster.n_followers) preprocessParams <- * preProcess(wine_data_to_be_standardized[, 1:8], method = c("center", "scale")) print(preprocessParams) transformed <- predict(preprocessParams, wine_data_to_be_standardized[, 1:8]) summary(transformed) head(transformed) wine_data_standardized <- bind_cols(wine_data_not_to_be_standardized, transformed) ``` --- class: exploritory # EXPLORATORY DATA ANALYSIS - As you can see centering and scaling increases the correlation between price as well as the transforming the dependent variable with `\(-1 \cdot price^{-0.3}\)` .pull-left[ <img src="present_files/figure-html/ggpairs-1.png" width="720" /> ] .pull-right[ <img src="present_files/figure-html/ggpairs-standardized-1.png" width="720" /> ] --- # TEST/TRAIN SPLIT - The createDataPartition function of caret was used to create balanced test and training sets including 75% and 25% of the data respectively - Specifically, price was broken into 5 groups and data was sampled by group to ensure an even distribution. ```r set.seed(1861) options(scipen = 50) TRAIN.PERCENT <- 0.75 *inTrainSetIndex <- createDataPartition(y = wine_data_standardized$price, * p=TRAIN.PERCENT, * list=FALSE, groups=5) data.train <- wine_data_standardized[ inTrainSetIndex, ] data.test <- wine_data_standardized[-inTrainSetIndex, ] ``` .footnote[ <i class="fad fa-pen-square"></i> serialized versions of the test and training set were saved for working on so that the entire team had access to the same data. ] --- class: inverse, middle, center background-image: url(wine-bottles.svg) background-size: contain # MODELING THE DATA --- class: ols-log # LINEAR REGRESSION - LOG OF PRICE - Forward and backward fit to determine significant variables. - Using `regsubset` from the `leaps` package. .pull-left[ <img src="present_files/figure-html/backward-fit-.gif" width="504" /> ] .pull-right[ - Once significant variables were determined we ran several OLS models. - Transformation of DV: Due to how disparate the variable price is our original strategy was to take the log of price. - This makes the data more normal and models more accurate - This represents a significant increase in accuracy. ] --- # LINEAR REGRESSION - TUKEY TRANSFORMED PRICE .pull-left[ - Slightly more accurate than Log(Price) model. - These results validate the transformation provided by Tukey's ladder over log transformation. `$$\hat{price_{tukey}} = .01points + .04country_{Canada} \\ + .07country_{Germany} + .04variety_{Chardonnay} \\+ .01\cdot designation_{Estate} - .03\cdot province_{Champagne} \\- .02\cdot color_{White} - .02\cdot taster.gender_{M} \\ ... \\ \text{ }\\ \text{where }price_{tukey}\text{ is } -1 \cdot price^{-0.3}$$` ] .pull-right[ <img src="present_files/figure-html/ols-bkwfit-.gif" width="504" /> ] --- # ELASTIC NET - TUKEY TRANSFORMED PRICE .pull-left[ <img src="present_files/figure-html/enet-.gif" width="504" /> ] .pull-right[ - Seeks to combine penalties from lasso and ridge regression tests - Locate optimal α Figure - Far more robust than OLS - The optimal α is 0.38 and the optimal λ is `\(3.6978697\times 10^{-4}\)` - Interestingly when given more factor levels the optimal λ was 1. `$$\hat{\beta} = \underset{\beta}{\operatorname{argmin}} \left\{ \sum_{i=1}^N\left(y_i-\sum_{j=1}^p x_{ij} \beta_j\right)^2 + \lambda_1 \sum_{j=1}^p |\beta_j|+ \lambda_2 \sum_{j=1}^p \beta_j^2 \right\}$$` ] --- # BOOTSTRAPPING - LOG OF PRICE .pull-left[ - Number of Bootstrap sample to 100 - Size of each bootstrap = 500 to ensure adequate data - The model was selected via `ctree` `$$\hat{f_{bag}}=\hat{f_{1}}(X)+\hat{f_{2}}(X)+\cdots +\hat{f_{b}}(X)$$` where `\((X)\)` is the record we want to generate a prediction for and `\(\hat{f_{1}}(X),\cdots,\hat{f_{b}}(X)\)` are preditions from individual models. ```r B <- 100 # number of bootstrap samples num_b <- 500 # sample size of each bootstrap boot_mods <- list() # store our bagging models for(i in 1:B){ boot_idx <- sample(1:nrow(data.train), size = num_b, replace = FALSE) data_slice = data.train %>% slice(boot_idx) ... } ``` ] .pull-right[ <div class="figure"> <img src="present_files/figure-html/bootstrap-1.png" alt="Example Tree" width="504" /> <p class="caption">Example Tree</p> </div> ] --- # BOOTSTRAPPING - DISTRIBUTION OF VALUES .pull-left[ <img src="present_files/figure-html/bootstrap_distribution-1.png" width="504" /> ] .pull-right[ - Due to time and computing constraints, the accuracy of this model is distorted since ~50% of the observations got no prediction - Bootstrap prediction have narrower distribution than actual: have less variance - Bootstrap predictions have a bias to the left of actual price distribution - Hence this bootstrap model will tend to show lower than actual price for wine (i.e. undervalue wine) - From the review of a number of the decision trees we see that most root and decision nodes use `points` for decision to split, then `taster.avg_points` and `variety` are used. ] --- # BOOTSTRAPPING - FURTHER WORK .pull-left[ - Given the issues running the bootstrapping code from class at larger numbers and an error relating to not being able to make binary decisions above 31 levels, we investigated using `CARET` to train a bootstrapp model. ```r cl <- makeCluster(4) # use 4 workers registerDoParallel(cl) # register the parallel backend model.bag <- train( -1 * (price^(-.3)) ~ ., data = data.train, method = "treebag", trControl = trainControl(method = "cv", number = 10), nbagg = 200, control = rpart.control(minsplit = 2, cp = 0) ) pushover(message = "Process finished", user = userID, app = appToken) ``` ] .pull-right[ ```r Bagged CART 68626 samples 19 predictor No pre-processing Resampling: Cross-Validated (10 fold) Summary of sample sizes: 61763, 61765, 61763, 61764, 61764, 61763, ... Resampling results: RMSE Rsquared MAE 0.0406081 0.6767615 0.03010213 Test Set Results RMSE Rsquared MAE 0.0402520 0.6823170 0.02976208 ``` ] .footnote[<i class="fad fa-pen-square"></i> The pushover api was used to notify me when things were finished since even running on 4 cores this took 50 hours] --- # RANDOM FOREST - TUKEY TRANSFORMED PRICE .pull-left[  ] .pull-right[ - We found only using default mtry and ntree parameter settings allowed us to generate results for the random-forest model, in the future this is a potential are of improvement - Hence tuning (varying mtry and ntree params) of the random forest model was note performed ```r rf_fit <- randomForest( .tukey(price) ~ points + country + color + winery + taster.gender + taster.avg_points + title.n_words+ title.n_chars + title.sentement + title.has_accents, data = data.train, importance = TRUE, localImp = TRUE) ``` where .tukey transforms price to `\(-1\cdot \frac{1}{price^{-.3}}\)` ] --- name: GINI class: gini # RANDOM FOREST - GINI COEFFICIENT - IncNodePurity - Total decrease in node impurities from splitting on the variable, averaged over all trees. - Points is by far the most important predictor of price and hence has the highest %IncMSE and IncNodePurity values .pull-left[ <img src="present_files/figure-html/gini-coeficient-2-1-1.png" width="504" /> ] .pull-right[ <img src="present_files/figure-html/gini-coeficient-2-2-1.png" width="504" /> ] --- # RANDOM FOREST - SUMMARY .pull-left[ - Similar to the bootstrap model Points, scoring mean minimal depth of only 1.1 and having significantly larger % increase in both MSE and Node Purity than other parameters, are the most important factor in modeling `\(-1 \cdot price^{-0.3}\)` - Variables `country`, `taster.avg_points` and `color` appear to also strengthen the model with mean minimal depth below 2, but less impressive % increase in MSE (except for color) and Node Purity. - All parameters tied to title appear to have much less impact, and price is pretty much not impacted by accents in the title - An `\(r^2\)` of .5987 implies that the model explains ~ 60% of the variation in our data. ] .pull-right[ <img src="present_files/figure-html/random-forest-summary-.gif" width="504" /> ] --- # LOGISTIC REGRESSION - WELL PRICED .pull-left[ - Modeling `well_priced` as a function of price and points. `$$\frac{points_{median}-points_{minimum}}{\log(price_{median})-\log(price_{minumum})}$$` - This takes into consideration diminishing returns i.e. marginal increase in points is accompanied by a higher and higher increase in price. - `price`<sub>`miniumum`</sub> = $2.50 (the lowest viable wine price of a drinkable wine - `points`<sub>`miniumum`</sub> = lowest score in the training set (80 or -2.901992 after centering and scaling) ] .pull-right[ <img src="present_files/figure-html/points-vs-price-well-priced-1.png" width="504" /> ] --- # LOGISTIC REGRESSION - SUMMARY .pull-left[ <img src="present_files/figure-html/roc-curves-1.png" width="504" /> ] .pull-right[ - Train and Test above the diagonal chance-only line - Determining whether the wine is a “good value” better than chance - AUC values are high, above 80% or about 70% higher than chance - AUC for train and test nearly identical hence model neither over- or underfit - Points, Specific Province, Specific Variety, Specific Winery, followed by Specific Taster explain the model best whereas Title information not so much ``` r well_priced ~ points.category + country + province + winery + color + variety + designation + title.has_accents + taster.name + taster.gender + title.n_words + title.sentement + title.n_chars + taster.avg_points + taster.n_reviews + taster.n_tweets + taster.n_followers ``` ] --- class: inverse, middle, center background-image: url(wine-bottles.svg) background-size: contain # SUMMARY --- # MODEL COMPARISON
.footnote[<i class="fad fa-pen-square"></i> In class bootstrap method was left out since it did not generate a full range of predictions as was logistic regression since these metrics don't apply.] --- # FURTHER WORK ### Various other candidate models were looked at that would be interesting to work with - **Support Vector Machine** - **Neural Networks** - **K Nearest Neighbors** - **Bayesian Additive Regression Trees** mostly because the R package is called bartMachine #### Additionally we would like to delve deeper into the capabilities of the `caret` package specifically hyperparameter tuning and model comparison as well as using `caretEnsemble` to run a composite model. --- background-image: url(present_files/thanks.png) background-position: center; background-repeat: no-repeat; background-size: contain; /* applied using JavaScript only if background-image is larger than slide */ class: center, middle # THANK YOU Slides created via the R package [**xaringan**](https://github.com/yihui/xaringan). xarigan uses [remark.js](https://remarkjs.com), [**knitr**](http://yihui.org/knitr), and [R Markdown](https://rmarkdown.rstudio.com) in the background.