Problem Statement as being a information scientist for the marketing division at reddit.

Port Kitchen appliance Reports and even website articles or blog posts Concerning Engadget
October 12, 2020
The Bottomless Pit of Financial Obligation That is Auto Title Loans
October 12, 2020

Problem Statement as being a information scientist for the marketing division at reddit.

Problem Statement as being a information scientist for the marketing division at reddit.

i must get the many predictive key words and/or expressions to accurately classify the the dating advice and relationship advice subreddit pages them to determine which advertisements should populate on each page so we can use. Because this is a category issue, we’ll make use of Logistic Regression & Bayes models. Misclassifications in this instance is fairly benign thus I will utilize the precision rating and set up a baseline of 63.3per cent to price success. Using TFiDfVectorization, I’ll get the function importance to ascertain which terms have actually the greatest forecast energy for the prospective factors. If effective, this model may be utilized to a target other pages which have similar regularity associated with words that are same expressions.

Data Collection

See relationship-advice-scrape and dating-advice-scrape notebooks with this component.

After switching most of the scrapes into DataFrames, they were saved by me as csvs that you can get when you look at the dataset folder with this repo.

Information Cleaning and EDA

  • dropped rows with null self text line becuase those rows are worthless if you ask me.
  • combined title and selftext column directly into one brand brand new all_text columns
  • exambined distributions of term counts for games and selftext column per post and contrasted the 2 subreddit pages.

Preprocessing and Modeling

Found the baseline precision rating 0.633 this means if i select the value that develops most frequently, i will be appropriate 63.3% of that time period.

First effort: logistic regression model with default CountVectorizer paramaters. train rating: 99 | test 75 | cross val 74 Second attempt: tried CountVectorizer with Stemmatizer preprocessing on first pair of scraping, pretty bad score with a high variance. Train 99%, test 72%

  • attempted to decrease maximum features and rating got a whole lot worse
  • tried with lemmatizer preprocessing instead and test score went as much as 74per cent

Just increasing the information and stratifying y in my test/train/split increased my cvec test score to 81 and cross val to 80. Incorporating 2 paramaters to my CountVectorizers helped a lot. A min_df of 3 and ngram_range of (1,2) increased my test score to 83.2 and cross val to 82.3 nonetheless, these rating disappeared.

I believe Tfidf worked the very best to reduce my overfitting due to variance issue because

we customized the end terms to take away the ones that have been really too frequent to be predictive. This is a success, nonetheless, with increased time we most likely could’ve tweaked them a little more to boost all ratings. Taking a look at both the single terms and terms in categories of two (bigrams) had been the most readily useful param that gridsearch recommended, nonetheless, each of my top many predictive terms wound up being uni-grams. My initial a number of features had a good amount of jibberish terms and typos. Minimizing the # of that time period term ended up being expected to show as much as 2, helped be rid of these. Gridsearch additionally recommended 90% max df rate which assisted to eradicate oversaturated terms too. Finally, establishing max features to 5000 reduced cut down my columns to about one fourth of whatever they had been to simply concentrate the absolute most commonly used terms of the thing that was kept.

Summary and tips

Also though I wish to have greater train and test ratings, I happened to be in a position to effectively reduce the variance and you will find certainly a few terms that have high predictive energy

thus I think the model is willing to launch a test. The same key words could be used to find other potentially lucrative pages if advertising engagement increases. It was found by me interesting that taking right out the overly used terms aided with overfitting, but brought the accuracy rating down. I believe there was probably nevertheless space to relax and play around with the paramaters regarding the Tfidf Vectorizer to see if different end terms produce an or that is different


Used Reddit’s API, needs collection, and BeautifulSoup to clean articles from two subreddits: Dating information & union information, and trained a binary category model to predict which subreddit confirmed post originated in

Leave a Reply

Your email address will not be published. Required fields are marked *