Using Reddit’s API for Predicting Comments
In this project, we will practice two major skills. Collecting data via an API request and then building a binary predictor.
As we discussed in week 2, and earlier today, there are two components to starting a data science problem: the problem statement, and acquiring the data.
For this article, your problem statement will be: What characteristics of a post on Reddit contribute most to what subreddit it belongs to?
Your method for acquiring the data will be scraping threads from at least two subreddits.
Once you’ve got the data, you will build a classification model that, using Natural Language Processing and any other relevant features, predicts which subreddit a given post belongs to.
Scraping Thread Info from Reddit.com
Set up a request (using requests) to the URL below.
NOTE: Reddit will throw a 429 error when using the following code:
res = requests.get(URL)
This is because Reddit has throttled python’s default user agent. You’ll need to set a custom User-agent
to get your request to work.
res = requests.get(URL, headers={'User-agent': 'YOUR NAME Bot 0.1'})
# Imports - used or otherwise.
import pandas as pd
import requests
import json
import time
import regex as re
import praw
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from bs4 import BeautifulSoup
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict, GridSearchCV, StratifiedKFold
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, ENGLISH_STOP_WORDS
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn import metrics
from sklearn.metrics import roc_auc_score, roc_curve
# Create the URL variables
URL_ds = "http://www.reddit.com/r/datascience.json"
URL_stats = "https://www.reddit.com/r/statistics.json"
# Authenticating via OAuth for praw
reddit = praw.Reddit(client_id='AOJTLQLavhOXPg',
client_secret='eS08QOpy2lWh37qkVBGlN7yMjRI',
username='TCRAY_DSI',
password='dsi123',
user_agent='TK Bot 0.1')
# Check
print(reddit.user.me())
TCRAY_DSI
# Create subs for praw:
sub_ds = reddit.subreddit('datascience')
sub_stats = reddit.subreddit('statistics')
# Create top pulls
top_ds = sub_ds.top(time_filter='year')
top_stats = sub_stats.top(time_filter='year')
# These were used and attempted (to success) before creating the loop
# Request the JSON files
# I did them in seperate cells to space out the scrapping, so reddit wouldn't throw a 429 error
# res_ds = res.get(URL_ds, headers={'User-agent': 'TK Bot 0.1'})
# res_stats = requests.get(URL_stats, headers={'User-agent': 'TK Bot 0.1'})
# res_stats.status_code
Use res.json()
to convert the response into a dictionary format and set this to a variable.
data = res.json()
# These were used and attempted (to success) before creating the loop
# Convert the JSON responses
# data_ds = res_ds.json()
# data_stats = res_stats.json()
# Check out data
# data_ds
# data_stats
# Testing adding nested dictionaries to each other
# doubling_up = data_ds['data']['children'] + data_ds['data']['children']
# doubling_up
Getting more results
By default, Reddit will give you the top 25 posts:
print(len(data['data']['children']))
If you want more, you’ll need to do two things:
- Get the name of the last post:
data['data']['after']
- Use that name to hit the following url:
http://www.reddit.com/r/boardgames.json?after=THE_AFTER_FROM_STEP_1
- Create a loop to repeat steps 1 and 2 until you have a sufficient number of posts.
NOTE: Reddit will limit the number of requests per second you’re allowed to make. When you create your loop, be sure to add the following after each iteration.
time.sleep(3) # sleeps 3 seconds before continuing```
This will throttle your loop and keep you within Reddit's guidelines. You'll need to import the `time` library for this to work!
```python
# Check out length
# print(len(data_ds['data']['children']))
# print(len(data_stats['data']['children']))
# Test the last post pull
# data_ds['data']['after']
# For DS set - previously ran to generate CSV
url_ds = "https://www.reddit.com/r/datascience.json"
data_ds = []
total = []
next_get = ''
# I went with 40 b/c 40 * 25 = 1000 posts total
for i in range(40):
# Request get
res = requests.get(url_ds+next_get, headers={'User-agent': 'TK Bot 0.1'})
# Convert the JSON
new_dict = res.json()
# Add to already collected data set
data_ds.extend(new_dict['data']['children'])
# Collect 'after' from new dict to generate next URL
new_url_end = str(new_dict['data']['after'])
# Generate the next URL
next_get = '?after='+new_url_end
# CSV add/update along with DF creation
# Chose greater than 0 so the else executes on the first iteration
if i > 0:
# Read in previous csv for comparision/add
# Establish current DF - left over from previous way of running
# past_posts = pd.read_csv('data_ds.csv')
# current_df = pd.DataFrame(data_ds)
# Append new and old
total = pd.DataFrame(data_ds)
# Convert to DF and save to new csv file
pd.DataFrame(total).to_csv('data_ds.csv', index = False)
else:
pd.DataFrame(data_ds).to_csv('data_ds.csv', index = False)
# Sleep to fit within Reddit's pull limit
time.sleep(3)
# For stats set - previously ran to generate CSV
url_stats = "https://www.reddit.com/r/statistics.json"
data_stats = []
total = []
next_get = ''
# I went with 40 b/c 40 * 25 = 1000 posts total
for i in range(40):
# Request get
res = requests.get(url_stats+next_get, headers={'User-agent': 'TK Bot 0.1'})
# Convert the JSON
new_dict = res.json()
# Add to already collected data set
data_stats.extend(new_dict['data']['children'])
# Collect 'after' from new dict to generate next URL
new_url_end = str(new_dict['data']['after'])
# Generate the next URL
next_get = '?after='+new_url_end
# CSV add/update along with DF creation
# Chose greater than 0 so the else executes on the first iteration
if i > 0:
# Read in previous csv for comparision/add
# Establish current DF - left over from previous way of running
# past_posts = pd.read_csv('data_stats.csv')
# current_df = pd.DataFrame(data_stats)
# Append new and old
total = pd.DataFrame(data_stats)
# Convert to DF and save to new csv file
pd.DataFrame(total).to_csv('data_stats.csv', index = False)
else:
pd.DataFrame(data_stats).to_csv('data_stats.csv', index = False)
# Sleep to fit within Reddit's pull limit
time.sleep(3)
# This was an older attempt at writing the function, that I scrapped and decided to start fresh on:
# url_ds = "https://www.reddit.com/r/datascience.json?after=" + last_post_ds
# for i in range(25):
# # Get the name of the last post
# last_post_ds = data_ds['data']['after']
# # Set the url from the last post
# new_url_ds = "https://www.reddit.com/r/datascience.json?after=" + last_post_ds
# # Perform request get
# new_res_ds = res.get(new_url_ds, headers={'User-agent': 'TK Bot 0.1'})
# # Convert the JSON to a dict
# new_data_ds = new_res_ds.json()
# # Add the new dict to the already existing one
# data_ds.update(new_data_ds)
# data_ds['data']['children'] = data_ds['data']['children'] + new_data_ds['data']['children']
# data_ds['data']['after'] = new_data_ds['data']['after']
# # Sleep
# # time.sleep(3)
# Next few cells devoted to understanding how to generate a combined dict
# new_data_ds.items()
# OG_ds_data = data_ds.copy()
# new_data_ds = new_res_ds.json()
# data_ds.update(new_data_ds)
# Testing adding nested dictionaries to each other
# doubling_up = data_ds['data']['children'] + data_ds['data']['children']
# doubling_up
Save your results as a CSV
You may do this regularly while scraping data as well, so that if your scraper stops of your computer crashes, you don’t lose all your data.
# My loop in the previous cell completes this step.
Read my files back in and clean them up / EDA
%pwd
'/Users/tomkelly/Desktop/general_assembly/DSI-US-5/project-3'
df_ds = pd.read_csv('./data_ds.csv')
df_stats = pd.read_csv('./data_stats.csv')
# 983 DS posts vs 978 stats posts
# df_ds.shape[0]
df_stats.shape[0]
978
# for i in df_ds.shape[0]
# Testing what I want to loop
df_ds['body'] = pd.Series(re.findall('(?<=selftext).{4}(.*).{4}(?=author_fullname)', df_ds['data'][0]))
df_ds.head()
data | kind | body | |
---|---|---|---|
0 | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | The Mod Team has decided that it would be nice... |
1 | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | NaN |
2 | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | NaN |
3 | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | NaN |
4 | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | NaN |
type(df_ds['body'].iloc[0,])
str
# To pull out the body of the post and make it a new column
for i in range(0, df_ds.shape[0]):
try: #Since regex makes it a list, this helps deal with nulls
df_ds['body'][i] = re.findall('(?<=selftext).{4}(.*).{4}(?=author_fullname)', df_ds['data'][i])[0]
except:
df_ds['body'][i] = ''
df_ds.head()
data | kind | body | |
---|---|---|---|
0 | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | The Mod Team has decided that it would be nice... |
1 | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | \n\nWelcome to this week's 'Entering & Tr... |
2 | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | |
3 | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | |
4 | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | I'm working on making a list of Machine Learni... |
# For some reason, wrapping it in pd.Series makes this work before I loop it
try:
df_ds['title'] = pd.Series(re.findall('(?<= .title).{4}(.*).{4}(?=link_flair_richtext)', df_ds['data'][0]))[0]
except:
df_ds['title'] = ''
# To pull out the title of the post and make it a new column
for i in range(0, df_ds.shape[0]):
try:
df_ds['title'][i] = re.findall('(?<= .title).{4}(.*).{4}(?=link_flair_richtext)', df_ds['data'][i])[0]
except:
df_ds['title'][i] = ''
df_stats.shape[0]
978
df_ds.head()
data | kind | body | title | |
---|---|---|---|---|
0 | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | The Mod Team has decided that it would be nice... | DS Book Suggestions/Recommendations Megathread |
1 | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | \n\nWelcome to this week's 'Entering & Tr... | Weekly 'Entering & Transitioning' Thread. ... |
2 | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | Mo Data, Mo Problems. Everyone always talks ab... | |
3 | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | Make “Fairness by Design” Part of Machine Lear... | |
4 | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | I'm working on making a list of Machine Learni... | Papers with Code |
# Looks like the body/title got pulled in as a list, turning it into a str
# This is leftover from an older method
# for w in range(0,df_ds['body'].shape[0]):
# df_ds['body'][w] = str(df_ds['body'][w])
# Additional Clean-up - DS
df_ds['body'] = df_ds['body'].map(lambda x: x.replace('\\n',''))
df_ds['body'] = df_ds['body'].map(lambda x: x.replace('\n',''))
df_ds['body'] = df_ds['body'].map(lambda x: x.replace('\\',''))
df_ds['body'] = df_ds['body'].map(lambda x: x.replace("\\'","'"))
# df_ds['body'] = [w.replace('/n', '') for w in df_ds['body']]
df_ds['body']
0 The Mod Team has decided that it would be nice...
1 Welcome to this week's 'Entering & Transi...
2
3
4 I'm working on making a list of Machine Learni...
5 I do most of my work in Python. Building the m...
6 [Project Link](https://github.com/HiteshGorana...
7 Before I got hired, my company had a contracto...
8 I'm looking for an open-source web-based tool ...
9 I've been reading around online a bit as to wh...
10 I am new to time series data, so bear with me....
11
12 Hey all, Do people have recommendations for pi...
13 I am quite old (23), but would like to become ...
14
15 I know that python and R are the standard lang...
16 Which tools and packages do you use the most a...
17
18 Has anyone dealt with such a problem statement...
19
20
21 So, I'm trying to build playlists based on val...
22 My intents are to analyze the results with Ex...
23
24 Does anyone have experience in using either pl...
25 Since I started as a data scientist, I have be...
26
27 Good Afternoon Everyone,&#x200B;I was work...
28 Hi all, this is a followup on [Separated from ...
29 This is maybe not a specific DS question, but ...
...
953 Specifically, as AI gets better and better, an...
954 What is the difference between sklearn.impute....
955
956 ', 'author_fullname': 't2_pqifw', 'saved': Fal...
957 I have a prospective client who’s keen to do s...
958
959
960 Hello all!I have a final interview for a Sales...
961
962 So here’s a little about me. I’ve been a lead ...
963 Please shoo me away to the proper sub if I'm a...
964
965 What's the best open source (i.e., free) appro...
966 Bayesian Network is a probabilistic graphical ...
967 ', 'author_fullname': 't2_r3q3m', 'saved': Fal...
968
969
970 Hi, guys. I have a dataset of different addres...
971 I have been reading a lot of quora answers and...
972 This is my first kernel on Kaggle doing some d...
973 Hi Guys, I need some advise or personal experi...
974
975 I'm finding myself in a position where I may h...
976 I'm looking to make some data science projects...
977 Hi, this is my first post ever, so sorry in ad...
978 Cheers everyone! This is my first kernel on Ka...
979 Hello /r/datascience. TLDR: given the current ...
980 What data science course you studied from and ...
981
982 I'm looking for a ISO file of a distro that it...
Name: body, Length: 983, dtype: object
# Add target column for later combination
df_ds['subreddit_target'] = 1
# Check out the nulls
df_ds.isnull().sum().sort_values()
data 0
kind 0
body 0
title 0
subreddit_target 0
dtype: int64
# Same process of pulling out body/post for df_stats
try:
df_stats['body'] = pd.Series(re.findall('(?<=selftext).{4}(.*).{4}(?=author_fullname)', df_stats['data'][0]))[0]
except:
df_stats['body'] = ''
# To pull out the body of the post and make it a new column
for i in range(0, df_stats.shape[0]):
try:
df_stats['body'][i] = re.findall('(?<=selftext).{4}(.*).{4}(?=author_fullname)', df_stats['data'][i])[0]
except:
df_stats['body'] = ''
# For some reason, wrapping it in pd.Series makes this work before I loop it
try:
df_stats['title'] = pd.Series(re.findall('(?<= .title).{4}(.*).{4}(?=link_flair_richtext)', df_stats['data'][0]))[0]
except:
df_stats['title'] = ''
# To pull out the title of the post and make it a new column
for i in range(0, df_stats.shape[0]):
try:
df_stats['title'][i] = re.findall('(?<= .title).{4}(.*).{4}(?=link_flair_richtext)', df_stats['data'][i])[0]
except:
df_stats['title'] = ''
# Looks like the body got pulled in as a list
# restricting how I clean it up, turning it into a str
# Leftover
# for w in range(0,df_stats['body'].shape[0]):
# df_stats['body'][w] = str(df_stats['body'][w])
# Additional Clean-up - DS
df_stats['body'] = df_stats['body'].map(lambda x: x.replace('\\n',''))
df_stats['body'] = df_stats['body'].map(lambda x: x.replace('\n',''))
df_stats['body'] = df_stats['body'].map(lambda x: x.replace('\\',''))
df_stats['body'] = df_stats['body'].map(lambda x: x.replace("\\'","'"))
# df_stats['body'] = [w.replace('/n', '') for w in df_stats['body']]
df_stats['subreddit_target'] = 0
# Check out the nulls
df_stats.isnull().sum().sort_values()
data 0
kind 0
body 0
title 0
subreddit_target 0
dtype: int64
# Renaming the columns so they're easier to discern
# Left over from previous way of solving
# df_ds.columns = ['data','kind','body_ds','title_ds']
# df_stats.columns = ['data','kind','body_stats','title_stats']
df_ds.head(1)
data | kind | body | title | subreddit_target | |
---|---|---|---|---|---|
0 | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | The Mod Team has decided that it would be nice... | DS Book Suggestions/Recommendations Megathread | 1 |
# Create combined list for later usage
dflist = [df_ds, df_stats]
dfCombined = pd.concat(dflist, axis=0, sort=True)
dfCombined.head()
# .fillna(value=" ")
body | data | kind | subreddit_target | title | |
---|---|---|---|---|---|
0 | The Mod Team has decided that it would be nice... | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | 1 | DS Book Suggestions/Recommendations Megathread |
1 | Welcome to this week's 'Entering & Transi... | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | 1 | Weekly 'Entering & Transitioning' Thread. ... |
2 | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | 1 | Mo Data, Mo Problems. Everyone always talks ab... | |
3 | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | 1 | Make “Fairness by Design” Part of Machine Lear... | |
4 | I'm working on making a list of Machine Learni... | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | 1 | Papers with Code |
# Check length is what I expected
dfCombined['body'].shape[0]
1961
dfCombined['title_body'] = dfCombined['body'] + dfCombined['title']
dfCombined
body | data | kind | subreddit_target | title | title_body | |
---|---|---|---|---|---|---|
0 | The Mod Team has decided that it would be nice... | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | 1 | DS Book Suggestions/Recommendations Megathread | The Mod Team has decided that it would be nice... |
1 | Welcome to this week's 'Entering & Transi... | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | 1 | Weekly 'Entering & Transitioning' Thread. ... | Welcome to this week's 'Entering & Transi... |
2 | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | 1 | Mo Data, Mo Problems. Everyone always talks ab... | Mo Data, Mo Problems. Everyone always talks ab... | |
3 | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | 1 | Make “Fairness by Design” Part of Machine Lear... | Make “Fairness by Design” Part of Machine Lear... | |
4 | I'm working on making a list of Machine Learni... | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | 1 | Papers with Code | I'm working on making a list of Machine Learni... |
5 | I do most of my work in Python. Building the m... | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | 1 | Looking for resources to learn how to launch m... | I do most of my work in Python. Building the m... |
6 | [Project Link](https://github.com/HiteshGorana... | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | 1 | DataScience365 ( A project started recently to... | [Project Link](https://github.com/HiteshGorana... |
7 | Before I got hired, my company had a contracto... | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | 1 | Anyone have experience parsing hospital data f... | Before I got hired, my company had a contracto... |
8 | I'm looking for an open-source web-based tool ... | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | 1 | Open Source Tools for Dashboard Design | I'm looking for an open-source web-based tool ... |
9 | I've been reading around online a bit as to wh... | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | 1 | MS online vs in-person | I've been reading around online a bit as to wh... |
10 | I am new to time series data, so bear with me.... | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | 1 | Best method for predicting the likelihood of a... | I am new to time series data, so bear with me.... |
11 | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | 1 | Very low cost cloud GPU instances (<$0.15/h... | Very low cost cloud GPU instances (<$0.15/h... | |
12 | Hey all, Do people have recommendations for pi... | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | 1 | Pipeline Versioning (Open Source / Free) What ... | Hey all, Do people have recommendations for pi... |
13 | I am quite old (23), but would like to become ... | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | 1 | Data Science and being a Quant: how transferab... | I am quite old (23), but would like to become ... |
14 | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | 1 | Data Democratization - Data and Analytics Take... | Data Democratization - Data and Analytics Take... | |
15 | I know that python and R are the standard lang... | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | 1 | Mathematica is the best tool for data science ... | I know that python and R are the standard lang... |
16 | Which tools and packages do you use the most a... | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | 1 | What tools do you actually use at work? | Which tools and packages do you use the most a... |
17 | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | 1 | Feature engineering that exploit symmetries ca... | Feature engineering that exploit symmetries ca... | |
18 | Has anyone dealt with such a problem statement... | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | 1 | R clustering with maximum size per cluster | Has anyone dealt with such a problem statement... |
19 | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | 1 | Get free GPU for training with Google Colab - ... | Get free GPU for training with Google Colab - ... | |
20 | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | 1 | [Cheat Sheet] Snippets for Plotting With ggplot | [Cheat Sheet] Snippets for Plotting With ggplot | |
21 | So, I'm trying to build playlists based on val... | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | 1 | How to use recommender Systems with Multiple "... | So, I'm trying to build playlists based on val... |
22 | My intents are to analyze the results with Ex... | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | 1 | Please Take This Survey if You're a College Gr... | My intents are to analyze the results with Ex... |
23 | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | 1 | How useful is a reference letter from an econ ... | How useful is a reference letter from an econ ... | |
24 | Does anyone have experience in using either pl... | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | 1 | H2O.ai vs Datarobot? Your take | Does anyone have experience in using either pl... |
25 | Since I started as a data scientist, I have be... | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | 1 | Are independent research papers useful for a d... | Since I started as a data scientist, I have be... |
26 | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | 1 | Super helpful cheat sheets for Keras, Numpy, P... | Super helpful cheat sheets for Keras, Numpy, P... | |
27 | Good Afternoon Everyone,&#x200B;I was work... | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | 1 | Correlation Plot of a correlation matrix ( usi... | Good Afternoon Everyone,&#x200B;I was work... |
28 | Hi all, this is a followup on [Separated from ... | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | 1 | Step down from Data Scientist in next job- how... | Hi all, this is a followup on [Separated from ... |
29 | This is maybe not a specific DS question, but ... | {'approved_at_utc': None, 'subreddit': 'datasc... | t3 | 1 | How do you deal with post-job-interview though... | This is maybe not a specific DS question, but ... |
... | ... | ... | ... | ... | ... | ... |
948 | Hello all. I'm a grad school student who ended... | {'approved_at_utc': None, 'subreddit': 'statis... | t3 | 0 | Need to Learn How to Use SPSS Syntax ASAP | Hello all. I'm a grad school student who ended... |
949 | I have been reading the Wikipedia explanations... | {'approved_at_utc': None, 'subreddit': 'statis... | t3 | 0 | ELI5: bray curtis dissimilarity matrix and UPG... | I have been reading the Wikipedia explanations... |
950 | Hello all.u200bThe survey: Our survey asks peo... | {'approved_at_utc': None, 'subreddit': 'statis... | t3 | 0 | Weighting an online survey with a lot of unknowns | Hello all.u200bThe survey: Our survey asks peo... |
951 | Hi everyone. I'm curious whether anyone knows ... | {'approved_at_utc': None, 'subreddit': 'statis... | t3 | 0 | Textbooks in statistics with great problem sets | Hi everyone. I'm curious whether anyone knows ... |
952 | I am analyzing dyadic data in a multilevel mod... | {'approved_at_utc': None, 'subreddit': 'statis... | t3 | 0 | Residuals plot: Is this autocorrelation? | I am analyzing dyadic data in a multilevel mod... |
953 | How do you apply the Bonferroni correction if ... | {'approved_at_utc': None, 'subreddit': 'statis... | t3 | 0 | Bonferroni corrections | How do you apply the Bonferroni correction if ... |
954 | Hello everyone, I'm looking for books which ta... | {'approved_at_utc': None, 'subreddit': 'statis... | t3 | 0 | Resources for undergrad material in Python &am... | Hello everyone, I'm looking for books which ta... |
955 | Hi there! I was hoping someone may be able to ... | {'approved_at_utc': None, 'subreddit': 'statis... | t3 | 0 | Unsure which test to use | Hi there! I was hoping someone may be able to ... |
956 | I should preface this by saying I know very li... | {'approved_at_utc': None, 'subreddit': 'statis... | t3 | 0 | Help with normalization of data | I should preface this by saying I know very li... |
957 | Hey r/statistics, I need some advice on how to... | {'approved_at_utc': None, 'subreddit': 'statis... | t3 | 0 | Advice on an epidemiology dataset | Hey r/statistics, I need some advice on how to... |
958 | I'm facing 3 problems in my current analysis (... | {'approved_at_utc': None, 'subreddit': 'statis... | t3 | 0 | Groupsize differences, unequal genders and g p... | I'm facing 3 problems in my current analysis (... |
959 | I am working with panel data with n=30 and t=7... | {'approved_at_utc': None, 'subreddit': 'statis... | t3 | 0 | How to interpret counterintuitive signs from m... | I am working with panel data with n=30 and t=7... |
960 | I just finished gelmans Bayesian data analysis... | {'approved_at_utc': None, 'subreddit': 'statis... | t3 | 0 | Where to go after Gelman's BDA3? | I just finished gelmans Bayesian data analysis... |
961 | Ignore for a moment the issues with NHST.If a ... | {'approved_at_utc': None, 'subreddit': 'statis... | t3 | 0 | If you are working in the paradigm of NHST, wh... | Ignore for a moment the issues with NHST.If a ... |
962 | For the “big” study this group says they hypot... | {'approved_at_utc': None, 'subreddit': 'statis... | t3 | 0 | How can I use pilot data to plan sample sizes ... | For the “big” study this group says they hypot... |
963 | Hi. I need to write two predictive supply and ... | {'approved_at_utc': None, 'subreddit': 'statis... | t3 | 0 | Predictive supply and demand model | Hi. I need to write two predictive supply and ... |
964 | An illustration of my issue: For e.g. X is a h... | {'approved_at_utc': None, 'subreddit': 'statis... | t3 | 0 | Determining which variable is more affected | An illustration of my issue: For e.g. X is a h... |
965 | A group of students takes a PRE test with 50 q... | {'approved_at_utc': None, 'subreddit': 'statis... | t3 | 0 | Repeat measures t-test on exam data, but pre a... | A group of students takes a PRE test with 50 q... |
966 | Trying to figure out that if I have 7 variable... | {'approved_at_utc': None, 'subreddit': 'statis... | t3 | 0 | Easy question from one confused boi; 7 variabl... | Trying to figure out that if I have 7 variable... |
967 | Hello,I’ve been doing some analysis regardin... | {'approved_at_utc': None, 'subreddit': 'statis... | t3 | 0 | How to deal with the log of a variable where s... | Hello,I’ve been doing some analysis regardin... |
968 | Hi there, I'm a bit confused about usage of F... | {'approved_at_utc': None, 'subreddit': 'statis... | t3 | 0 | Questions about Firth logistic regressions | Hi there, I'm a bit confused about usage of F... |
969 | I have ranked preference data for 7 items. How... | {'approved_at_utc': None, 'subreddit': 'statis... | t3 | 0 | Analyzing Ranked Preference Data | I have ranked preference data for 7 items. How... |
970 | I am measuring the effect of scale on the numb... | {'approved_at_utc': None, 'subreddit': 'statis... | t3 | 0 | Wondering which test to conduct and how to con... | I am measuring the effect of scale on the numb... |
971 | I'm looking at some instruction/examples on A/... | {'approved_at_utc': None, 'subreddit': 'statis... | t3 | 0 | [Q] non-parametric, permutations A/B testing | I'm looking at some instruction/examples on A/... |
972 | &#x200B; | {'approved_at_utc': None, 'subreddit': 'statis... | t3 | 0 | What is a good tutorial for learning how to ca... | &#x200B;What is a good tutorial for learni... |
973 | &#x200B; | {'approved_at_utc': None, 'subreddit': 'statis... | t3 | 0 | i'm a psych phd student who wants to befriend ... | &#x200B;i'm a psych phd student who wants ... |
974 | Howdy, So I’m in the beginnings of a PhD in ep... | {'approved_at_utc': None, 'subreddit': 'statis... | t3 | 0 | Any grad students from other fields also looki... | Howdy, So I’m in the beginnings of a PhD in ep... |
975 | Correlation And Causation By Examplehttp://blo... | {'approved_at_utc': None, 'subreddit': 'statis... | t3 | 0 | Correlation And Causation By Example | Correlation And Causation By Examplehttp://blo... |
976 | Hi all,&#x200B;I'm having a bit of trouble... | {'approved_at_utc': None, 'subreddit': 'statis... | t3 | 0 | Merging item responses into a single variable ... | Hi all,&#x200B;I'm having a bit of trouble... |
977 | Can somebody help this statistics rookie?Resea... | {'approved_at_utc': None, 'subreddit': 'statis... | t3 | 0 | [Question] Should I use a Two-way ANOVA? | Can somebody help this statistics rookie?Resea... |
1961 rows × 6 columns
# Save the cleaned-up product on the side
dfCombined.to_csv('Combined.csv', index = False)
NLP
Use CountVectorizer
or TfidfVectorizer
from scikit-learn to create features from the thread titles and descriptions (NOTE: Not all threads have a description)
- Examine using count or binary features in the model
- Re-evaluate your models using these. Does this improve the model performance?
- What text features are the most valuable?
N-grams = 1
# Going back after the fact to add some obvious stop words
# This was form a 'normal' run of CountVectorizer, e.g. n-grams = 1
# amp seems to be some bad html code that got pulled in mistakenly
new_stop_words = {'science', 'like', 'https', 'com', 've', '10', '12', 'amp'}
stop_words = ENGLISH_STOP_WORDS.union(new_stop_words)
# Instantiate
cvec = CountVectorizer(stop_words=stop_words) # First run through of n-grams = 1
# Set variables and train_test_split
# Sticking with the normal 75/25 split
X = dfCombined['title_body'].values
y = dfCombined['subreddit_target']
# .map({'statistics':0, 'datascience':1})
X_train, X_test, y_train, y_test = train_test_split(X,
y,
random_state=42)
# Fit and transform
cvec.fit(X_train)
X_train_transform = cvec.transform(X_train)
X_test_transform = cvec.transform(X_test)
df_view_stats = pd.DataFrame(X_test_transform.todense(),
columns=cvec.get_feature_names(),
index=y_test.index)
df_view_stats.head()
# .T.sort_values('statistics', ascending=False).head(10).T
00 | 000 | 0005 | 0016 | 0031 | 004 | 004100341sig | 00411621sig | 004p2 | 00625 | ... | zipper | zippers | zjt | zones | zoo | zuckerberg | zwitch | zziz | µᵢ | χ2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
113 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
572 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
450 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
383 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
506 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 8621 columns
# Most commonly used words on data science
# This was run multiple times for different sets of n-grams
word_count_test = pd.concat([df_view_stats, y_test], axis=1)
word_count_test['subreddit_target'] = word_count_test['subreddit_target'].map({0:'statistics', 1:'datascience'})
word_count_test.groupby(by='subreddit_target').sum().sort_values(by='datascience', axis=1, ascending=False).T
subreddit_target | datascience | statistics |
---|---|---|
data | 435 | 74 |
learning | 97 | 17 |
work | 79 | 10 |
time | 74 | 36 |
python | 70 | 5 |
model | 67 | 34 |
know | 65 | 25 |
false | 64 | 1 |
using | 63 | 16 |
use | 61 | 28 |
looking | 53 | 10 |
new | 52 | 6 |
job | 51 | 8 |
learn | 51 | 5 |
just | 51 | 11 |
dataset | 44 | 5 |
want | 44 | 21 |
tf | 43 | 0 |
need | 43 | 19 |
project | 42 | 4 |
code | 41 | 1 |
good | 41 | 12 |
projects | 41 | 7 |
way | 39 | 19 |
set | 39 | 10 |
tensorflow | 39 | 0 |
lt | 38 | 25 |
machine | 38 | 4 |
analysis | 38 | 12 |
working | 37 | 7 |
... | ... | ... |
ljung | 0 | 0 |
classifying | 0 | 0 |
livestream | 0 | 0 |
classname | 0 | 0 |
lived | 0 | 0 |
cleanly | 0 | 0 |
classification_report | 0 | 0 |
classical | 0 | 1 |
claim | 0 | 0 |
class3 | 0 | 0 |
claimed | 0 | 0 |
claims | 0 | 0 |
clarify | 0 | 1 |
lol | 0 | 0 |
logs | 0 | 1 |
logo | 0 | 0 |
lognormal | 0 | 0 |
logits | 0 | 0 |
logit | 0 | 1 |
logistics | 0 | 0 |
clarifying | 0 | 0 |
logical | 0 | 0 |
logic | 0 | 1 |
clarityhow | 0 | 0 |
logarithms | 0 | 0 |
logarithmicaly | 0 | 0 |
class1 | 0 | 0 |
locked | 0 | 0 |
class2 | 0 | 0 |
χ2 | 0 | 0 |
8621 rows × 2 columns
# Most commonly used words on statistics
# This was run multiple times for different sets of n-grams
word_count_test = pd.concat([df_view_stats, y_test], axis=1)
word_count_test['subreddit_target'] = word_count_test['subreddit_target'].map({0:'statistics', 1:'datascience'})
word_count_test.groupby(by='subreddit_target').sum().sort_values(by='statistics', axis=1, ascending=False).T
subreddit_target | datascience | statistics |
---|---|---|
data | 435 | 74 |
statistics | 26 | 50 |
mean | 6 | 48 |
variables | 20 | 44 |
variable | 15 | 42 |
test | 15 | 42 |
help | 36 | 41 |
time | 74 | 36 |
regression | 18 | 36 |
model | 67 | 34 |
use | 61 | 28 |
know | 65 | 25 |
lt | 38 | 25 |
question | 34 | 25 |
11 | 5 | 23 |
different | 27 | 22 |
distribution | 3 | 21 |
x200b | 25 | 21 |
make | 34 | 21 |
want | 44 | 21 |
way | 39 | 19 |
number | 15 | 19 |
need | 43 | 19 |
09 | 2 | 18 |
statistical | 18 | 18 |
sample | 10 | 18 |
linear | 17 | 18 |
day | 15 | 18 |
population | 0 | 17 |
15 | 7 | 17 |
... | ... | ... |
fine | 1 | 0 |
flagship | 0 | 0 |
fishermen | 0 | 0 |
flagged | 1 | 0 |
flag | 1 | 0 |
fizzle | 0 | 0 |
fizzbuzz | 0 | 0 |
fixing | 0 | 0 |
fix | 1 | 0 |
fivethirtyeight | 0 | 0 |
fitted | 0 | 0 |
fitness | 0 | 0 |
fit_transform | 1 | 0 |
fit2 | 0 | 0 |
fishing | 0 | 0 |
fischer | 0 | 0 |
finger | 0 | 0 |
fiscal | 0 | 0 |
firmly | 0 | 0 |
firm | 2 | 0 |
firing | 0 | 0 |
firefox | 0 | 0 |
fintech | 0 | 0 |
finnoq | 0 | 0 |
finnish | 0 | 0 |
finland | 0 | 0 |
finite | 0 | 0 |
finishes | 0 | 0 |
finished | 2 | 0 |
χ2 | 0 | 0 |
8621 rows × 2 columns
N-grams = 2
# This was the second run of CountVectorizer, e.g. n-grams = 2
# I removed science, because I wanted to make a differentation b/w 'science' and 'data science', and also 've', b/c it was only getting picked up b/c 'I've'
# Going to leave stop words as is for n-grams = 2, aside from html crap that got pulled in
new_stop_words = {'amp', 'x200b', 'amp x200b'}
stop_words = ENGLISH_STOP_WORDS.union(new_stop_words)
# Instantiate
cvec2 = CountVectorizer(stop_words=stop_words, ngram_range=(2,2)) #Second run through of n-grams = 2
# Set variables and train_test_split
# Sticking with the normal 75/25 split
X = dfCombined['title_body'].values
y = dfCombined['subreddit_target']
# .map({'statistics':0, 'datascience':1})
X_train2, X_test2, y_train2, y_test2 = train_test_split(X,
y,
random_state=42)
# Fit and transform
cvec2.fit(X_train2)
X_train_transform2 = cvec2.transform(X_train2)
X_test_transform2 = cvec2.transform(X_test2)
df_view_stats2 = pd.DataFrame(X_test_transform2.todense(),
columns=cvec2.get_feature_names(),
index=y_test2.index)
df_view_stats2.head()
# .T.sort_values('statistics', ascending=False).head(10).T
00 00 | 00 29s2 | 00 9987 | 00 cheap | 00 cost | 00 established | 00 mean | 00 primarily | 00 went | 000 10 | ... | zippers validate | zjt vector | zones topping | zoo ggplot2 | zuckerberg eric | zwitch mapd | zziz pwcpapers | µᵢ fixed | χ2 05 | χ2 distribution | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
113 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
572 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
450 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
383 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
506 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 43537 columns
# Most commonly used words on data science
word_count_test = pd.concat([df_view_stats2, y_test2], axis=1)
word_count_test['subreddit_target'] = word_count_test['subreddit_target'].map({0:'statistics', 1:'datascience'})
word_count_test.groupby(by='subreddit_target').sum().sort_values(by='datascience', axis=1, ascending=False).T
subreddit_target | datascience | statistics |
---|---|---|
data science | 127 | 5 |
machine learning | 38 | 2 |
data scientist | 30 | 0 |
https www | 26 | 0 |
data scientists | 20 | 0 |
gt lt | 17 | 0 |
tensorflow js | 16 | 0 |
https github | 15 | 0 |
statistical learning | 15 | 2 |
github com | 15 | 0 |
data analyst | 15 | 1 |
kaggle com | 13 | 0 |
www kaggle | 13 | 0 |
time series | 13 | 3 |
https redd | 12 | 0 |
data analytics | 12 | 0 |
feel like | 12 | 0 |
https youtu | 10 | 0 |
data set | 10 | 3 |
open source | 10 | 0 |
greatly appreciated | 9 | 1 |
linear algebra | 8 | 3 |
don know | 8 | 2 |
scikit learn | 8 | 0 |
data analysis | 7 | 1 |
work data | 7 | 0 |
lt script | 7 | 0 |
sql queries | 7 | 0 |
little bit | 7 | 0 |
new data | 7 | 0 |
... | ... | ... |
gallery 1hbpy1w | 0 | 0 |
gallery ehcawau | 0 | 0 |
gallery ej9di3f | 0 | 0 |
gallery html | 0 | 0 |
gallery http | 0 | 0 |
gallery o45qf8o | 0 | 0 |
gallery olzrzxz | 0 | 0 |
gallery plotly | 0 | 0 |
gallery wtdpir3 | 0 | 0 |
gain round | 0 | 0 |
gain opinions | 0 | 0 |
gain followers | 0 | 0 |
future timeseries | 0 | 0 |
future performance | 0 | 0 |
future price | 0 | 0 |
future research | 0 | 0 |
future researcherdon | 0 | 0 |
future statistical | 0 | 0 |
future thoughts | 0 | 0 |
future time | 0 | 0 |
future using | 0 | 0 |
gain academic | 0 | 0 |
future weather | 0 | 0 |
fyi data | 0 | 0 |
fyi learning | 0 | 0 |
g1 mn | 0 | 0 |
g2 13 | 0 | 0 |
ga 90 | 0 | 0 |
ga tools | 0 | 0 |
χ2 distribution | 0 | 0 |
43537 rows × 2 columns
# Most commonly used words on statistics
word_count_test = pd.concat([df_view_stats2, y_test2], axis=1)
word_count_test['subreddit_target'] = word_count_test['subreddit_target'].map({0:'statistics', 1:'datascience'})
word_count_test.groupby(by='subreddit_target').sum().sort_values(by='statistics', axis=1, ascending=False).T
subreddit_target | datascience | statistics |
---|---|---|
standard deviation | 0 | 6 |
linear regression | 3 | 6 |
non stationary | 0 | 6 |
https imgur | 3 | 5 |
regression model | 2 | 5 |
independent variables | 0 | 5 |
don think | 0 | 5 |
imgur com | 3 | 5 |
make sense | 1 | 5 |
data science | 127 | 5 |
things like | 4 | 5 |
normally distributed | 0 | 4 |
logistic regressions | 0 | 4 |
prediction model | 1 | 4 |
need help | 4 | 4 |
normal distribution | 0 | 4 |
hypothesis testing | 0 | 4 |
capture recapture | 0 | 4 |
comp sci | 3 | 4 |
random sample | 0 | 4 |
average mean | 0 | 4 |
post test | 0 | 3 |
real time | 2 | 3 |
pre post | 0 | 3 |
make statistical | 0 | 3 |
data excel | 0 | 3 |
hotspot mapping | 0 | 3 |
independent variable | 0 | 3 |
index variables | 0 | 3 |
statistical curve | 0 | 3 |
... | ... | ... |
frames day | 0 | 0 |
framework aware | 0 | 0 |
framework building | 0 | 0 |
framework cheersbest | 0 | 0 |
framework consistent | 0 | 0 |
framework guidance | 0 | 0 |
framework implemented | 0 | 0 |
framework interactive | 0 | 0 |
fragments feeding | 0 | 0 |
fraction discard | 0 | 0 |
forward similar | 0 | 0 |
fpsyg 2018 | 0 | 0 |
forward want | 0 | 0 |
forxa03xa0months july | 0 | 0 |
foundation hiring | 0 | 0 |
foundation mathematics | 0 | 0 |
foundation prior | 0 | 0 |
foundations predictive | 0 | 0 |
foundations python | 0 | 0 |
founder kdnuggets | 0 | 0 |
fourmilab ch | 0 | 0 |
fourth generate | 0 | 0 |
foxes hounds | 0 | 0 |
foxes immediately | 0 | 0 |
foxes seven | 0 | 0 |
foxhole inside | 0 | 0 |
fp growth | 0 | 0 |
fp persons | 0 | 0 |
fpsyg 09 | 0 | 0 |
χ2 distribution | 0 | 0 |
43537 rows × 2 columns
# Instantiate and fit
lr2 = LogisticRegression()
lr2.fit(X_train_transform2, y_train2)
lr2.score(X_train_transform2, y_train2)
0.9863945578231292
lr2.score(X_test_transform2, y_test2)
# Looks like a pretty decent overfit
0.7637474541751528
Predicting subreddit using Random Forests + Another Classifier
# Instantiate and fit
# From here on out, it's n-grams = 2
lr = LogisticRegression()
lr.fit(X_train_transform, y_train)
lr.score(X_train_transform, y_train)
0.9897959183673469
lr.score(X_test_transform, y_test)
# Looks like a pretty decent overfit
0.8757637474541752
We want to predict a binary variable - class 0
for one of your subreddits and 1
for the other.
preds = lr.predict(X_test_transform)
pred_proba = lr.predict_proba(X_test_transform)[:,1]
roc_auc = roc_auc_score(y_test, preds)
roc_auc
0.8772046367954297
roc_auc = roc_auc_score(y_test, preds)
FPR, TPR, thresholds = roc_curve(y_test, pred_proba)
plt.figure(figsize=(8, 6))
plt.plot(FPR, TPR, label='Logistic Regression (area = %0.2f)' % roc_auc)
plt.title('ROC-AUC (n-grams=1)')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.plot([0, 1], [0, 1],'r--')
plt.legend(loc="lower right")
plt.show()
Thought experiment: What is the baseline accuracy for this model?
## I'm going to take an educated guess that the baseline accuracy is 50%, as in, randomly guessing
Create a RandomForestClassifier
model to predict which subreddit a given post belongs to.
# Instantiate
rf = RandomForestClassifier()
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
Use cross-validation in scikit-learn to evaluate the model above.
- Evaluate the accuracy of the model, as well as any other metrics you feel are appropriate.
- Bonus: Use
GridSearchCV
withPipeline
to optimize yourCountVectorizer
/TfidfVectorizer
and classification model.
cvs_train = cross_val_score(rf, X_train_transform, y_train, cv=cv, n_jobs=-1)
print(cvs_train)
print(cvs_train.mean())
[0.81292517 0.86734694 0.79591837 0.78231293 0.79931973]
0.8115646258503402
cvs_test = cross_val_score(rf, X_test_transform, y_test, cv=cv, n_jobs=-1)
print(cvs_test)
print(cvs_test.mean())
# Still slight overfit
[0.70707071 0.71717172 0.76767677 0.82474227 0.74226804]
0.7517859002395084
Repeat the model-building process using a different classifier (e.g. MultinomialNB
, LogisticRegression
, etc)
MultinomialNB
mnb = MultinomialNB()
cvs_train = cross_val_score(mnb, X_train_transform, y_train, cv=cv, n_jobs=-1)
print(cvs_train)
print(cvs_train.mean())
[0.82312925 0.8537415 0.81632653 0.80272109 0.81972789]
0.8231292517006802
cvs_test = cross_val_score(mnb, X_test_transform, y_test, cv=cv, n_jobs=-1)
print(cvs_test)
print(cvs_test.mean())
# Not as bad of an overfit
[0.7979798 0.75757576 0.7979798 0.81443299 0.77319588]
0.788232843902947
GaussianNB
gnb = GaussianNB()
cvs_train = cross_val_score(gnb, X_train_transform.toarray(), y_train, cv=cv, n_jobs=-1)
print(cvs_train)
print(cvs_train.mean())
[0.77891156 0.80612245 0.78231293 0.81292517 0.80952381]
0.7979591836734693
cvs_test = cross_val_score(gnb, X_test_transform.toarray(), y_test, cv=cv, n_jobs=-1)
print(cvs_test)
print(cvs_test.mean())
# Overfit isn't as much of a problem on this model
# However, the overall score isn't as strong as the other models
[0.71717172 0.76767677 0.80808081 0.78350515 0.70103093]
0.7554930750807038
Executive Summary
Put your executive summary in a Markdown cell below.
Reclassifying all of Reddit is an incredible daunting task. However, the machine learning and natural language processing abilities of Python can turn this into a manageable task. Reddit calls itself the 'frontpage of the internet,' and indicative of the innovation that drove the creation of the internet, Reddit can innovate to overcome this challenge as it has countless obstacles before this.
Specifically, the distinction between r/DataScience and r/Statistics is relatively small as these subreddits generally discuss similar ideas and concepts. Despite these similarities, I believe my models performed quite well (especially my first run of Logistic Regression using n-grams = 1). Additionally, I chose to remove specific stop words ('science', 'https', 'com') that would more easily identify r/DataScience as the correct subreddit, in order to 'challenge' my modeling and evaluation skills, as well as allow this process to more generally be applied to all of Reddit's various subreddits. Removing these stop word would have increased my models' classifying ability even further.
Finally, I believe that this machine learning/NLP process can be applied to Reddit as a whole to help reclassify and realign it's subreddits with a high degree of success. Coupled with Reddit's strong community, including its committed mods, this is challenge that Reddit can overcome, and potentially be stronger off because of it.