Using Reddit’s API for Predicting Comments

In this project, we will practice two major skills. Collecting data via an API request and then building a binary predictor.

As we discussed in week 2, and earlier today, there are two components to starting a data science problem: the problem statement, and acquiring the data.

For this article, your problem statement will be: What characteristics of a post on Reddit contribute most to what subreddit it belongs to?

Your method for acquiring the data will be scraping threads from at least two subreddits.

Once you’ve got the data, you will build a classification model that, using Natural Language Processing and any other relevant features, predicts which subreddit a given post belongs to.

Scraping Thread Info from Reddit.com

Set up a request (using requests) to the URL below.

NOTE: Reddit will throw a 429 error when using the following code:

res = requests.get(URL)

This is because Reddit has throttled python’s default user agent. You’ll need to set a custom User-agent to get your request to work.

res = requests.get(URL, headers={'User-agent': 'YOUR NAME Bot 0.1'})

# Imports - used or otherwise.
import pandas as pd
import requests
import json
import time
import regex as re
import praw
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from bs4 import BeautifulSoup
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict, GridSearchCV, StratifiedKFold
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, ENGLISH_STOP_WORDS
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn import metrics
from sklearn.metrics import roc_auc_score, roc_curve

# Create the URL variables
URL_ds = "http://www.reddit.com/r/datascience.json"
URL_stats = "https://www.reddit.com/r/statistics.json"

# Authenticating via OAuth for praw
reddit = praw.Reddit(client_id='AOJTLQLavhOXPg',
                     client_secret='eS08QOpy2lWh37qkVBGlN7yMjRI',
                     username='TCRAY_DSI',
                     password='dsi123',
                     user_agent='TK Bot 0.1')

# Check
print(reddit.user.me())

TCRAY_DSI

# Create subs for praw:
sub_ds = reddit.subreddit('datascience')
sub_stats = reddit.subreddit('statistics')

# Create top pulls
top_ds = sub_ds.top(time_filter='year')
top_stats = sub_stats.top(time_filter='year')

# These were used and attempted (to success) before creating the loop
# Request the JSON files
# I did them in seperate cells to space out the scrapping, so reddit wouldn't throw a 429 error
# res_ds = res.get(URL_ds, headers={'User-agent': 'TK Bot 0.1'})

# res_stats = requests.get(URL_stats, headers={'User-agent': 'TK Bot 0.1'})

# res_stats.status_code

Use `res.json()` to convert the response into a dictionary format and set this to a variable.

data = res.json()

# These were used and attempted (to success) before creating the loop
# Convert the JSON responses
# data_ds = res_ds.json()
# data_stats = res_stats.json()

# Check out data
# data_ds
# data_stats

# Testing adding nested dictionaries to each other
# doubling_up = data_ds['data']['children'] + data_ds['data']['children']
# doubling_up

Getting more results

By default, Reddit will give you the top 25 posts:

print(len(data['data']['children']))

If you want more, you’ll need to do two things:

Get the name of the last post: data['data']['after']
Use that name to hit the following url: http://www.reddit.com/r/boardgames.json?after=THE_AFTER_FROM_STEP_1
Create a loop to repeat steps 1 and 2 until you have a sufficient number of posts.

NOTE: Reddit will limit the number of requests per second you’re allowed to make. When you create your loop, be sure to add the following after each iteration.

time.sleep(3) # sleeps 3 seconds before continuing```

This will throttle your loop and keep you within Reddit's guidelines. You'll need to import the `time` library for this to work!


```python
# Check out length
# print(len(data_ds['data']['children']))
# print(len(data_stats['data']['children']))

# Test the last post pull
# data_ds['data']['after']

# For DS set - previously ran to generate CSV
url_ds = "https://www.reddit.com/r/datascience.json"
data_ds = []
total = []
next_get = ''

# I went with 40 b/c 40 * 25 = 1000 posts total
for i in range(40):

    # Request get
    res = requests.get(url_ds+next_get, headers={'User-agent': 'TK Bot 0.1'})

    # Convert the JSON
    new_dict = res.json()

    # Add to already collected data set
    data_ds.extend(new_dict['data']['children'])

    # Collect 'after' from new dict to generate next URL
    new_url_end = str(new_dict['data']['after'])

    # Generate the next URL
    next_get = '?after='+new_url_end

    # CSV add/update along with DF creation
    # Chose greater than 0 so the else executes on the first iteration
    if i > 0:
        # Read in previous csv for comparision/add
        # Establish current DF - left over from previous way of running
        # past_posts = pd.read_csv('data_ds.csv')
        # current_df = pd.DataFrame(data_ds)

        # Append new and old
        total = pd.DataFrame(data_ds)

        # Convert to DF and save to new csv file
        pd.DataFrame(total).to_csv('data_ds.csv', index = False)

    else:
        pd.DataFrame(data_ds).to_csv('data_ds.csv', index = False)

    # Sleep to fit within Reddit's pull limit
    time.sleep(3)

# For stats set - previously ran to generate CSV
url_stats = "https://www.reddit.com/r/statistics.json"
data_stats = []
total = []
next_get = ''

# I went with 40 b/c 40 * 25 = 1000 posts total
for i in range(40):

    # Request get
    res = requests.get(url_stats+next_get, headers={'User-agent': 'TK Bot 0.1'})

    # Convert the JSON
    new_dict = res.json()

    # Add to already collected data set
    data_stats.extend(new_dict['data']['children'])

    # Collect 'after' from new dict to generate next URL
    new_url_end = str(new_dict['data']['after'])

    # Generate the next URL
    next_get = '?after='+new_url_end

    # CSV add/update along with DF creation
    # Chose greater than 0 so the else executes on the first iteration
    if i > 0:
        # Read in previous csv for comparision/add
        # Establish current DF - left over from previous way of running
        # past_posts = pd.read_csv('data_stats.csv')
        # current_df = pd.DataFrame(data_stats)

        # Append new and old
        total = pd.DataFrame(data_stats)

        # Convert to DF and save to new csv file
        pd.DataFrame(total).to_csv('data_stats.csv', index = False)

    else:
        pd.DataFrame(data_stats).to_csv('data_stats.csv', index = False)

    # Sleep to fit within Reddit's pull limit
    time.sleep(3)

# This was an older attempt at writing the function, that I scrapped and decided to start fresh on:
# url_ds = "https://www.reddit.com/r/datascience.json?after=" + last_post_ds

# for i in range(25):
#     # Get the name of the last post
#     last_post_ds = data_ds['data']['after']

#     # Set the url from the last post
#     new_url_ds = "https://www.reddit.com/r/datascience.json?after=" + last_post_ds

#     # Perform request get
#     new_res_ds = res.get(new_url_ds, headers={'User-agent': 'TK Bot 0.1'})

#     # Convert the JSON to a dict
#     new_data_ds = new_res_ds.json()

#     # Add the new dict to the already existing one
#     data_ds.update(new_data_ds)
#     data_ds['data']['children'] = data_ds['data']['children'] + new_data_ds['data']['children']
#     data_ds['data']['after'] = new_data_ds['data']['after']

#     # Sleep
#     # time.sleep(3)

# Next few cells devoted to understanding how to generate a combined dict
# new_data_ds.items()

# OG_ds_data = data_ds.copy()

# new_data_ds = new_res_ds.json()
# data_ds.update(new_data_ds)

# Testing adding nested dictionaries to each other
# doubling_up = data_ds['data']['children'] + data_ds['data']['children']
# doubling_up

Save your results as a CSV

You may do this regularly while scraping data as well, so that if your scraper stops of your computer crashes, you don’t lose all your data.

# My loop in the previous cell completes this step.

Read my files back in and clean them up / EDA

%pwd

'/Users/tomkelly/Desktop/general_assembly/DSI-US-5/project-3'

df_ds = pd.read_csv('./data_ds.csv')
df_stats = pd.read_csv('./data_stats.csv')

# 983 DS posts vs 978 stats posts
# df_ds.shape[0]
df_stats.shape[0]

# for i in df_ds.shape[0]
# Testing what I want to loop
df_ds['body'] = pd.Series(re.findall('(?<=selftext).{4}(.*).{4}(?=author_fullname)', df_ds['data'][0]))

df_ds.head()

	data	kind	body
0	{'approved_at_utc': None, 'subreddit': 'datasc...	t3	The Mod Team has decided that it would be nice...
1	{'approved_at_utc': None, 'subreddit': 'datasc...	t3	NaN
2	{'approved_at_utc': None, 'subreddit': 'datasc...	t3	NaN
3	{'approved_at_utc': None, 'subreddit': 'datasc...	t3	NaN
4	{'approved_at_utc': None, 'subreddit': 'datasc...	t3	NaN

type(df_ds['body'].iloc[0,])

str

# To pull out the body of the post and make it a new column
for i in range(0, df_ds.shape[0]):
    try: #Since regex makes it a list, this helps deal with nulls
        df_ds['body'][i] = re.findall('(?<=selftext).{4}(.*).{4}(?=author_fullname)', df_ds['data'][i])[0]
    except:
        df_ds['body'][i] = ''

df_ds.head()

	data	kind	body
0	{'approved_at_utc': None, 'subreddit': 'datasc...	t3	The Mod Team has decided that it would be nice...
1	{'approved_at_utc': None, 'subreddit': 'datasc...	t3	\n\nWelcome to this week's 'Entering & Tr...
2	{'approved_at_utc': None, 'subreddit': 'datasc...	t3
3	{'approved_at_utc': None, 'subreddit': 'datasc...	t3
4	{'approved_at_utc': None, 'subreddit': 'datasc...	t3	I'm working on making a list of Machine Learni...

# For some reason, wrapping it in pd.Series makes this work before I loop it
try:
    df_ds['title'] = pd.Series(re.findall('(?<= .title).{4}(.*).{4}(?=link_flair_richtext)', df_ds['data'][0]))[0]
except:
    df_ds['title'] = ''

# To pull out the title of the post and make it a new column
for i in range(0, df_ds.shape[0]):
    try:
        df_ds['title'][i] = re.findall('(?<= .title).{4}(.*).{4}(?=link_flair_richtext)', df_ds['data'][i])[0]
    except:
        df_ds['title'][i] = ''

df_stats.shape[0]

df_ds.head()

	data	kind	body	title
0	{'approved_at_utc': None, 'subreddit': 'datasc...	t3	The Mod Team has decided that it would be nice...	DS Book Suggestions/Recommendations Megathread
1	{'approved_at_utc': None, 'subreddit': 'datasc...	t3	\n\nWelcome to this week's 'Entering & Tr...	Weekly 'Entering & Transitioning' Thread. ...
2	{'approved_at_utc': None, 'subreddit': 'datasc...	t3		Mo Data, Mo Problems. Everyone always talks ab...
3	{'approved_at_utc': None, 'subreddit': 'datasc...	t3		Make “Fairness by Design” Part of Machine Lear...
4	{'approved_at_utc': None, 'subreddit': 'datasc...	t3	I'm working on making a list of Machine Learni...	Papers with Code

# Looks like the body/title got pulled in as a list, turning it into a str
# This is leftover from an older method
# for w in range(0,df_ds['body'].shape[0]):
#     df_ds['body'][w] = str(df_ds['body'][w])

# Additional Clean-up - DS
df_ds['body'] = df_ds['body'].map(lambda x: x.replace('\\n',''))
df_ds['body'] = df_ds['body'].map(lambda x: x.replace('\n',''))
df_ds['body'] = df_ds['body'].map(lambda x: x.replace('\\',''))
df_ds['body'] = df_ds['body'].map(lambda x: x.replace("\\'","'"))
# df_ds['body'] = [w.replace('/n', '') for w in df_ds['body']]

df_ds['body']

    The Mod Team has decided that it would be nice...
     Welcome to this week's 'Entering &amp; Transi...
                                                     
                                                     
    I'm working on making a list of Machine Learni...
    I do most of my work in Python. Building the m...
    [Project Link](https://github.com/HiteshGorana...
    Before I got hired, my company had a contracto...
    I'm looking for an open-source web-based tool ...
    I've been reading around online a bit as to wh...
   I am new to time series data, so bear with me....
                                                    
   Hey all, Do people have recommendations for pi...
   I am quite old (23), but would like to become ...
                                                    
   I know that python and R are the standard lang...
   Which tools and packages do you use the most a...
                                                    
   Has anyone dealt with such a problem statement...
                                                    
                                                    
   So, I'm trying to build playlists based on val...
    My intents are to analyze the results with Ex...
                                                    
   Does anyone have experience in using either pl...
   Since I started as a data scientist, I have be...
                                                    
   Good Afternoon Everyone,&amp;#x200B;I was work...
   Hi all, this is a followup on [Separated from ...
   This is maybe not a specific DS question, but ...
                             ...                        
  Specifically, as AI gets better and better, an...
  What is the difference between sklearn.impute....
                                                   
  ', 'author_fullname': 't2_pqifw', 'saved': Fal...
  I have a prospective client who’s keen to do s...
                                                   
                                                   
  Hello all!I have a final interview for a Sales...
                                                   
  So here’s a little about me. I’ve been a lead ...
  Please shoo me away to the proper sub if I'm a...
                                                   
  What's the best open source (i.e., free) appro...
  Bayesian Network is a probabilistic graphical ...
  ', 'author_fullname': 't2_r3q3m', 'saved': Fal...
                                                   
                                                   
  Hi, guys. I have a dataset of different addres...
  I have been reading a lot of quora answers and...
  This is my first kernel on Kaggle doing some d...
  Hi Guys, I need some advise or personal experi...
                                                   
  I'm finding myself in a position where I may h...
  I'm looking to make some data science projects...
  Hi, this is my first post ever, so sorry in ad...
  Cheers everyone! This is my first kernel on Ka...
  Hello /r/datascience. TLDR: given the current ...
  What data science course you studied from and ...
                                                   
  I'm looking for a ISO file of a distro that it...
Name: body, Length: 983, dtype: object

# Add target column for later combination
df_ds['subreddit_target'] = 1

# Check out the nulls
df_ds.isnull().sum().sort_values()

data                0
kind                0
body                0
title               0
subreddit_target    0
dtype: int64

# Same process of pulling out body/post for df_stats
try:
    df_stats['body'] = pd.Series(re.findall('(?<=selftext).{4}(.*).{4}(?=author_fullname)', df_stats['data'][0]))[0]
except:
    df_stats['body'] = ''

# To pull out the body of the post and make it a new column
for i in range(0, df_stats.shape[0]):
    try:
        df_stats['body'][i] = re.findall('(?<=selftext).{4}(.*).{4}(?=author_fullname)', df_stats['data'][i])[0]
    except:
        df_stats['body'] = ''

# For some reason, wrapping it in pd.Series makes this work before I loop it
try:
    df_stats['title'] = pd.Series(re.findall('(?<= .title).{4}(.*).{4}(?=link_flair_richtext)', df_stats['data'][0]))[0]
except:
    df_stats['title'] = ''

# To pull out the title of the post and make it a new column
for i in range(0, df_stats.shape[0]):
    try:
        df_stats['title'][i] = re.findall('(?<= .title).{4}(.*).{4}(?=link_flair_richtext)', df_stats['data'][i])[0]
    except:
        df_stats['title'] = ''

# Looks like the body got pulled in as a list
# restricting how I clean it up, turning it into a str
# Leftover
# for w in range(0,df_stats['body'].shape[0]):
#     df_stats['body'][w] = str(df_stats['body'][w])

# Additional Clean-up - DS
df_stats['body'] = df_stats['body'].map(lambda x: x.replace('\\n',''))
df_stats['body'] = df_stats['body'].map(lambda x: x.replace('\n',''))
df_stats['body'] = df_stats['body'].map(lambda x: x.replace('\\',''))
df_stats['body'] = df_stats['body'].map(lambda x: x.replace("\\'","'"))
# df_stats['body'] = [w.replace('/n', '') for w in df_stats['body']]

df_stats['subreddit_target'] = 0

# Check out the nulls
df_stats.isnull().sum().sort_values()

data                0
kind                0
body                0
title               0
subreddit_target    0
dtype: int64

# Renaming the columns so they're easier to discern
# Left over from previous way of solving
# df_ds.columns = ['data','kind','body_ds','title_ds']
# df_stats.columns = ['data','kind','body_stats','title_stats']

df_ds.head(1)

	data	kind	body	title	subreddit_target
0	{'approved_at_utc': None, 'subreddit': 'datasc...	t3	The Mod Team has decided that it would be nice...	DS Book Suggestions/Recommendations Megathread	1

# Create combined list for later usage
dflist = [df_ds, df_stats]
dfCombined = pd.concat(dflist, axis=0, sort=True)

dfCombined.head()
# .fillna(value=" ")

	body	data	kind	subreddit_target	title
0	The Mod Team has decided that it would be nice...	{'approved_at_utc': None, 'subreddit': 'datasc...	t3	1	DS Book Suggestions/Recommendations Megathread
1	Welcome to this week's 'Entering & Transi...	{'approved_at_utc': None, 'subreddit': 'datasc...	t3	1	Weekly 'Entering & Transitioning' Thread. ...
2		{'approved_at_utc': None, 'subreddit': 'datasc...	t3	1	Mo Data, Mo Problems. Everyone always talks ab...
3		{'approved_at_utc': None, 'subreddit': 'datasc...	t3	1	Make “Fairness by Design” Part of Machine Lear...
4	I'm working on making a list of Machine Learni...	{'approved_at_utc': None, 'subreddit': 'datasc...	t3	1	Papers with Code

# Check length is what I expected
dfCombined['body'].shape[0]

dfCombined['title_body'] = dfCombined['body'] + dfCombined['title']

dfCombined

	body	data	kind	subreddit_target	title	title_body
0	The Mod Team has decided that it would be nice...	{'approved_at_utc': None, 'subreddit': 'datasc...	t3	1	DS Book Suggestions/Recommendations Megathread	The Mod Team has decided that it would be nice...
1	Welcome to this week's 'Entering & Transi...	{'approved_at_utc': None, 'subreddit': 'datasc...	t3	1	Weekly 'Entering & Transitioning' Thread. ...	Welcome to this week's 'Entering & Transi...
2		{'approved_at_utc': None, 'subreddit': 'datasc...	t3	1	Mo Data, Mo Problems. Everyone always talks ab...	Mo Data, Mo Problems. Everyone always talks ab...
3		{'approved_at_utc': None, 'subreddit': 'datasc...	t3	1	Make “Fairness by Design” Part of Machine Lear...	Make “Fairness by Design” Part of Machine Lear...
4	I'm working on making a list of Machine Learni...	{'approved_at_utc': None, 'subreddit': 'datasc...	t3	1	Papers with Code	I'm working on making a list of Machine Learni...
5	I do most of my work in Python. Building the m...	{'approved_at_utc': None, 'subreddit': 'datasc...	t3	1	Looking for resources to learn how to launch m...	I do most of my work in Python. Building the m...
6	[Project Link](https://github.com/HiteshGorana...	{'approved_at_utc': None, 'subreddit': 'datasc...	t3	1	DataScience365 ( A project started recently to...	[Project Link](https://github.com/HiteshGorana...
7	Before I got hired, my company had a contracto...	{'approved_at_utc': None, 'subreddit': 'datasc...	t3	1	Anyone have experience parsing hospital data f...	Before I got hired, my company had a contracto...
8	I'm looking for an open-source web-based tool ...	{'approved_at_utc': None, 'subreddit': 'datasc...	t3	1	Open Source Tools for Dashboard Design	I'm looking for an open-source web-based tool ...
9	I've been reading around online a bit as to wh...	{'approved_at_utc': None, 'subreddit': 'datasc...	t3	1	MS online vs in-person	I've been reading around online a bit as to wh...
10	I am new to time series data, so bear with me....	{'approved_at_utc': None, 'subreddit': 'datasc...	t3	1	Best method for predicting the likelihood of a...	I am new to time series data, so bear with me....
11		{'approved_at_utc': None, 'subreddit': 'datasc...	t3	1	Very low cost cloud GPU instances (<$0.15/h...	Very low cost cloud GPU instances (<$0.15/h...
12	Hey all, Do people have recommendations for pi...	{'approved_at_utc': None, 'subreddit': 'datasc...	t3	1	Pipeline Versioning (Open Source / Free) What ...	Hey all, Do people have recommendations for pi...
13	I am quite old (23), but would like to become ...	{'approved_at_utc': None, 'subreddit': 'datasc...	t3	1	Data Science and being a Quant: how transferab...	I am quite old (23), but would like to become ...
14		{'approved_at_utc': None, 'subreddit': 'datasc...	t3	1	Data Democratization - Data and Analytics Take...	Data Democratization - Data and Analytics Take...
15	I know that python and R are the standard lang...	{'approved_at_utc': None, 'subreddit': 'datasc...	t3	1	Mathematica is the best tool for data science ...	I know that python and R are the standard lang...
16	Which tools and packages do you use the most a...	{'approved_at_utc': None, 'subreddit': 'datasc...	t3	1	What tools do you actually use at work?	Which tools and packages do you use the most a...
17		{'approved_at_utc': None, 'subreddit': 'datasc...	t3	1	Feature engineering that exploit symmetries ca...	Feature engineering that exploit symmetries ca...
18	Has anyone dealt with such a problem statement...	{'approved_at_utc': None, 'subreddit': 'datasc...	t3	1	R clustering with maximum size per cluster	Has anyone dealt with such a problem statement...
19		{'approved_at_utc': None, 'subreddit': 'datasc...	t3	1	Get free GPU for training with Google Colab - ...	Get free GPU for training with Google Colab - ...
20		{'approved_at_utc': None, 'subreddit': 'datasc...	t3	1	[Cheat Sheet] Snippets for Plotting With ggplot	[Cheat Sheet] Snippets for Plotting With ggplot
21	So, I'm trying to build playlists based on val...	{'approved_at_utc': None, 'subreddit': 'datasc...	t3	1	How to use recommender Systems with Multiple "...	So, I'm trying to build playlists based on val...
22	My intents are to analyze the results with Ex...	{'approved_at_utc': None, 'subreddit': 'datasc...	t3	1	Please Take This Survey if You're a College Gr...	My intents are to analyze the results with Ex...
23		{'approved_at_utc': None, 'subreddit': 'datasc...	t3	1	How useful is a reference letter from an econ ...	How useful is a reference letter from an econ ...
24	Does anyone have experience in using either pl...	{'approved_at_utc': None, 'subreddit': 'datasc...	t3	1	H2O.ai vs Datarobot? Your take	Does anyone have experience in using either pl...
25	Since I started as a data scientist, I have be...	{'approved_at_utc': None, 'subreddit': 'datasc...	t3	1	Are independent research papers useful for a d...	Since I started as a data scientist, I have be...
26		{'approved_at_utc': None, 'subreddit': 'datasc...	t3	1	Super helpful cheat sheets for Keras, Numpy, P...	Super helpful cheat sheets for Keras, Numpy, P...
27	Good Afternoon Everyone,&#x200B;I was work...	{'approved_at_utc': None, 'subreddit': 'datasc...	t3	1	Correlation Plot of a correlation matrix ( usi...	Good Afternoon Everyone,&#x200B;I was work...
28	Hi all, this is a followup on [Separated from ...	{'approved_at_utc': None, 'subreddit': 'datasc...	t3	1	Step down from Data Scientist in next job- how...	Hi all, this is a followup on [Separated from ...
29	This is maybe not a specific DS question, but ...	{'approved_at_utc': None, 'subreddit': 'datasc...	t3	1	How do you deal with post-job-interview though...	This is maybe not a specific DS question, but ...
...	...	...	...	...	...	...
948	Hello all. I'm a grad school student who ended...	{'approved_at_utc': None, 'subreddit': 'statis...	t3	0	Need to Learn How to Use SPSS Syntax ASAP	Hello all. I'm a grad school student who ended...
949	I have been reading the Wikipedia explanations...	{'approved_at_utc': None, 'subreddit': 'statis...	t3	0	ELI5: bray curtis dissimilarity matrix and UPG...	I have been reading the Wikipedia explanations...
950	Hello all.u200bThe survey: Our survey asks peo...	{'approved_at_utc': None, 'subreddit': 'statis...	t3	0	Weighting an online survey with a lot of unknowns	Hello all.u200bThe survey: Our survey asks peo...
951	Hi everyone. I'm curious whether anyone knows ...	{'approved_at_utc': None, 'subreddit': 'statis...	t3	0	Textbooks in statistics with great problem sets	Hi everyone. I'm curious whether anyone knows ...
952	I am analyzing dyadic data in a multilevel mod...	{'approved_at_utc': None, 'subreddit': 'statis...	t3	0	Residuals plot: Is this autocorrelation?	I am analyzing dyadic data in a multilevel mod...
953	How do you apply the Bonferroni correction if ...	{'approved_at_utc': None, 'subreddit': 'statis...	t3	0	Bonferroni corrections	How do you apply the Bonferroni correction if ...
954	Hello everyone, I'm looking for books which ta...	{'approved_at_utc': None, 'subreddit': 'statis...	t3	0	Resources for undergrad material in Python &am...	Hello everyone, I'm looking for books which ta...
955	Hi there! I was hoping someone may be able to ...	{'approved_at_utc': None, 'subreddit': 'statis...	t3	0	Unsure which test to use	Hi there! I was hoping someone may be able to ...
956	I should preface this by saying I know very li...	{'approved_at_utc': None, 'subreddit': 'statis...	t3	0	Help with normalization of data	I should preface this by saying I know very li...
957	Hey r/statistics, I need some advice on how to...	{'approved_at_utc': None, 'subreddit': 'statis...	t3	0	Advice on an epidemiology dataset	Hey r/statistics, I need some advice on how to...
958	I'm facing 3 problems in my current analysis (...	{'approved_at_utc': None, 'subreddit': 'statis...	t3	0	Groupsize differences, unequal genders and g p...	I'm facing 3 problems in my current analysis (...
959	I am working with panel data with n=30 and t=7...	{'approved_at_utc': None, 'subreddit': 'statis...	t3	0	How to interpret counterintuitive signs from m...	I am working with panel data with n=30 and t=7...
960	I just finished gelmans Bayesian data analysis...	{'approved_at_utc': None, 'subreddit': 'statis...	t3	0	Where to go after Gelman's BDA3?	I just finished gelmans Bayesian data analysis...
961	Ignore for a moment the issues with NHST.If a ...	{'approved_at_utc': None, 'subreddit': 'statis...	t3	0	If you are working in the paradigm of NHST, wh...	Ignore for a moment the issues with NHST.If a ...
962	For the “big” study this group says they hypot...	{'approved_at_utc': None, 'subreddit': 'statis...	t3	0	How can I use pilot data to plan sample sizes ...	For the “big” study this group says they hypot...
963	Hi. I need to write two predictive supply and ...	{'approved_at_utc': None, 'subreddit': 'statis...	t3	0	Predictive supply and demand model	Hi. I need to write two predictive supply and ...
964	An illustration of my issue: For e.g. X is a h...	{'approved_at_utc': None, 'subreddit': 'statis...	t3	0	Determining which variable is more affected	An illustration of my issue: For e.g. X is a h...
965	A group of students takes a PRE test with 50 q...	{'approved_at_utc': None, 'subreddit': 'statis...	t3	0	Repeat measures t-test on exam data, but pre a...	A group of students takes a PRE test with 50 q...
966	Trying to figure out that if I have 7 variable...	{'approved_at_utc': None, 'subreddit': 'statis...	t3	0	Easy question from one confused boi; 7 variabl...	Trying to figure out that if I have 7 variable...
967	Hello,I’ve been doing some analysis regardin...	{'approved_at_utc': None, 'subreddit': 'statis...	t3	0	How to deal with the log of a variable where s...	Hello,I’ve been doing some analysis regardin...
968	Hi there, I'm a bit confused about usage of F...	{'approved_at_utc': None, 'subreddit': 'statis...	t3	0	Questions about Firth logistic regressions	Hi there, I'm a bit confused about usage of F...
969	I have ranked preference data for 7 items. How...	{'approved_at_utc': None, 'subreddit': 'statis...	t3	0	Analyzing Ranked Preference Data	I have ranked preference data for 7 items. How...
970	I am measuring the effect of scale on the numb...	{'approved_at_utc': None, 'subreddit': 'statis...	t3	0	Wondering which test to conduct and how to con...	I am measuring the effect of scale on the numb...
971	I'm looking at some instruction/examples on A/...	{'approved_at_utc': None, 'subreddit': 'statis...	t3	0	[Q] non-parametric, permutations A/B testing	I'm looking at some instruction/examples on A/...
972	&#x200B;	{'approved_at_utc': None, 'subreddit': 'statis...	t3	0	What is a good tutorial for learning how to ca...	&#x200B;What is a good tutorial for learni...
973	&#x200B;	{'approved_at_utc': None, 'subreddit': 'statis...	t3	0	i'm a psych phd student who wants to befriend ...	&#x200B;i'm a psych phd student who wants ...
974	Howdy, So I’m in the beginnings of a PhD in ep...	{'approved_at_utc': None, 'subreddit': 'statis...	t3	0	Any grad students from other fields also looki...	Howdy, So I’m in the beginnings of a PhD in ep...
975	Correlation And Causation By Examplehttp://blo...	{'approved_at_utc': None, 'subreddit': 'statis...	t3	0	Correlation And Causation By Example	Correlation And Causation By Examplehttp://blo...
976	Hi all,&#x200B;I'm having a bit of trouble...	{'approved_at_utc': None, 'subreddit': 'statis...	t3	0	Merging item responses into a single variable ...	Hi all,&#x200B;I'm having a bit of trouble...
977	Can somebody help this statistics rookie?Resea...	{'approved_at_utc': None, 'subreddit': 'statis...	t3	0	[Question] Should I use a Two-way ANOVA?	Can somebody help this statistics rookie?Resea...

1961 rows × 6 columns

# Save the cleaned-up product on the side
dfCombined.to_csv('Combined.csv', index = False)

NLP

Use `CountVectorizer` or `TfidfVectorizer` from scikit-learn to create features from the thread titles and descriptions (NOTE: Not all threads have a description)

Examine using count or binary features in the model
Re-evaluate your models using these. Does this improve the model performance?
What text features are the most valuable?

N-grams = 1

# Going back after the fact to add some obvious stop words
# This was form a 'normal' run of CountVectorizer, e.g. n-grams = 1
# amp seems to be some bad html code that got pulled in mistakenly
new_stop_words = {'science', 'like', 'https', 'com', 've', '10', '12', 'amp'}
stop_words = ENGLISH_STOP_WORDS.union(new_stop_words)

# Instantiate
cvec = CountVectorizer(stop_words=stop_words) # First run through of n-grams = 1

# Set variables and train_test_split
# Sticking with the normal 75/25 split
X = dfCombined['title_body'].values
y = dfCombined['subreddit_target']
# .map({'statistics':0, 'datascience':1})

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state=42)

# Fit and transform
cvec.fit(X_train)
X_train_transform = cvec.transform(X_train)
X_test_transform = cvec.transform(X_test)

df_view_stats = pd.DataFrame(X_test_transform.todense(),
                             columns=cvec.get_feature_names(),
                             index=y_test.index)
df_view_stats.head()
# .T.sort_values('statistics', ascending=False).head(10).T

	00	000	0005	0016	0031	004	004100341sig	00411621sig	004p2	00625	...	zipper	zippers	zjt	zones	zoo	zuckerberg	zwitch	zziz	µᵢ	χ2
113	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
572	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
450	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
383	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
506	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

5 rows × 8621 columns

# Most commonly used words on data science
# This was run multiple times for different sets of n-grams
word_count_test = pd.concat([df_view_stats, y_test], axis=1)
word_count_test['subreddit_target'] = word_count_test['subreddit_target'].map({0:'statistics', 1:'datascience'})
word_count_test.groupby(by='subreddit_target').sum().sort_values(by='datascience', axis=1, ascending=False).T

subreddit_target	datascience	statistics
data	435	74
learning	97	17
work	79	10
time	74	36
python	70	5
model	67	34
know	65	25
false	64	1
using	63	16
use	61	28
looking	53	10
new	52	6
job	51	8
learn	51	5
just	51	11
dataset	44	5
want	44	21
tf	43	0
need	43	19
project	42	4
code	41	1
good	41	12
projects	41	7
way	39	19
set	39	10
tensorflow	39	0
lt	38	25
machine	38	4
analysis	38	12
working	37	7
...	...	...
ljung	0	0
classifying	0	0
livestream	0	0
classname	0	0
lived	0	0
cleanly	0	0
classification_report	0	0
classical	0	1
claim	0	0
class3	0	0
claimed	0	0
claims	0	0
clarify	0	1
lol	0	0
logs	0	1
logo	0	0
lognormal	0	0
logits	0	0
logit	0	1
logistics	0	0
clarifying	0	0
logical	0	0
logic	0	1
clarityhow	0	0
logarithms	0	0
logarithmicaly	0	0
class1	0	0
locked	0	0
class2	0	0
χ2	0	0

8621 rows × 2 columns

# Most commonly used words on statistics
# This was run multiple times for different sets of n-grams
word_count_test = pd.concat([df_view_stats, y_test], axis=1)
word_count_test['subreddit_target'] = word_count_test['subreddit_target'].map({0:'statistics', 1:'datascience'})
word_count_test.groupby(by='subreddit_target').sum().sort_values(by='statistics', axis=1, ascending=False).T

subreddit_target	datascience	statistics
data	435	74
statistics	26	50
mean	6	48
variables	20	44
variable	15	42
test	15	42
help	36	41
time	74	36
regression	18	36
model	67	34
use	61	28
know	65	25
lt	38	25
question	34	25
11	5	23
different	27	22
distribution	3	21
x200b	25	21
make	34	21
want	44	21
way	39	19
number	15	19
need	43	19
09	2	18
statistical	18	18
sample	10	18
linear	17	18
day	15	18
population	0	17
15	7	17
...	...	...
fine	1	0
flagship	0	0
fishermen	0	0
flagged	1	0
flag	1	0
fizzle	0	0
fizzbuzz	0	0
fixing	0	0
fix	1	0
fivethirtyeight	0	0
fitted	0	0
fitness	0	0
fit_transform	1	0
fit2	0	0
fishing	0	0
fischer	0	0
finger	0	0
fiscal	0	0
firmly	0	0
firm	2	0
firing	0	0
firefox	0	0
fintech	0	0
finnoq	0	0
finnish	0	0
finland	0	0
finite	0	0
finishes	0	0
finished	2	0
χ2	0	0

8621 rows × 2 columns

N-grams = 2

# This was the second run of CountVectorizer, e.g. n-grams = 2
# I removed science, because I wanted to make a differentation b/w 'science' and 'data science', and also 've', b/c it was only getting picked up b/c 'I've'
# Going to leave stop words as is for n-grams = 2, aside from html crap that got pulled in
new_stop_words = {'amp', 'x200b', 'amp x200b'}
stop_words = ENGLISH_STOP_WORDS.union(new_stop_words)

# Instantiate
cvec2 = CountVectorizer(stop_words=stop_words, ngram_range=(2,2)) #Second run through of n-grams = 2

# Set variables and train_test_split
# Sticking with the normal 75/25 split
X = dfCombined['title_body'].values
y = dfCombined['subreddit_target']
# .map({'statistics':0, 'datascience':1})

X_train2, X_test2, y_train2, y_test2 = train_test_split(X,
                                                    y,
                                                    random_state=42)

# Fit and transform
cvec2.fit(X_train2)
X_train_transform2 = cvec2.transform(X_train2)
X_test_transform2 = cvec2.transform(X_test2)

df_view_stats2 = pd.DataFrame(X_test_transform2.todense(),
                             columns=cvec2.get_feature_names(),
                             index=y_test2.index)
df_view_stats2.head()
# .T.sort_values('statistics', ascending=False).head(10).T

	00 00	00 29s2	00 9987	00 cheap	00 cost	00 established	00 mean	00 primarily	00 went	000 10	...	zippers validate	zjt vector	zones topping	zoo ggplot2	zuckerberg eric	zwitch mapd	zziz pwcpapers	µᵢ fixed	χ2 05	χ2 distribution
113	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
572	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
450	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
383	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
506	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

5 rows × 43537 columns

# Most commonly used words on data science
word_count_test = pd.concat([df_view_stats2, y_test2], axis=1)
word_count_test['subreddit_target'] = word_count_test['subreddit_target'].map({0:'statistics', 1:'datascience'})
word_count_test.groupby(by='subreddit_target').sum().sort_values(by='datascience', axis=1, ascending=False).T

subreddit_target	datascience	statistics
data science	127	5
machine learning	38	2
data scientist	30	0
https www	26	0
data scientists	20	0
gt lt	17	0
tensorflow js	16	0
https github	15	0
statistical learning	15	2
github com	15	0
data analyst	15	1
kaggle com	13	0
www kaggle	13	0
time series	13	3
https redd	12	0
data analytics	12	0
feel like	12	0
https youtu	10	0
data set	10	3
open source	10	0
greatly appreciated	9	1
linear algebra	8	3
don know	8	2
scikit learn	8	0
data analysis	7	1
work data	7	0
lt script	7	0
sql queries	7	0
little bit	7	0
new data	7	0
...	...	...
gallery 1hbpy1w	0	0
gallery ehcawau	0	0
gallery ej9di3f	0	0
gallery html	0	0
gallery http	0	0
gallery o45qf8o	0	0
gallery olzrzxz	0	0
gallery plotly	0	0
gallery wtdpir3	0	0
gain round	0	0
gain opinions	0	0
gain followers	0	0
future timeseries	0	0
future performance	0	0
future price	0	0
future research	0	0
future researcherdon	0	0
future statistical	0	0
future thoughts	0	0
future time	0	0
future using	0	0
gain academic	0	0
future weather	0	0
fyi data	0	0
fyi learning	0	0
g1 mn	0	0
g2 13	0	0
ga 90	0	0
ga tools	0	0
χ2 distribution	0	0

43537 rows × 2 columns

# Most commonly used words on statistics
word_count_test = pd.concat([df_view_stats2, y_test2], axis=1)
word_count_test['subreddit_target'] = word_count_test['subreddit_target'].map({0:'statistics', 1:'datascience'})
word_count_test.groupby(by='subreddit_target').sum().sort_values(by='statistics', axis=1, ascending=False).T

subreddit_target	datascience	statistics
standard deviation	0	6
linear regression	3	6
non stationary	0	6
https imgur	3	5
regression model	2	5
independent variables	0	5
don think	0	5
imgur com	3	5
make sense	1	5
data science	127	5
things like	4	5
normally distributed	0	4
logistic regressions	0	4
prediction model	1	4
need help	4	4
normal distribution	0	4
hypothesis testing	0	4
capture recapture	0	4
comp sci	3	4
random sample	0	4
average mean	0	4
post test	0	3
real time	2	3
pre post	0	3
make statistical	0	3
data excel	0	3
hotspot mapping	0	3
independent variable	0	3
index variables	0	3
statistical curve	0	3
...	...	...
frames day	0	0
framework aware	0	0
framework building	0	0
framework cheersbest	0	0
framework consistent	0	0
framework guidance	0	0
framework implemented	0	0
framework interactive	0	0
fragments feeding	0	0
fraction discard	0	0
forward similar	0	0
fpsyg 2018	0	0
forward want	0	0
forxa03xa0months july	0	0
foundation hiring	0	0
foundation mathematics	0	0
foundation prior	0	0
foundations predictive	0	0
foundations python	0	0
founder kdnuggets	0	0
fourmilab ch	0	0
fourth generate	0	0
foxes hounds	0	0
foxes immediately	0	0
foxes seven	0	0
foxhole inside	0	0
fp growth	0	0
fp persons	0	0
fpsyg 09	0	0
χ2 distribution	0	0

43537 rows × 2 columns

# Instantiate and fit
lr2 = LogisticRegression()
lr2.fit(X_train_transform2, y_train2)
lr2.score(X_train_transform2, y_train2)

0.9863945578231292

lr2.score(X_test_transform2, y_test2)
# Looks like a pretty decent overfit

0.7637474541751528

Predicting subreddit using Random Forests + Another Classifier

# Instantiate and fit
# From here on out, it's n-grams = 2
lr = LogisticRegression()
lr.fit(X_train_transform, y_train)
lr.score(X_train_transform, y_train)

0.9897959183673469

lr.score(X_test_transform, y_test)
# Looks like a pretty decent overfit

0.8757637474541752

We want to predict a binary variable - class `0` for one of your subreddits and `1` for the other.

preds = lr.predict(X_test_transform)
pred_proba = lr.predict_proba(X_test_transform)[:,1]

roc_auc = roc_auc_score(y_test, preds)
roc_auc

0.8772046367954297

roc_auc = roc_auc_score(y_test, preds)
FPR, TPR, thresholds = roc_curve(y_test, pred_proba)
plt.figure(figsize=(8, 6))
plt.plot(FPR, TPR, label='Logistic Regression (area = %0.2f)' % roc_auc)
plt.title('ROC-AUC (n-grams=1)')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.plot([0, 1], [0, 1],'r--')
plt.legend(loc="lower right")
plt.show()

png

Thought experiment: What is the baseline accuracy for this model?

## I'm going to take an educated guess that the baseline accuracy is 50%, as in, randomly guessing

Create a `RandomForestClassifier` model to predict which subreddit a given post belongs to.

# Instantiate
rf = RandomForestClassifier()
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

Use cross-validation in scikit-learn to evaluate the model above.

Evaluate the accuracy of the model, as well as any other metrics you feel are appropriate.
Bonus: Use GridSearchCV with Pipeline to optimize your CountVectorizer/TfidfVectorizer and classification model.

cvs_train = cross_val_score(rf, X_train_transform, y_train, cv=cv, n_jobs=-1)

print(cvs_train)
print(cvs_train.mean())

[0.81292517 0.86734694 0.79591837 0.78231293 0.79931973]
0.8115646258503402

cvs_test = cross_val_score(rf, X_test_transform, y_test, cv=cv, n_jobs=-1)

print(cvs_test)
print(cvs_test.mean())
# Still slight overfit

[0.70707071 0.71717172 0.76767677 0.82474227 0.74226804]
0.7517859002395084

Repeat the model-building process using a different classifier (e.g. `MultinomialNB`, `LogisticRegression`, etc)

MultinomialNB

mnb = MultinomialNB()

cvs_train = cross_val_score(mnb, X_train_transform, y_train, cv=cv, n_jobs=-1)

print(cvs_train)
print(cvs_train.mean())

[0.82312925 0.8537415  0.81632653 0.80272109 0.81972789]
0.8231292517006802

cvs_test = cross_val_score(mnb, X_test_transform, y_test, cv=cv, n_jobs=-1)

print(cvs_test)
print(cvs_test.mean())
# Not as bad of an overfit

[0.7979798  0.75757576 0.7979798  0.81443299 0.77319588]
0.788232843902947

GaussianNB

gnb = GaussianNB()

cvs_train = cross_val_score(gnb, X_train_transform.toarray(), y_train, cv=cv, n_jobs=-1)

print(cvs_train)
print(cvs_train.mean())

[0.77891156 0.80612245 0.78231293 0.81292517 0.80952381]
0.7979591836734693

cvs_test = cross_val_score(gnb, X_test_transform.toarray(), y_test, cv=cv, n_jobs=-1)

print(cvs_test)
print(cvs_test.mean())
# Overfit isn't as much of a problem on this model
# However, the overall score isn't as strong as the other models

[0.71717172 0.76767677 0.80808081 0.78350515 0.70103093]
0.7554930750807038

Executive Summary

Put your executive summary in a Markdown cell below.

Reclassifying all of Reddit is an incredible daunting task. However, the machine learning and natural language processing abilities of Python can turn this into a manageable task. Reddit calls itself the 'frontpage of the internet,' and indicative of the innovation that drove the creation of the internet, Reddit can innovate to overcome this challenge as it has countless obstacles before this.

Specifically, the distinction between r/DataScience and r/Statistics is relatively small as these subreddits generally discuss similar ideas and concepts. Despite these similarities, I believe my models performed quite well (especially my first run of Logistic Regression using n-grams = 1). Additionally, I chose to remove specific stop words ('science', 'https', 'com') that would more easily identify r/DataScience as the correct subreddit, in order to 'challenge' my modeling and evaluation skills, as well as allow this process to more generally be applied to all of Reddit's various subreddits. Removing these stop word would have increased my models' classifying ability even further.

Finally, I believe that this machine learning/NLP process can be applied to Reddit as a whole to help reclassify and realign it's subreddits with a high degree of success. Coupled with Reddit's strong community, including its committed mods, this is challenge that Reddit can overcome, and potentially be stronger off because of it.

	00	000	0005	0016	0031	004	004100341sig	00411621sig	004p2	00625	...	zipper	zippers	zjt	zones	zoo	zuckerberg	zwitch	zziz	µᵢ	χ2
113	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
572	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
450	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
383	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
506	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

	00 00	00 29s2	00 9987	00 cheap	00 cost	00 established	00 mean	00 primarily	00 went	000 10	...	zippers validate	zjt vector	zones topping	zoo ggplot2	zuckerberg eric	zwitch mapd	zziz pwcpapers	µᵢ fixed	χ2 05	χ2 distribution
113	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
572	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
450	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
383	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
506	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

	00	000	0005	0016	0031	004	004100341sig	00411621sig	004p2	00625	...	zipper	zippers	zjt	zones	zoo	zuckerberg	zwitch	zziz	µᵢ	χ2
113	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
572	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
450	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
383	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
506	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

	00 00	00 29s2	00 9987	00 cheap	00 cost	00 established	00 mean	00 primarily	00 went	000 10	...	zippers validate	zjt vector	zones topping	zoo ggplot2	zuckerberg eric	zwitch mapd	zziz pwcpapers	µᵢ fixed	χ2 05	χ2 distribution
113	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
572	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
450	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
383	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
506	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

Classifying Reddit Posts: r/DataScience vs r/Statistics