Thomas Kelly

Classifying Reddit Posts: r/DataScience vs r/Statistics

30 Sep 2018 tarihinde yayınlandı.

Using Reddit’s API for Predicting Comments

In this project, we will practice two major skills. Collecting data via an API request and then building a binary predictor.

As we discussed in week 2, and earlier today, there are two components to starting a data science problem: the problem statement, and acquiring the data.

For this article, your problem statement will be: What characteristics of a post on Reddit contribute most to what subreddit it belongs to?

Your method for acquiring the data will be scraping threads from at least two subreddits.

Once you’ve got the data, you will build a classification model that, using Natural Language Processing and any other relevant features, predicts which subreddit a given post belongs to.

Scraping Thread Info from Reddit.com

Set up a request (using requests) to the URL below.

NOTE: Reddit will throw a 429 error when using the following code:

res = requests.get(URL)

This is because Reddit has throttled python’s default user agent. You’ll need to set a custom User-agent to get your request to work.

res = requests.get(URL, headers={'User-agent': 'YOUR NAME Bot 0.1'})
# Imports - used or otherwise.
import pandas as pd
import requests
import json
import time
import regex as re
import praw
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from bs4 import BeautifulSoup
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict, GridSearchCV, StratifiedKFold
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, ENGLISH_STOP_WORDS
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn import metrics
from sklearn.metrics import roc_auc_score, roc_curve
# Create the URL variables
URL_ds = "http://www.reddit.com/r/datascience.json"
URL_stats = "https://www.reddit.com/r/statistics.json"
# Authenticating via OAuth for praw
reddit = praw.Reddit(client_id='AOJTLQLavhOXPg',
                     client_secret='eS08QOpy2lWh37qkVBGlN7yMjRI',
                     username='TCRAY_DSI',
                     password='dsi123',
                     user_agent='TK Bot 0.1')
# Check
print(reddit.user.me())
TCRAY_DSI
# Create subs for praw:
sub_ds = reddit.subreddit('datascience')
sub_stats = reddit.subreddit('statistics')
# Create top pulls
top_ds = sub_ds.top(time_filter='year')
top_stats = sub_stats.top(time_filter='year')
# These were used and attempted (to success) before creating the loop
# Request the JSON files
# I did them in seperate cells to space out the scrapping, so reddit wouldn't throw a 429 error
# res_ds = res.get(URL_ds, headers={'User-agent': 'TK Bot 0.1'})
# res_stats = requests.get(URL_stats, headers={'User-agent': 'TK Bot 0.1'})
# res_stats.status_code

Use res.json() to convert the response into a dictionary format and set this to a variable.

data = res.json()
# These were used and attempted (to success) before creating the loop
# Convert the JSON responses
# data_ds = res_ds.json()
# data_stats = res_stats.json()
# Check out data
# data_ds
# data_stats
# Testing adding nested dictionaries to each other
# doubling_up = data_ds['data']['children'] + data_ds['data']['children']
# doubling_up

Getting more results

By default, Reddit will give you the top 25 posts:

print(len(data['data']['children']))

If you want more, you’ll need to do two things:

  1. Get the name of the last post: data['data']['after']
  2. Use that name to hit the following url: http://www.reddit.com/r/boardgames.json?after=THE_AFTER_FROM_STEP_1
  3. Create a loop to repeat steps 1 and 2 until you have a sufficient number of posts.

NOTE: Reddit will limit the number of requests per second you’re allowed to make. When you create your loop, be sure to add the following after each iteration.

time.sleep(3) # sleeps 3 seconds before continuing```

This will throttle your loop and keep you within Reddit's guidelines. You'll need to import the `time` library for this to work!


```python
# Check out length
# print(len(data_ds['data']['children']))
# print(len(data_stats['data']['children']))
# Test the last post pull
# data_ds['data']['after']
# For DS set - previously ran to generate CSV
url_ds = "https://www.reddit.com/r/datascience.json"
data_ds = []
total = []
next_get = ''

# I went with 40 b/c 40 * 25 = 1000 posts total
for i in range(40):

    # Request get
    res = requests.get(url_ds+next_get, headers={'User-agent': 'TK Bot 0.1'})

    # Convert the JSON
    new_dict = res.json()

    # Add to already collected data set
    data_ds.extend(new_dict['data']['children'])

    # Collect 'after' from new dict to generate next URL
    new_url_end = str(new_dict['data']['after'])

    # Generate the next URL
    next_get = '?after='+new_url_end

    # CSV add/update along with DF creation
    # Chose greater than 0 so the else executes on the first iteration
    if i > 0:
        # Read in previous csv for comparision/add
        # Establish current DF - left over from previous way of running
        # past_posts = pd.read_csv('data_ds.csv')
        # current_df = pd.DataFrame(data_ds)

        # Append new and old
        total = pd.DataFrame(data_ds)

        # Convert to DF and save to new csv file
        pd.DataFrame(total).to_csv('data_ds.csv', index = False)

    else:
        pd.DataFrame(data_ds).to_csv('data_ds.csv', index = False)

    # Sleep to fit within Reddit's pull limit
    time.sleep(3)
# For stats set - previously ran to generate CSV
url_stats = "https://www.reddit.com/r/statistics.json"
data_stats = []
total = []
next_get = ''

# I went with 40 b/c 40 * 25 = 1000 posts total
for i in range(40):

    # Request get
    res = requests.get(url_stats+next_get, headers={'User-agent': 'TK Bot 0.1'})

    # Convert the JSON
    new_dict = res.json()

    # Add to already collected data set
    data_stats.extend(new_dict['data']['children'])

    # Collect 'after' from new dict to generate next URL
    new_url_end = str(new_dict['data']['after'])

    # Generate the next URL
    next_get = '?after='+new_url_end

    # CSV add/update along with DF creation
    # Chose greater than 0 so the else executes on the first iteration
    if i > 0:
        # Read in previous csv for comparision/add
        # Establish current DF - left over from previous way of running
        # past_posts = pd.read_csv('data_stats.csv')
        # current_df = pd.DataFrame(data_stats)

        # Append new and old
        total = pd.DataFrame(data_stats)

        # Convert to DF and save to new csv file
        pd.DataFrame(total).to_csv('data_stats.csv', index = False)

    else:
        pd.DataFrame(data_stats).to_csv('data_stats.csv', index = False)

    # Sleep to fit within Reddit's pull limit
    time.sleep(3)
# This was an older attempt at writing the function, that I scrapped and decided to start fresh on:
# url_ds = "https://www.reddit.com/r/datascience.json?after=" + last_post_ds

# for i in range(25):
#     # Get the name of the last post
#     last_post_ds = data_ds['data']['after']

#     # Set the url from the last post
#     new_url_ds = "https://www.reddit.com/r/datascience.json?after=" + last_post_ds

#     # Perform request get
#     new_res_ds = res.get(new_url_ds, headers={'User-agent': 'TK Bot 0.1'})

#     # Convert the JSON to a dict
#     new_data_ds = new_res_ds.json()

#     # Add the new dict to the already existing one
#     data_ds.update(new_data_ds)
#     data_ds['data']['children'] = data_ds['data']['children'] + new_data_ds['data']['children']
#     data_ds['data']['after'] = new_data_ds['data']['after']

#     # Sleep
#     # time.sleep(3)
# Next few cells devoted to understanding how to generate a combined dict
# new_data_ds.items()
# OG_ds_data = data_ds.copy()
# new_data_ds = new_res_ds.json()
# data_ds.update(new_data_ds)
# Testing adding nested dictionaries to each other
# doubling_up = data_ds['data']['children'] + data_ds['data']['children']
# doubling_up

Save your results as a CSV

You may do this regularly while scraping data as well, so that if your scraper stops of your computer crashes, you don’t lose all your data.

# My loop in the previous cell completes this step.

Read my files back in and clean them up / EDA

%pwd
'/Users/tomkelly/Desktop/general_assembly/DSI-US-5/project-3'
df_ds = pd.read_csv('./data_ds.csv')
df_stats = pd.read_csv('./data_stats.csv')
# 983 DS posts vs 978 stats posts
# df_ds.shape[0]
df_stats.shape[0]
978
# for i in df_ds.shape[0]
# Testing what I want to loop
df_ds['body'] = pd.Series(re.findall('(?<=selftext).{4}(.*).{4}(?=author_fullname)', df_ds['data'][0]))
df_ds.head()
data kind body
0 {'approved_at_utc': None, 'subreddit': 'datasc... t3 The Mod Team has decided that it would be nice...
1 {'approved_at_utc': None, 'subreddit': 'datasc... t3 NaN
2 {'approved_at_utc': None, 'subreddit': 'datasc... t3 NaN
3 {'approved_at_utc': None, 'subreddit': 'datasc... t3 NaN
4 {'approved_at_utc': None, 'subreddit': 'datasc... t3 NaN
type(df_ds['body'].iloc[0,])
str
# To pull out the body of the post and make it a new column
for i in range(0, df_ds.shape[0]):
    try: #Since regex makes it a list, this helps deal with nulls
        df_ds['body'][i] = re.findall('(?<=selftext).{4}(.*).{4}(?=author_fullname)', df_ds['data'][i])[0]
    except:
        df_ds['body'][i] = ''
df_ds.head()
data kind body
0 {'approved_at_utc': None, 'subreddit': 'datasc... t3 The Mod Team has decided that it would be nice...
1 {'approved_at_utc': None, 'subreddit': 'datasc... t3 \n\nWelcome to this week's 'Entering &amp; Tr...
2 {'approved_at_utc': None, 'subreddit': 'datasc... t3
3 {'approved_at_utc': None, 'subreddit': 'datasc... t3
4 {'approved_at_utc': None, 'subreddit': 'datasc... t3 I'm working on making a list of Machine Learni...
# For some reason, wrapping it in pd.Series makes this work before I loop it
try:
    df_ds['title'] = pd.Series(re.findall('(?<= .title).{4}(.*).{4}(?=link_flair_richtext)', df_ds['data'][0]))[0]
except:
    df_ds['title'] = ''
# To pull out the title of the post and make it a new column
for i in range(0, df_ds.shape[0]):
    try:
        df_ds['title'][i] = re.findall('(?<= .title).{4}(.*).{4}(?=link_flair_richtext)', df_ds['data'][i])[0]
    except:
        df_ds['title'][i] = ''
df_stats.shape[0]
978
df_ds.head()
data kind body title
0 {'approved_at_utc': None, 'subreddit': 'datasc... t3 The Mod Team has decided that it would be nice... DS Book Suggestions/Recommendations Megathread
1 {'approved_at_utc': None, 'subreddit': 'datasc... t3 \n\nWelcome to this week's 'Entering &amp; Tr... Weekly 'Entering &amp; Transitioning' Thread. ...
2 {'approved_at_utc': None, 'subreddit': 'datasc... t3 Mo Data, Mo Problems. Everyone always talks ab...
3 {'approved_at_utc': None, 'subreddit': 'datasc... t3 Make “Fairness by Design” Part of Machine Lear...
4 {'approved_at_utc': None, 'subreddit': 'datasc... t3 I'm working on making a list of Machine Learni... Papers with Code
# Looks like the body/title got pulled in as a list, turning it into a str
# This is leftover from an older method
# for w in range(0,df_ds['body'].shape[0]):
#     df_ds['body'][w] = str(df_ds['body'][w])
# Additional Clean-up - DS
df_ds['body'] = df_ds['body'].map(lambda x: x.replace('\\n',''))
df_ds['body'] = df_ds['body'].map(lambda x: x.replace('\n',''))
df_ds['body'] = df_ds['body'].map(lambda x: x.replace('\\',''))
df_ds['body'] = df_ds['body'].map(lambda x: x.replace("\\'","'"))
# df_ds['body'] = [w.replace('/n', '') for w in df_ds['body']]
df_ds['body']
0      The Mod Team has decided that it would be nice...
1       Welcome to this week's 'Entering &amp; Transi...
2                                                       
3                                                       
4      I'm working on making a list of Machine Learni...
5      I do most of my work in Python. Building the m...
6      [Project Link](https://github.com/HiteshGorana...
7      Before I got hired, my company had a contracto...
8      I'm looking for an open-source web-based tool ...
9      I've been reading around online a bit as to wh...
10     I am new to time series data, so bear with me....
11                                                      
12     Hey all, Do people have recommendations for pi...
13     I am quite old (23), but would like to become ...
14                                                      
15     I know that python and R are the standard lang...
16     Which tools and packages do you use the most a...
17                                                      
18     Has anyone dealt with such a problem statement...
19                                                      
20                                                      
21     So, I'm trying to build playlists based on val...
22      My intents are to analyze the results with Ex...
23                                                      
24     Does anyone have experience in using either pl...
25     Since I started as a data scientist, I have be...
26                                                      
27     Good Afternoon Everyone,&amp;#x200B;I was work...
28     Hi all, this is a followup on [Separated from ...
29     This is maybe not a specific DS question, but ...
                             ...                        
953    Specifically, as AI gets better and better, an...
954    What is the difference between sklearn.impute....
955                                                     
956    ', 'author_fullname': 't2_pqifw', 'saved': Fal...
957    I have a prospective client who’s keen to do s...
958                                                     
959                                                     
960    Hello all!I have a final interview for a Sales...
961                                                     
962    So here’s a little about me. I’ve been a lead ...
963    Please shoo me away to the proper sub if I'm a...
964                                                     
965    What's the best open source (i.e., free) appro...
966    Bayesian Network is a probabilistic graphical ...
967    ', 'author_fullname': 't2_r3q3m', 'saved': Fal...
968                                                     
969                                                     
970    Hi, guys. I have a dataset of different addres...
971    I have been reading a lot of quora answers and...
972    This is my first kernel on Kaggle doing some d...
973    Hi Guys, I need some advise or personal experi...
974                                                     
975    I'm finding myself in a position where I may h...
976    I'm looking to make some data science projects...
977    Hi, this is my first post ever, so sorry in ad...
978    Cheers everyone! This is my first kernel on Ka...
979    Hello /r/datascience. TLDR: given the current ...
980    What data science course you studied from and ...
981                                                     
982    I'm looking for a ISO file of a distro that it...
Name: body, Length: 983, dtype: object
# Add target column for later combination
df_ds['subreddit_target'] = 1
# Check out the nulls
df_ds.isnull().sum().sort_values()
data                0
kind                0
body                0
title               0
subreddit_target    0
dtype: int64
# Same process of pulling out body/post for df_stats
try:
    df_stats['body'] = pd.Series(re.findall('(?<=selftext).{4}(.*).{4}(?=author_fullname)', df_stats['data'][0]))[0]
except:
    df_stats['body'] = ''
# To pull out the body of the post and make it a new column
for i in range(0, df_stats.shape[0]):
    try:
        df_stats['body'][i] = re.findall('(?<=selftext).{4}(.*).{4}(?=author_fullname)', df_stats['data'][i])[0]
    except:
        df_stats['body'] = ''
# For some reason, wrapping it in pd.Series makes this work before I loop it
try:
    df_stats['title'] = pd.Series(re.findall('(?<= .title).{4}(.*).{4}(?=link_flair_richtext)', df_stats['data'][0]))[0]
except:
    df_stats['title'] = ''
# To pull out the title of the post and make it a new column
for i in range(0, df_stats.shape[0]):
    try:
        df_stats['title'][i] = re.findall('(?<= .title).{4}(.*).{4}(?=link_flair_richtext)', df_stats['data'][i])[0]
    except:
        df_stats['title'] = ''
# Looks like the body got pulled in as a list
# restricting how I clean it up, turning it into a str
# Leftover
# for w in range(0,df_stats['body'].shape[0]):
#     df_stats['body'][w] = str(df_stats['body'][w])
# Additional Clean-up - DS
df_stats['body'] = df_stats['body'].map(lambda x: x.replace('\\n',''))
df_stats['body'] = df_stats['body'].map(lambda x: x.replace('\n',''))
df_stats['body'] = df_stats['body'].map(lambda x: x.replace('\\',''))
df_stats['body'] = df_stats['body'].map(lambda x: x.replace("\\'","'"))
# df_stats['body'] = [w.replace('/n', '') for w in df_stats['body']]
df_stats['subreddit_target'] = 0
# Check out the nulls
df_stats.isnull().sum().sort_values()
data                0
kind                0
body                0
title               0
subreddit_target    0
dtype: int64
# Renaming the columns so they're easier to discern
# Left over from previous way of solving
# df_ds.columns = ['data','kind','body_ds','title_ds']
# df_stats.columns = ['data','kind','body_stats','title_stats']
df_ds.head(1)
data kind body title subreddit_target
0 {'approved_at_utc': None, 'subreddit': 'datasc... t3 The Mod Team has decided that it would be nice... DS Book Suggestions/Recommendations Megathread 1
# Create combined list for later usage
dflist = [df_ds, df_stats]
dfCombined = pd.concat(dflist, axis=0, sort=True)
dfCombined.head()
# .fillna(value=" ")
body data kind subreddit_target title
0 The Mod Team has decided that it would be nice... {'approved_at_utc': None, 'subreddit': 'datasc... t3 1 DS Book Suggestions/Recommendations Megathread
1 Welcome to this week's 'Entering &amp; Transi... {'approved_at_utc': None, 'subreddit': 'datasc... t3 1 Weekly 'Entering &amp; Transitioning' Thread. ...
2 {'approved_at_utc': None, 'subreddit': 'datasc... t3 1 Mo Data, Mo Problems. Everyone always talks ab...
3 {'approved_at_utc': None, 'subreddit': 'datasc... t3 1 Make “Fairness by Design” Part of Machine Lear...
4 I'm working on making a list of Machine Learni... {'approved_at_utc': None, 'subreddit': 'datasc... t3 1 Papers with Code
# Check length is what I expected
dfCombined['body'].shape[0]
1961
dfCombined['title_body'] = dfCombined['body'] + dfCombined['title']
dfCombined
body data kind subreddit_target title title_body
0 The Mod Team has decided that it would be nice... {'approved_at_utc': None, 'subreddit': 'datasc... t3 1 DS Book Suggestions/Recommendations Megathread The Mod Team has decided that it would be nice...
1 Welcome to this week's 'Entering &amp; Transi... {'approved_at_utc': None, 'subreddit': 'datasc... t3 1 Weekly 'Entering &amp; Transitioning' Thread. ... Welcome to this week's 'Entering &amp; Transi...
2 {'approved_at_utc': None, 'subreddit': 'datasc... t3 1 Mo Data, Mo Problems. Everyone always talks ab... Mo Data, Mo Problems. Everyone always talks ab...
3 {'approved_at_utc': None, 'subreddit': 'datasc... t3 1 Make “Fairness by Design” Part of Machine Lear... Make “Fairness by Design” Part of Machine Lear...
4 I'm working on making a list of Machine Learni... {'approved_at_utc': None, 'subreddit': 'datasc... t3 1 Papers with Code I'm working on making a list of Machine Learni...
5 I do most of my work in Python. Building the m... {'approved_at_utc': None, 'subreddit': 'datasc... t3 1 Looking for resources to learn how to launch m... I do most of my work in Python. Building the m...
6 [Project Link](https://github.com/HiteshGorana... {'approved_at_utc': None, 'subreddit': 'datasc... t3 1 DataScience365 ( A project started recently to... [Project Link](https://github.com/HiteshGorana...
7 Before I got hired, my company had a contracto... {'approved_at_utc': None, 'subreddit': 'datasc... t3 1 Anyone have experience parsing hospital data f... Before I got hired, my company had a contracto...
8 I'm looking for an open-source web-based tool ... {'approved_at_utc': None, 'subreddit': 'datasc... t3 1 Open Source Tools for Dashboard Design I'm looking for an open-source web-based tool ...
9 I've been reading around online a bit as to wh... {'approved_at_utc': None, 'subreddit': 'datasc... t3 1 MS online vs in-person I've been reading around online a bit as to wh...
10 I am new to time series data, so bear with me.... {'approved_at_utc': None, 'subreddit': 'datasc... t3 1 Best method for predicting the likelihood of a... I am new to time series data, so bear with me....
11 {'approved_at_utc': None, 'subreddit': 'datasc... t3 1 Very low cost cloud GPU instances (&lt;$0.15/h... Very low cost cloud GPU instances (&lt;$0.15/h...
12 Hey all, Do people have recommendations for pi... {'approved_at_utc': None, 'subreddit': 'datasc... t3 1 Pipeline Versioning (Open Source / Free) What ... Hey all, Do people have recommendations for pi...
13 I am quite old (23), but would like to become ... {'approved_at_utc': None, 'subreddit': 'datasc... t3 1 Data Science and being a Quant: how transferab... I am quite old (23), but would like to become ...
14 {'approved_at_utc': None, 'subreddit': 'datasc... t3 1 Data Democratization - Data and Analytics Take... Data Democratization - Data and Analytics Take...
15 I know that python and R are the standard lang... {'approved_at_utc': None, 'subreddit': 'datasc... t3 1 Mathematica is the best tool for data science ... I know that python and R are the standard lang...
16 Which tools and packages do you use the most a... {'approved_at_utc': None, 'subreddit': 'datasc... t3 1 What tools do you actually use at work? Which tools and packages do you use the most a...
17 {'approved_at_utc': None, 'subreddit': 'datasc... t3 1 Feature engineering that exploit symmetries ca... Feature engineering that exploit symmetries ca...
18 Has anyone dealt with such a problem statement... {'approved_at_utc': None, 'subreddit': 'datasc... t3 1 R clustering with maximum size per cluster Has anyone dealt with such a problem statement...
19 {'approved_at_utc': None, 'subreddit': 'datasc... t3 1 Get free GPU for training with Google Colab - ... Get free GPU for training with Google Colab - ...
20 {'approved_at_utc': None, 'subreddit': 'datasc... t3 1 [Cheat Sheet] Snippets for Plotting With ggplot [Cheat Sheet] Snippets for Plotting With ggplot
21 So, I'm trying to build playlists based on val... {'approved_at_utc': None, 'subreddit': 'datasc... t3 1 How to use recommender Systems with Multiple "... So, I'm trying to build playlists based on val...
22 My intents are to analyze the results with Ex... {'approved_at_utc': None, 'subreddit': 'datasc... t3 1 Please Take This Survey if You're a College Gr... My intents are to analyze the results with Ex...
23 {'approved_at_utc': None, 'subreddit': 'datasc... t3 1 How useful is a reference letter from an econ ... How useful is a reference letter from an econ ...
24 Does anyone have experience in using either pl... {'approved_at_utc': None, 'subreddit': 'datasc... t3 1 H2O.ai vs Datarobot? Your take Does anyone have experience in using either pl...
25 Since I started as a data scientist, I have be... {'approved_at_utc': None, 'subreddit': 'datasc... t3 1 Are independent research papers useful for a d... Since I started as a data scientist, I have be...
26 {'approved_at_utc': None, 'subreddit': 'datasc... t3 1 Super helpful cheat sheets for Keras, Numpy, P... Super helpful cheat sheets for Keras, Numpy, P...
27 Good Afternoon Everyone,&amp;#x200B;I was work... {'approved_at_utc': None, 'subreddit': 'datasc... t3 1 Correlation Plot of a correlation matrix ( usi... Good Afternoon Everyone,&amp;#x200B;I was work...
28 Hi all, this is a followup on [Separated from ... {'approved_at_utc': None, 'subreddit': 'datasc... t3 1 Step down from Data Scientist in next job- how... Hi all, this is a followup on [Separated from ...
29 This is maybe not a specific DS question, but ... {'approved_at_utc': None, 'subreddit': 'datasc... t3 1 How do you deal with post-job-interview though... This is maybe not a specific DS question, but ...
... ... ... ... ... ... ...
948 Hello all. I'm a grad school student who ended... {'approved_at_utc': None, 'subreddit': 'statis... t3 0 Need to Learn How to Use SPSS Syntax ASAP Hello all. I'm a grad school student who ended...
949 I have been reading the Wikipedia explanations... {'approved_at_utc': None, 'subreddit': 'statis... t3 0 ELI5: bray curtis dissimilarity matrix and UPG... I have been reading the Wikipedia explanations...
950 Hello all.u200bThe survey: Our survey asks peo... {'approved_at_utc': None, 'subreddit': 'statis... t3 0 Weighting an online survey with a lot of unknowns Hello all.u200bThe survey: Our survey asks peo...
951 Hi everyone. I'm curious whether anyone knows ... {'approved_at_utc': None, 'subreddit': 'statis... t3 0 Textbooks in statistics with great problem sets Hi everyone. I'm curious whether anyone knows ...
952 I am analyzing dyadic data in a multilevel mod... {'approved_at_utc': None, 'subreddit': 'statis... t3 0 Residuals plot: Is this autocorrelation? I am analyzing dyadic data in a multilevel mod...
953 How do you apply the Bonferroni correction if ... {'approved_at_utc': None, 'subreddit': 'statis... t3 0 Bonferroni corrections How do you apply the Bonferroni correction if ...
954 Hello everyone, I'm looking for books which ta... {'approved_at_utc': None, 'subreddit': 'statis... t3 0 Resources for undergrad material in Python &am... Hello everyone, I'm looking for books which ta...
955 Hi there! I was hoping someone may be able to ... {'approved_at_utc': None, 'subreddit': 'statis... t3 0 Unsure which test to use Hi there! I was hoping someone may be able to ...
956 I should preface this by saying I know very li... {'approved_at_utc': None, 'subreddit': 'statis... t3 0 Help with normalization of data I should preface this by saying I know very li...
957 Hey r/statistics, I need some advice on how to... {'approved_at_utc': None, 'subreddit': 'statis... t3 0 Advice on an epidemiology dataset Hey r/statistics, I need some advice on how to...
958 I'm facing 3 problems in my current analysis (... {'approved_at_utc': None, 'subreddit': 'statis... t3 0 Groupsize differences, unequal genders and g p... I'm facing 3 problems in my current analysis (...
959 I am working with panel data with n=30 and t=7... {'approved_at_utc': None, 'subreddit': 'statis... t3 0 How to interpret counterintuitive signs from m... I am working with panel data with n=30 and t=7...
960 I just finished gelmans Bayesian data analysis... {'approved_at_utc': None, 'subreddit': 'statis... t3 0 Where to go after Gelman's BDA3? I just finished gelmans Bayesian data analysis...
961 Ignore for a moment the issues with NHST.If a ... {'approved_at_utc': None, 'subreddit': 'statis... t3 0 If you are working in the paradigm of NHST, wh... Ignore for a moment the issues with NHST.If a ...
962 For the “big” study this group says they hypot... {'approved_at_utc': None, 'subreddit': 'statis... t3 0 How can I use pilot data to plan sample sizes ... For the “big” study this group says they hypot...
963 Hi. I need to write two predictive supply and ... {'approved_at_utc': None, 'subreddit': 'statis... t3 0 Predictive supply and demand model Hi. I need to write two predictive supply and ...
964 An illustration of my issue: For e.g. X is a h... {'approved_at_utc': None, 'subreddit': 'statis... t3 0 Determining which variable is more affected An illustration of my issue: For e.g. X is a h...
965 A group of students takes a PRE test with 50 q... {'approved_at_utc': None, 'subreddit': 'statis... t3 0 Repeat measures t-test on exam data, but pre a... A group of students takes a PRE test with 50 q...
966 Trying to figure out that if I have 7 variable... {'approved_at_utc': None, 'subreddit': 'statis... t3 0 Easy question from one confused boi; 7 variabl... Trying to figure out that if I have 7 variable...
967 Hello,I’ve been doing some analysis regardin... {'approved_at_utc': None, 'subreddit': 'statis... t3 0 How to deal with the log of a variable where s... Hello,I’ve been doing some analysis regardin...
968 Hi there, I'm a bit confused about usage of F... {'approved_at_utc': None, 'subreddit': 'statis... t3 0 Questions about Firth logistic regressions Hi there, I'm a bit confused about usage of F...
969 I have ranked preference data for 7 items. How... {'approved_at_utc': None, 'subreddit': 'statis... t3 0 Analyzing Ranked Preference Data I have ranked preference data for 7 items. How...
970 I am measuring the effect of scale on the numb... {'approved_at_utc': None, 'subreddit': 'statis... t3 0 Wondering which test to conduct and how to con... I am measuring the effect of scale on the numb...
971 I'm looking at some instruction/examples on A/... {'approved_at_utc': None, 'subreddit': 'statis... t3 0 [Q] non-parametric, permutations A/B testing I'm looking at some instruction/examples on A/...
972 &amp;#x200B; {'approved_at_utc': None, 'subreddit': 'statis... t3 0 What is a good tutorial for learning how to ca... &amp;#x200B;What is a good tutorial for learni...
973 &amp;#x200B; {'approved_at_utc': None, 'subreddit': 'statis... t3 0 i'm a psych phd student who wants to befriend ... &amp;#x200B;i'm a psych phd student who wants ...
974 Howdy, So I’m in the beginnings of a PhD in ep... {'approved_at_utc': None, 'subreddit': 'statis... t3 0 Any grad students from other fields also looki... Howdy, So I’m in the beginnings of a PhD in ep...
975 Correlation And Causation By Examplehttp://blo... {'approved_at_utc': None, 'subreddit': 'statis... t3 0 Correlation And Causation By Example Correlation And Causation By Examplehttp://blo...
976 Hi all,&amp;#x200B;I'm having a bit of trouble... {'approved_at_utc': None, 'subreddit': 'statis... t3 0 Merging item responses into a single variable ... Hi all,&amp;#x200B;I'm having a bit of trouble...
977 Can somebody help this statistics rookie?Resea... {'approved_at_utc': None, 'subreddit': 'statis... t3 0 [Question] Should I use a Two-way ANOVA? Can somebody help this statistics rookie?Resea...

1961 rows × 6 columns

# Save the cleaned-up product on the side
dfCombined.to_csv('Combined.csv', index = False)

NLP

Use CountVectorizer or TfidfVectorizer from scikit-learn to create features from the thread titles and descriptions (NOTE: Not all threads have a description)

  • Examine using count or binary features in the model
  • Re-evaluate your models using these. Does this improve the model performance?
  • What text features are the most valuable?

N-grams = 1

# Going back after the fact to add some obvious stop words
# This was form a 'normal' run of CountVectorizer, e.g. n-grams = 1
# amp seems to be some bad html code that got pulled in mistakenly
new_stop_words = {'science', 'like', 'https', 'com', 've', '10', '12', 'amp'}
stop_words = ENGLISH_STOP_WORDS.union(new_stop_words)
# Instantiate
cvec = CountVectorizer(stop_words=stop_words) # First run through of n-grams = 1
# Set variables and train_test_split
# Sticking with the normal 75/25 split
X = dfCombined['title_body'].values
y = dfCombined['subreddit_target']
# .map({'statistics':0, 'datascience':1})

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state=42)
# Fit and transform
cvec.fit(X_train)
X_train_transform = cvec.transform(X_train)
X_test_transform = cvec.transform(X_test)
df_view_stats = pd.DataFrame(X_test_transform.todense(),
                             columns=cvec.get_feature_names(),
                             index=y_test.index)
df_view_stats.head()
# .T.sort_values('statistics', ascending=False).head(10).T
00 000 0005 0016 0031 004 004100341sig 00411621sig 004p2 00625 ... zipper zippers zjt zones zoo zuckerberg zwitch zziz µᵢ χ2
113 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
572 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
450 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
383 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
506 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 8621 columns

# Most commonly used words on data science
# This was run multiple times for different sets of n-grams
word_count_test = pd.concat([df_view_stats, y_test], axis=1)
word_count_test['subreddit_target'] = word_count_test['subreddit_target'].map({0:'statistics', 1:'datascience'})
word_count_test.groupby(by='subreddit_target').sum().sort_values(by='datascience', axis=1, ascending=False).T
subreddit_target datascience statistics
data 435 74
learning 97 17
work 79 10
time 74 36
python 70 5
model 67 34
know 65 25
false 64 1
using 63 16
use 61 28
looking 53 10
new 52 6
job 51 8
learn 51 5
just 51 11
dataset 44 5
want 44 21
tf 43 0
need 43 19
project 42 4
code 41 1
good 41 12
projects 41 7
way 39 19
set 39 10
tensorflow 39 0
lt 38 25
machine 38 4
analysis 38 12
working 37 7
... ... ...
ljung 0 0
classifying 0 0
livestream 0 0
classname 0 0
lived 0 0
cleanly 0 0
classification_report 0 0
classical 0 1
claim 0 0
class3 0 0
claimed 0 0
claims 0 0
clarify 0 1
lol 0 0
logs 0 1
logo 0 0
lognormal 0 0
logits 0 0
logit 0 1
logistics 0 0
clarifying 0 0
logical 0 0
logic 0 1
clarityhow 0 0
logarithms 0 0
logarithmicaly 0 0
class1 0 0
locked 0 0
class2 0 0
χ2 0 0

8621 rows × 2 columns

# Most commonly used words on statistics
# This was run multiple times for different sets of n-grams
word_count_test = pd.concat([df_view_stats, y_test], axis=1)
word_count_test['subreddit_target'] = word_count_test['subreddit_target'].map({0:'statistics', 1:'datascience'})
word_count_test.groupby(by='subreddit_target').sum().sort_values(by='statistics', axis=1, ascending=False).T
subreddit_target datascience statistics
data 435 74
statistics 26 50
mean 6 48
variables 20 44
variable 15 42
test 15 42
help 36 41
time 74 36
regression 18 36
model 67 34
use 61 28
know 65 25
lt 38 25
question 34 25
11 5 23
different 27 22
distribution 3 21
x200b 25 21
make 34 21
want 44 21
way 39 19
number 15 19
need 43 19
09 2 18
statistical 18 18
sample 10 18
linear 17 18
day 15 18
population 0 17
15 7 17
... ... ...
fine 1 0
flagship 0 0
fishermen 0 0
flagged 1 0
flag 1 0
fizzle 0 0
fizzbuzz 0 0
fixing 0 0
fix 1 0
fivethirtyeight 0 0
fitted 0 0
fitness 0 0
fit_transform 1 0
fit2 0 0
fishing 0 0
fischer 0 0
finger 0 0
fiscal 0 0
firmly 0 0
firm 2 0
firing 0 0
firefox 0 0
fintech 0 0
finnoq 0 0
finnish 0 0
finland 0 0
finite 0 0
finishes 0 0
finished 2 0
χ2 0 0

8621 rows × 2 columns

N-grams = 2

# This was the second run of CountVectorizer, e.g. n-grams = 2
# I removed science, because I wanted to make a differentation b/w 'science' and 'data science', and also 've', b/c it was only getting picked up b/c 'I've'
# Going to leave stop words as is for n-grams = 2, aside from html crap that got pulled in
new_stop_words = {'amp', 'x200b', 'amp x200b'}
stop_words = ENGLISH_STOP_WORDS.union(new_stop_words)
# Instantiate
cvec2 = CountVectorizer(stop_words=stop_words, ngram_range=(2,2)) #Second run through of n-grams = 2
# Set variables and train_test_split
# Sticking with the normal 75/25 split
X = dfCombined['title_body'].values
y = dfCombined['subreddit_target']
# .map({'statistics':0, 'datascience':1})

X_train2, X_test2, y_train2, y_test2 = train_test_split(X,
                                                    y,
                                                    random_state=42)

# Fit and transform
cvec2.fit(X_train2)
X_train_transform2 = cvec2.transform(X_train2)
X_test_transform2 = cvec2.transform(X_test2)
df_view_stats2 = pd.DataFrame(X_test_transform2.todense(),
                             columns=cvec2.get_feature_names(),
                             index=y_test2.index)
df_view_stats2.head()
# .T.sort_values('statistics', ascending=False).head(10).T
00 00 00 29s2 00 9987 00 cheap 00 cost 00 established 00 mean 00 primarily 00 went 000 10 ... zippers validate zjt vector zones topping zoo ggplot2 zuckerberg eric zwitch mapd zziz pwcpapers µᵢ fixed χ2 05 χ2 distribution
113 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
572 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
450 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
383 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
506 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 43537 columns

# Most commonly used words on data science
word_count_test = pd.concat([df_view_stats2, y_test2], axis=1)
word_count_test['subreddit_target'] = word_count_test['subreddit_target'].map({0:'statistics', 1:'datascience'})
word_count_test.groupby(by='subreddit_target').sum().sort_values(by='datascience', axis=1, ascending=False).T
subreddit_target datascience statistics
data science 127 5
machine learning 38 2
data scientist 30 0
https www 26 0
data scientists 20 0
gt lt 17 0
tensorflow js 16 0
https github 15 0
statistical learning 15 2
github com 15 0
data analyst 15 1
kaggle com 13 0
www kaggle 13 0
time series 13 3
https redd 12 0
data analytics 12 0
feel like 12 0
https youtu 10 0
data set 10 3
open source 10 0
greatly appreciated 9 1
linear algebra 8 3
don know 8 2
scikit learn 8 0
data analysis 7 1
work data 7 0
lt script 7 0
sql queries 7 0
little bit 7 0
new data 7 0
... ... ...
gallery 1hbpy1w 0 0
gallery ehcawau 0 0
gallery ej9di3f 0 0
gallery html 0 0
gallery http 0 0
gallery o45qf8o 0 0
gallery olzrzxz 0 0
gallery plotly 0 0
gallery wtdpir3 0 0
gain round 0 0
gain opinions 0 0
gain followers 0 0
future timeseries 0 0
future performance 0 0
future price 0 0
future research 0 0
future researcherdon 0 0
future statistical 0 0
future thoughts 0 0
future time 0 0
future using 0 0
gain academic 0 0
future weather 0 0
fyi data 0 0
fyi learning 0 0
g1 mn 0 0
g2 13 0 0
ga 90 0 0
ga tools 0 0
χ2 distribution 0 0

43537 rows × 2 columns

# Most commonly used words on statistics
word_count_test = pd.concat([df_view_stats2, y_test2], axis=1)
word_count_test['subreddit_target'] = word_count_test['subreddit_target'].map({0:'statistics', 1:'datascience'})
word_count_test.groupby(by='subreddit_target').sum().sort_values(by='statistics', axis=1, ascending=False).T
subreddit_target datascience statistics
standard deviation 0 6
linear regression 3 6
non stationary 0 6
https imgur 3 5
regression model 2 5
independent variables 0 5
don think 0 5
imgur com 3 5
make sense 1 5
data science 127 5
things like 4 5
normally distributed 0 4
logistic regressions 0 4
prediction model 1 4
need help 4 4
normal distribution 0 4
hypothesis testing 0 4
capture recapture 0 4
comp sci 3 4
random sample 0 4
average mean 0 4
post test 0 3
real time 2 3
pre post 0 3
make statistical 0 3
data excel 0 3
hotspot mapping 0 3
independent variable 0 3
index variables 0 3
statistical curve 0 3
... ... ...
frames day 0 0
framework aware 0 0
framework building 0 0
framework cheersbest 0 0
framework consistent 0 0
framework guidance 0 0
framework implemented 0 0
framework interactive 0 0
fragments feeding 0 0
fraction discard 0 0
forward similar 0 0
fpsyg 2018 0 0
forward want 0 0
forxa03xa0months july 0 0
foundation hiring 0 0
foundation mathematics 0 0
foundation prior 0 0
foundations predictive 0 0
foundations python 0 0
founder kdnuggets 0 0
fourmilab ch 0 0
fourth generate 0 0
foxes hounds 0 0
foxes immediately 0 0
foxes seven 0 0
foxhole inside 0 0
fp growth 0 0
fp persons 0 0
fpsyg 09 0 0
χ2 distribution 0 0

43537 rows × 2 columns

# Instantiate and fit
lr2 = LogisticRegression()
lr2.fit(X_train_transform2, y_train2)
lr2.score(X_train_transform2, y_train2)
0.9863945578231292
lr2.score(X_test_transform2, y_test2)
# Looks like a pretty decent overfit
0.7637474541751528

Predicting subreddit using Random Forests + Another Classifier

# Instantiate and fit
# From here on out, it's n-grams = 2
lr = LogisticRegression()
lr.fit(X_train_transform, y_train)
lr.score(X_train_transform, y_train)
0.9897959183673469
lr.score(X_test_transform, y_test)
# Looks like a pretty decent overfit
0.8757637474541752

We want to predict a binary variable - class 0 for one of your subreddits and 1 for the other.

preds = lr.predict(X_test_transform)
pred_proba = lr.predict_proba(X_test_transform)[:,1]
roc_auc = roc_auc_score(y_test, preds)
roc_auc
0.8772046367954297
roc_auc = roc_auc_score(y_test, preds)
FPR, TPR, thresholds = roc_curve(y_test, pred_proba)
plt.figure(figsize=(8, 6))
plt.plot(FPR, TPR, label='Logistic Regression (area = %0.2f)' % roc_auc)
plt.title('ROC-AUC (n-grams=1)')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.plot([0, 1], [0, 1],'r--')
plt.legend(loc="lower right")
plt.show()

png

Thought experiment: What is the baseline accuracy for this model?

## I'm going to take an educated guess that the baseline accuracy is 50%, as in, randomly guessing

Create a RandomForestClassifier model to predict which subreddit a given post belongs to.

# Instantiate
rf = RandomForestClassifier()
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

Use cross-validation in scikit-learn to evaluate the model above.

  • Evaluate the accuracy of the model, as well as any other metrics you feel are appropriate.
  • Bonus: Use GridSearchCV with Pipeline to optimize your CountVectorizer/TfidfVectorizer and classification model.
cvs_train = cross_val_score(rf, X_train_transform, y_train, cv=cv, n_jobs=-1)

print(cvs_train)
print(cvs_train.mean())
[0.81292517 0.86734694 0.79591837 0.78231293 0.79931973]
0.8115646258503402
cvs_test = cross_val_score(rf, X_test_transform, y_test, cv=cv, n_jobs=-1)

print(cvs_test)
print(cvs_test.mean())
# Still slight overfit
[0.70707071 0.71717172 0.76767677 0.82474227 0.74226804]
0.7517859002395084

Repeat the model-building process using a different classifier (e.g. MultinomialNB, LogisticRegression, etc)

MultinomialNB

mnb = MultinomialNB()
cvs_train = cross_val_score(mnb, X_train_transform, y_train, cv=cv, n_jobs=-1)

print(cvs_train)
print(cvs_train.mean())
[0.82312925 0.8537415  0.81632653 0.80272109 0.81972789]
0.8231292517006802
cvs_test = cross_val_score(mnb, X_test_transform, y_test, cv=cv, n_jobs=-1)

print(cvs_test)
print(cvs_test.mean())
# Not as bad of an overfit
[0.7979798  0.75757576 0.7979798  0.81443299 0.77319588]
0.788232843902947

GaussianNB

gnb = GaussianNB()
cvs_train = cross_val_score(gnb, X_train_transform.toarray(), y_train, cv=cv, n_jobs=-1)

print(cvs_train)
print(cvs_train.mean())
[0.77891156 0.80612245 0.78231293 0.81292517 0.80952381]
0.7979591836734693
cvs_test = cross_val_score(gnb, X_test_transform.toarray(), y_test, cv=cv, n_jobs=-1)

print(cvs_test)
print(cvs_test.mean())
# Overfit isn't as much of a problem on this model
# However, the overall score isn't as strong as the other models
[0.71717172 0.76767677 0.80808081 0.78350515 0.70103093]
0.7554930750807038

Executive Summary


Put your executive summary in a Markdown cell below.

Reclassifying all of Reddit is an incredible daunting task. However, the machine learning and natural language processing abilities of Python can turn this into a manageable task. Reddit calls itself the 'frontpage of the internet,' and indicative of the innovation that drove the creation of the internet, Reddit can innovate to overcome this challenge as it has countless obstacles before this.

Specifically, the distinction between r/DataScience and r/Statistics is relatively small as these subreddits generally discuss similar ideas and concepts. Despite these similarities, I believe my models performed quite well (especially my first run of Logistic Regression using n-grams = 1). Additionally, I chose to remove specific stop words ('science', 'https', 'com') that would more easily identify r/DataScience as the correct subreddit, in order to 'challenge' my modeling and evaluation skills, as well as allow this process to more generally be applied to all of Reddit's various subreddits. Removing these stop word would have increased my models' classifying ability even further.

Finally, I believe that this machine learning/NLP process can be applied to Reddit as a whole to help reclassify and realign it's subreddits with a high degree of success. Coupled with Reddit's strong community, including its committed mods, this is challenge that Reddit can overcome, and potentially be stronger off because of it.