adds link to full package to readme

adds html files to gitignore
data/OUT/profiles/CovTweets.html gelöscht
2023-08-31 01:23:38 +02:00 · 2023-08-31 01:21:31 +02:00 · 2023-08-31 01:20:39 +02:00 · 2023-08-31 01:20:31 +02:00 · 2023-08-30 21:54:13 +02:00 · 2023-08-30 21:53:05 +02:00
7 changed files with 149 additions and 6292 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -2,6 +2,8 @@
 **/*lock*
 **/*-slice*.csv
 **/*.zip
 **/*.html
 **/*.htm
 /ALL-SENATORS-LONG.csv
 /ALL-SENATORS.csv
 /collect2.py
--- a/README.md
+++ b/README.md
@@ -1,7 +1,131 @@
-# How to use
+# Requirements
-Execute collect.py to scrape tweets and generate the ´ALL-SENATORS-TWEETS.csv´.
+- python 3.10+
 - snscrape 0.6.2.20230321+ (see git repo in this folder)
 - transformers 4.31.0
 - numpy 1.23.5
 - pandas 2.0.3
 - scikit-learn 1.3.0
 - torch 2.0.1
-Execute collectSenData.py to scrape senator data and generate ´ALL-SENATORS.csv´.
+# About
-All new files will be written to ´data/OUT/´. Necessary data has to be located in ´data/IN/´
+This collection of scripts scrapes tweets of US-senators in the time from 2020-01-01T00:00:00Z to 2023-01-03T00:00:00Z, scrapes account data of the senators, prepares the tweets for the training of a NLP-model, trains two models to (1) classify the tweets topic as covid or non-covid and (2) the tweets as either "fake news" tweets or "non-fake news" tweets.
 Training only works with a prepared dataset in which the tweets are pre classified.
 More info in the comments of the scripts.
 Due to time constraints, most of the code is procedurally coded and ugly but effective.
 # How to
 Tested on Ubuntu 22.04. 
 If needed, the virual environment can be exported and send to you.
 All files in the folder data/in have to exist in order to execute the scripts.
 Execute in the following order:
 01 collect.py (see more for further info on scraping)
 02 collectSenData.py
 03 cleanTweets
 04 preTestClassification.py
 05 trainTopic.py
 06 trainFake.py
 07 ClassificationFake.py
 08 ClassificationTopic.py
 # Files & Folders
 Datafiles are not included in the repository but can be found in the full package that can be downloaded from [here](https://ncloud.mischbeck.de/s/T4QcMDSfYSkadYC) (password protected).
 ```
 ├── data
 │   ├── IN
 │   │   ├── counterKeywordsFinal.txt
 │   │   ├── counterKeywords.txt
 │   │   ├── keywords-raw.txt
 │   │   ├── keywords.txt
 │   │   ├── own_keywords.txt
 │   │   ├── pretest-tweets_fake.txt				contains tweet ids for pretest
 │   │   ├── pretest-tweets_not_fake.txt			contains tweet ids for pretest
 │   │   └── senators-raw.csv					senator datafile
 │   ├── OUT
 │   │   ├── ALL-SENATORS-TWEETS.csv
 │   │   ├── graphs
 │   │   │   ├── Timeline.png
 │   │   │   ├── Wordcloud-All.png
 │   │   │   └── Wordcloud-Cov.png
 │   │   ├── Pretest-Prep.csv
 │   │   ├── Pretest-Results.csv
 │   │   ├── Pretest-SENATORS-TWEETS.csv
 │   │   ├── profiles							dataset profiles
 │   │   │   ├── AllTweets.html
 │   │   │   └── CovTweets.html
 │   │   ├── SenatorsTweets-Final.csv
 │   │   ├── SenatorsTweets-OnlyCov.csv
 │   │   ├── SenatorsTweets-train-CovClassification.csv
 │   │   ├── SenatorsTweets-train-CovClassificationTRAIN.csv
 │   │   ├── SenatorsTweets-train-CovClassification.tsv
 │   │   ├── SenatorsTweets-train-FakeClassification.csv
 │   │   ├── SenatorsTweets-train-FakeClassificationTRAIN.csv
 │   │   ├── SenatorsTweets-train-FakeClassification.tsv
 │   │   ├── SenatorsTweets-Training.csv
 │   │   ├── SenatorsTweets-Training_WORKING-COPY.csv
 │   │   ├── topClass-PRETEST-Prep.csv
 │   │   ├── topClass-PRETEST-Results.csv
 │   │   ├── Tweets-All-slices.zip
 │   │   ├── Tweets-Classified-Fake-Prep.csv
 │   │   ├── Tweets-Classified-Fake-Results.csv
 │   │   ├── Tweets-Classified-Prep.csv
 │   │   ├── Tweets-Classified-Topic-Prep.csv
 │   │   ├── Tweets-Classified-Topic-Results.csv
 │   │   └── Tweets-Stub.csv
 ├── funs
 │   ├── CleanTweets.py					2023-01-03T00:00:00Z		multiple functions to clean tweet contents for NLN-processing
 │   ├── ClearDupes.py							function for deletion of duplicate keywords
 │   ├── __init__.py
 │   ├── Scrape.py								scraper functions to be used for multiprocessing
 │   └── TimeSlice.py							time slice script to slice the time span in 24 slices, speeds up scraping through multiprocessing
 ├── log											logs of the scraping process
 │   ├── log_2023-06-23_21-06-10_err.log
 │   ├── log_2023-06-23_21-06-10.log
 │   └── log_2023-06-23_21-06-10_missing.log
 ├── models
 │   ├── CovClass								Covid tweet classification model
 │   │   └── 2023-08-15_05-56-50
 │   │       ├── 2023-08-15_05-56-50.csv			training output
 │   │       ├── config.json
 │   │       ├── pytorch_model.bin
 │   │       ├── special_tokens_map.json
 │   │       ├── tokenizer_config.json
 │   │       ├── tokenizer.json
 │   │       └── vocab.txt
 │   └── FakeClass								Fake tweet classification model
 │       └── 2023-08-15_14-35-43
 │           ├── 2023-08-15_14-35-43.csv			training output
 │           ├── config.json
 │           ├── pytorch_model.bin
 │           ├── special_tokens_map.json
 │           ├── tokenizer_config.json
 │           ├── tokenizer.json
 │           └── vocab.txt
 ├── snscrape									contains snscrape 0.6.2.20230321+ git repo
 ├── ClassificationFake.py						classifies tweets as fake or non-fake, saves:
 │													Tweets-Classified-Fake-Prep.csv		- prepared training dataset
 │													Tweets-Classified-Fake-Results.csv	- Tweets-Classified-Topic-Results.csv with cov classification results
 ├── ClassificationTopic.py						classifies tweet topic, saves: 
 │													Tweets-Classified-Topic-Prep.csv 	- prepared training dataset
 │													Tweets-Classified-Topic-Results.csv	- SenatorsTweets-OnlyCov.csv with cov classification results
 ├── cleanTweets.py								Curates keywordlists 
 │												Merges senator and tweet datasets
 │												Creates multiple datasets:
 │													SenatorsTweets-Final.csv	- all tweets with keyword columns
 │													SenatorsTweets-OnlyCov.csv	- only covid tweets, filtered by keywordlist
 │													SenatorsTweets-Training.csv	- training dataset, containing ~1800 randomly selected tweets from SenatorsTweets-OnlyCov.csv
 ├── collect.py									scrapes tweets, saves to ALL-SENATORS-TWEETS.csv
 ├── collectSenData.py							scrapes senator account data, saves to ALL-SENATORS.csv
 ├── createGraphs.py								creates wordcloud & timeline graphs
 ├── preTestClassification.py					pretest script that uses bvrau/covid-twitter-bert-v2-struth to analyze 100 preclassified tweets
 ├── profiler.py									creates dataset profiles
 ├── README.md									readme
 ├── trainFake.py								training script for the fake tweet classification model
 └── trainTopic.py								training script for the tweet topic classification model
 ```
--- a/cleanTweets.py
+++ b/cleanTweets.py
@@ -9,7 +9,8 @@ Created on Mon Jun 26 20:36:43 2023
 import pandas as pd
 # import pyreadstat
 import numpy as np
-from funs.ClearDupes import deDupe
+import sys
 # Seet for training dataset generation
 seed = 86431891
@@ -49,6 +50,11 @@ senDatasetPath = wd + di + senDataset
 df = pd.read_csv(senCSVPath, dtype=(object))
 ## Import own functions
 funs = wd+"funs"
 sys.path.insert(1, funs)
 from ClearDupes import deDupe
 mixed_columns = df.columns[df.nunique() != len(df)]
 print(mixed_columns)
--- a/collect.py
+++ b/collect.py
@@ -66,7 +66,6 @@ which is the final output.
 import os
 import pandas as pd
 import glob
 import time
 import sys
 from datetime import datetime
 import concurrent.futures
@@ -149,10 +148,12 @@ tweetDFColumns = [
 ################## do NOT change anything below this line ###################
 #############################################################################
-## Import functions
+## Import own functions
-from funs.TimeSlice import *
+funs = wd+"funs"
-from funs.ClearDupes import deDupe
+sys.path.insert(1, funs)
-from funs.Scrape import scrapeTweets
+from TimeSlice import get_Tslices
 from ClearDupes import deDupe
 from Scrape import scrapeTweets
 ################### 
 # Create logfile & log all outputs
--- a/data/OUT/profiles/AllTweets.html
+++ b/data/OUT/profiles/AllTweets.html
--- a/data/OUT/profiles/CovTweets.html
+++ b/data/OUT/profiles/CovTweets.html
--- a/preTestClassification.py
+++ b/preTestClassification.py
@@ -1,13 +1,8 @@
 import re
 import string
 import numpy as np
 import pandas as pd
 from datetime import datetime
 from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
 from datasets import load_dataset
 from transformers.pipelines.pt_utils import KeyDataset
 from funs.CleanTweets import remove_URL, remove_emoji, remove_html, remove_punct
 #%%
 # prepare
@@ -40,7 +35,6 @@ senCSVPretest = "Pretest.csv"
 senCSVPretestPrep = "Pretest-Prep.csv"
 senCSVPretestResult = "Pretest-Results.csv"
 # don't change this one
 senCSVPath = wd + ud + senCSV
 senCSVcPath = wd + ud + senCSVc
@@ -50,6 +44,11 @@ senCSVcPretestResultPath = wd + ud + senCSVPretestResult
 preTestIDsFakePath = wd + di + preTestIDsFake
 preTestIDsNotPath = wd + di + preTestIDsNot
 import sys
 funs = wd+"funs"
 sys.path.insert(1, funs)
 import CleanTweets
 # List of IDs to select
 # Read the IDs from a file
 preTestIDsFakeL = []
@@ -85,11 +84,7 @@ tokenizer = AutoTokenizer.from_pretrained("bvrau/covid-twitter-bert-v2-struth")
 # Source https://www.kaggle.com/code/daotan/tweet-analysis-with-transformers-bert
-dfPreTest['cleanContent'] = dfPreTest['rawContent'].apply(remove_URL)
+dfPreTest['cleanContent'] = dfPreTest['rawContent'].apply(CleanTweets.preprocess_text)
 dfPreTest['cleanContent'] = dfPreTest['cleanContent'].apply(remove_emoji)
 dfPreTest['cleanContent'] = dfPreTest['cleanContent'].apply(remove_html)
 dfPreTest['cleanContent'] = dfPreTest['cleanContent'].apply(remove_punct)
 dfPreTest['cleanContent'] = dfPreTest['cleanContent'].apply(lambda x: x.lower())
 #%%
 timeStart = datetime.now() # start counting execution time
Author	SHA1	Message	Date
Michael Beck	89b4755c65	adds link to full package to readme	2023-08-31 01:23:38 +02:00
Michael Beck	01e58b1b99	adds html files to gitignore	2023-08-31 01:21:31 +02:00
Michael Beck	d0fcefedf4	data/OUT/profiles/CovTweets.html gelöscht	2023-08-31 01:20:39 +02:00
Michael Beck	71cf907249	data/OUT/profiles/AllTweets.html gelöscht	2023-08-31 01:20:31 +02:00
Michael Beck	a9018fedee	REALLY corrects the filetree	2023-08-30 21:54:13 +02:00
Michael Beck	d94a93295f	corrects filetree	2023-08-30 21:53:05 +02:00
Michael Beck	80b63b39df	adds readme	2023-08-30 21:45:38 +02:00
Michael Beck	d8136909c8	corrects import of own functions that didn't work anymore because of a newer python version.	2023-08-30 21:45:27 +02:00