Data Wrangling Report¶
Introduction¶
In this project, I gathered data from the WeRateDogs Twitter archive. The goal for this project was to wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations.
The wrangling tasks completed in this project are:
- Data gathering
- Assessing data
- Cleaning data
Data Gathering¶
Data for the project was gathered from 3 sources as explained below
1. Enhanced Twitter Archive¶
This archive contains basic tweet data (tweet ID, timestamp, text, etc.) up to August 1, 2017, since 2015 that the WeRateDogs account was created. This was provided by Udacity in a csv file format and contains 2000+ basic tweet data about dog rating, name, and "stage".
2. Tweet Image Predictions Dataset¶
The file is hosted on Udacity's servers and was downloaded programmatically, using the Requests library, from image_predictions.tsv. This dataset contains dog breed prediction results (from a Neural Network classifier) for every dog images from the WeRateDogs Twitter archive
3. Twitter API (tweepy)¶
This data resides on Twitter site and can be pulled via their API tweepy. I used the API to query additional data (in JSON format) and downloaded into a file named tweet_json.txt. This file has favorite and retweet count information for each tweet ID in the WeRateDogs Twitter archive, which are crucial for the dog rating analysis
Assessing Data¶
Below are the steps taking in assessing the data
Enhanced Twitter Archive¶
- As a first step, a sample of data was assessed visually and a summary of data types and non-null values was displayed. This allowed us to identify columns with incorrect data type and/or null values.
- Then, IDs were checked for duplicates.
- Next, the number of tweets which are replies and retweets was assessed.
- Expanded URLs were firstly assessed visually and then checked programmatically for the existence of more than one URL
Namecolumn was assessed programmatically for anomalies and data inconsitency.- Then, all tweets were checked for dogs with more than one growth stage assigned.
- Rating denominators and numerators were assessed visually by displaying a sample of data, and then based on the assessment of
ratingcolumns,textcolumn was checked programmatically for any float ratings
Oservations from Enhanced Twitter Archive Assessment¶
Quality issues¶
Dataset contains retweets
Tthe
namecolumnn contains "None" and some stopwords like 'a', 'an' etcSome dogs are not classified as one of "doggo", "floofer", "pupper" or "puppo".
The source contains HTML code and not really sources
Expanded url is more than one
Wrong datatype for Timestamp column
Wrong numerator ratings
Tidiness issues¶
- The columns doggo, floofer,pupper and puppo represent dog's stage and should be in one column
Tweet Image Predictions Dataset¶
- A sample of data was assessed visually and a summary of data types and non-null values is displayed. This allows to identify columns with the incorrect data type and/or null values
- Then, the jpg_url column was checked for duplicates
- Lastly, the
1st predictionwas checked to see how many images were correctly classified as dog images
Oservations from Image Predictions Dataset Assessment¶
Quality issues¶
The dataset contains 66 duplicated images/retweets
Some pictures were not predicted to contain dog by top prediction model
Breed
predictioncolumn contains inconsistent cases, and underscores were used to separate breed name
Tidiness issues¶
- The dataset contains
tweet_id. Thus, it should be merged with the Twitter Archive dataset.
Twitter API Dataset¶
- Checked summary of data types and non-null values in the dataset.
- Then checked if the API Data contains Retweets
Oservations from Twitter API Dataset Assessment¶
Tidiness issues¶
display_text_rangecontains 2 variablesContains
tweet_id. Thus, it should be merged with the twitter archive dataset.
Data Cleaning¶
The quality and tidiness issues identified in the Data Assessment section were cleaned using pandas, regex, and come custom modules etc
Twitter Archive Dataset¶
- First, a copy of the dataset was created for use throughout the cleaning exercise.
- Then, I removed retweets and response to tweets data from the dataset. Then drop columns with retweet and replies information
- Replaced
namesthat are stopwords andNonewith NaN - Dog 'stage' classification (
doggo, floofer, pupper, puppo) which was broken into four separate columns, was merged into one column. - Extract Dog stage from the
textcolumn Sourcecolumn which contains HTML was redefined by extracting sources from the HTML- We have some tweet URLs which contain more than one link, therefore we built correct links by using the tweet id.
- Next, we fixed the
timestampcolumn which has an incorrect data type, by converting it to a DateTime object - Lastly, re-extracted the numerator ratings from the
textcolumn and cleaned appropriately
Tweet Image Predictions Dataset¶
- First, a copy of the dataset was created for use throughout the cleaning exercise
- Then dropped the 66 duplicated images from the dataset
- For the pictures where the top prediction was not a dog, 2nd or 3rd prediction was used to obtain the dog breed
- Then replaced underscores with whitespace in the
breedcolumn, and then capitalized the first letter of each word to make it human readable - Finaly, the cleaned version of this dataset was merged with Twitter Archive sataset set using twitter_id
Twitter API¶
- First, a copy of the dataset was created for use throughout the cleaning exercise
- The
text rangecolumn was splitted into two separate columns:lower_text_rangeandupper_text_range - Since the dataset contains
twitter_idcolumn, this was further merged with the Twitter Archive dataset
Storing Data¶
Before further analysis, the cleaned consolidated dataset was saved to a CSV file named twitter_archive_master.csv and an SQLite file