Data Wrangling Report¶

Introduction¶

In this project, I gathered data from the WeRateDogs Twitter archive. The goal for this project was to wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations.

The wrangling tasks completed in this project are:

  • Data gathering
  • Assessing data
  • Cleaning data

Data Gathering¶

Data for the project was gathered from 3 sources as explained below

1. Enhanced Twitter Archive¶

This archive contains basic tweet data (tweet ID, timestamp, text, etc.) up to August 1, 2017, since 2015 that the WeRateDogs account was created. This was provided by Udacity in a csv file format and contains 2000+ basic tweet data about dog rating, name, and "stage".

2. Tweet Image Predictions Dataset¶

The file is hosted on Udacity's servers and was downloaded programmatically, using the Requests library, from image_predictions.tsv. This dataset contains dog breed prediction results (from a Neural Network classifier) for every dog images from the WeRateDogs Twitter archive

3. Twitter API (tweepy)¶

This data resides on Twitter site and can be pulled via their API tweepy. I used the API to query additional data (in JSON format) and downloaded into a file named tweet_json.txt. This file has favorite and retweet count information for each tweet ID in the WeRateDogs Twitter archive, which are crucial for the dog rating analysis

Assessing Data¶

Below are the steps taking in assessing the data

Enhanced Twitter Archive¶

  • As a first step, a sample of data was assessed visually and a summary of data types and non-null values was displayed. This allowed us to identify columns with incorrect data type and/or null values.
  • Then, IDs were checked for duplicates.
  • Next, the number of tweets which are replies and retweets was assessed.
  • Expanded URLs were firstly assessed visually and then checked programmatically for the existence of more than one URL
  • Name column was assessed programmatically for anomalies and data inconsitency.
  • Then, all tweets were checked for dogs with more than one growth stage assigned.
  • Rating denominators and numerators were assessed visually by displaying a sample of data, and then based on the assessment of rating columns, text column was checked programmatically for any float ratings
Oservations from Enhanced Twitter Archive Assessment¶
Quality issues¶
  1. Dataset contains retweets

  2. Tthe name columnn contains "None" and some stopwords like 'a', 'an' etc

  3. Some dogs are not classified as one of "doggo", "floofer", "pupper" or "puppo".

  4. The source contains HTML code and not really sources

  5. Expanded url is more than one

  6. Wrong datatype for Timestamp column

  7. Wrong numerator ratings

Tidiness issues¶
  1. The columns doggo, floofer,pupper and puppo represent dog's stage and should be in one column

Tweet Image Predictions Dataset¶

  • A sample of data was assessed visually and a summary of data types and non-null values is displayed. This allows to identify columns with the incorrect data type and/or null values
  • Then, the jpg_url column was checked for duplicates
  • Lastly, the 1st prediction was checked to see how many images were correctly classified as dog images
Oservations from Image Predictions Dataset Assessment¶
Quality issues¶
  1. The dataset contains 66 duplicated images/retweets

  2. Some pictures were not predicted to contain dog by top prediction model

  3. Breed prediction column contains inconsistent cases, and underscores were used to separate breed name

Tidiness issues¶
  1. The dataset contains tweet_id. Thus, it should be merged with the Twitter Archive dataset.

Twitter API Dataset¶

  • Checked summary of data types and non-null values in the dataset.
  • Then checked if the API Data contains Retweets
Oservations from Twitter API Dataset Assessment¶

Tidiness issues¶

  1. display_text_range contains 2 variables

  2. Contains tweet_id. Thus, it should be merged with the twitter archive dataset.

Data Cleaning¶

The quality and tidiness issues identified in the Data Assessment section were cleaned using pandas, regex, and come custom modules etc

Twitter Archive Dataset¶

  • First, a copy of the dataset was created for use throughout the cleaning exercise.
  • Then, I removed retweets and response to tweets data from the dataset. Then drop columns with retweet and replies information
  • Replaced names that are stopwords and None with NaN
  • Dog 'stage' classification (doggo, floofer, pupper, puppo) which was broken into four separate columns, was merged into one column.
  • Extract Dog stage from the text column
  • Source column which contains HTML was redefined by extracting sources from the HTML
  • We have some tweet URLs which contain more than one link, therefore we built correct links by using the tweet id.
  • Next, we fixed the timestamp column which has an incorrect data type, by converting it to a DateTime object
  • Lastly, re-extracted the numerator ratings from the text column and cleaned appropriately

Tweet Image Predictions Dataset¶

  • First, a copy of the dataset was created for use throughout the cleaning exercise
  • Then dropped the 66 duplicated images from the dataset
  • For the pictures where the top prediction was not a dog, 2nd or 3rd prediction was used to obtain the dog breed
  • Then replaced underscores with whitespace in the breed column, and then capitalized the first letter of each word to make it human readable
  • Finaly, the cleaned version of this dataset was merged with Twitter Archive sataset set using twitter_id

Twitter API¶

  • First, a copy of the dataset was created for use throughout the cleaning exercise
  • The text range column was splitted into two separate columns: lower_text_range and upper_text_range
  • Since the dataset contains twitter_id column, this was further merged with the Twitter Archive dataset

Storing Data¶

Before further analysis, the cleaned consolidated dataset was saved to a CSV file named twitter_archive_master.csv and an SQLite file