Building a Twitter Data pipeline using twarc and jq


01 Oct 2021 - Shell scripting

TODO: full write-up…. For a standalone, full script written in bash for both accessing the twitter API and wrangling the initial raw dataset, please view: GitHub Script… For script functionality as a component of a full NLP project (which diagram is most relevant to), please view: GitHub Script

A diagram depicting bash script processes - the initial transformation workflow used on the gathered raw datasets: tweet property parametrization; character pattern matching replacing newline escape characters; format conversions from JSONL to CSV; removal of duplicate tweet entries; and text lowercasing. De-duplication removed an extensive 37000 unwanted data records.


Diagram of process workflow of the raw data processing script



Return Home