ptools
is a package to help you organize your data pipeline project. Since setting up a project follows recurrent step, a default procedure is suggested here to save time. The purpose is also to ease project upgrades and allow unit testing.
The first step of one project is to define what you want to do with your data. If there are several uses to your data you may want to cut your project in pieces. Straightforward runs will allow you to get quicker wins and also to keep your teams (end user, devs,… ) motivated.
By default the following structure is created to keep track of your data processing :
landing
: contains only outside raw data that will processed then archived. If things goes to worst you will be able to start from scratch from there.data
: your operations starts here, the default here is to process only csv files before building Impala or Hive tables.
raw
: for primary operations like processing the extension of your data (e.g from json to csv).intermediate
: optionnal, depending on your pipeline you may want to reshape the data or add more cleaning steps.final
: here will be stored the cleaned data that you want to build Impala or Hive tables uppon. The final clean data will be automatically converted to Impala (or Hive) tables from the final folder(s) on HDFS with the corresponding type. Then you may aggregate it using Hadoop.There are several operations that you cannot perform easily or not at all in Impala/Hive (such as complex data transformation, reshaping, …). But you should perform joins using Hadoop to ensure that all the data is matched with what is intended no matter how delayed this data was uploaded and insure that your project is viable in time (e.g 5 years from now your R code might not be optimized to join on much more data).
Thus you can only focus on cleaning your data while not messing with the landing folder
. The structure of your pipeline should look like :
project_name
|- .gitignore
|- data
|- references.csv
|- documents
|- meeting_notes.md
|- R
|- raw_*.R # e.g : raw_to_csv.csv
|- inter_*.R # e.g : inter_reshape.R
|- final_*.R # e.f : final_types
|- project_name.Rproj
|- README.md
|- README.Rmd
|- vignette
|- report.Rmd
inter_reshape.R
for an intermediate job than takes data from the raw
folder, reshape it (e.g from long to wide format) and write to the intermediate
folder.data
folder one can put variables dictionnary.README.Rmd
file is the minimal documentation for you or a peer to retake your project later on.It is your choice to build a package from this structure to ensure reproducibility and stability of your code in time. You can also source your script for each pipeline job.
If you rely often on this pipeline or plan to have a stable code in time, the best thing to do is to package your code (from the structure above it is to add a Description
file and compile + debug) and add a docker file.