This project focuses on creating a workflow to take data in its raw form and process it to a point where it can be handed off to analysts for further work (exploratory analysis, predictive modeling, etc.). The software Cloudera is used to set up a "Data Hub": various databases to divide up the ETL process into two parts: the back office and the front office.
The back office consists of the first three parts of the ETL process: staging, cleansing, and comforming. Staging the data involves transferring an copy from the original system for analysis without transformations. Cleansing the data involves a general cleanup to fit specific analyst needs (applying data types like timezones stamps, replacing NULL values, verify addresses, etc.). Conforming the data is where the all final checks and ids are filled in, as well as calculation of various metrics for the analysts to use.
The front office is the destination for the final prepped data for analysts to use. Any new incoming observations received is inserted as a row, and any changes made to observations are updated then inserted (also called upserted).
A general view of the this process can be seen here:
<img src = "http://i.imgur.com/oRJrVHJ.png" height = "500" width = "750">
In an ideal world, data that is delivered to front office is:
+ Contains an auditable history of changes
+ Structured in a way that is easy to understand
The two main sources of inputs from the Data Hub are:
+ Real-Time Events, via streaming
+ Static Data, via batch processing
The three main sources of outputs from Data Hub are:
+ Analysts, via SQL/BI tools
+ Applications, via APIs
+ Data Science use cases, such as Machine Learning (R, Python, Scala, etc.)
This highlights the general workflow (inputs/outputs):
<img src = "http://i.imgur.com/wWInzdM.png" height = "500" width = "750">
Everything is done in via the software Cloudera (which needs to be hosted via VirtualBox); within the software SQL through Hive will be mainly be used in this project).
# Part 1
## Creating the Cloudera environment with VirtualBox (Mac)
### Downloading the necessary software
First, we must install the software that will allow us to run our Cloudera virtual environment: VirtualBox. The link to install can be found [here](https://www.virtualbox.org/wiki/Downloads). Browse to the correct software depending on your MacOS.
**Disclaimer: If you have a Windows machine you may not need to download VirtualBox, but these instructions are not inclusive of the process for non-Mac machines.**
Next, download the sandbox image from Cloudera for VirtualBox [here](https://www.cloudera.com/downloads/quickstart_vms/5-10.html). You will likely get a .zip file; unzip it to get the specific file you need.
Lastly for downloads: Go to the link [here](https://github.com/Njfritter/linkedinLearning/tree/master/dataEngineering) and download the exercise files. Or clone the repository, whichever you choose :)
### Installing software/prepping Cloudera
Now run the .dmg file for VirtualBox and follow the instructions. This should install VirtualBox on your machine and should look like this:
Now within VirtualBox:
+ Open VirtualBox, File → Import Appliance
+ Select icon to the right of the text box, find .ovf file (Open Virtualization Format) from the unzipped Cloudera file
* You may delete the other Cloudera file as it is big and not necessary for this project
You should now see the Cloudera software as "cloudera-quickstart-vm-VERSION-NUMBER-virtualbox" (with whatever version you've download subbed in above); to make sure we have access to our exercise files, within the Cloudera environment:
+ Click Settings → Shared Folders, then the folder icon on the right with a plus sign
+ Find the Exercise Files and select them
+ Make sure the settings “Auto-Mount” and “Make Permanent” are selected
+ “Make Permanent” may not be there no worries
+ Run the Cloudera machine within VirtualBox (Double click, then Start --> Normal Start)
* It may take a while to boot up, just wait and an image like this should show up:
Now, the Exercise files will be avaiable in the form “sf_Exercise_Files”. It will not be accessible yet on the left because we have not given the environment permission to use it.
To do this:
+ Open terminal within Cloudera (far right icon on top bar) & type:
* sudo gedit /etc/group
+ Scroll down to find vboxsf:x:474:
+ Append “cloudera” to the end of this
+ Hit “Save” then close the window and the Terminal
Now the files are saved; however we need to reset the virtual environment to have the changes take place. So within VirtualBox, double click the cloudera machine and click "Reset" and wait for the machine to reset. Once reset, you should be able to access “sf_Exercise_Files” folder.
### Setting up the file structure for analysis
Lastly we must set up the file structure through which we will be completing the rest of this project.
+ Through the Mozilla Firebox internet tab that opened with the Cloudera Live page, scroll down to “Analyze Your Data” click “Launch Hue UI”
+ Sign in with the username AND password “cloudera”
+ When Hue opens it will check the current configuration
* If “OOZIE_EMAIL_SERVER” is the only issue, it can safely be ignored
* Otherwise a Google search is in order
+ Go to “Query Editors” and click “Hive” (place to run queries)
* Located in top left barmenu under "Query Data"
+ Open the “sf_Exercise_Files” and open “Scripts”
+ Open the “2_1.sql” file and hit “Display”
+ Run first three commands to make directories in hadoop
+ Scroll down to the line that reads:
* Add jar hdfs:///user/cloudera/csv-serde-1.1.2-0.11.0-all.jar;
+ Copy the command, go to Hive and click “File Browser” or “HDFS Browser”
+ Under “/user/cloudera” select “Upload” and “File”
+ Find the jar file in the “sf_Exercise_Files” under “setup” and upload
+ Go back to Hive and run the command you copied in the query editor
* This will allow us to access csv’s in a special way
Now that our environment is up and running, it's time to start creating our pipeline!
(Please stay tuned for the next parts of this project). Also feel free to clone this repository on Github.