Tame the Anaconda (Python)

If you planning to work on python, you can download python’s Anaconda distribution and install it on your system. conda is the main command you will be dealing with while working on Anaconda. You can create many virtual environments with different versions of python and with different sets of tools. Conda Commands Check version of conda:

Check…

Continue reading →

Apache Spark Basics with Scala

Prerequisite: Be comfortable with Scala language. Have access to spark installation. Dataset: Record Linkage Comparison Patterns Data Set 1. Download and Store Download the dataset from the above repository and store it either on local filesystem or HDFS file system: Local File System:

HDFS File System:

2. Launch Spark Launch spark on local…

Continue reading →

Frequently used HDFS / Hadoop shell commands

Hadoop version

Contents of the root directory in HDFS

Amount of space used and available on currently mounted filesystem

Number of directories, files and bytes under the paths that match the specified file pattern

DFS filesystem checking utility

A cluster balancing utility

Create a new directory named “data” below…

Continue reading →

Hadoop 2.6 Installing on Ubuntu 14.04 (Single-Node Cluster)

Installing Hadoop 2.6 on Ubuntu 14.04 (Single-Node Cluster) Let’s see how to install a single-node Hadoop cluster backed by the Hadoop Distributed File System on Ubuntu. 1. Update the Source list:

2. Check if Java is installed

3. Add a dedicated Hadoop user

4. Install ssh How to install ssh and…

Continue reading →

Plots / Charts / Graphs

A picture is worth a thousand words. Scatter Plot A scatter plot reveals relationships or association between two variables ref. How to plot Scatter Plot using matplotlib ref? What is a Contour Plot? A contour plot is a graphical technique for representing a 3-dimensional surface by plotting constant z slices, called contours, on a 2-dimensional format ref. How…

Continue reading →

Fork me on GitHub