Big data space: July 2016

Saturday, July 16, 2016

Basics Stats

Mean
A mean score is an average score, often denoted by X. It is the sum of individual scores divided by the number of individuals.

Median

The median is a simple measure of central tendency. To find the median, we arrange the observations in order from smallest to largest value. If there is an odd number of observations, the median is the middle value. If there is an even number of observations, the median is the average of the two middle values.

Mode

The mode is the most frequently appearing value in a population or sample.

Wednesday, July 6, 2016

Accessing Hbase from shell

Create table

create 'Test-Table', 'Test-Column-Family'

Add records to the table

put ’Test-Table’,’ROW_KEY’,’Test-Column-Family:Test-Column-Qualifier’,’Test-Value’

Scan entire table

scan 'Test-Table'

Scan with Column Family

scan 'Test-Table', {COLUMNS => 'Test-Column-Family'}

Scan with Column Family and Column Qualifier

 scan 'Test-Table', {COLUMNS => 'Test-Column-Family:Test-Column-Qualifier'}

Scan the entire table, limit number of records displayed to 10

 scan 'Test-Table', {COLUMNS => 'Test-Column-Family:Test-Column-Qualifier', LIMIT => 10}

Add a column family to an existing table

alter 'Test-Table', NAME => 'Test-Column-Family2'

Delete a Column Family

alter 'Test-Table', 'delete' => 'Test-Column-Family2'

Delete entire table

Deleting table is 2 step process first it has to be disabled and then dropped

disable 'Test-Table' 

drop 'Test-Table'

Saturday, July 2, 2016

Machine Learning Terminolgy

Regularization is a technique used to solve over fitting problem. Other techniques include early stopping, cross validation.

Bias–variance trade off is the problem of simultaneously minimizing two sources of error that prevent supervised learning algorithms from generalizing beyond their training set:

i) The bias is error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).

ii) The variance is error from sensitivity to small fluctuations in the training set. High variance can cause overfitting: modeling the random noise in the training data, rather than the intended outputs.

Algorithms classification:
Discrminative Learning Algorithms:
Algorithms that try to learn p(y|x) directly (such as logistic regression),or algorithms that try to learn mappings directly from the space of inputs X to the labels{0,1}, (such as the perceptron algorithm) are called discriminative learning algorithms.
Generative Learning Algorithms:
Algorithms that instead try to model p(x|y) (and p(y)). These algorithms are called generative learning algorithms. For instance, if y indicates whether an example is a dog (0) or an elephant (1), then p(x|y= 0) models the distribution of dogs’ features, and p(x|y= 1) models the distribution of elephants’ features.

Source: http://cs229.stanford.edu/notes/cs229-notes2.pdf

What is Self Organizing Map in Neural Nets?

Teuvo Kohonen introduced a special class of ANNs called Self-Organizing feature maps. These maps are based on competitive learning.

In competitive learning neurons compete among themselves to be activated.

Brain is dominated by the cerebral cortex, a very complex structure of billions neurons and hundreds of billions synapses.

It includes areas that are responsible for different human activities (motor, visual, auditory, somatosensory, etc.), and associated with different sensory inputs.

Each sensory input is mapped into a corresponding area of the cerebral cortex.The cortex is a self -organizing computational map The cortex is a self -organizing computational map in the human brain.

Classification Model Performance Metrics

Confusion matrix

Precision: Proportion of patients diagnosed as having cancer actually had cancer. In other words how many selected items are relevant.
TP/TP+FP

Recall: Proportion of patients that actually had cancer were diagnosed as having cancer. In other words how many relevant items are selected.
TP/TP+FN