Tuesday, February 28, 2017

Digging Deeper: Probability Space & Euclidean Space

“Measure Theory becomes Probability Theory when Euclidean Space becomes Probability Space”

In probability theory we model the probability of a complex real phenomenon that is random in nature. Probability of occurrence of outcomes of such phenomenon is dependent on several variables that are also random in nature. This can be explained by a multidimensional Euclidean Space with one coordinate axes being the probability; this probability is a function of these multiple random variables. So a multi-dimensional probability space has to be mentally visualized.  If there is only one variable then the probability space becomes a two dimensional Euclidean space

As we go closer and closer to predicting the complexity of real life phenomenon by computing its probability, the simple formula of favorable cases/total cases takes an entirely new form. This form is governed by axioms of probability theory which can be closely linked to measure theory. For a two dimensional probability space computation of probability for a continuous random variable is equal to finding an area. This is done using simple integration, as shown figure above. Please refer to blog of 26 Feb 2017 for the details of these figures. For a three dimensional space it is a volume and we use double integrals to find this volume. But for n dimensional space, it is Lebesgue measure (measure theory) that aids in computation of the n dimensional volume. For ease of graphical representation the figure below shows a three dimensional space.

Monday, February 27, 2017

Probability Distributions: In search of an underlying law/pattern

Probability distribution gives the distribution of probability for all the outcomes of a real phenomenon/experiment. These phenomenon are random in nature. If the pattern exhibited by a probability distribution is explained with an algebraic equation, then it is called a probability density function (PDF) or probability mass function (PMF) depending on the random variable as it is continuous or discrete.  Binomial distribution, Poisson distribution are examples of PMF and Normal distribution and Exponential distribution are examples of PDF. So the dynamics of change in probability is predicted and analyzed with the help of PDF and PMF. This probability p of an event is directly related to frequency of that event. If the frequency of an event is high then it is more frequent. More frequent outcomes have higher probability of occurrence. So p takes higher value for more frequent events than less frequent event.  p is a ratio taking value between 0 and 1. Probability of an impossible event is 0 and for a sure event is 1.

Sunday, February 26, 2017

Let’s zoom into Probability Theory

For estimating the probability of an event resulting from a random phenomenon, we have to visualize all the possible outcomes of this phenomenon (experiment). It could be a simple event where the probability space is two dimensional or it could be a complex event (dependent on several other random variables) where probability space is multidimensional. When probability is a function of single random variable we are looking at this random phenomenon with a simple perspective. The probability space here is two dimensional with random variable on X axis and Probability on Y axis. For example when we study the change in Blood Pressure with respect to Weight, the probability of an individual suffering from hypertension will be defined by the following two dimensional space.

We discuss another example with more variables in picture. Here the probability space is multidimensional. For example we study the duration of stay of a patient in ICU (Intensive Care Unit) with respect to his weight, blood pressure, blood sugar and cholesterol. The probability of a longer duration of stay (more than average) will be defined by five dimensional Euclidean space.  For ease of graphical representation a three dimensional Euclidean space will look like the following.

Wednesday, February 22, 2017

Probability Theory: An effort to bind chance factor

Initially developed for gamblers of 16 th century, probability theory has crept into all the disciplines in the modern world. Gambling is called the game of chance. But every exact science is also influenced by this “game of chance”. There is a deterministic and probabilistic perspective to every subject. Deterministic aspects can be exactly determined and controlled completely. But the probabilistic part cannot be controlled completely and hence becomes a “game of chance”. Let’s take two examples one from the field of medicine and one from engineering science. In medical science different steps in the surgery of an individual is deterministic, but reaction of body of individual to this surgery is probabilistic. Number of days spent in the hospital before he/she is discharged and the speed of his/her recovery is probabilistic. Similarly in a manufacturing industry, steps in the production of an electronic device are deterministic. But number of defective in a lot, time of first failure of this device and life time of this device are probabilistic. The concept of probability initially developed for gambler is now replicated in all the disciplines. Here the probability is quantified with a mathematical value p, where p is between 0 and 1. Our aim is to use different theorem developed in this field and find a suitable value of p. An accurate value of p will help us in making valid inference about the population from the sample.

Tuesday, February 21, 2017

Computational Tools in Data Analysis







In today’s world,  evidence based (quantitative data) methods of approaching and handling a problem have become indispensable. This is due to cheap computing facilities/software available in the laptop and computers. These “computational aids” help crunch large set of numbers is a very short time. There are different approaches for this process. There is a spreadsheet like approach (for example MS Excel) or a user friendly software approach (like SPSS-Statistical Package for Social Sciences) or a statistical computing environment type approach in the form of a freeware (like R software) or finally a programming language approach with say FORTRAN.
Laborious and time consuming calculations done previously by hand or with the help of a simple pocket calculator can now be “mechanized” with the help of computer software. This way tedious calculation taking hours to solve can now be done within minutes.  This saves us from the drudgery of doing long calculations.

These calculations based on certain formula which was previously done by hand form the foundation of such computer programs and software. Using “old fashioned” technique of solving simple versions of such mathematical problems helps us in getting a feel of the problem. So the technique of doing a sum by hand with a pocket calculator before trying it in the software is also necessary. This way we can keep track of the algorithm needed for implementation of a specific statistical methodology in specific software. 

Univariate, Bivariate and Multivariate data – Amplified with an example

Example: If we are interested in studying the efficacy of medicine B on 20 individuals, the terms univariate, bivariate and multivariate can be explained in the following manner. The univariate, bivariate and multivariate data collected will be used for wider statistical inference (sample to population).
Univariate data: If we record data on say Blood Pressure of these 20 individuals, where readings on blood pressure is an indicator of impact of medicine B on the health of these individuals, then this data is a univariate data.
Bivariate data: If we record data on say Blood Pressure and Cholesterol levels of these 20 individuals, where Blood Pressure and Cholesterol are indicators of impact of medicine B on the health of these individuals, then this is a bivariate data. The interdependence between change in Blood Pressure levels and Cholesterol levels as an impact of medicine B can also be further analyzed.

Multivariate data: If we record data on say blood pressure, cholesterol, weight and blood sugar levels of an individual here then it is called a multivariate data. In this case interrelationship between these variables can be minutely analyzed. The dynamics of change in the value of one variable as a result of change in other variable can be statistically predicted. The collective effect of all the other variables (say cholesterol, weight and blood sugar) on a single variable (say blood pressure) can also be studied and predicted.

Monday, February 20, 2017

Some more clarification………………

Example: A manufacturing company has developed and introduced a new technology in its factory. This technology produces superior cables in comparison to cables produced by existing technology. The company wants to test the cables produced by new technology and compare it with cables produced by old technology.
Research Hypothesis: New technology is superior to existing technology
Population: All cables that have been ever produced by the new technology and the existing technology
Sample: Selection of ten cables produced by new technology and ten cables produced by existing technology
Data Collection: Conducting various tests on variables related to the quality of cables on a sample of twenty cables. Then quantitative data is collected on these variables.
Statistical Analysis and Inference: The analysis of data obtained from these two test groups will result in inference that new technology is better than old technology. This will happen if the variable related to quality show superior results for new technology in comparison to existing technology. This has to be validated by sample data of 20 cables classified into two groups of new and existing technology.
Advantage: We test on 20 cables and generalize for thousands of cables produced and thus save time and money.
Probability: This generalization of results from 20 cables to thousand of cables ever produced involve some uncertainty quantified by probability (level of significance and level of confidence)

Thursday, February 16, 2017

The roles of sample mean and sample variance in the statistical landscape

Sample mean and sample variance can be taken as a “genetic code” of data. Information on average and spread of the sample data is provided by these measures. But these values play a significant role in a wider perspective in inferential statistics. In testing of hypothesis, the hypotheses of population are tested on the basis of test statistics based on these measures. Intervals within which the unknown population parameter lies are also built with these measures. These intervals are called confidence intervals. These measures satisfy properties of good estimator that are namely unbiasedness, consistency and efficiency. They can be used for fitting probability distributions to the data. The parameters of these distributions are also expressed in terms of mean and variance.

Wednesday, February 15, 2017

What Should be the Sample Size?

It is a fundamental question of any data based research. A big sample  ensures accuracy and wider predictability of the results. But more time and money get involved in this process making it expensive and time consuming. This might not be realistic for countries like Nepal.In comparison to a developed country, research is conducted here with much less financial support from governmental funding agencies. So if inherent variability of the population with respect to the objective of study can be known in advance from some previous studies, a small sample that is big enough to reflect this variability can be equally efficient. So while deciding about the sample size a compromise has to be made. Here time/money is on one side and accuracy is on the other side. So if inherent variability is less, the population is homogeneous and a small sample will be enough to reflect this variability. But for heterogeneous population, the sample size should be big enough to reveal this variability.

Tuesday, February 14, 2017

Why do we need the Variance when we already have the Mean?

The roles played by Mean and Variance are different but complementary. Mean gives the average value of data whereas Variance gives the spread of data. So as a manufacturer if we are interested in the quality of product then mean on one hand gives the average quality. The variance on the other hand gives the consistency of this quality. But these measures can also be used in depth analysis, resulting in wider generalization of results obtained from the sample to the whole population. This concept can be explained by following example.

Average life expectancy or average longevity in years is a development indicator of any country. For a developed country these take higher values in comparison to a developing country. For example average life expectancy of a Nepalese in 2012 was 67 years, whereas average life expectancy of Japanese in 2012 was 83 years. Higher average value for Japan in contrast to that of Nepal reflects higher standard of living in Japan with good health care facilities offered by the government. The variance in the longevity of an individual from the average value will be smaller in Japan, implying that there is not much difference in the longevity with respect to the average value. So for a developed country the average is high and the variance (from mean 83 years) is low, as many individual live till 83 years. But for Nepal the average is low and the variance (from the mean 67 years) is high, as many die in their late fifties and many live till their seventies.

Monday, February 13, 2017

Mean, Mode, Median, Pie Charts and Bar Graphs

In Statistics Data is the main speaker. We have to let the Data do all the speaking and tell us its story.
But should we hear the full story? Or should we be satisfied with just the summary?
Mean, Mode, Median, Pie Charts and Bar Graphs seem to be very popular among students/researchers. But these measures give only an overview of information conveyed by the data. These measures give only a summary of this story. There are several other statistical methods which conduct in depth analysis of data and tell us the full story. Let’s use the full potential of data hear its full story. Regression analysis, Principal component analysis, factor analysis and categorical data analysis are names of some of such measures.

Our work as students/researchers is to apply right statistics to right data so that this information communicated by the data is understood and imbibed by us. Then we have link this information to the right context in a right manner.

Sunday, February 12, 2017

Zoom in & Zoom out/ Inferential Statistics & Descriptive Statistics

Statistical Analysis is split into two parts Descriptive Statistics and Inferential Statistics. Zooming into the data or taking a close up view of data can be metaphorical linked to Inferential Statistics. Zooming out or overview of the data is provided by Descriptive Statistics.


Data collected are in raw form. This data is a sample and has a predetermined sample size. Some statistical measures provide a summary of the story told by this data. This summary can be provided quantitatively in the form of Descriptive Statistics. Mean, Mode, Median, Variance etc are some of Descriptive Statistics tools. Graphical representations like pie chart, bar chart, box plots, pareto charts etc provide a visual summary of the data. These immediately follow data collection process and help us in getting “acquainted” with the data. These are one of the first milestones in the long road leading to good statistical inference. Inferential statistics are milestone appearing from half of the journey in this road. Inferential Statistics frequently uses results from descriptive statistics for making valid inferences about the population from the sample.

Friday, February 10, 2017

Unknown population & Known sample


 The population is always unknown. Population parameters like population mean µ, population variance σand population correlation ρ are also unknown.  Population is a collection of all objects under study. These objects may or may not be human beings. In many situations it is impossible to study the whole population. In other situations we are bound by limited time and money. A sample is a representative part of this population. If this sample is drawn in a proper manner so that it can reflect the variability in the population with respect to the object of study, we can make correct inferences about the population from the sample itself. With data collected from this sample, we compute sample mean, sample variance and sample correlation. These are called sample statistics. We want the sample mean to be as close as possible to population mean, sample variance to be as close as possible to the population variance and sample correlation to be as close as possible to population correlation.
Unknown population & Known sample – Example
The following example can illustrate the concept of unknown population and known sample.
A company wants to know the efficacy of medicine B that is being sold currently in the market.
Population: It is the collection of all individuals in the world who are prescribed medicine B. The details of all individuals taking medicine B are unknown. It will require lot of time and money to know about all these individuals. So it is impossible to know the population. Proportion of people cured from a disease after the use of medicine B in the whole population is unknown. This is denoted by π. This is a population parameter.
But we want to estimate π!
Sample: With a prior knowledge of all the age groups using this medicine we can select a sample of 20 individual who have been taking this medicine. We have to do some research in sampling. These 20 people are thoroughly examined and the proportion of people cured from the disease is calculated. This sample proportion p is a sample statistics. With this p (known) we are estimating (unknown) π.

We are estimating π with p!

Sample & Population/ Sample Statistics and Population Parameter



Population is unknown. So the values of Population parameters are also unknown. Population Parameters are attributes related to the population. µ, σ2 and ρ are population parameters for mean, variance and correlation. Due to time constraints and monetary constraints it is impossible to study each and every unit of population and hence know these values.
Sample is known. Sample statistic is known. Sample Statistics are attributes related to the sample. Sample mean, Sample variance & Sample correlation are sample statistics. For a particular sample these statistics will take a known value.
We are trying to estimate the Unknown Population Parameter from a known value of a Sample Statistics.


The core of Statistics lies in estimating these unknown population parameters with (known) sample statistics.
We want the sample estimates to be as close as possible to the unknown value of population parameters.
Based on these sample values, we predict an interval within which this unknown population parameter lies. This is done with a certain probability level, also called level of confidence.

Wednesday, February 8, 2017

Data & Data Analysis/ Statistics & Statistics

Generation of good quality data is very crucial. But once the data has been generated with lots of patience and perseverance, it has to be analyzed. There are some crucial questions that have to be addressed here.
1.   Should we be satisfied with just pie chart, bar charts?

2.  Should we leave everything to the software to handle?
3.   Should we hire a statistician to do all the work?
We analyze sample data for extraction of information. This information is drawn from the sample. But results obtained from this extraction are not only true for that sample it should hold true for the entire population. This process is the core of Statistics. We get data from a sample but the results are generalized for the entire population. Some uncertainty is involved in this generalization process. This is quantified by probability or level of confidence.


This process can be linked to squeezing of a lemon. Here lemon is a metaphor for data.
1.       Extraction of information by Statistics/ Data Analysis is like squeezing of a lemon for the lemon juice.
2.       Right amount of pressure is to be applied for extraction of right amount of juice.
Correct statistical tools have to be applied to the data.
3.       If we apply less pressure we have less lemon juice. The lemon is wasted as much of the juice within the lemon is not extracted.
If we just use simple graphical representations, we are not realizing the potential of data. Our patience and perseverance used in data generation goes to waste as we have not utilized the data to its full extent.
4.       If we apply more pressure than required the lemon juice become bitter.
If we don’t use correct statistical methodology or overuse various methodologies then the information inferred might not be correct.
So appropriateness of the methodology should be questioned at every step.
5.       Should we let the device decide (How much of lemon should be squeezed)? Should we leave everything to the statistical software?
The Statistical Software works under our instructions. We decide which methodology will be suitable for the data and give instructions accordingly.  The software simply obeys our order with the commands (in a programming language) that we give. Our knowledge is very important and software is just obeying our commands. If we give wrong instructions due to inadequate knowledge the software cannot tell us that we are applying wrong methodology.
6.       Should we let somebody else squeeze the lemon? Should we hire a statistician?

A statistician can do the analyses suitable for the data. But we should have enough knowledge.  We can monitor his/her progress and see whether the analyses are being steered in the right direction. Our knowledge is also very crucial and it will complement the work of a statistician. It can also crosscheck his work in many situations.

Tuesday, February 7, 2017

Optimize! Optimize! Minimize Sample Size, Maximize Data Quality

A good quality data is generated in the following manner.
1.       A clear objective of study

It should answer following questions.
a.       What we do we want to study?
b.      How we want to study?
c.       What questions (related to the study) should be addressed?
d.      How to address these questions?
Answering these questions requires knowledge in the subject. These questions bridge that subject (say Medicine or Engineering) to statistics. We collect quantitative facts that validate the theory. This theory is the object of our study. 1c, 1d discussed above is an art in itself. It can also be called an art of questionnaire design.
2.         A representative sample

This sample should represent the entire cross section of the population with respect to the objective of study. A sample is a small replica of the population. Here all the features of the population are reflected in the sample in right proportions. A sample is a smaller version of the population.
For example if we are interested in testing the efficacy of medicine A, this medicine should be a administered to a sample comprising of entire cross section of age groups that consume this medicine. This cross section should represent the population.  
3.  Design of Experiments based on principles of randomization, replication and local control.
These principles ensure that the data are collected with minimum background noise. It can be metaphorically linked to tuning of a radio. If we are at the right frequency we get clear radio messages with minimum white noise. A well designed experiment ensures that we can clearly hear the message conveyed by the data with minimum background noise.


Monday, February 6, 2017

Optimize Data Collection Process

Data collection or generation of primary data is a time consuming and expensive work. But we can optimize the data collection process by obtaining maximum information from minimum possible sample size. 
Methods of optimization in data collection are the following.
  1.      Clear objective of study – What do we want to measure? 
  2.      List of well worked out questions. Data collected are the answers to these questions.
  3.        Methods of asking these questions. Tests related to these questions or art of asking these question
  4.         Design of experiments conducted to furnish answers to our questions by minimizing the background noise.
  5.        A sample representing the entire cross section of the population (always keep the objective of study in mind!)

Types of data collected. Primary data and Secondary data

Primary data: Here we generate our own data through our own data collection process. Thus we can assure the quality of such primary data. It was discussed yesterday.

Secondary data: When secondary data is used in data analysis, we are actually using data published from other sources. Here we use fish that has been fished by somebody else. So we cannot be sure of its quality and freshness. Especially in developing countries one cannot be sure of the quality of secondary data. So we use secondary data published from reliable sources.

  

Sunday, February 5, 2017

Let get started! Data collection - continued




Well exercised data collection process asks right questions in right manner, thus generating good quality data. Data are generated from experiments and sample surveys. High quality data is generated from well exercised data collection process, properly designed experiments and pretested questionnaires.
Data are of two type namely Primary data and Secondary data.
Primary data: Here we collect our own data or in other words catch our own fish. So we can assure the quality of this data. Collection of primary data is a time consuming and money consuming process. But we can generate very good quality data.
For example: A pharmaceutical company claims that the medicine A, that they have developed is better than medicine B which is already in the market.

So medicine A is administered on 10 people and medicine B on 10 other people. Now several tests have to be made on these 20 individuals. The results of these tests are numerical values which will validate or reject our hypothesis that medicine A is better than medicine B. What tests are to be conducted on these 20 individuals have to be clearly specified? These tests are questions for which we are seeking answers in terms of data. So right questions asked in right manner will generate good quality primary data.

Let's get started! Data collection

Good quality data is the foundation of a good statistics/data analysis.
So if foundation is not strong implying that data collected are faulty, big structures of theory built on this foundation will not be strong.
Data collection should be conducted with a lot of patience and perseverance
Fishing can be a suitable metaphor for data collection process.
The process of data collection can be metaphorically linked with fishing, requiring patience and perseverance.
But how is data collected?
The objective of data collection (study) should be clear before the process of data collection is started.
We should clearly know what we want to study.
Then data collection is like seeking answers to right questions asked in a right manner.
A questionnaire is a list of such questions.

Preparation of questionnaire requires lot of exercise and theoretical knowledge of the concept on which data is collected.

Saturday, February 4, 2017

Let's get started. But how do we start?

We have to validate our theory with quantitative facts.
But how is this done?
We collect data related to a theory we want to validate.
Why do we collect data?
Data answer our questions related to a theory.
For Example: A company has developed a medicine (say A) which it claims is better than the one available in the market (say B).
Theory: Medicine A better than Medicine B
Validation (Statistical): We give medicine A to ten people and note down its effects (it might take some time). We also give medicine B to ten people for comparison and note down its effects. That is we note down data which are quantitative values.
Results (requires time and patience): If performance of A tested on 10 people is better than B, we claim that A is better than B.

So our theory (medicine A is better than medicine B) is backed by data.

Let's get started!

This need to justify any theory (in medicine, engineering, sociology or management) by solid quantitative facts makes Statistics indispensable in any discipline.

But then how much Statistics should be used?
Should we be satisfied with the use of bar graphs and pie charts?
Shouldn’t we explore the vast potential of statistical methodologies?
So we have to go bit by bit and step by step on this long and beautiful road.
This journey will be long but very interesting.
When we have overcome our fear of numbers and are not scared to apply, we have reached our goal.

So let’s get started!

Friday, February 3, 2017

What are we doing? We are giving answers with quantitative facts.


We are backing our results with quantitative facts.


But why?


Quantitative variables are objective in nature. Theories (in any discipline) validated by quantitative variables (numerical facts) are of objective nature. Such facts are widely accepted in all the communities including the scientific community. Subjective theories not backed by quantitative facts are not accepted especially in scientific community.


What are we doing ?



1.      Why do social scientists use quantitative variables such as age at first marriage, age of a woman at birth of her first child as development indicators of a society?
2.      Why do doctors use quantitative variables like blood pressure, total cholesterol and blood sugar as "well being indicators" of an individual?
3.      Why do engineers use quantitative variables like breaking strength of cables as one of the indicators of quality of cables produced by a particular engineering design?