Sunday, April 30, 2017

Insights into Concept of Hypothesis Testing (Part 13)


Decreasing Type I Error Increases Type II Error

Decreasing alpha increases beta




Population 1= {1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 5, 5}
Population Mean = 3 and Population variance = 1.263
Population 2 = {3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6 }
Population Mean = 4.944, Population Variance = 1.719

The probability of rejecting a true null hypothesis is denoted by α (alpha) . Rejection of a true null hypothesis is called Type I Error. In the adjacent figure we see that alpha is the area of the region where sample coming from a parent population with mean 3 is still rejected and it is falsely concluded that it comes from a parent population with mean 4.944. Committing Type I error disturbs the status quo. So minimize this error and minimize alpha. But when we try to minimize alpha we increase beta, as seen from the area of alpha and beta shown above.

Thursday, April 27, 2017

Insights into Hypothesis Testing (Part 12)

Understanding Mathematics behind Type I Error and Type II Error
Let’s consider the following population.
Population 1 = {1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 5, 5}
Mean = 3, Mode = 3, Median = 3
Population 2= {3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 7, 8}
Mean = 5, Mode = 5, Median = 5
As shown in the blog of previous day if sample mean is more than 4.27 we conclude that the sample doesn’t belong to Population 1 and we commit type I error. But if the sample mean is less than 4.27 we accept the null hypothesis. Either we accept a true null hypothesis or accept a false null hypothesis. Type I error and Type II error with respect to Population I and Population mentioned above are explained by the following diagram


Type I Error and Type II Error

Sunday, April 23, 2017

Insights into concepts of hypothesis testing (Part 11)

Demonstration of Type I Error with an Example
This example demonstrates how Type I error can be committed.  Suppose this is population data Population = {1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5}.
It is a symmetric population with Population Mean = 3, Population Mode = 3 and Population Median = 3. These are unknown, but for the sake of ease of comprehension I have mentioned it here. Let’s draw a sample of size 3.
Sample = {4, 5, 5}
Due to variations in sampling we can get such an extreme sample with sample mean = 4.66.  We test the hypothesis that this sample comes from a population with mean 3.
Null Hypothesis: Population Mean is equal to 3
Alternative Hypothesis: Population Mean is not equal to 3
Under the assumption that population standard deviation is known and is 1.1239 the Z test statistics

is 2.55 and is more than 1.96. So here the null hypothesis is rejected at 5% level of significance and we conclude that the sample doesn’t come from this population. The p value is 0.0053. If the sample mean is greater than 4.27  we have to reject the null hypothesis and conclude that the sample mean is not 3.

Sample mean more than 4.27 than we reject the null hypothesis

Friday, April 21, 2017

Insights into concepts of Hypothesis Testing (Part 10)

Assumption of Normality of Parent Population
The population from which a sample is drawn is called a parent population. This parent population is usually unknown. Parametric tests under testing of hypothesis are based on the assumption of normality of parent population. Non parametric tests are not based on this assumption. Today’s discussion delves deeper into this assumption of normal population. We look at various transformations that convert a skewed (non normal) population to a normal population.
For the sake of simplicity let’s consider the following population. It is unknown under normal conditions.
Population = {1, 1, 1, 1, 1, 1, 2, 2, 2, 5, 5, 20, 25}
This is a non normal or a positively skewed data. It is represented by following frequency curve on the left side of the image. Here. Hence Mode<Median<Mean for a positively skewed data.
If the population was normal it’s simplified version will be the following.
Population = {3, 4, 5, 5, 5, 5, 5, 5, 6, 7}
It is represented by the frequency curve in the right side. Here Mean = Mode = Median = 5. For a normal data, Mean = Mode = Median.
For conducting parametric tests the parent population should be normal where mean, mode and median are very close to one another.
Some transformations like logarithmic, square root and power (1/4) can transform a skewed data by bringing the mean, mode and median closer to each other. But these transformations don’t change the shape of the frequency curve.

After we conduct logarithmic transformation to the non normal population that is {1, 1, 1, 1, 1, 1, 2, 2, 2, 5, 5, 20, 25}, Mean = 5.15, Median = 2 and Mode = 1 changes to Mean = 0.385, Median = 0.301 and Mode = 0.  After doing the square root of the original data Mean = 1.86, Median = 1.414 and Mode = 1. And if power (1/4) is done for the original data, Mean = 1.3007, Mode = 1 and Median = 1.189. But these transformations don’t change the shape of the frequency curve. This is illustrated by the image below giving the screen shot of the excel worksheet comparing different transformations.

Thursday, April 20, 2017

News “cycling to work can halve cancer risk” - from a statistical perspective

Holiday Special Update III
Today’s blog gives a statistical perspective on the article published in BBC health news on 20 April 2017 titled “Cycling to work halves cancer risk”.
Risk to a disease means the probability of catching that disease. Catching of a disease in an individual’s life time is a random event and it is a function of time. So denoting X(t) as a random variable denoting the state of having a disease (say cancer) at time t takes two values 0 and 1. x (t) = 0 implies a disease free state with some probability P(x (t) = 0) denoted by  p1 and x(t) = 1 implies the existence of a disease at time t with probability P(x(t) = 1) denoted by p2. Here p1+p2=1, because either an individual is in a state of health or he/she has a disease at time t and these two states are exhaustive and mutually exclusive. This disease could be as common as common cold or not as common as cancer. For a healthy and young individual p1>p2 and as time progresses implying that as an individual become older and older p2 becomes closer and closer to 1. These probabilities can be explained by a probability distribution. Diseases those are common in modern day life like high blood pressure and sugar can be explained by Binomial probability distribution. This distribution tends to normal distribution when the size of population is large. The incidence of not so common diseases/rare diseases can be explained by Poisson distribution and Negative binomial distribution. Normal distribution is a limiting case for these distributions as well. As the news says ”Cycling to work can halve the cancer risk”, this implies that p2, which is the probability of catching a disease (cancer) at time t is reduced by half when people cycle to their places of work. Similarly regular exercise and consumption of balanced diet can also reduce p2. p2 is a function of time/age and it increases as age increases. But its growth can be checked by cycling to work.
Human life is governed by several random events. These occurrences can be statistically analyzed by the probability distribution of random variables that explain these random events.  Clinical trials and data based research give us an idea of values of these p1 and p2.  This is an evidence/data based approach of estimating p1 and p2. This kind of research complements laboratory based research. If time and energy is invested in making a  foundation of good quality data, then breakthrough results can be obtained with much less time and money.


Monday, April 17, 2017

News “Publication of Gender Biased Books” - from a statistical perspective


Holiday Special Update II



Today’s blog gives a statistical perspective on the article published in BBC Asia on 15 April 2017 titled “India’s enquiry into sexist text books” and also on BBC radio journal talk ”Gender biased books”.
Gender biased is not gender balanced. Gender biased means giving preference to one specific gender. The article published in BBC Asia on 15 April 2017 “India’s enquiry into sexist text books” and talk program in BBC radio journal “Gender biased books” try to sensitize us to the importance of having gender balanced views for the sustainable development of the society. But what is gender balanced? Gender balanced attribute implies, maintaining the natural balance between number of male and number of female with respect to that attribute. Sex ratio measures this ratio between the number of male and the number of female.  It is normally defined as the number of males per 100 females. Sex ratio at birth measures this ratio between male and female at the time of birth. Under normal conditions and in the absence of any external biological intervention the sex ratio at birth is between 103 and 105.  This implies that there are normally 103 – 105 male births to every 100 female births. So if there are 100 births then 50.74% [(103/203)*100] are male and 49.26% are female. This is the gender balance given by nature.
For the publications to be gender balanced same ratio has to be maintained implying that for every 10 books published in the market 5 – 6 should portray male perspective of an issue and 4- 5 books should portray female perspective of the same issue. Or if there are 10 stories published in a book, 5-6 stories should have male in a lead role and 4-5 should have female heroines. So when a reader reads the entire book, he/she has an idea of how a man would think and also of how a woman would tackle an issue. Reading gender balanced books result in development of impartial views on any issue.
According to census 2011, sex ratio is 94 for Nepal. There are 94 male per 100 female when whole population of 2011 is taken into consideration.  Census 2011 tells that sex ratio in the age 00- 04 years is 105. So this drop from 105 to 94 is attributed to increased life expectancy of females in all age groups. Vast rural urban differential existing in many developing countries including Nepal is also reflected in the sex ratio.   Sex ratio is 104 for urban areas and 92.3 for rural areas. Male from rural areas migrate for education and employment to urban areas. Thus sex ratio in urban areas is higher and more than 100 in comparison to sex ratio of rural areas.
So what is gender balanced and what is gender biased? If sex ratio is close to 100 with respect to an attribute then that attribute is gender balanced and if sex ratio is much less than 100 or much more than 100 then it is gender biased.


Saturday, April 15, 2017

Today’s news “World’s oldest person dies at 117 years” - From a statistical perspective

Holiday Special Update

I am on a summer vacation this week. There will not be regular daily updates but a holiday special update. I will look at the News published today in BBC on 16 April 2017, “World’s oldest person dies at 117 years” from statistical perspective. This lady hails from Italy and the average longevity of an Italian female is 84.8 years (source: http://www.worldlifeexpectancy.com/italy-life-expectancy). Italy ranks top sixth in 2015 in the global ranking of average life expectancy. Average life expectancy is a development indicator of that country. Higher average life expectancy means higher standard of living and better health care facilities provided by the government. Many questions arise in our mind as we read this news. Some of them are the following.
1.       What is it like to be the world’s oldest person?
2.       What is the probability of being world’s oldest person?
3.       What is the probability of being in a country’s top 2% longest living people?
Life Tables tries to address these questions and gives the mortality experience of a group called Cohort. l(x), q(x), L(x), T(x) and e(x) are some of the columns of Life tables. l(x) gives the conditional probability of surviving till age x given that a person has lived till age x-1. q(x) is the conditional probability of dying before age x given that the person has lived till age x-1. L(x) is the number of person years lived between x-1 to x. T(x) is the total number of person years lived till age x. e(x) is the average life expectancy at age x. The values of this life table are governed by the current mortality experience.
I will focus on the last column of life tables that is e(x) and specially on average life expectancy at birth e(0).  The average life expectancy of a Nepalese Woman is 67.97 years (Source: Population Monograph of Nepal 2014). Due to vast rural and urban differential in health facilities which are common to all developing countries, it is 71 years in urban area and 68 years in rural areas. This implies that under given health conditions and under given socioeconomic conditions a woman on average lives till 71 years in urban areas of Nepal. If this woman belongs to rural areas, she is exposed to the socioeconomic status and health conditions of rural areas and lives till 68 years on average. This is a very promising figure indicating that a woman of Nepal has 67.97 years to fulfill all her dreams and aspirations. In contrast to Nepal a woman from Italy has 84.8 years to meet all her dreams and aspirations. So she has on average 17 more years to live. The Italian lady mentioned in the BBC news today exceeded every Italian citizen with an average life expectancy of 82.7 years (Source:http://www.worldlifeexpectancy.com/italy-life-expectancy) by living till 117 years.  
The life expectancy at birth for a female in Nepal has increased from 28.5 years in 1954 to 67.9 years in 2011 (Source: Population Monograph of Nepal 2014). This is due to increasing modern health facilities that have reduced death rates such as maternal mortality rates, infant mortality rates and child death rates. If we assume that the standard deviation is 2 years then what is the age for top 2% in terms of female life expectancy for Nepal. The average life expectancy is normally distributed. The women living higher than 72 years comprise the top 2% of the female population given that the average longevity is normally distributed with mean 67.9 years and standard deviation of 2 years.  This is under the current mortality conditions. This portrayed in the image below.



Wednesday, April 12, 2017

Insights into the concepts of hypothesis testing (Part 9)

Consequences of Hypothesis Testing
In continuation with yesterday’s example, if hypothesis testing is based on well designed experiments aimed at maximizing the information with minimum background noise and complemented by a representative sampling scheme, then we are most likely to make correct decisions. Among four possible consequences listed below, only one will take place.
Null Hypothesis: No increment in SPM of 2017 to 2016
Alternative Hypothesis: There is an increment in SPM of 2017 to 2016
Correct Decision: Accepting a true null hypothesis. This implies that we correctly conclude that there is no increment in SPM in 2017 in comparison to 2016.
Correct Decision: Rejecting a false null hypothesis. This implies that we correctly conclude that there has been an increment in SPM of 2017 in comparison to 2016.
Error I: Type I Error: Rejecting a true null hypothesis and thus falsely concluding that SPM in 2017 is significantly more than 2016. Consequences of this error will be very expensive (time and money) for the government, so the government will be wrongly forced to take measures to control pollution which is beneficial to the public.
Error II: Type II Error: Accepting a false null hypothesis and thus falsely concluding that there is no change in SPM of 2017 in comparison to 2016. Consequences of this conclusion will be drastic for the public as it will have adverse impact on their health.



Tuesday, April 11, 2017

Insights into the concepts of hypothesis testing (Part 8)

Interchange of Null and Alternative Hypothesis doesn't change the final conclusion.The following two images validate this statement.


Monday, April 10, 2017

Insights into the concepts of hypothesis testing (Part 7)

In continuation with yesterday's example
For example: We want to test whether the level of air pollution in terms of Suspended Particulate Matter (SPM) for the year 2017 is higher than that of 2016. We analyze this on the basis daily data of SPM collected March – April (dry seasons) at a specific time say between 12:00 hours – 14:00 hour, for year 2017 and 2016. The choice of Null and Alternative Hypothesis are explained in detail in the image below.

Sunday, April 9, 2017

Insights into the concepts of hypothesis testing (Part 6)

Choice of appropriate null and alternative hypothesis

Null hypothesis is the hypothesis of no difference. Alternative hypothesis is the opposite of null hypothesis. The choice of an appropriate null and alternative hypothesis always poses as a cause of concern among researchers. Null Hypothesis supports the status quo of the scenario of study. Alternative hypothesis tries to endorse the result obtained from sample data. We want to use sample data as evidence and reject the null hypothesis in favor of alternative hypothesis. For example if the sample data shows an increment in air pollution in comparison to last year the alternative hypothesis should reflect this trend. Then the null hypothesis will be the opposite of alternative hypothesis which is also the “Status Quo”. How is this status quo known to us? This is known through previous studies or through previous studies or through no difference assumptions or through the opposite of alternative hypothesis.
For example: We want to test whether the level of air pollution in terms of Suspended Particulate Matter (SPM) for the year 2017 is higher than that of 2016. We analyze this on the basis daily data of SPM,  collected in March – April (dry seasons) at a specific time say between 12:00 hours – 14:00 hour, for year 2017 and 2016.
What should be the Null hypothesis and Alternative hypothesis? What will be the consequence of Type I Error and Type II Error? This will be discussed tomorrow

Friday, April 7, 2017

Insights into the concepts of hypothesis testing (Part 5)

Understanding Steps of Hypothesis Testing

In continuation to yesterday’s discussion, a sample of size 5 is drawn from a normally distributed population with population variance equal to 0.75. The sample mean is 5.112. Test the hypothesis at 5% level of significance that this sample comes from a population with mean 5.
1.       Null Hypothesis: Population Mean = 5
(Parent population of this sample has a mean of 5)
2.       Alternative Hypothesis: Population Mean >5
(Parent population of this sample has a mean of greater than 5. Since sample mean is 5.112, we think that alternative hypothesis might be true)
3.       Alpha = 0.05
(Probability of error of rejecting a true null hypothesis is 5 in 100)
4.       Z = (sample mean - population mean (under null hypothesis))/standard error of sample mean 
Sample mean = 5.112. Big Question- Is this sample mean large enough to conclude that it comes from parent population of Mean > 5
We know that sample mean of 5.112 is greater than the population mean of 5. But we are generalizing results from a sample of size 5 to the entire population that is we are inferring for the whole population so these steps of hypothesis testing have to be followed. The figure below gives the sampling distribution of sample mean. We see that the sample mean should be at least 5.634 for us to conclude that the sample mean is large enough for us to conclude the null hypothesis.
Sampling Distribution of Sample Mean with Mean Equal to 5 and Standard Error Equal to 0.385




 

Thursday, April 6, 2017

Insights into the concepts of hypothesis testing (Part 4)

Type II error – Consumer’s risk

In continuation to yesterday’s discussion, we don’t have the luxury of knowing the whole population. Unknown population means unknown values of population parameters. But through hypothesis testing, confirmatory inferences about the unknown population parameters can be made. But when we accept a null hypothesis either we make a correct decision of accepting a true null hypothesis or an error of accepting a false null hypothesis. The error of accepting a false null hypothesis is called Type II error. This error is not considered serious as Type I error, as the status quo is not disturbed as a consequence of this error. It can be compared to accepting a bad lot due to variations in sampling also called as Consumer’s Risk. A bad lot is wrongfully accepted as it confirms to the decision rule developed for the acceptance of a lot on the basis of sample drawn from the lot. The probability of this error is called beta.

Tuesday, April 4, 2017

Insights into the concepts of hypothesis testing (Part 3)

Type I error – Producer’s risk

In continuation to yesterday’s discussion, we don’t have the luxury of knowing the whole population. Unknown population means unknown values of population parameters. But through hypothesis testing, confirmatory inferences about the unknown population parameters can be made. But when we reject a null hypothesis either we make a correct decision of rejecting a false null hypothesis or we make an error of rejecting a true null hypothesis. This error of rejecting a true null hypothesis is called Type I error.  It is a very serious error as committing this error disturbs the “Status Quo”, normally resulting in a loss of significant amount of time and money. This is also called a Producer’s Risk, where a producer rejects a perfectly good lot because of variations in sampling. Producer’s risk is very expensive for the producer as the time and money invested in the lot goes to waste.  The reason behind it is that the product cannot reach the consumer.  The probability of committing this risk is minimized. The probability of committing Type I Error is called level of significance and is denoted by alpha. In the entire hypothesis testing process we fix this risk alpha equal to 0.05 or 0.01. This means that the probability of rejecting a good lot if it ever happens is only 5 in 100 or 1 in 100.

Monday, April 3, 2017

Insights into the concepts of hypothesis testing (Part 2)

As mentioned on yesterday’s BLOG in Hypothesis Testing, null hypothesis is interested in maintaining the “Status Quo”. Alternative hypothesis is the opposite of null hypothesis.
We don’t know the consequences of our decisions. We don’t have the luxury of knowing each and every unit of the population so we have to be satisfied with the sample. But results obtained from this sample should hold true for the entire population. Because of this generalization of results we run the risk of committing two types or errors. Either we can reject a true null hypothesis that is called Type I error OR we can accept a false null hypothesis that is called Type II error. Falsely rejecting a true null hypothesis in favor of alternative hypothesis disturbs our “Status Quo”. So committing type I error lands us in a worst off scenario.
For example: A person is looking for greener pastures and so is in look out for a new job.
Null Hypothesis: Current job is good   (maintaining the status quo)
Alternative Hypothesis: The new job is good (opposite of null hypothesis)
Consequence of Type I error
Leaving the old job and being unsatisfied (due to poor wages and bad working conditions) with the new job. This error disturbs our “Status Quo” and moves us to a worst scenario.
Consequence of Type II error
Remaining in the old job and remaining unsatisfied. This error is not so serious as Type I error as we are not pushed below from our “Status Quo” position.


Sunday, April 2, 2017

Insights into the concepts of hypothesis testing (Part 1)

In this week I will dig deeper into the concept of hypothesis testing. This blog aims to orient on various statistical concepts in an example oriented and simplified manner. With this aim I am now going deeper into the concept of hypothesis testing. Testing of hypothesis is confirmatory in nature. Here either a null hypothesis is rejected and an alternative hypothesis is accepted or a null hypothesis is accepted and an alternative hypothesis is rejected. These conclusions are based on results obtained from a sample, through the computation of a test statistic. This test statistic is based on sampling distribution of sample statistic. The value of the test statistic is compared with the tabulated value and either the null hypothesis is accepted or the null hypothesis is rejected. These are inferences about the population based on results obtained from the sample.
Null hypothesis is interested in maintaining the “Status Quo”. Alternative hypothesis is the opposite of null hypothesis.
For example: A person is looking for greener pastures and so is in look out for a new job.
Null Hypothesis: Current job is good   (maintaining the status quo)
Alternative Hypothesis: The new job is good (opposite of null hypothesis)

(to be continued tomorrow)