Friday, March 31, 2017

Survival Analysis – Reliability Theory / Human Life – Transistor’s Life (Final Part)

Some Mathematics of Reliability Theory – Survival Analysis
In this blog I am trying to explain mathematical concepts with real life metaphors and examples. The discussion on similarities between reliability theory and survival analysis is concluded in today’s blog with explanation on mathematics behind this concept. This mathematics summarizes the discussion conducted in previous five parts.
Denoting the life of a device/ human by a random variable T, t denotes the specific value taken by this random variable.
R(t) = Reliability of device at time t = P( T greater than equal to t)  = Probability that the life time of device is greater than t
S(t)= Survival of an individual at least till time t = P( T greater than equal to t) = Probability that an individual survives at time t
Cumulative distribution function = F(t) = 1 - R(t) = Probability that the device fails before t
Cumulative distribution function = F(t) = 1 - S(t) = Probability that an individual dies before t
 h(t) = hazard rate = Probability (the device fails between time t and t + Δ t|The device has survived till t)
= -R’(t)/R(t)
h(t)= instantaneous force of mortality = Probability ( an individual dies between time t and t + Δ t|The individual survived till t) = -S’(t)/S(t)
h(t) = λ=constant  for electronic device – memory less property and independent of time-Reliability Theory
h(t) = λ(t) = function of time for human life – Survival Analysis



Thursday, March 30, 2017

Survival Analysis – Reliability Theory / Human Life – Transistor’s Life (Part 5)

Life of an electronic device
The life of an electronic device might not have as many shades of color as in a human life. But it still has some shades that make it a playground of chance factor and randomness. This can be illustrated by the fact that the number of times an electronic device becomes out of order before it is totally discarded is a discrete random variable. The time of first failure of a device is a continuous random variable. The time between two consecutive failures of a device is continuous random variables.  The failure in a device can be related to the occurrence of illness in human lives. The time taken to repair a device is a continuous random variable. The failure rate or hazard rate is bath tub like curve as in human force of mortality. The failure rate of device is the highest in the first few days of its installation and in the last days. High value of hazard rate in the last days is due to wear and tear of the device. Further the lifetimes of electronic devices and batteries have lack of memory property. This implies that time of failure of the device has “no memory” of how long that device has already worked.


Wednesday, March 29, 2017

Survival Analysis – Reliability Theory / Human Life – Transistor’s Life (Part 4)



Summary of Human Life in Terms of Random Variables
Discrete Random Variables:
1         Gender of an individual at the time of birth
2        No of time he/she falls sick in his/her entire lifetime
3         No of children born by him/her in the entire lifetime
4           No of girls born among the total children born
5         No of jobs changed in his/her entire lifetime
Continuous Random Variables:
1           Time of birth of an individual
2          Time of illness
3          Time between two consecutive illnesses
4           Time of recovery from illness
5          Time of marriage
6           Time of birth of children
7            Time between two consecutive births
8           Age  at the time of death

In all these events memory plays an important role. For example the time between two consecutive births is at least 9 months. The probability of time of second attack of severe illness is the highest when this time is closer to the time of recovery from first illness. This is due to the probability of relapse of a disease. So failure rate or force of mortality is affected by memory. Failure rate is higher for very old ages and very young ages.  But this is not true in the case of electronic devices, batteries etc. This  will be discussed tomorrow in the discussion on reliability theory.

Tuesday, March 28, 2017

Survival Analysis – Reliability Theory / Human Life – Transistor’s Life (Part 3)

In continuation with yesterday’s discussion, we saw that human life is marked by several random events. The time of marriage is a continuous random variable as it can take any value in a real continuum. Similarly the number of children born to the individual is a discrete random variable denoted by say n. Here n = 0, 1, 2, 3, …., m; here n = 0 denotes having no children, n = 1 denotes having a child in the entire life time of an individual and so on. Similarly the time of birth of first child is a continuous random variable, the time of birth of second child is also a continuous random variable and so is the time between two consecutive births. These are continuous random variables as time is continuous and can take any value in the continuum. But the time of second birth is greater than the time of first birth. The time at death is a continuous random variable and the age at the death of an individual is a continuous random variable. The number of professional jobs taken by an individual in his entire life time is a discrete random variable. We see that human life is governed by multiple discrete and continuous random variables, which take values governed by some probability law.  Our objective is to explain this probability law with the help of a function. This way we can quantify the chances of occurrences of these events. In survival analysis we are interested in predicting the mortality of an individual. This is very useful in actuarial science which is based on quantitative value of chance of mortality of an individual.  

Monday, March 27, 2017

Survival Analysis – Reliability Theory / Human Life – Transistor’s Life (Part 2)

I am continuing yesterday’s discussion. So the beginning of human life is a playground of probability where the time of birth can be described by a continuous random variable and the outcome (gender/number) are discrete random variables. Similarly, the number of times that individual becomes seriously ill in his entire life time, so that he has to take more than two days leave from his school/office is a discrete random variable, say Y. Then Y = 0, 1, 2, ….n ; implying that he can be seriously ill zero times, once, twice and so on.   This random variable is dependent on the health status of the individual. For a healthy individual the lower values of Y will have high probabilities in contrast to a physically weak individual.  No of sick leaves that he has to take from his school/office is also a discrete random variable. Z = 3, 4, 5, …,m; here we have defined serious illness as illness where number of sick leaves taken from school/office is more than two days. The time between two consecutive illnesses is a continuous random variable. For a physically weak individual smaller time between two consecutive illnesses will have higher probability in contrast to healthy individual.  Healthy individual tend fall seriously ill very less frequently and hence the time between two consecutive illnesses are longer with a higher probability. The time of outbreak of an illness for any individual is also a continuous random variable. (To be continued tomorrow).

Sunday, March 26, 2017

Survival Analysis – Reliability Theory / Human Life – Transistor’s Life (Part 1)

Survival analysis quantifies the uncertainty associated with the mortality of human life. To understand this concept we have to zoom into the human life. Human life is governed by multiple random events. But some aspects of human lives can be totally controlled and are hence of deterministic nature. These are normal day to day activities like waking up at a specific time, going to school at a specific time, activities in school and offices conducted at a specific time, engagement in specific extracurricular activities after school/office and finally going to bed at a specific time. Each individual can exercise some kind of control in the implementation of these activities and also in the time of conduction of these activities. But there are several events where the time of occurrence and its outcome are random in nature. Hence these outcomes and times of occurrence of these outcomes can be described by a random variable. Following examples highlight different shades of human life where probability theory plays a crucial role. We start from the very beginning where the time of birth of an individual is a random variable. Although a date of birth is predicted by a Gynecologist, the exact time of birth (by natural process) cannot be controlled. This time is a continuous random variable. The gender of the child who is the result of this process is also a random variable. If X is a random variable denoting a girl, then X = 0, 1 implies that either a boy or a girl is born out of this process. (To be continued tomorrow)


Thursday, March 23, 2017

Visualization of Sample Space through Concepts of Functions and Relations

Sample Space: Sample Space is the set of all possible outcomes of an experiment. The set of Favorable cases is a subset of the Sample Space. While computing the probability we should visualize the sample space. For a single discrete random variable this sample space can be listed. A random variable is a function of this sample space. This is illustrated by the following example.
Example: Getting a face 6 in a roll of a die
Sample Space: {1, 2, 3, 4, 5, 6}
Outcome: Getting a Six
Random variable X denotes getting a face 6 in a single roll of a die.
X = 0, 1
This is same as saying,  
X = not getting six, getting six
P(not getting six) = 5/6
P( getting six) = 1/6
The figure above shows the mapping from Sample Space to Probability Space

Wednesday, March 22, 2017

Parameter and Parametric Space – Example

Let me explain the concept with the help of an example.  The biophysical process behind catching a disease say Breast Cancer has a deterministic and probabilistic aspect to it. The deterministic aspect can be explained (determined) and controlled. Having a healthy and balanced diet, doing regular exercises etc are some deterministic examples of controlling breast cancer. But several other unexplained factors including the psychology of an individual, genetic makeup etc contribute to the probabilistic or random facet of this disease. This random aspect has to be explained by probability distribution function. So if the chance of getting cancer has to be quantified with a numerical value and described by an equation; then there will be a deterministic part and a probabilistic part. The parameters and the parametric space of this biophysical phenomenon have to be identified for an exact estimation of this probability. The probability distribution will be a multivariate distribution as multiple random variables contribute to this process.


Normal distribution can explain various indicators of health condition like blood pressure, cholesterol, sugar etc. Normal distribution tells that mean or average value has the highest probability and as we go away from the mean on either side the probability decreases. For this normal distribution the figure below gives Normal probability density function for different values of the parameter µ and σ2. Here each curve could depict say blood pressure for each age group. The normal curve on the left most side could represent the probability density function of blood pressure pattern of age 20- 25. The normal curve in the middle could represent blood pressure pattern of age 40 – 45.  The normal curve of the right could represent the blood pressure pattern of age 60 – 65. The average blood pressure is lowest for the age 20-25. The variance of blood pressure around the mean is lowest for the age 20-25.


Tuesday, March 21, 2017

Parameter and Parametric Space

The working mechanism of a human body is a complex biophysical process. This can be used as a metaphor for the term population that is frequently used in Statistics.  Population is unknown and so is this complex biophysical system of human body. Several parameters governing this process are thus called population parameters. To completely describe this biophysical system with equations, we require values of these population parameters. This system is not solely deterministic and chance factor also governs some aspect of this system. This chance factor can be explained by a probability distribution based on values of population parameters. These population parameters can take different values from the parametric space, which are related to the state of wellbeing of an individual. For a realistic description of randomness of functioning of human body, multiple parameters can play a vital role. These are described by a multidimensional parametric space. Through sample and its data collection, we try to find a realistic sample estimate of these unknown population parameters. We aim to choose estimators that are unbiased, consistent and efficient. We will discuss in detail with an example, tomorrow in this BLOG.

Monday, March 20, 2017

Types of Data

Data can be classified on Nominal, Ordinal, Interval and Ratio scale.
Nominal Data are simple classifications. Such data cannot be subjected to mathematical operations like addition, subtraction, multiplication and division.
Ordinal Data are simple classifications that can be ranked. Such data cannot be subjected to mathematical operations but can be ranked.
Interval Data are simple classifications that can be ranked and can be subjected to mathematical operations of addition and subtraction. Interval data don’t have absolute zero.
Ratio Data are Data that can be subjected to all mathematical operations and have absolute zero.
Example: Suppose a questionnaire comprises of following questions.
Q1. What is your name? Response here is a Nominal Data
Q2. How old are you? Response is Ratio Data
Q3. What is your date of birth? Response here is Nominal Data
Q4. How to you rate our services? Response here is Ordinal Data
Not good (1)   average (2)   good (3)    very good (4)
Q5. What is your body temperature? Response is interval data

Sunday, March 19, 2017

Proportion of defective integrated chips (IC) /incidence rate of a disease

We refer to BLOGS of March 13 and 14. A company can allow only up to 10% defectives. Then maximum number of defective in a lot of 52 IC is 5. A sampling scheme is designed. A sample of size 10 is drawn from a lot of 52 IC. But 10% of this sample of size 10 is 1 defective IC. But due to variation in sampling the distribution of number of defectives in 95% of the samples will be between 0 to 2.8~3.  Here C(52, 10) = 15820024220  unique samples of size 10 can be drawn from this lot of size 52. This example can also be related to number of patients suffering from a disease. Suppose that the incidence of disease is 10% in the entire population. Then in a sample of 10 people,  0 to 3 people suffer from this disease if the incidence rate is 10%. But if the number of patients suffering is higher than 3 in the sample of size 10, then we conclude that the incidence of disease is higher than 10% in the entire population.
H0: p = 0.10
H1: p > 0.10
The mathematics is given in the figure below.

Thursday, March 16, 2017

Quantify the risk of wrong diagnosis of a disease




Population is unknown and population parameters are also unknown. Biophysics of human body is a complex system that can be related to an unknown population. Knowing the values of population parameters is normally beyond the scope of medical examination of a patient. Limited time and money compels the examiner and the patient to make valid and correct conclusions from a sample. Blood sample, urine sample and stool sample are taken in addition to several scans conducted by advanced medical equipment. Although these tests and equipments are very advanced and correct, risk of a wrong diagnosis always remains. The quantification of this risk of wrong diagnosis can be done by probability as illustrated by the example below.
Example: The incidence of a rare disease (X) is only 4% in the entire population and is illustrated by a tree diagram. Hospital records have shown that there is a chance of 4 in 100 that a person with disease is wrongly diagnosed and that there is a chance of 1 in 100 that a healthy person is wrongly diagnosed. This information is exhibited by the diagram above.
P( wrong diagnosis) = (0.04)(0.04)+(0.96)(0.01) = 0.0112

So the risk of a wrong diagnosis is 1.12 %

Wednesday, March 15, 2017

Demonstration of Role of Combinatorics in Probability

There are 8 people where 3 are girls and 5 are boys (refer to  blog 10 March 2017). We are interested in the probability distribution of no. of top 3 positions occupied by girls. Let X be a random variable denoting no. of top 3 positions occupied by girls X = 0, 1, 2, 3
X= 0 top three positions occupied by 0 girls which means all three positions occupied by 3 boys
X = 1 means one position occupied by girls implying two positions occupied by boys
X = 2 means two positions occupied by girls implying one position occupied by a boy
X = 3 means top three positions occupied by three girls
P(X = 0) = P (Only boys in the top 3) = C(5, 3) C(3, 0) 3!5!/8! = 10/56
C(5, 3) means choosing 3 boys out of 5 boys
C(3, 0) means choosing 0 girls out of 3 girls
3!5! mean arranging these three boys in three positions and remaining 5 students ( girls and boys) in 5 positions.
Similarly
P(X = 1) = P (One girl and two boys) = C(5, 2) C(3, 1) 3!5!/8! = 30/56
P(X = 2) = P( two girls and one boy) = C(3, 2) C(5, 1) 3!5!/8! = 15/56
P(X = 3) = C(3, 3) 3!5!/8! = 1/56
Total probability is 1
P(X = 0) + P(X=1) + P(X=2) + P(X = 3) = 1
This probability distribution can also be used in describing the gender distribution among deaths of first 3 people out of a total of 8 people. Here permutation and combination has been used together as a counting technique for selection (where order is not important) and arrangement (where order is important).



Tuesday, March 14, 2017

Consumer’s Risk & Producer’s Risk

Population is unknown and population parameters are also unknown. In case of industrial applications, knowing the population is a very expensive and time consuming process. So a sampling scheme is designed. We base our judgment on the basis of a sample. This involves a risk of wrongfully rejecting a good lot and wrongfully accepting a bad lot. The prior is called producer’s risk and later is called consumer’s risk. With probability we can quantify these risks. Yesterday’s example illustrated the quantification of producer’s risk with the value of probability. These concepts are discussed further in this example.
Example:  A lot contains 52 integrated chips (IC). A sample of 4 IC is selected at random from each lot. According to the sampling scheme designed for quality control, if the sample contains more than 2 defectives it is rejected. Due to the choice of a bad supplier suppose the number of defective in a lot of 52 rise to 12 (this is unknown to us). What is the probability that this lot is still accepted?
Solution: Let X be the number of defective IC in a lot of 4 defectives.
P(lot is accepted) = P( no of defectives less than or equal to 2) = P(X=0) + P(X = 1) + P(X=2)
= [C(40, 4)/C(52, 4)]+[C(12, 1)C(40, 3)/C(50, 4)]+ [C(12, 2)C(40, 2)/C(50, 4)]
= 0.96

There is 96% chance that we can accept a lot with more than 10% of defective. Our objective is to have only 10% defectives in the population. This is consumer’s risk as consumers can get these defective products. The producer’s risk was quantified as 0.017% in yesterday’s example.

Monday, March 13, 2017

Demonstration of Role of Combinations in Probability

Let’s illustrate the power of counting techniques in the computation of probability with the following example. Here Combination is used as a counting technique. Combination is collection, here order is not important. In permutation order is important as permutation is arrangement.

Example: Four cards are drawn one after the other without replacement from a well shuffled deck of 52 cards. What is the probability of drawing three kings in these four draws?
Solution: Probability of a king in first draw = 4/52 [4 favorable cases and 52 total cases]
Probability of a king in second draw = 3/51 [3 favorable cases and 51 total cases]
Probability of a king in third draw = 2/50 [2 favorable cases and 50 total cases]
Probability of a non king card in the fourth draw=48/49[48 favorable cases and 49 total cases]
Probability of drawing three kings in these four draws =C(4,3) (4/52)(3/51)(2/50)(48/49)= 192/270725
Here order is not important, just receiving 3 Kings in a hand of 4 Cards counts.
Using Combinations - Probability of drawing three kings in these four draws = C(4, 3)C(48, 1)/C(52, 4) = 192/270725
If four cards are distributed to four players from a well shuffled deck of 52 cards, individual listing of possible combination of cards among four players becomes tedious and time consuming. Combination formula simplifies counting techniques of identifying set of favorable cases from set of total cases.
Industrial Application: A lot of 52 integrated chips contain 4 defective chips. While testing the quality of any lot of 52 integrated chips a sample of 4 chips is drawn. If more than two chips are found defective in the sample, the entire lot is rejected. What is the probability that this lot is rejected? Although there are only four defectives in the entire lot the entire lot is wrongly rejected. This is also called Producer’s risk!
P( 3 defectives) = C(4, 3)C(48, 1)/C(52, 4) = 0.00071
P(4 defectives) = C(4, 4)/C(52, 4)=1/27025
P( the lot is rejected) = 0.000714
This risk is 0.0714% only. So we can quantify uncertainty with probability.



Friday, March 10, 2017

Demonstration of Role of Permutation in Probability

Let’s illustrate the power of counting techniques in the computation of probability with the following example. There are 8 students in the class namely A, B, C, D, E, F, G, H.  
B, D, F are girls &
A, C, E, G, H are boys.
What is the probability that girls occupy top three positions in the end semester exam? No two students get exactly same marks. All the students are academically equally sound.
Solution:  Total arrangements of 8 students = 8!
If we list down each individual arrangement it will be a very time consuming process.
And it will look like this.
ABCDEFGH, HABCDEFG, …………………………., ABHCDEFG
These are 8! = 40320 arrangements of 8 students.
Arrangement in which girls occupy top 3 places and boys occupy last 5 positions = 3! 5!
If we list down each individual arrangement it will be a very time consuming process.
And it will look like this.
BDFACEGH, DBFACEGH, FDBACEGH, FBDACEGH, BFDACEGH, DFBACEGH, ……….DFBAGHCE
These are 3! 5! = 720 arrangements
P( only girls occupy top 3 positions)= (3!5!)/(8!)=1/56
We see that in the long run if these 8 students take 56 exams then girls are in the top 3 positions in only one exam.
Listing down 40320 outcomes of the total cases and identifying 720 outcomes of favorable cases is a time consuming and tedious process. Use of permutation here simplifies everything. This example can be related to other similar situations like what is the probability the three girls are first to die when considering the mortality experience of these people. We assume that they have the same health status.


Wednesday, March 8, 2017

Variable and Random Variable – continuous case


Let us consider the inequality 2X > 3 …….(1). This inequality is satisfied by X > 3/2, so here {X|X>3/2}. So the solution set of (1) is an infinite set and lies in a continuum. It can be represented as 

Whereas for,
 One Dimensional Random Variables
We illustrate with the following example.
Let X be a random variable denoting time taken by an individual in a telephone call. The PDF can be explained by Exponential Distribution with parameter λ..
Now let’s consider the following inequalities.

2X+3Y< 5 and X-3Y<2, the solution to these inequalities can be represented by the shaded region in the following diagram

 A Two Dimensional Random Variable
Let X be a random variable denoting the time of failure of an electronic device and Let Y be a random variable denoting the time to repair this device.
We assume X and Y to be independent.
Then {(X. Y) |X>0, Y>0}. The PDF of X can be represented by Exponential Distribution and Y Can be represented by Normal Distribution.

Tuesday, March 7, 2017

Random World and Random Variables



We live in a random world; hence it is a multidimensional probability space. In our world some part are deterministic, these can be controlled by us and probability has no role in it. But our “random world” is governed by multiple random variables. This fact is illustrated by the following example of use of a mobile phone.
One dimensional random variables
Number of phone calls made per day by a person: Discrete Random Variable
Number of wrong numbers made by a person per day: Discrete Random Variable
Time spent on a phone call by a person: Continuous Random Variable
Time between two phone calls made by a person in a day: Continuous Random Variables
Two dimensional random variables
Marks obtained by an undergraduate student in two internal assessments: Discrete Random Variables
Weight and height of a second year undergraduate student: Continuous Random Variables
Multidimensional (n) random variables
Time spent on each of n phone calls made per day by an individual: n dimensional continuous random variable
No. of phone calls made per day by n members of a family: n dimensional discrete random variable

Here n is also a one dimensional random variable

Monday, March 6, 2017

Variable and Random Variable




Random variable is a variable that is governed by chance. We first get acquainted with the word “variable” when we are taught to solve a simple algebraic equation like X + 5 = 7………(1). Here X is always equal to 2. A “random variable” is random in nature, so it takes different values unlike the variable X discussed in the equation (1).
Discrete Case
One dimensional random variable X
A player rolls a die once.  X denotes the number appearing on the face of a die. Unlike X discussed in equation (1), this number changes in every roll. Hence the X here is called a random variable and X = 1, 2, 3, 4, 5, 6. It takes values according to a probability distribution.
Two dimensional random variables (X1, X2)
Let’s consider a pair of simultaneous equations.

2X1+ 3X2 = 5
X1-3X2=2
X1 = 7/3 and X2 = 1/9 is the solution of the pair of simultaneous equations. These are called variables and take only these values for this pair of simultaneous equations.

Where as if two players roll a die one after the other, then the number appearing on the faces of dice thrown by these two players us denoted by two dimensional random variable (X1, X2). Here X1 denotes number thrown by player 1 and X2 denotes number thrown by player 2.  X1= 1, 2, 3, 4,5 6  & X2= 1, 2, 3, 4, 5, 6.  These are random in nature as they take different values in every roll and these values are governed by a probability distribution.

Sunday, March 5, 2017

Set Theory, Combinatorics & Probability Theory



Probability is the ratio of total numbers of elements of two sets namely set of favorable cases and set of total cases. For discrete random variable these sets are countable. For a continuous random variable these are uncountable and infinite. Set operators like union, intersection, complement and difference are used. Venn diagram helps visualize sets and various operations on such sets. For large countable set generated from multiple discrete random variables, we also use counting techniques also called Combinatorics. This is used in the calculation of probability of an event. Here we can find all the possible arrangements (permutation) and selections (combinations) of multiple discrete random variables for the computation of probability of an event. This value of probability can also be obtained by listing down all the possible outcomes. But this process is tedious and time consuming. So Combinatorics uses formula of permutation and combination for quick computation of such probabilities.

Thursday, March 2, 2017

Symmetric or Skewed?

A curve is said to be symmetric about Y axis if right side of the curve is the mirror image of the left side. The frequency distribution of a symmetric data is represented by the diagram on left hand side below. Here the frequency of attributes equidistant from the mean is same, so f(x1) = f(x2). The mean (average value), the mode (most frequent value), and median (lies in the middle of ordered data) coincide. Mean = Mode = Median. Income distribution of a European Country with most of people of that country with income around the average value can be described as symmetric. Since the values overlap the variance is low. A small sample can serve the purpose.

The divergence from symmetry is called skewness.  The frequency curve below on right hand side is positively skewed.  Here lower values have high frequency and higher values have low frequency. Mode < Median < Mean.f(x1) is not equal to  f(x2). Income distribution of countries like Nepal and India is positively skewed. Here most of the people exist on very low income but some people are one of the richest in the world. Since the values are scattered from very low to very high, the variance is high. A relatively large sample has to be collected to reflect the inherent variability.




Wednesday, March 1, 2017

Art of Visualization in Statistics

Art of visualization is very crucial for critical thinking.
It plays an important role in statistical analysis. Right statistics can be applied to right data, if scenario generating that data can be visualized correctly. For example while analyzing air pollution data, a discussion with the subject expert helps visualize the scenario.  Exploratory data analysis helps to explore this data further and identify underlying trends and patterns. There are various software having graphical functions like rotation and grid, that also help further understand this scenario. Probability of a complex real phenomenon requires visualization of a multidimensional (n) probability space. Probability theory is based on this visualization of n dimensional probability space. Probability is defined as a ratio of total number of elements of two sets, set of favorable cases and set of total cases. In such complex real life random phenomenon, we should visualize the probability of an event as n dimensional volume in n dimensional probability space.

Mental visualization of data collection process backed by a good theory helps in the generation of good quality data. Thus development of computer aided visualization skills build strong intra-disciplinary and interdisciplinary bridges.