Open Side Menu Go to the Top
Register
Normalization/Correction/Sampling problem Normalization/Correction/Sampling problem

04-14-2020 , 07:28 PM
Lets say I have two groups of cells (biology example). Red Blood Cells (RBC) and White Blood Cells (WBC).

My dataset has 90 RBCs and 10 WBCs

I am able to order the dataset, or cells, by some metric, lets say size. And I order them by size and I take out the top 20 biggest ones (I choose a size cutoff that gives me the top20% of cells) to examine them.

I examine them and I discover that 15 (75%) are RBCs and 5 (25%) are WBCs.

Here is the question. It would be incorrect to conclude that "Most cells above X size are RBCs" obviously because you have way more RBCs in this sample than WBCs.


Is there a way to normalize or correct for the 75% and 25% (15 and 5) numbers you got to reflect what the numbers should be if the two sets were equal (or you had equal numbers of RBC and WBC)?

Thanks
Normalization/Correction/Sampling problem Quote
04-15-2020 , 05:43 AM
1.Sample is too small.
2.There is difference between top 20% of cells and most cells above x size.
If I were you I would make more groups and analysis.
For example:I would include this calculations for example: Top 20% of RBC and WBC together, top 20% of RBC, top 20% of WBC and than top 20% together (RBC and WBC) over x size and under x size(I would use at least two different x sizes). I would do the same for top 40% and top 60% to 80%.
ADDED LATER:
I would also use this: top 20% of WBC vs top 20% RBC (by sizes in %) and do the same with top 40% and top 60 to 80%.

The more groups(you can make even more specific: less than 20% groups: for example 10%) you will make, the better interpretation will be.

Last edited by felitelli; 04-15-2020 at 06:02 AM.
Normalization/Correction/Sampling problem Quote
04-15-2020 , 09:24 AM
To add to previous post:
100 sample:
90 RBC
10 WBC
20% top:15 RBC and 5 WBC

Some calcs:
5/10 of WBC are in the top 20%. That would make: 50% of WBC in top 20%.
15/90 of RBC are in top 20%. That would make:17% of RBC in top 20%.

This means WBC are "bigger". 50%>17%.

But this is really too simple interpretation.

Sample does not have to be the same for RBC and WBC (but it needs to have bigger sample), and you should calculate like I mentioned here and in previous post also combined with this post.

Last edited by felitelli; 04-15-2020 at 09:42 AM.
Normalization/Correction/Sampling problem Quote
04-15-2020 , 05:07 PM
Quote:
Originally Posted by felitelli

Some calcs:
5/10 of WBC are in the top 20%. That would make: 50% of WBC in top 20%.
15/90 of RBC are in top 20%. That would make:17% of RBC in top 20%.
This is exactly what I was looking for but I couldn't think of it for some reason. It seems trivial looking at it now but I couldn't figure it out

Thanks so much
Normalization/Correction/Sampling problem Quote
04-22-2020 , 11:21 PM
This is where statistical significance testing is used. Just looking at the percentages you can not tell if there is enough information for a conclusion. Using a chi-squared test, or something similar, you could determine the likelihood that there really is a relationship between color and size.
Normalization/Correction/Sampling problem Quote
05-28-2020 , 08:38 AM
Yeah, just compute a correlation coefficient and associated s.e.

You could also do some hypothesis testing such as a difference in means test. I second VBAces' post which goes in the same direction as mine.
Normalization/Correction/Sampling problem Quote

      
m