Forum Replies Created
-
AuthorPosts
-
pstinchcombeMember
OK, thanks — so basically the conclusion is that we can do descriptive statistics on a non-representative sample just fine, and it’s only when we want to make inferences about the whole population that we need to worry about representativeness.
It seems to me that questions about whether there *is* a linear trend (or, equivalently, a nonlinear trend, or a relationship between categorical variables) are irrevocably inferential — unless the correlation is exactly zero, which almost never happens. Do you agree with that, or is there a purely descriptive way to make sense of such a question?
The progressions doc does indicate that we should make such judgments. (“if the two proportions… are about the same…”) If one of two proportions is a little bit higher than the other, the only way I know how to make sense of differentiating between sample proportions being about the same vs. one being higher is to make an (informal) inferential distinction: if I flip a coin 100 times, and 51% of them are heads, that’s about the same as I’d expect for a fair coin, but if I flip a coin 100,000,000 times and 51% are heads, that’s not about the same as I’d expect for a fair coin.
Does this seem right to you? That questions about whether there is a relationship are inferential by their very nature? It seems like in this case we’ve got to be careful about using the data set in the example on speeds and lifespans to make any conclusions; it seems like saying “the linear trend is due to a couple of outliers” is okay, as is “the line that best fits the data is…” but asking if there is a linear trend is a bad idea since this is an inferential question (or rather, either it’s an inferential question or the answer is “yes” for almost every data set including this one) and the data are unrepresentative.
pstinchcombeMemberJust to clarify, I certainly don’t think this problem is a good way to introduce students to the idea of representative sampling. The question isn’t whether we should use such datasets to teach students about representative sampling: it’s whether we should use such datasets at all.
My impression has been that representativeness of a sample is still relevant — although harder to assess — when the population in question isn’t as easy to understand. If we don’t believe our sample is representative at least in the key regards, what justification could we have for drawing conclusions about the relationship between characteristics of animals from that sample?
Part of the problem, for me, is that the standards do address this, briefly, when they say “random sampling tends to produce representative samples and valid inferences.” The probability that any random selection of animals — whether you weighted by geographic distribution, or by population, or biomass, or any other reasonable criterion except familiarity to us — would produce a sample which is this weighted towards large animals and mammals is vanishingly small.
pstinchcombeMemberOn question two:
Mathematically, I think the mean absolute deviation is more sensitive to outliers — and it makes sense for a measure of spread to be sensitive to outliers. For example, the IQR of the data set {4,5,6,6,7,8,9} is the same as the IQR of {1,5,5,8,8,8,20} — the quartiles are 5 and 8 — but I think any reasonable person would say the second set is more dispersed.
From a practical standpoint, I think IQR is rarely used by professionals, so students are unlikely to see it reported. If part of the goal of S&P curriculum is to prepare students to understand stats they encounter, IQR does very little to serve that goal. MAD isn’t very often used either (I think it is used more commonly than IQR, but it’s still not very standard), but it’s closer to the measure of spread which actually is used most often, which is standard deviation.
pstinchcombeMemberMathematically, I think the mean absolute deviation is more sensitive to outliers — and it makes sense for a measure of spread to be sensitive to outliers. For example, the IQR of the data set {4,5,6,6,7,8,9} is the same as the IQR of {1,5,5,8,8,8,20} — the quartiles are 5 and 8 — but I think any reasonable person would say the second set is more dispersed.
From a practical standpoint, I think IQR is rarely used by professionals, so students are unlikely to see it reported. If part of the goal of S&P curriculum is to prepare students to understand stats they encounter, IQR does very little to serve that goal. MAD isn’t very often used either (I think it is used more commonly than IQR, but it’s still not very standard), but it’s closer to the measure of spread which actually is used most often, which is standard deviation.
-
AuthorPosts