Monday, 16 June 2014

Could the Flynn effect be an invalid artefact? Yes - if IQ tests are no better than any other type of exam at tracking long-term changes in cognitive ability

*

Supposing we just accept that IQ tests are no better at measuring long term change in abilities than any other type of examination?

Then it would not be surprising that the 'Flynn effect' - of rising raw IQ test scores over the twentieth century - seems to have no real-world validity; and is contradicted by slowing simple reaction times over the same timescale.

*

But why should we suppose, why should we assume (without proof) in the first place that the raw scores of IQ tests are any better at tracking longitudinal changes of general intelligence than are the raw scores of examinations of (for instance) Latin vocabulary, arithmetic, or historical knowledge?

Everybody knows that academic exams in Latin, Maths, History or any other substantive field will depend on a multitude of factors - what is taught, how big is the curriculum, how it is taught, how the teaching relates to the exam, how much practice of exams and of what type, the conditions of the exam (including possibilities for cheating), how the exam is marked (including possibilities of cheating), and the proportion of nature of the population or sample to whom the exam is administered.

In a cross-sectional use - this type of exam is good at predicting relative future performance on the basis of rank order in the results (not on the basis of absolute percentage scores) when applied to same age groups having been taught a common curriculum etc. - and in this respect academic exams resemble IQ tests (IQ test being, of course, marked and interpreted as age-specific, rank order exams).

All of which means the raw score of academic exams - the percentage correct - means nothing (or not necessarily anything) when looked at longitudinally. Different percentage scores among different groups at different times is what we expect from academic exams.

*

Cross-sectionally, performance in different academic exams correlate with each other; and with 'g' as calculated from IQ tests, or with sub-tests of IQ tests.

But just because differential performance in an IQ test (a specific test, in a specific group, at a specific time) is a valid predictor; does not mean that IQ testing over time is a valid measure of change in general intelligence.

The two things are utterly different.

Cross sectional use of IQ testing measures relative difference now to predict relative differences in future; but longitudinal use of IQ data uses relative difference at various time-points to try and measure objective change over time: incommensurable.

*

So, what advantage do IQ tests have over academic exams? Well, mainly the advantage is that good IQ tests are less dependent on prior educational experience (also (which is not exactly the same thing) their components are 'g-loaded').

Historically, IQ tests were mainly used to pick out intelligent children from poor and deprived backgrounds - whose social and educational experience had led to them under-performing on, say, Latin, arithmetic and History exams - because they had never been taught these subjects - or because their teaching was insufficient or inadequate in some way.

It was found that a high rank-order score in IQ testing was usefully-predictive of high rank-order performance in future educational exams (assuming that the requisite educational inputs were sufficient: high IQ does not lead to high scores in Latin vocabulary unless the child has actually studied Latin.)

But IQ tests were done cross-sectionally - to put test-takers in rank order -  they were not developed to measure longitudinal change within or between age cohorts. Indeed, since IQ tests are rank-order tests, they have no reference point to anchor them against: 100 is the average IQ (for England, as the reference population) but that number of 100 is not anchored or referenced to anything else - it is merely an average '100'  not mean anything at all as a measure of intelligence; just as an average score of 50% in A Latin Vocabulary Exam is is not an absolute measure of Latin ability - the test score number 50 does not mean anything at all in terms of an absolute measure of Latin ability.

*

What applies to the academic exam or IQ test as a whole, also applies to each of the individual items of the test. The ability to answer any specific individual test item correctly, or wrongly, depends on those things I mentioned before: "what is taught, how big is the curriculum, how it is taught, how the teaching relates to the exam, how much practice of exams and of what type, the conditions of the exam" etc. etc...

My point is that we have been to ready to assume that IQ testing (in particular raw average scores and specific item scores) is immune to the limitations, variations and problems of all other types of academic exams - problems which render them more-or-less meaningless when raw average scores or specific item scores are used, decontextualized, in the attempt to track long term changes in cognitive ability.

*

It is entirely conjectural to suppose, to assume, that IQ tests can function in a way that other cognitive ability tests (such as academic exams) cannot. And once this is understood, it can be seen that - far from being a mystery, there is nothing to explain about the Flynn effect.

If longitudinal raw average or test item IQ scores have zero expected predictive validity as a measure of intelligence change; then there is no mystery to solve regarding why they might change, at such and such a rate, or stop changing, or anything else!

The Flynn effect might show IQ raw scores or specific item responses going up, down, or round in circles - and it would not necessarily mean anything at all!

*

10 comments:

pumpkinperson said...

There are two Flynn Effects. A huge Flynn Effect on culture fair non-verbal IQ tests, and a generally smaller Flynn Effect on culturally biased verbal-numerical IQ tests. In my humble opinion, the culture fair Flynn Effect reflects real huge biological gains in non-verbal brain power caused by 20th century nutrition, but the culturally biased Flynn Effect is just a spurious consequence of more schooling, mass media and cultural change:

http://brainsize.wordpress.com/2014/06/10/more-thoughts-on-the-flynn-effect/

Bruce Charlton said...

@p - IF IQ tests are a valid instrument for measuring changes in long term cognitive ability...

pumpkinperson said...

But why should we suppose, why should we assume (without proof) in the first place that the raw scores of IQ tests are any better at tracking longitudinal changes of general intelligence than are the raw scores of examinations of (for instance) Latin vocabulary, arithmetic, or historical knowledge?

We can't assume that all IQ tests are better at tracking cohort differences in "general intelligence" than scholastic achievements tests are, but we should tentatively assume that non-verbal "culture fair" tests are better at tracking real cognitive differences between cohorts, because the "culture fair" tests were explicitly designed to be insensitive to cultural and educational differences. So since long-term changes in the culture should not influence "culture fair" tests, the default assumption is that changes in intelligence have occurred. Remember "culture fair" tests have been used to compare people whose cultures differ by tens of thousands of years of social changes and thousands and thousands of miles of geography (hunter/gathers and modern yuppies) so the notion that they are not even valid for comparing two generations within the same family and the same town, means 100 years of cross-cultural research has been absolutely worthless. It's far more likely that intelligence itself is changing.

Indeed, since IQ tests are rank-order tests, they have no reference point to anchor them against: 100 is the average IQ (for England, as the reference population) but that number of 100 is not anchored or referenced to anything else - it is merely an average '100' not mean anything at all as a measure of intelligence; just as an average score of 50% in A Latin Vocabulary Exam is is not an absolute measure of Latin ability - the test score number 50 does not mean anything at all in terms of an absolute measure of Latin ability.

All of that's true, but so what? If we randomly selected 100 words from a Latin dictionary to test students on, then a score of 50% would have an objective meaning: One knows 50% of Latin. It still wouldn't change the fact that Latin knowledge is rising, falling, or stabilizing, and it would give us no more insight into whether that reflects a change in intelligence.

Cross sectional use of IQ testing measures relative difference now to predict relative differences in future; but longitudinal use of IQ data uses relative difference at various time-points to try and measure objective change over time: incommensurable.

In the 19th century, 25% of Dutch people failed the army's height requirement. Today fewer than one in a 1000 Dutch people fail that same height requirement. Even without knowing what that height requirement was or what the height difference between the two eras is, the mere relative difference is passing rate allows us to conclude there've been huge objective changes in Dutch height.

Bruce Charlton said...

@p - From what you say, I don't think you have followed my argument.

pumpkinperson said...

Perhaps, I'm not understanding your argument. It sounds like you're arguing that just because a measurement has predictive and construct validity within generations, does not mean it's valid between generations. That measurements need to be anchored in something concrete to have long-term meaning. Jensen used the analogy that imagine if we measured height through measuring people's shadows. The relative differences would have predictive validity for say basketball success among people all measured at the same time of day, but could not be compared with people measured at a different time of day (where all shadows were longer because of the sun's changing position, not because people were taller). Because shadows are only useful as a relative measure of height, not an absolute measure of height.

Bruce Charlton said...

@p - yes, that's it. The idea that culture-fair IQ tests were suitable for tracking long terms changes in intelligence was just an assertion - when they were actually used for that purpose (in the mid 20th century) they gave the opposite result to that which was predicted - in other words they showed rising IQ test scores when it was predicted (from fertility being inversely correlated with IQ) that IQ scores would be (ought to be) falling.

This matter of rising IQ scores is, of course, the Flynn effect. But this does not prove that average g is rising, because the suitability of IQ testing for measuring longitudinal change never was established - all that happened was that it yielded an unexpected (opposite) result and gradually - over few decades - people started to believe the IQ tests rather than the theory which predicted that g *must* be declining due to 'dysgenic' fertility patterns.

The new work on simple reaction times has, I think, clarified what is going on - and has proven that IQ testing is unsuitable for measuring long term change, and that the Flynn effect was essentially an artefact of testing - which is how IQ test scores can rise (or zig zag, or loop the loop) while g is declining - and declining fast.

pumpkinperson said...

The idea that culture-fair IQ tests were suitable for tracking long terms changes in intelligence was just an assertion - when they were actually used for that purpose (in the mid 20th century) they gave the opposite result to that which was predicted - in other words they showed rising IQ test scores when it was predicted (from fertility being inversely correlated with IQ) that IQ scores would be (ought to be) falling.

But the mere fact that scores are increasing for non-genetic reasons needn't imply that the measuring tool lacks long-term validity. By that logic a stadiometer is an invalid measure of 20th century height increases.

This matter of rising IQ scores is, of course, the Flynn effect. But this does not prove that average g is rising, because the suitability of IQ testing for measuring longitudinal change never was established - all that happened was that it yielded an unexpected (opposite) result and gradually - over few decades - people started to believe the IQ tests rather than the theory which predicted that g *must* be declining due to 'dysgenic' fertility patterns.

But the long-term suitability of measuring g loadings has also not been established. For g, much like IQ scores, is a relative measure. g is just the single biggest source of cognitive variation within cohorts, but if we were to factor analyse differences between cohorts, we might find very different g loadings for various subtests. Within cohorts, IQ differences are extremely genetic, and so the g loading closely parallels the genetic loading. But if between cohorts, IQ differences are caused by nutrition, then between cohort g loadings might closely parallel nutrition loadings.

The new work on simple reaction times has, I think, clarified what is going on - and has proven that IQ testing is unsuitable for measuring long term change, and that the Flynn effect was essentially an artefact of testing - which is how IQ test scores can rise (or zig zag, or loop the loop) while g is declining - and declining fast.

I agree that the reaction time work proves that GENETIC intelligence has not been increasing, but it does not yet prove the Flynn Effect is an artefact of testing. As I documented in the link above, some cognitive abilities are much more sensitive to pre-natal nutrition than others. It could be that poor nutrition impaired the ability of Victorians to solve certain IQ test problems, but that basic information processing speed was preserved, thus reflecting their genetic potential. If that's true, then the Flynn Effect would reflect real valid biological gains in intelligence that just happen to be completely non-genetic.

Bruce Charlton said...

@p "I agree that the reaction time work proves that GENETIC intelligence has not been increasing, but it does not yet prove the Flynn Effect is an artefact of testing."

Agreed.

One point is that (beyond a limited practice effect) you cannot teach people to have faster reaction times; but you can teach people partly to remember, partly to work-out, more (and more) correct answers to what are often - in practice - a limited range of questions and question-types.

This has been further exacerbated by progressively increasing the use of multiple choice answers - where the test taker merely need to recognize (not to remember) the correct answer.

Bu these are just illustrations. The general point is that IQ tests are not designed for long term monitoring and there is no compelling reason to assume that they can do this.

In fact, the first use of IQ testing was probably to find bright children who were poor and uneducated - the tests were designed to require minimal socialization and a minimal educational curriculum - and so good marks were attainable by some children who were (for example) farm labourers children living in rural Northumberland and who were literate and numerate but had little else in the way of education.

http://medicalhypotheses.blogspot.co.uk/2008/09/pioneering-studies-of-iq.html

Many of the issues of IQ testing are a consequence of assuming that their precision or resolving power is greater than is actually attainable. They are fairly crude and rough and ready rank orderings - but very useful, nonetheless.

pumpkinperson said...

The general point is that IQ tests are not designed for long term monitoring and there is no compelling reason to assume that they can do this.

But culture reduced IQ tests were designed to transcend huge cultural differences, so in my humble opinion, the burden of proof is on those who claim that long-term cultural changes have been too great for these tests to bridge the gap.

Simpy stating that even culture reduced IQ tests might not be valid for comparing different generations is not a very satisfactory explanation for the Flynn Effect because such arguments could be used to dismiss almost any group difference in IQ. Culture reduced IQ tests are considered valid for comparing different towns, different cities, different countries, different continents, different language groups, different genders, different education levels, different social classes, different religions, and yet for some mysterious reason, you believe they are uniquely suspect for comparing different generations.

Maybe you're right, but if one is to claim that generational differences are an exception to the cross-cultural validity these tests enjoy, one should have a specific theory as to why they're an exception, and that theory should make falsifiable predictions.

Bruce Charlton said...

@p - You need to think a bit more about what I have written, and what you are saying - because it doesn't cohere.

Also - what you are suggesting makes no sense in terms of how science works - what you are asking is impossible in principle to answer.