Warren D.Smith, May 2013. (Skip to summary)
The British Association for Advancement of Science in 1932 tasked a committee to report on "quantitative measurement of sensory events." It produced its final report in 1940. This was stimulated by the "sone scale of loudness" purported to measure "objective scale of [subjective] auditory sensation."
Encyclopedia Brittanica, "Sone": Loudness is a subjective characteristic of a sound (as opposed to the sound-pressure level in decibels, which is objective and directly measurable). Consequently, the sone scale of loudness is based on data obtained from subjects who were asked to judge the loudness of pure tones and noise. One sone is arbitrarily set equal to the loudness of a 1,000-hertz tone at a sound level of 40 decibels above the standard reference level (i.e., the minimum audible threshold). A sound with a loudness of four sones is one that listeners perceive to be four times as loud as the reference sound.
Such scales are crucially important for purposes such as telephony, computer speech synthesis, audio compression, etc. But one member of the BAAS committee claimed any such quantitative scale "is not merely false but in fact meaningless unless and until a meaning can be given to the concept of addition as applied to sensation." (Final report p.245.) Meanwhile other members had extremely opposite views!
You can already tell that that quote was hogwash. A counterexample is "temperature," an apparently obscure and little known concept unfamiliar to eminent members of the British Association for Advancement of Science. Temperature is meaningful and measurable, despite the fact that the "sum of two temperatures" seems meaningless and you never do that ("what is the total temperature of this apple and this cup of coffee?").
S.S.Stevens, who'd co-invented the Sone scale, tried to resolve this controversy in his influential paper
On the Theory of Scales of Measurement, SCIENCE magazine 103,2684 (June 1946) 677-680
by proposing four "scale" notions, which he named 'nominal', 'ordinal', 'interval' and 'ratio'. [Interval and Ratio scales automatically are Ordinal also, which in turn automatically are Nominal also.] Stevens claimed that "the statistical manipulations that can be [meaningfully] applied to empirical data depend on the type of scale" – and some kinds of scales make sense in certain situations while others do not and would be meaningless. Here is Stevens' table, but we have added two additional rows to it at bottom.
Scale name | Basic empirical operations | Mathematical group | Some "permissible" statistics |
---|---|---|---|
Nominal | Equality testing only | Permutation group of 1-to-1 bijections x'=f(x) among scale elements | Number of cases. Mode. Contingency correlation. |
Ordinal | also: x<y and x>y testing | "Isotonic" group, x'=f(x) where f monotonic increasing | Median. Quantiles. Disallowed: determining the midpoint (x+y)/2 of two scale values x,y. |
Interval | also: comparing differences | "General linear" group x'=ax+b, [a>0, b both real] | Ordinal stuff, and also: Mean. Standard deviation. Rank-order correlation. Product-moment correlation. |
Ratio | "ordinal" operations and also: comparing ratios | "Similarity" group x'=ax, [a>0 real] | Ordinal stuff, and also: Coefficient of variation. |
Angular (new) | Comparing pair-differences | "Modular additive" group x'=x+a mod 360, where a is any real | Ordinal stuff, and also: mean squared difference. Disallowed: multiplying by constants. |
Range (new) | "interval" operations and also computing and comparing convex linear combinations αx+(1-α)y where 0≤α≤1; see text for more | Ordinal stuff, and also: Weighted and unweighted means. Disallowed in general: multiplying by constants>1. |
1. Is it reasonable to use numerical scales in voting? The answer is a resounding no, for several reasons: 2. The numbers mean nothing unless they are defined: proposals to use weights give them no definition. Their only real "meaning" is found in their strategic use. This induces comparisons, which immediately leads to Arrow's paradox... E.g. with these actual ballot instructions Give a grade to each of the twelve candidates: either 0, or 1, or 2 (2 the best grade, 0 the worst). To do so, place a cross in the corresponding box etc. The candidate elected with [this] method is the one who receives the highest number of points. 3. nothing is said concerning the meaning of 0, 1, or 2. The numbers induce relative, so strategic, behavior. Other numbers could have been given. For example, with {-1,0 +1} mathematically there is no difference, but were these numbers used the behavior of the voters would almost surely have been different. [In fact, this experiment later was tried and voter behavior was significantly different.] 4. When numbers are used, they may well not be used in the same way at all: when a 0-100 scale is used, some voters may view 80 to be an excellent grade, others may see it as merely middling. 5. Even if the numbers did provide a common language, they will almost certainly not be a proper interval measure [in the sense of Stevens – it is here that Balinski & Laraki invoke "measurement theory"] – that depends on who the candidates are and how the voters give their grades. For example, the 0-20 scale used in France is a common language, but an 18, 19, or 20 is unheard of in philosophy or literature, so the scale is not an interval measure. Once the distribution of the grades is known – after many elections (or many examinations) – it is possible to determine whether the scale is an interval measure and, if not, to correct it (as did the Danes). But then it is too late, since the weights must be announced ahead of time. 6. Even if it turned out that the scale did approximate an interval measure, the procedure depends on irrelevant alternatives, [hence] is subject to Arrow's paradox: for if one or several candidates drop out, the distribution of the remaining grades will almost certainly be different, so the scale is no longer an interval measure. [For example, in the French 2007 presidential election, the counts of the number of times each of their 6 verbal scores was used, changed considerably when all scores for the 8 "unimportant" among the 12 candidates were removed.] |
Balinski and Laraki also attack approval voting. After noting that in their MJ experiment in Orsay France 2007, approval voting would have elected Bayrou if scores≥"assez bien" were approved, but Sarkozy if only scores≥"bien" were, they complained:
7. Approval voting is extremely sensitive to the question posed. Imagine what would have happened if the threshold had been either higher or lower. This shows that approval voting's two-word language is insufficient and arbitrary. |
OK, we've had enough. It is time to respond to these attacks.
1a. It is rather strange to see the mathematician Balinski rejecting the use of "numbers" as "meaningless" since "undefined." Numbers are certainly more concise, and I would think better defined (especially for voters some of whom might be poor French speakers or come from different cultures), than essentially all adjectives such as "excellent," "passable," and "assez bien." Indeed, if it were me voting, I would have thought "assez bien" and "passable" meant exactly the same thing, even though Balinski & Laraki think the first is clearly superior! No such difficulty happens with "3>2"; that is agreed by everybody from every culture.
Several French→English dictionaries (e.g. 1982 Harrap's) say "assez bien" means "good enough" and "passable"→"passable." The 2013 Merriam-Webster English→English dictionary says "passable"="good enough."
We also warn the reader that "bien assez" has a third meaning ("quite enough") different from "assez bien." After consulting several French speakers I agree with Balinski & Laraki that "assez bien" is commonly regarded as superior to "passable"; I am just pointing out its non-obviousness to a non-native speaker like me, even with aid from dictionaries.
Indeed, the question of quantifying the strength of adjectives has been examined by the psychometricians Jones & Thurstone 1955. They examined the semantic meanings, to respondents, of 51 scale-point descriptors using numerical scales, and subsequently presented a listing of words and phrases ranging from those expressing "greatest like" to those conveying "greatest dislike." That is, they succeeded in constructing a "continuum of meaning" ranging between the end points "best of all" to its extreme opposite "despise" (p.33), providing the mean scale value and standard deviation of each of the tested words and phrases. Myers & Warner 1968 and later teams such as Bartram & Yelding 1973, Vidali 1975, Wildt & Mazis 1978, and Braunsberger & Gates 2009 all redid the same sort of study over again. They declared a high degree of success in the sense that their numbers were "surprisingly consistent among very diverse groups of people." However, they were not completely consistent. Thus Myers & Warner 1968 showed there are statistically significant differences between the quantified meanings of various words, as perceived by housewives versus students: "slightly poor" was rated 8.48±1.83 (mean±standard deviation) by 25 undergraduate students but 5.92±1.96 by 25 housewives, which is a 4.77σ discrepancy, i.e. taken by itself would have 99.9999% confidence of being genuine. But actually since this was "cherrypicked" as one of the largest among about 100-200 such discrepancies, the confidence really should be only about 99.98% that it is real.
P.Bertram & D.Yelding: The development of an empirical method of selecting phrases used in verbal rating scales, Journal of the Market Research Society 15 (1973) 151-156.
Karin Braunsberger & Roger Gates: Developing inventories for satisfaction and Likert scales in a service environment, J. of Services Marketing 23,4 (2009) 219-225.
L.V.Jones & L.L.Thurstone: The psychophysics of semantics: an experimental investigation, Journal of Applied Psychology 39,1 (1955) 31-36.
J.H.Myers & W.G.Gregory Warner: Semantic properties of selected evaluation adjectives, J. Marketing Research 5,4 (November 1968) 409-412.
J.J.Vidali: Context effects on scaled evaluatory adjective meaning, J. Market Research Society 17,1 (1975) 21-25.
A.R.Wildt & M.B.Mazis: Determinants of scale response: label versus position, J. Marketing Research 15,2 (May 1978) 261-267.
Is it democratically fair to partially disenfranchise (say) housewives because they interpret some word differently? Or is it more fair to provide a numerical scale whose meaning is unambiguously defined by the rules of the voting system?
1b. Why must vote-scores have some sort of Balinski-approved "meaning" at all? Why must they be "measurable" according to Stevens' notions at all? Voting systems input "votes" (which are information packets, e.g. bitstrings) and output a winner. Full stop.
Voting is an exercise of power, not a sentiment. The votes need not be "measured" ala Stevens, they merely must be "transmitted." The only true meaning of a vote, in this general setting, is defined – and wholy, completely, and mathematically defined – by the rules the voting system uses to deduce the winner from the votes.
1c. However, there could be additional (untrue) "meaning" carried around by humans as psychological baggage. For example, within the MJ voting system, the true meaning of "assez bien" is defined solely by the MJ winner-determining-algorithm. The untrue meaning that "assez bien"="passable" is carried around by me, and by various dictionaries, as psychological baggage. Naturally, we wish that the 'baggage' and 'true' meanings of votes should correspond as closely as possible for as many people as possible! I also contend (and apparently Balinski & Laraki agree, since they approvingly use Stevensian "measurement theory") that more humans would be happier the more Stevens-like properties votes have. I.e. while it is not necessary that votes viewed as "quality measures" be addable, multiplyable, </=/>-comparable, or whatever, the more of that kind of stuff approximately-works, the better a lot of humans will probably feel about it, and the better the voting system will probably work in practice.
1d. For standard greatest-average-wins range voting using numerical scores, the baggage/true meanings-correspondence seems close, and the "range" Stevensian properties exactly work. In contrast, Balinski & Laraki contend that their 6 verbal MJ scores merely form an "ordinal scale," enjoying a strictly smaller set of Stevensian properties. I.e. with MJ only "=", ">" and "<" have "meaning." With MJ, taking the midpoint (x+y)/2 of two scores "has no meaning" unlike in standard range voting (using the real interval [0,1] as the score-set) in which the meaning of m=(x+y)/2 is precisely this: two votes m are equivalent to one x and one y vote.
Now Balinski & Laraki's contention that MJ scores form an "ordinal scale" is
substantially justified by the fact that the MJ voting system
is monotonic, i.e. a voter increasing her score for Sarkozy
can help, but cannot hurt, Sarkozy's winning chances. Thus ">" has its
naively expected meaning. Standard range voting
also is monotonic, hence also forms an ordinal scale, and my stronger contention that
range voting's scores form a range scale is justified by the
"true meaning" of convex combination for any rational α with 0≤α≤1:
if α=a/b then
1e. But in order for a vote to have "meaning" in the eyes of a voter it also would seem desirable for the "participation property" to hold: you, by casting an honest vote, cannot worsen the election winner (with your view of "worsen") versus if you had not voted at all. Without this property, your vote could be "less meaningful" than nothing! And MJ fails the participation property while average-based range voting obeys it! (IRV also fails it.)
1-SUMMARY: At this point it is clear that standard range voting scores have strictly more meaning than MJ scores both according to Balinski & Laraki's very own preferred bludgeon – measurement theory – as well as the participation criterion. In short, their whole attack was exactly wrong, is refuted, shows the opposite of what they thought, and lies completely in ruins.
But wait, there is more.
Contradiction 2 versus 3:
Balinski & Laraki began their attack by immediately
contradicting themselves. They assert the only true meaning of
numerical voting scores lies in their strategic use, i.e. as defined by the rules the voting system
uses to elect winners. (I devoutly agree, for general voting systems and whether or not the
votes are numerical.)
And they then correctly claim voter behavior empirically
differs when
Non-logical 4: This exact "argument" can be used against Balinski/Laraki MJ, not for it. So it has no logical impact at all. E.g. I could equally well have said (using their exact same sentence but with a few items swapped) "When adjectives are used, they may well not be used in the same way at all: when an 'Excellent'↔'A Rejeter' scale is used, some voters may view "Bien" to be a 90 while others may see it as merely 60."
Just wrong 5: Balinski & Laraki here implicitly assume (or act as though) the only possible Stevensian scales are the four originally proposed by Stevens. As we have shown there is at least one further such notion, the "range" type scale. (Also the question of whether "18 is unheard of in philosophy or literature" has absolutely nothing to do with it – Stevens in his paper nowhere mentions philosophical or literature use as a criterion – plus we anyhow are rather surprised to hear that 18 has never been used in literature.)
Absurd garbage 6: Both standard range voting and MJ evade Arrow's "impossibility theorem." But the crux of the matter is "Arrow's IIA condition" which says, essentially, that removing candidates should leave the voting systems's output-ordering of the remaining candidates unaffected. If, after we remove some candidates we allow voters to change their ballots to make them more strategic, then the winner could change. In that case, we still get "Arrow's paradox" with both MJ and average-based range voting. For example, in a 3-candidate race where a voter scores Sarkozy="A Rejeter", Bayrou="Insuffisant", and Royal="Excellent", after Royal drops out, that voter, if allowed to do so, might modify her vote to still score Sarkozy="A Rejeter" but Bayrou="Excellent." Balinski & Laraki seem to think that due to the magic-meaning property of their magic words, no voter would ever do that, whereas, if we instead were using average-based range voting with score set {0,1,2,3,4,5} with Sarkozy=0, Bayrou=1, Royal=5, then voters would change to Bayrou=5. That contention is absurd garbage.
Strange 7: It is quite odd that Balinski & Laraki attack approval voting but not MJ, because approval voting is the special case of MJ voting when there only are two allowed scores "approve" and "disapprove"! (It also coincides with standard range voting using score set {0,1}.) They complain that the election results could be sensitive to the baggage-meaning of the word "approve." Well, of course! And in MJ, the election results would similarly be sensitive to the baggage-meaning of any and all the 6 Balinski-approved magic words. (For example, if "assez bien", "passable", "insuffisant" and "a rejecter" all meant the same, then the winner in their Orsay MJ study would change.) So what? Apparently, Balinski & Laraki believe that some words, such as "assez bien," have magic meanings, while others, such as "approve" and "3 on an 0-5 scale," do not.
Oh.
I would say that unfortunately humans can and do attach strange time-varying culture-varying baggage meanings to words, albeit I would guess that "3" will probably remain comparatively unambiguous across all cultures at all future times. The question of which verbal or numerical scales work the best in the presence of such stresses (and how badly they are hurt) is an experimental question, which simply is not answerable by abstract mathematical "theories of measurement" such as Narens 2002, and not by Balinski and Laraki by Proclamation either.
For example, {-1, 0, +1} score voting was used for centuries to elect the Venetian doge. It seems to have worked well, and we are unaware of any complaints that their votes all were meaningless and undefined. For another example, the "meaningless" {0,1,2} score voting and approval voting systems elected the president France wanted – Bayrou – in 2007 (as did {0,1,2,3,4,5} score voting using Balinski & Laraki's own vote data), while the MJ system extrapolated from Balinski & Laraki's data to all France (here I am using their own extrapolation) instead elected Sarkozy. How can it be that two meaningless voting systems outperformed their MJ system in their own experiment?!?!
A tremendous amount of experimental evidence and literature exists about different verbal and numerical scales (e.g. Balinski & Laraki's "too late" claim in their 5 is also false), which Balinski & Laraki almost entirely ignored. We shall discuss it, but on a different web page.
Stevensian "measurement theory" is clearly incomplete, wasn't needed, and to the extent we do complete and apply it, shows precisely the opposite of Balinski & Laraki's contentions. Balinski and Laraki's entire attack on all voting methods that use numbers (e.g. average-based range voting) has been completely busted from start to finish.