Aram’s Lair of Mad Science

The Arrow of Time

2022-10-19T00:00:00+00:00

Apologies for taking so long to follow up on the probability post. I was floored by its engagement on Hacker News, with over 200 thoughtful comments! I’ve revised the post to incorporate all your feedback.

Now to unveil the secret behind that post: it was a part of my preparation for a very ambitious research project concerning the arrow of time. This has occupied all of my writing energy, hence the absence from blogging. I’ve since completed that paper and presented it to Prof. Marcus Hutter’s group.

You may download the presentation slides or the full paper, and reach out to me with any questions or comments. I look forward to collaborating on closely related topics! Naturally, this means I’m not done talking about probability and induction. Please stay tuned :)

Why is the Universe Quantum?

2021-04-08T00:00:00+00:00

Quantum mechanics is such a bizarre theory. In the iconic thought experiment, Schrödinger’s cat couldn’t decide whether to be dead or alive until it was looked at. It’s natural to ask ourselves: why is the Universe like this? Must it be so, or could a classical universe equally well support life as we know it?

A beautiful paper by Lucien Hardy demonstrates that quantum mechanics is the logical consequence of five simple axioms. The first four are fairly straightforward and are satisfied by classical theories. The fifth axiom, that of continuous transitions, is the most interesting.

To understand it in more detail, let’s imagine that we’re designing the universe. We could make it discrete, like a cellular automaton. However, we would then miss out on the nicer symmetries of continuous space-time. To give an example, continuous space can grant equal status to all its directions. The line segment connecting any pair of points determines such a direction; this segment represents the unique shortest path between its endpoints; furthermore, it has a unique midpoint which is equidistant from its endpoints. Any two points in time likewise have a midpoint between them. A cellular grid lacks these symmetries.

Let’s, at the very least, commit to a continuous time. It seems natural for space to be continuous too, with all its points considered equally good positions for objects to occupy. Even if restricted to a finite-sized region, such as the interior of a box, an object can occupy any of infinitely many possible states. If our laws of physics allow infinitely precise measurements (say, perhaps, by shining incredibly weak and narrow beams of light to “see” the object’s position), then our box can store an unlimited amount of information. In such a world, it would seem that computers of bounded size can be constructed to have unbounded power: for instance, by making their parts arbitrarily small. Natural selection and life, as we know them, thrive on the challenge of searching for solutions intelligently, using bounded computation. To avoid trivializing life, it’s essential to limit computational resources somehow. One solution is to make it so that finite-sized systems have a finite number of distinguishable states. Let’s define exactly what we mean.

States and Measurement

Consider a collection of states $S_1, S_2, \ldots, S_n$. Suppose we are given an object, known to be in one of these $n$ states, but we don’t know which one. We’ll say this collection of states is reliably distinguishable if there exists a measurement technique that can tell us, with certainty, which state it’s in. This is a strong, clear-cut test. On the other hand, suppose we have a machine which can produce, at the press of a button, a freshly prepared object with a fixed state from the collection, but once again we don’t know which one. We’ll say this collection of states is statistically distinguishable if, by performing experiments on enough objects from the machine, we can infer which state is prepared by the machine. Finally, we’ll say a pair of states is statistically indistinguishable if it’s not statistically distinguishable. We make a few observations regarding our definitions:

First, any reliably distinguishable collection is obviously statistically distinguishable as well. Second, statistical indistinguishability is an equivalence relation. For all practical purposes, any statistically indistinguishable pair might as well be considered to be the same state; by defining states in this way, it follows that any collection of distinct states must be statistically distinguishable. Our final observation concerns states which are statistically, but not reliably, distinguishable. In this case, it follows that some measurements will necessarily alter the state: for if they did not, then any series of measurements that we would perform using the machine, can instead be performed on a lone object of the state in question.

From Bit to Qubit

Let’s return to looking at a closed system in our made-up universe. It evolves with continuous time. We want a finite maximum on the number of states which can be reliably distinguished from one another: for simplicity, let’s assume this maximum is two, though our arguments can be generalized. Suppose A and B are two such mutually distinguishable states, or eigenstates. Is it feasible for A and B be the only states taken by the system? In order to do anything interesting in continuous time, we should allow A to transition to B after a non-zero length of time. However, if this length of time were deterministic, say fixed to $\Delta t$, then that would imply the existence of additional states. Indeed, let C be the system’s state after evolving A for a time $\Delta t / 2$. Since C is prior to the transition, our A-B measurement would detect it as A; thus, C is distinct from B. On the other hand, C would be detected as B if we perform a “delayed measurement”, in which we wait an additional $\Delta t / 2$ before applying our A-B measurement; thus, C is also distinct from A. We must conclude that C is a third state, different from A and B. In fact, by waiting for different amounts of time, we find an infinite continuum of intermediate states.

In a last deperate attempt to avoid adding extra states, we might allow the transition times to be random, like radioactive decay. However, if we’re allowing genuinely random processes into our theory, we might as well consider the “maybe-A maybe-B” situation to itself be a state: after all, distinct probabilistic mixtures of distinguishable states remain statistically distinguishable from one another. This formalism has the advantage that an outsider, not involved in the experiment, can model the evolution of our system deterministically: from their view, the state A simply evolves into “maybe”-states that contain increasing shares of B. When an experimenter measures the state, from their perspective it can be said that the state “collapses” to either A or B. The outsider, who doesn’t communicate with the experimenter, would then say that the experimenter became “entangled” with the system: together, they are jointly in a state of “maybe A, with the experimenter seeing A; or maybe B, with the experimenter seeing B”. Substituting A and B for the dead and alive states of Schrödinger’s cat, the resemblance becomes clear! The entangled experimenter, having observed the system, is resigned to either the A branch or the B branch. For the unentangled outsider, on the other hand, neither branch has “materialized”¹.

Recall that A and B are a maximal set of reliably distinguishable states for our system. Having accepted that our theory must support additional states that are only statistically distinguishable, we can consider alternative formulations, aside from the probabilistic mixtures that we’ve just discussed. Indeed, the probabilistic theory has some shortcomings. For instance, it’s not clear how to make nice reversible laws that transition from A to B. Upon reaching B, such a law should begin to transition back to A; however, how would a “maybe-A maybe-B” state know whether it’s currently on the “forward swing” from A to B, or on the “return swing” back to A? For reasons such as this, it becomes convenient to use a complex number-valued variant of probability theory, in which, rather than swinging linearly from A to B, the states are arranged on a sphere, transitioning along its geodesics. I’ll defer to Hardy for the rigorous argument. The upshot is that complex numbers have phases and amplitudes, which allow the “random” outcomes to interfere with one another, constructively and destructively, much like vibrating strings. Quantum weirdness ensues.

Waves

We’ve sketched out quantum mechanics for a two-eigenstate system, or qubit. While a classical computer bit has clear-cut 0 and 1 states, we saw that a qubit can take on a variety of “in-between” states, which can be conceptualized on a sphere². A real-life example of a qubit is the spin of an electron. But what about the position of an electron, or of anything for that matter? We still want our continuous spatial geometry, while somehow bounding the number of distinguishable states!

Here, nature has a trick up its sleeve: the position and momentum are Fourier transforms of one another. Since Fourier transforms are also relevant to how we produce and perceive sound, we can illustrate by analogy: if you think of position as being spread out in a wave like a plucked guitar string, then the momentum would be spread out like the frequency spectrum that characterizes the pitch and timbre of its sound. Notice that a string cannot simultaneously have both a precise frequency and a precise position of displacement: a pure tone displaces the entire string, whereas a pure point displacement lacks a frequency. In the same way, physical objects cannot have simultaneously a pure position and a pure momentum³; even attaining a pure position requires infinite energy. This is the Heisenberg uncertainty principle.

Unfortunately, I can’t sketch this out more convincingly without diving into the math⁴. Nonetheless, at its core, the theory is the same as in the simpler qubit system. Instead of A and B, the eigenstates now are the component sine waves that combine together to make a wave packet.

Conclusions

There is a great irony in the argument we’ve laid out. Because quantum mechanics appears so mysterious, it has become fashionable to speculate that it may hold the key to more powerful forms of computation, intelligence, and even consciousness. There do exist computational problems for which the fastest known algorithms are quantum, but that’s only if we presuppose a discrete model of computation, corresponding to the regime of our universe in which quantum states decohere. Our discussion suggests, in fact, that the true “purpose” of quantum mechanics may be directly antithetical to these popular interpretations: it serves not to increase our power, but to constrain it! From the perspective of our universe-building exercise, quantum mechanics offers the best of both worlds: the symmetries of a continuous universe, with the informational constraints of a discrete one.

Please note that, unlike the paper on which they’re based, my arguments here are not at all rigorous. Nonetheless, I hope they may provide some intuition into the mysterious nature of quantum mechanics, without demanding as much technical depth. Thanks to Sid Jain for pointing me to the paper!

This distinction is unimportant in classical probability theory, where the branches add up independently. However, in quantum theory, an outsider may yet have the branches interfere. ↩
I wonder… could this help explain why space has three dimensions? A competing justification is that it takes exactly three dimensions to embed all graphs. ↩
In everyday life, when you don’t need too much precision, objects may appear to have definite positions and momenta. Similarly, musical composers may notate definite pitches occurring at definite times. However, if you could analyze just a microsecond from a live recording, you won’t easily decipher the pitches playing at that instant. ↩
Would anyone like to demonstrate using animations, perhaps? ↩

Is probability real? (Part 1)

2020-11-22T00:00:00+00:00

Today, I want to address an issue with statements involving chance. To demonstrate, let’s first consider a statement that doesn’t involve chance:

“A cubic die tossed onto a flat surface will come to rest on one of its six sides.”

This claim can be empirically tested, with various dice and surfaces. If any one of our experiments results in the die spinning endlessly on a corner, we will have disproven the claim. We may have to refine the claim’s conditions; for instance, by requiring the presence of gravity. Nonetheless, it’s fairly clear what it means for the statement to be true or false. Now let’s try to make a claim involving probability:

“If a pair of standard dice are thrown, the probability of their upward-facing sides summing to nine will be one in nine (about 0.11 or 11%).”

What does it mean for this statement to be true? Unlike the first statement, this one doesn’t specify which result we’ll actually see. How can we possibly hope to test it, or to make use of its information?

The mathematician’s multiverse

Within the realm of abstract mathematics, we’re free to model probability in a way that fits our intuitions. Imagine a multiverse containing an infinity of possible worlds, whose total measure is 100%. Define the probability of an event, such as that of rolling a nine, to be the measure assigned to the subset of worlds in which the event actually occurs.

In the abstract formalism, we’re allowed to assign the measure however we like, subject to Kolmogorov’s axioms: the measure must be non-negative, countably additive, and sum up to 100%. Using the symmetry of an idealized die, we might argue that only one such assignment makes sense¹; from it, we can calculate the probability of any event involving dice rolls.

There are two shortcomings to this approach. Firstly, it doesn’t generalize well to asymmetrical objects and events. Secondly, while a priori deductions are neat, we’d still like some means of testing probabilistic claims using real-life observations. Empirical testing presents a serious challenge: how can we hope to infer the measure on a hypothetical multiverse, when we only ever experience one world? Indeed, a realist might question if it makes any sense to discuss the probability of an event happening: either it happens or it doesn’t!

The economist’s wager

You might be more trusting of someone who puts their money where their mouth is. To back up a definite claim, not involving chance, I can sign a contract, agreeing to pay a hefty fine if it turns out that I’m wrong.

This idea can be extended to probabilistic claims in the following manner: consider a lottery that pays a $90 jackpot if the next roll of a pair of dice yields a nine. If the maximum that I’m willing to pay to play is $10, this indicates that I believe I have a one in nine chance of winning. This approach is appealing because, after all, the raison d’être of probability theory is to explain the decision-making of individuals facing uncertainty.

If another gambler’s view conflicts with mine, we may aggregate our beliefs by creating a market on which we buy and sell predictions. Consider a contract that pays $100 (plus interest) when a specified event occurs. Then, its price on the market can be interpreted as the percentage probability of that event. Thus, to say an event is twice as “likely” as another, simply means its contract’s market price is twice that of the other.

Compared to the lone gambler, a frictionless market offers the advantage of transparent, near-identical buy and sell prices. As a result, any violations of Kolmogorov’s axioms become money-making arbitrage opportunities. Arbitrage activity acts as an enforcer of the axioms, giving rise to what economists call the risk-neutral probability measure.

In real markets, however, this probability measure exhibits several inconsistenties. Firstly, it depends on which currency is used: as an extreme example, we’d never buy a dollar-denominated contract that only pays out if the dollar collapses, no matter how likely we imagine the collapse to be. Secondly, this measure is sensitive to non-diversifiable risk: if a widely-believed prophecy held that rolling a nine would induce a catastrophic famine, the market would value this outcome a lot more, because everyone would want to buy insurance against such a catastrophe. Thirdly, frictionless markets can be hard to set up in practice. And finally, markets can be misinformed: indeed, a common motivation for participating in a market is to try to beat it!

Interlude: a hybrid approach?

The preceding approaches represent two extremes on a spectrum: the “objective” probability measure over some imagined multiverse, for which no empirical test exists; and the “subjective” gambling probabilities, which can be elicited by imposing suitable stakes. Can we get the best of both worlds? That is, does nature have any empirically meaningful notion of probability, consistent with our mathematical concept’s objectivity (e.g., being independent of market idiosyncrasies)?

Every gambler alive today owes their existence to an intensive optimization process: Darwinian natural selection. As such, it makes sense to ask how an idealized Darwinian agent would act when faced with uncertainty. In general, this is a complex question (see for example, our earlier discussion of risk). To start easy, let’s imagine that an agent, or a species, is faced with repeated instances of some scenario involving uncertainty. If the environment is sufficiently competitive, a necessary condition for survival is to attain the best cumulative outcome over a large number of trials. Thus, while a coin will always land either heads or tails, it’s considered unwise to wager your life on either outcome. Intuitively speaking, the rationale is that you’re almost certain to lose eventually, if you keep playing this way. This idea of repeated trials inspires our next interpretation, which is currently the most popular among scientists.

The statistician’s frequentism

According to the frequentist school of thought, a probabilistic statement is not to be taken literally as regarding one isolated event, but rather, as shorthand for a claim involving a very large collection of similar events. Thus, they bear a resemblance to universally quantified claims such as:

“All swans are white.”

The statement has an empirical meaning: we consider it to be true if it’s impossible to find a counterexample, e.g., a black swan. Inferring it in practice is a bit trickier, of course: while a single black swan suffices to falsify the statement, we can never really prove it, short of checking every swan. If we see lots of swans in lots of different locations, and find that they’re all white, then we may be inclined to believe the claim. Nonetheless, our observations cannot distinguish it from such alternatives as:

“All swans that we saw by [today’s date] are white, while all other swans are black.”

If we’re indifferent between the $2^S$ assignments of black or white to a global population of $S$ swans, then both hypotheses are equally good. Prior knowledge may lead us to prefer the first hypothesis, but this simply regresses the problem, since prior knowledge should itself be learned from data.

To avoid the baggage of common knowledge about swans, let’s instead consider the sequence $2, 3, 5, 7, 11,\ldots,$ whose elements are observed one at a time. Three hypotheses consistent with these starting terms are:

“The sequence of prime integers in ascending order.”

“The sequence that starts at 2, adding 1 on the next term, adding 2 on each of the next 2 terms, and so on, doubling both the increment and the number of terms each time.”

“The sequence whose n’th term is $\frac{1}{8}n^4 - \frac{17}{12}n^3 + \frac{47}{8}n^2 - \frac{103}{12}n + 6$.”

Depending on whether the next term is 13, 15, or 22, we’ll be able to eliminate two of the three hypotheses; nonetheless, no matter how many terms we see, there will always remain infinitely many possible extensions to the sequence. The scientific method demands that we collect fresh test data after selecting our hypothesis. So for example, if we choose the prime numbers hypothesis, and then observe the subsequent terms $13,17,19,23,29$, we’ll be justified in believing it.

The scientific method only works if we have a means of identifying good hypotheses. If we choose arbitrarily from a large collection of strange hypotheses (e.g., high-degree polynomials) that fit the first 5 terms, we’ll be much more likely to match these terms by chance, than to find an answer that successfully predicts future terms. In general, inductive learning is impossible without some sort of prior bias; this fact is called the No Free Lunch Theorem.

Now that we’ve seen how universally quantified claims run into empirical problems, let’s complicate matters further by reintroducing probabilities. Returning to the probabilistic claim that began this article, its frequentist interpretation is as follows:

“If a pair of standard dice are thrown repeatedly, then in the limit as the number of throws goes to infinity, the proportion of nines converges to one in nine (about 0.11 or 11%).”

The one-roll probability is replaced by a long-run proportion. Given an infinite sequence of rolls, this statement unambiguously reveals itself to be either true or false. In light of the frequentist interpretation, we can even make more sense of our earlier interpretations. While we only experience one world, repeating an experiment under similar conditions is like observing the experiment in a parallel universe: whether we count trials or worlds, the math is virtually identical. Likewise, in the limit of infinitely many bets, we can make some unambiguous conclusions about the long-term success of a gambler’s strategy: this is how casinos ensure that the house always wins!

Testing our claim is a simple matter: we roll the dice, over and over, and over and over… infinity times. Oops. Of course, there is no such thing as an experiment with infinity trials. Our arms will get tired, the dice will wear out, the Sun will explode, and all the free energy in the universe will be consumed. At best, we can do a very large number of trials. Let’s say we roll the dice 9,000 times; one in nine of these would amount to 1,000. Perhaps we won’t roll exactly 1,000 nines, so let’s interpret our claim with a suitable margin of error, essentially an inverted confidence interval:

“If a pair of standard dice are thrown 9,000 times, then the upward-facing sides will sum to nine between 920 and 1,080 times.”

Skipping some calculations, it turns out the probability of obtaining between 920 and 1,080 nines is 99.3%. Thus, we’ve turned our probabilitic statement into a much more certain (but still probabilistic!) prediction. We would hope that observing 1,100 nines would falsify our claim, but things are no longer as clear-cut as our black swan example. Indeed, if every household on Earth were to independently perform this 9,000-throw experiment, probability theory predicts that a great many (about 0.7%) of their results would falsify our claim, even if it’s true!

There’s no getting around it: despite its intuitive appeal, the frequentist definition of probability is circular, reducing probability claims to probability claims. To close the cycle, the frequentist chooses a threshold (say, 99%) beyond which to treat likely predictions as definitive. If we count a number of nines that’s outside the interval $[920,1080]$, we simply declare our probabilistic claim to be false. In practice, it seems fine to ignore sufficiently small chances of error: if 1% is not good enough, make it 0.0001%! Confidence can be increased by gathering more data, i.e., increasing the sample size.

This approach turns out to be very powerful. Even some phenomena which are not precisely repeatable, due to a dependence on uncontrollable parameters, can be statistically analyzed: we design more sophisticated hypotheses, or models, that generalize over those parameters. For example, weather forecasts are based on well-tested models that take measured input parameters such as temperature, pressure, humidity, and wind. The prediction probabilities coming out of such a model will vary as a function of its inputs.

Let’s consider domains even more irregular than weather. Statistical models of sports games, democratic elections, or stock markets are hard to test: the interactions are very complex and there are too few outcomes from which to extrapolate. Similarly, in your own life, when you try to predict which colleges will admit you, which of your friends will start a business, whether you’re being lied to, or whether extraterrestrial life eixsts, you don’t base your conclusions on repeated trials of the same situation. Is it a stretch to view these as statistical inference problems? Perhaps, but there’s no denying their importance and prevalence in ordinary decision-making. Let’s push the frequentist methodology to its limits here, by proposing an even more general kind of hypothesis: it consists of a model of the world, perhaps encoded in some part of your brain, that makes probabilistic predictions in arbitrary scenarios. Presumably, this model won’t be perfect. Since it can make a lifetime supply of predictions, one can design all sorts of tests to falsify it. But then, having falsified one model, how do we find a better candidate?

The last example is a bit extreme, going far beyond the capabilities of modern AI technology, let alone routine statistical analysis. Nonetheless, it exemplifies both the strengths and the weaknesses of the frequentist methodology. On one hand, we see how the methodology extends beyond the basic setting of independent, identically distributed trials. On the other hand, its use depends upon prior knowledge²: before collecting observations, we must commit to a limited number of hypotheses and testable predictions. If we make too many predictions, it may happen that while each is 99% likely, their intersection is less than 1% likely. This stands in contrast with the universally quantified setting, in which we were free to test any number of predictions.

Since our testing “budget” is limited, we should design our tests to fail in the event that some plausible alternative hypothesis is true. In the brain example, our model certainly has many flaws, and even more potential fixes, but we must be judicious in choosing which fixes to consider. In the simpler dice example, you might have wondered why we chose the expectation-centered interval $[920,1080]$ as our prediction, rather than some other 99%-probability region. It’s because we expected that, if our hypothesis were wrong, the most realistic alternative would be that our rolls are still independent and identically distributed, but with a modified frequency of nines. If we’re certain that the frequency isn’t lower, but suspect it may be higher, then the one-sided 99%-probability interval $[0,1074]$ is more likely to detect that: by specializing our alternative hypothesis, we increased the test’s power.

Since the chances of obtaining exactly 1,000 nines are about 1%, another perfectly valid prediction region would be the set of all integers except 1,000! Does such a test ever make practical sense? Yes, if our suspicion (i.e., alternative hypothesis) is that the dice, rather than being random, are rigged to produce exactly 1,000 nines. Since every fixed number of rolls can be tested against in this way, every conceivable result will fail some test. The point is that the frequentist methodology derives its power by making a prior commitment to as few hypotheses as possible; as such, it cannot generate, nor universally compare, hypotheses.

To summarize, we now have a circular definition of probabilistic statements in terms of empirical predictions, which themselves have a probability of failure. To keep misprediction probabilities small, human judgment is invoked to ensure a curated selection of hypotheses and predictions. A priori, it’s unclear whether any selection can be considered more valid than any other.

The philosopher’s razor

Suppose the universe were arbitrarily messy, complicated and irregular, its randomness devoid of any patterns; then, it would seem as if all events were decided by divine intervention. In such a world, there would be no role for science. The ancients believed in a magical world where everything, from weather to animal morphology, was subject to the daily whims of the Gods. Nevertheless, even the ancients believed in some basic patterns, which they could use to cook, hunt, navigate, build shelter, and otherwise live their lives. Without patterns to exploit, there would be no advantage for intelligent life, no reason for it to emerge. That there’s a simple order underlying our universe, is one of its most remarkable characteristics. Let’s take advantage of it.

If we don’t know which hypothesis to test, we might begin by considering every hypothesis that comes to mind: potentially thousands, millions, or infinitely many. In the dice experiment, we might consider some strange hypotheses, such as ones where the chances of rolling doubles depend on which celebrity’s credit card number was spelled out by the most recent rolls. Since no finite test is perfect, we would expect some fraction of these hypotheses to spuriously pass.

What makes the “true” hypothesis stand out from the many fakes? Well, the fakes are unlikely to stand up to additional testing. The more data we collect, the more contrived the hypotheses that we’ll have to resort to; nonetheless, it will always remain possible to fit an incorrect hypothesis to all of the data seen so far. This is such a serious problem in science that it has a name: overfitting.

Somehow, we must narrow down our hypotheses. Maybe you think that’s easy: only a few hypotheses describe plausible dice behavior; the rest are patently absurd! But now you’re relying on intuitive judgment, not a rigorous methodology. If you try to sort out the source of your intuitive knowledge about how dice ought to behave, you’ll find it to be rooted in your prior knowledge about how the world works, which itself must be tested against various hypotheses. If you have good prior knowledge and take it on faith, then this works fine in practice. However, it seriously begs the question: how do we manage to obtain any knowledge about the world in which we live?

There is a solution to this dilemma. For most of history, nobody knew how to state it in precise mathematical terms; hence, it was confined to the realm of philosophy. The solution to overfitting is the law of parsimony, Occam’s razor:

“Given competing theories that can explain our observations, always prefer the simplest.”

If we take “simple” to mean that it must be described by a short English sentence, then there are only a limited number of such sentences. By restricting ourselves to this limited number, it becomes possible to eliminate all the bad hypotheses with a finite amount of data.

Thus, we see that Occam’s razor provides the prior bias that traditional frequentist theories lacked. In order to complete this idea, we must make it precise, and justify its univeral applicability to hypothesis selection. The advent of computer science gave us a theoretical framework in which to do so. With it comes not only a cleaner interpretation of probability, but also a general theory of inference with Occam’s razor at its front and center.

The computer scientist’s electric razor

English sentences can be a bit ambiguous so, for precision’s sake, we’ll express our hypotheses as computer programs, and encode our observations as computer data. If you’re not a programmer, rest assured that it’s mostly kosher to replace the programs in our discussion with instructions written in your mother tongue.³ In the computer science framework, we restate Occam’s razor as follows:

“Given competing programs whose outputs match our observations, always prefer the shortest.”

Let’s see just how powerful this definition is. Right away, we see there’s no longer a need to carefully select hypotheses or tests, as both are built-in: all computer programs are hypotheses, with preference given to shorter ones. Testing a program amounts to verifying that its output exactly matches our observation record.

At first blush, the requirement to use deterministic programs appears to be a limitation. Luckily, randomized programs can be made deterministic by supplying the results of “coin flips” as an extra string of ones and zeros. This string makes the program longer, so Occam’s razor will prefer explanations that don’t depend on too much randomness, if one exists.

Given a string $x$, perhaps representing a very long sequence of observations, the length of the shortest program that outputs $x$ is called its Kolmogorov complexity $K(x)$. By prioritizing programs by their length, we ensure that each incumbent theory only competes against finitely many rival hypotheses. While one might imagine prioritizing programs by other criteria, it turns out that any computable criterion can be turned into a description language for $x$, and is therefore dominated by the Kolmogorov complexity. See the footnotes for an excellent technical reference,⁴, as well as a more accessible overview.⁵ We won’t go deeply into the theory here, but merely highlight how it helps us interpret and infer probabilistic statements.

Classical information theory studies the optimal rate at which random objects can be compressed. If the objects are drawn from a known probability distribution $\mathcal P$, then on average, the number of bits needed to compress one object is equal to a quantity known as the entropy $H(\mathcal P)$. No compression algorithm can beat this on average. In general, it’s unclear whether we should care about the average, as opposed to the median, mode, maximum, or some other statistic.

But now, suppose we independently sample a very large number $N$ of objects from $\mathcal P$. The Law of Large Numbers tends to make totals proportional to averages: almost certainly, the total encoding length of the entire sequence of objects will be very close to $N\cdot H(\mathcal P)$. The sequence’s Kolmogorov complexity will not be much greater: a suitable program consists of a description of $\mathcal P$, along with the classical encoding (optimized for $\mathcal P$) of each object in the sequence, for a total complexity of approximately $K(\mathcal P) + N\cdot H(\mathcal P)$.

If there’s no significantly shorter program that generates the same sequence, then the above program meets our defintion of a good theory to explain the sequence: it is approximately the simplest. That is, we can now look at an individual string $x$, and determine whether or not it looks like a sequence of random draws from $\mathcal P$. For example, consider the following sequence:

$3,1,4,1,5,9,2,6,5,3,5,8,9,7,9,3,2,3,8,4,6,2,6,4,3,3,8,3,2,7,9,5,0,2,8,8,4,1,9,7$

This will not pass as a random sequence of rolls from a pair of fair dice (with sides numbered 0 to 5, to ensure all digits are representable). Why? To interpret it as such, we must include an encoding for each digit. This might be slightly shorter than writing the sequence literally, but it’s much longer than the phrase: “first forty digits of pi”.

If it were possible to algorithmically compute the shortest program that outputs any given $x$, we would have a ridiculously powerful inference engine. For instance, we could feed it a bunch of data from physics experiments, and out comes a fully-formed scientific theory, better than any we know today. Naturally, such a thing is too good to be true. For fundamental reasons related to the theory of proofs and computation, the Kolmogorov complexity is not computable.⁶ Thus, we can only try our best to discover the most parsimonious explanations we can, never knowing how close we are to the best possible. Occam’s razor can distinguish good and bad hypotheses and, unlike pure frequentism, resists abuse by a barrage of ridiculous hypotheses. Nonetheless, it takes some ingenuity, or luck, to discover a good hypothesis.

In a sense, that’s exactly what the pursuit of scientific theories is about. The ancients would be astounded to learn that so much of the world (perhaps all of it!), with its vast richness, can be described by a few simple laws. Over the centuries, we’ve found more and more patterns, making our theories ever more parsimonious. The scientific method only works because the rules of the universe happen to be simple, while the set of observations it offers is vast. Kolmogorov complexity captures this defining characteristic of our reality.

Next time…

In the sequel, we’ll see that in pretty much any world where inference is possible, the Kolmogorov complexity approach applies. Thus, we’ll come to understand the limits of knowledge. Analogous issues will crop up in the Kolmogorov complexity methodology, via an ambiguity in the definition that we’ve overlooked until now: namely, the choice of computer programming language. Nonetheless, we’ll find that it’s possible to mostly mitigate the issues we found with frequentism. Finally, we’ll see what this means in practice, for probabilistic claims and their inference.

A more convincing argument might also use the roll’s chaos and ergodicity. However, this requires an initial source of randomness, putting us back where we started. ↩
Ironically, supporters of frequentism have been known to apply precisely this critique to the rival Bayesian school of thought, denouncing it for relying upon unjustified prior knowledge. In the sequel to this article, we’ll see how the universal prior completes both the frequentist and the Bayesian interpretations. ↩
That said, we can start to appreciate why a basic education in computational thinking is fundamental to understanding nature, just as math and science courses are. ↩
Ming Li and Paul Vitanyi, 2019. An Introduction to Kolmogorov Complexity and Its Applications (4th. ed.). Springer Publishing Company, Incorporated. ↩
Samuel Rathmanner and Marcus Hutter, 2011. A philosophical treatise of universal induction. Entropy, 13(6), 1076-1136. ↩
The uncomputability of $K(x)$ has to do with the fact that it’s hard to distinguish a program that takes absurdly long to run, from one that never finishes. Since an absurdly slow program is fairly useless, we might decide to include resource bounds in our complexity measure (see Chapter 7 from Li & Viyanyi). For instance, while the Standard Model of particle physics might in principle describe all of life’s processes, a supercomputer would struggle to simulate even one atom this way. To make useful inferences, we also need the theories of chemistry, biology, and the social sciences. Unlike $K(x)$, resource-bounded measures can be computed by trying every possible program until the resource bounds are exhausted. Of course, this still isn’t very practical, as the number of programs to try would be astronomical. ↩

How fickle should a forecaster be?

2020-11-01T00:00:00+00:00

Suppose one day, I say to you that there’s a 20% chance of rain next weekend. Or that there’s a 20% chance of Donald Trump being elected for another term as President of the United States. Upon reading about some scandal the next day, I revise my prediction to 85%. By nighttime the matter settles, so I announce my updated prediction of 5%. In the end, it doesn’t rain (or Joe Biden defeats the incumbent).

After following a bunch of my forecasts, you might criticize me for being too fickle, my predictions swinging wildly as if I can’t make up my mind. Or you might think the opposite: that I play it too safe, only shifting my opinion to one side when the evidence becomes overwhelming.

What’s the right amount of variation? Is it a thing that can be measured?

Yes it is! In this blog post, I propose the sum of squared changes as such a measure.

First, a formal derivation. If you’re not into the technicalities of probability theory, the next section can be skipped. We’ll understand its implications afterward.

A formal interlude

Let $X_t$ be a bounded martingale adapted to the filtration $\mathcal F_t$,¹ and let $0 = T_0 \le T_1 \le \ldots \le T_N$ be stopping times. Then,

\[\begin{aligned} \mathbb E\Big( \sum_{n=1}^N (X_{T_n} - X_{T_{n-1}})^2 \mid \mathcal F_0\Big) &= \mathbb E\Big(X_{T_N}^2 - X_{0}^2 + 2 \sum_{n=1}^N X_{T_{n-1}} (X_{T_{n-1}} - X_{T_n}) \mid \mathcal F_0\Big) \\&= \mathbb E(X_{T_N}^2 \mid \mathcal F_0) - X_{0}^2 + 2 \sum_{n=1}^N \mathbb E\Big(X_{T_{n-1}} (X_{T_{n-1}} - X_{T_n}) \mid \mathcal F_0\Big) \\&= \mathbb E(X_{T_N}^2 \mid \mathcal F_0) - X_{0}^2 + 2 \sum_{n=1}^N \mathbb E\Big(X_{T_{n-1}} \underbrace{\mathbb E(X_{T_{n-1}} - X_{T_n} \mid \mathcal F_{T_{n-1}})}_{=0} \mid \mathcal F_0\Big) \\&= \mathbb E(X_{T_N}^2 \mid \mathcal F_0) - X_{0}^2 \end{aligned}\]

Let $X_t$ be the $\mathcal F_t$-conditioned probability of some event whose outcome is determined by time $T_N$. This is a martingale for which $X_{T_N}$ is $1$ with probability $X_0$, and $0$ otherwise. Hence, $\mathbb E(X_{T_N}^2 \mid \mathcal F_0) = X_0$.

And we’re back!

So what have we shown? In essence, the martingale $X_t$ represents the prediction at time $t$. The role of the stopping times is to grant the forecaster some flexibility: they don’t have to announce predictions at every time $t$, nor even at pre-specified intervals. All that matters is that the decision on whether or not to announce a prediction at time $t$ be made by time $t$; deciding to withhold a prediction based on information from the future would be cheating! As long as this rule is satisfied, we can just add up the squared changes in the forecaster’s predictions to get a measure of their fickleness. If the forecaster reports genuine conditional probabilities, the first of which was $X_0$, then on average this sum should equal $X_0 - X_0^2$.

Using the example we started with, $X_0 = 0.2$, so $X_0 - X_0^2 = 0.16$. Our prediction series was $(0.2,\, 0.85,\, 0.05,\, 0)$, so the sum of squared changes is

\[(0.2 - 0.85)^2 + (0.85 - 0.05)^2 + (0.05 - 0)^2 = 1.065\]

This is a lot more than the expected $0.16$. Is the gap statistically significant enough to support your critique that my predictions are too fickle? The ratio between a positive random variable and its expectation is known by statisticians as an e-value. In this case, it is $1.065 / 0.16 \approx 6.66$. E-values between $4$ and $10$ are considered substantial evidence, while e-values above $10$ are strong evidence. As an exercise, you can calculate the e-value for a prediction series from a well-known forecaster, such as FiveThirtyEight!

Trading on volatility

Aside from professional forecasters such as FiveThirtyEight or your local weather channel, another source of prediction series are public markets, such as PredictIt. There, you can purchase any number of contracts which pay out $1 each, if the corresponding event comes to pass. If “Donald Trump wins the presidency” contracts sell for 40 cents apiece, one may say that the market of buyers and sellers have collectively estimated his chances at 40%.² Assuming no interest, transaction fees, or other frictions in the market, “Donald Trump doesn’t win the presidency” contracts must then sell for 60 cents apiece, ensuring that the two opposing contracts precisely cancel out.

Now, let’s say you don’t know which event will come to pass, but you’re confident that the prediction market is more volatile than a true martingale, meaning that its sum of squared deviations will exceed the expected $X_0 - X_0^2$. Is it possible to bet on this outcome, making a profit if it comes to pass?

Again, the answer is yes. The general strategy is to ensure that, at every point in time, we hold contracts on the side that’s deemed less likely to win. The number of contracts we hold should be in proportion to the difference between the two sides. For example, if we buy one contract for 40 cents, then we should sell it when its price rises to 50 cents. If it then shoots up to 80 cents, we should buy three of the opposite contract, for 20 cents apiece, and so on. Can you prove that this method works?

In fact, every e-value can be interpreted as a betting strategy.³ The intuitive idea is that if you do not believe my probabilistic forecasts, you can propose a bet that I should be willing to accept if I believe my own forecasts. If you end up much richer, that serves as strong evidence against my forecasts.

Filtration is a very technical term, but you can think of $\mathcal F_t$ as all of the information known at time $t$. ↩
More precisely, 40% would be what economists call the risk-neutral measure of such an event. ↩
Aaditya Ramdas, Peter Grünwald, Vladimir Vovk, and Glenn Shafer, 2023. Game-theoretic statistics and safe anytime-valid inference. Statistical Science, 38(4), pp.576-601. The connection to e-values was added to this blog post in 2024. ↩

How to cut your cake and eat it too

2017-09-20T00:00:00+00:00

A classic challenge goes as follows: in the absence of precise measurements, how would you divide a cake between two diners so that both agree they got a fair share? The cake, naturally, is a metaphor: it can stand for any bundle of goods that should be divided between parties.

In the case of two diners, the classic solution is I cut, you choose. If I let you inspect and choose between the two pieces, I can’t easily cheat you: if the pieces are uneven, you’ll go for the best one. Knowing this, I have an incentive to make the most equal cut that I can manage. We’ll each end up with about half the cake: I’ll lose no more than my own measurement error when cutting; likewise, you’ll lose no more than your own measurement error when inspecting. Seems fair.

Now, in real life, who would you rather be: the cutter or the chooser? I think the chooser is better off. The cutter has to do all the work of being precise, and in the end will probably end up with a slightly smaller piece. The chooser, on the other hand, can always choose the best piece. It that’s too difficult, they can just choose randomly, and still be guaranteed half the cake on average.

We’ve been deliberately vague about the nature of the problem: rather than stating it in mathematical terms, we used our intuition regarding cakes and knives. This got me thinking: is there any setting where it’s better to be the cutter instead of the chooser?

First attempt

The first idea that comes to mind is deception: maybe I can cut the cake so that the smaller piece looks bigger. However, if the cut looks fishy, a distrustful chooser has the easy defense of choosing randomly, say by tossing a coin. If the cake is homogeneous, this always works.

Second attempt

In our quest to make the cutter win, let’s look at a fancier cake: this one has 10 berries and 10 cherries on top. I love berries but don’t care much about cherries. Meanwhile, you adore cherries but couldn’t care less about berries. Knowing your tastes, how about I cut the cake so that one piece has all 10 berries but only 4 cherries? Then you’ll prefer the other piece, with 6 cherries. While that’s an ok deal for you, I’m the clear winner since I keep everything that’s valuable to me.

If you’re rational according to the classical definition used by economists and game theorists, it appears you have no recourse. However, an “irrational” chooser can punish me by taking the 10-berries piece. This spiteful strategy carries a cost for you, but it’s less than the damage done to me. You still get 4 cherries (instead of 6), while I’m effectively empty-handed.

Spite seems like an expensive strategy. However, the implied threat of punishment might deter me cutting greedily, so that you’ll rarely pay the price in practice. If I believe that you would take revenge if wronged, then I’ll take care to cut in a fair manner.

Wait, really?

Classical game theory doesn’t predict spite; one may argue this to be a flaw in the theory itself. Sure, we can invent workarounds: for instance, by employing fancy social contracts or threats that are costly to lie about. However, I feel that the framework of Updateless Decision Theory offers a cleaner abstraction to explain such behavior. Perhaps I’ll write about UDT, or about the power of good abstractions more generally, in a future post.

For readers who are unacquainted with applied math research, the takeaway is that one’s assumptions, whether stated or implied, can affect the outcome. We haven’t formally defined what fairness means, nor how the diners behave. Unless we clarify our assumptions, it’s impossible to narrow down to one right answer: the chooser may be classically rational, or they may be spiteful. If you happen to have the math background, I leave it as an exercise to formalize, and then generalize, the arguments in this post!

Last attempt

Ok, so I still can’t win against a spiteful chooser. It appears I’m doomed whenever our utilities are additive. Additive utility, here, is just a fancy way of saying we enjoy a piece of cake exactly as much as the sum of its parts, not caring about special combinations.

But now suppose the “cake” is actually a basket of different items, some combinations of which form recipes that I enjoy, while other combinations form recipes that you enjoy. To construct a very simple example, let’s say the cake contains not only berries and cherries, but also chocolate chips and vanilla chips. I enjoy berries only when they’re mixed with chocolate, and I enjoy cherries only when mixed with vanilla. For you it’s the reverse: berries require vanilla and cherries require chocolate.

Now as the cutter, I can cut the cake so that one piece contains mostly berries and chocolate, while the other contains mostly cherries and vanilla. No matter which you choose, you’ll be unhappy and I’ll be happy. If I were more considerate, I could make us both happy by splitting everything equally. However, if the cake had just one (indivisible) topping of each type, then only the cutter can win. Sweet victory!

What’s a Color Made of?

2017-08-31T00:00:00+00:00

As highly visual creatures, we understand the world largely in terms of what we see. The colors in a scene, their presence and interactions, have the power to delight, disgust, and inform.

As citizens of a scientific age, we may wonder about the nature of color. What could a red apple possibly have in common with a red fire? Is color an objective physical property of these objects, or a subjective experience we made up? If it’s objective, what sort of physical laws are involved? If it’s subjective, how did we make it up? Is there a way to characterize all possible colors?

Part I: Primary Colors

You may try to answer these questions by summoning memories of lessons from elementary school. Perhaps you had a painting session, in which you saw that a wide range of colors can be obtained from combinations of three primary paints: red, blue, and yellow. For example, blue and yellow paints can be mixed to obtain green, a secondary color. Adding red to the green mixture yields black. To summarize in a diagram:

Maybe a few years pass, and you find yourself in an IT class. You learn that TVs and other electronic displays produce realistic images by emitting combinations of three primary lights: red, blue and… green! This theory seems to agree with your biology class, where the retina is said to contain three types of cone cells, each sensitive to red, blue or green light. Apparently, light comes in three varieties, and our brains process combinations of them as follows:

But hold on, how did red and green go from making black to making yellow? This diagram seems to completely defy our common sense experience with paint and crayons!

Before we can finish our thoughts, a physics teacher enters the room. As if to delight in our confusion, as physics teachers often do, she explains that our eyes are in fact sensitive to electromagnetic (EM) radiation whose wavelength and frequency are within a thin continuous band. We have names for each of the bands: the longest EM waves are radio waves, whereas short ones are gamma and X-rays, but they are all fundamentally the same physical phenomenon. The human eye is only sensitive to a thin band of EM waves, which we call visible light:

In this theory, no mixing is needed because light comes in infinite varieties, providing the entire spectrum of color that you see in a rainbow! Indeed, rainbows occur when mixed (white) light is refracted in a wavelength-dependent manner. Now if you happen to be especially attentive, you might notice that magenta is missing from the rainbow… hmm!

It looks like our teachers have made a fun sport out of contradicting each other at the expense of impressionable minds! Can we reconcile their theories?

What might surprise you is that the plain act of uncovering these apparent contradictions is a major breakthrough in our journey! By digging to the source of disagreements, we’ve arrived very close to the truth. Taking this little-known technique to heart will turbocharge your learning.

In this case, having learned an objective (physical) theory that includes an infinite continuum of varieties of light, it’s a good time to revise our subjective (perceptual) theory of color. Let’s revisit those cone cells…

Part II: Cone Cells

Alright, so the physics part of our investigation was pretty easy. You may be tempted (or scared!) to read more about how EM works, but there’s really no need. The last diagram gave us the right idea: light is just EM radiation whose wavelength happens to fall within some range. The complications come not from the physics of EM, but from the biology.

Before turning to our favorite search engines for a full answer, let’s imagine how color perception might work, hypothetically. This exercise will deepen our understanding of the underlying logic behind the mess in which we find ourselves.

We know that we’re capable of seeing an infinite variety of light (up to a reasonable precision) because we see it in the rainbow: this is the continuum of wavelengths known as the visible spectrum. We perceive this spectrum as varying continuously from deep red, slightly orangish red, orange-red, reddish orange, orange, golden orange, golden yellow, and so on through all the colors of the rainbow, down to indigo and violet. These are the spectral colors, produced by monochromatic light, meaning light composed purely of a single wavelength.

By mixing the spectral colors, it’s possible to get non-spectral colors that don’t appear on the rainbow, such as white and magenta. With an infinite variety of pure spectral colors available, the possible combinations we can make with them boggle the mind! However, our eyes cannot distinguish between all possible combinations. For example, recall that our displays mix red and green lights to appear yellow: this is a mixture whose appearance closely resembles that of the rainbow’s pure spectral yellow.

Our limitation stems from the fact that our eyes typically only have three types of light-sensitive cone cells. Our visual cortex never learns the full spectral power distribution, which is a fancy term for the full “recipe” we’d write down if we were to describe how much of each wavelength is included in an EM mixture. Instead, the visual cortex effectively works with three numbers per location in its visual field, corresponding to the amount of activation on each cone type.

How do three types of cones accomodate a continuous spectrum? Since we’re able to perceive the entire visible EM band, it must be the case that the three types together cover the band. Indeed, here’s a graph plotting the sensitivities of each cone type against each wavelength:

There are three curves, corresponding to three types of cone cells. The horizontal axis corresponds to wavelength, and the height of a curve at each position tells us how sensitive the cone is to that wavelength. We see here that one type of cone (let’s call it R) peaks in the red-to-yellow range, one (G) in the green and one (B) in the blue-to-indigo. When multiple wavelengths shine on the same cone cell, its activation is the sum of its sensitivities to the individual wavelengths: this is an empirical result known as Grassman’s law.

Now, we see that monochromatic yellow light activates both R and G cones, with almost no activation of B. This is very similar to the activations induced by a mixture of red and green light; as a result, these two cases yield very similar subjective appearances. Different spectral power distributions that appear the same, because they induce the same RGB activations, are called metamers.

Some RGB activations cannot be induced by monochromatic light. For example, we can see from the graph that it takes at least two distinct wavelengths to simultaneously activate both R and B without G; the brain’s response to this combination is what we call magenta, and explains why we don’t see it in the rainbow. White is another example: no single wavelength activates all three cone types equally.

Finally, we find certain RGB activations that cannot be induced by any sort of light, mixtures included. For example, it’s impossible to get a pure G activation without also getting a bit of R or B mixed in there. If you could magically activate only your G cones, you would see a color that doesn’t exist in the real world: a subjective experience with no physical counterpart! This is a forbidden color.

Without going into every possible detail of physics and biology, we now have a coherent understanding of color, connecting the physical phenomenon with how it’s perceived. Summarized in one sentence, color is derived from the cone cell activations induced by mixtures of EM waves. Armed with these insights, can we draw a new diagram that characterizes all possible colors, including the non-spectral ones, in a natural way?

Part III: Color Space

We already came close to our goal when we saw the rainbow embedded within the EM spectrum. The spectral colors can be arranged along one axis, because they are specified by a single parameter: the wavelength. If we wanted to, we could add a second axis for brightness: shining a more intense light at the same wavelength makes it brighter.

To describe an arbitrary EM mixture, you’ll need infinitely many axes: one to specify how much of each wavelength is used. However, our subjective experience is determined only by the level of R, G and B activation. If we let the three spatial coordinate axes correspond to these activations, we get the Cube of All Colors:

Or do we? Alas, this cube is a fake! It contains those forbidden colors we mentioned; in particular, its corners correspond to pure activations, so we should at least carve out some of the perimeter. If we don’t care about brightness, we can remove even more. In this image of a cube, every point that’s hidden from view (e.g., the back and interior) is just a dimmer version of a color on the surface.

In summary, after removing the forbidden envelope of the cube, we can take a slice of the remaining solid to represent all of its colors at a fixed level of brightness. The result is a two-dimensional color space, allowing us to present all possible colors¹ on a flat display or sheet of paper.

Okay, while it’s nice that we can theoretically present everything in 2D, this is starting to sound complicated. Let’s cheat for a moment and look up the answer. We can resume trying to understand it afterward. A quick web search reveals the CIE 1931 color space:

Remember that all we’ve done is project the color cube onto a flat plane. Why the triangular and horseshoe shapes? Let’s explore whether they make sense.

Given any spectral distribution, our handy cone sensitivity graph lets us calculate the corresponding cone activations, which become coordinates in the color cube. The CIE standard then projects these to coordinates in the plane corresponding to a slice from the cube.

Which positions in the CIE space do the spectral colors occupy? Try to convince yourself that they must form an unbroken continuous curve. For a hint, look up the formal mathematical definition of a continuous function. I won’t give a formal proof, but here’s an intuitive sketch:

Let’s start from the longest visible wavelength (red) and find its place in CIE space. Then, let’s gradually lower the wavelength towards the violet end. Our cone sensitivity graph shows a smooth dependence on wavelength. Therefore, a sufficiently tiny change in wavelength causes a tiny change in cone activations, which remains a tiny change after projecting to CIE. As we visit all the wavelengths from red to violet, moving a little bit at a time, we’ll trace out a continuous curve in CIE space.

What about the non-spectral colors? By Grassman’s law, the mixture of two colors always lies on the line segment between them. Now imagine repeating this process, mixing additional colors one at a time. Given any set of colors to start with, we would find that their possible combinations take up the whole interior region bounded by the starting colors; in formal terms, their convex hull. A hands-on way to see the convex hull is to place thumbtacks all over the starting set. If we surround them all tightly with a thin rubber band, the region it would contain is the convex hull.

The CIE “horseshoe” is simply the convex hull of our spectral curve: it connects the red and violet ends with a straight line segment, and fills in the interior. The straight segment closes our curve with purples and magentas, and explains why artists talk about a color wheel instead of a two-ended spectrum. Meanwhile, the interior contains colors of lower saturation. Altogether, the horseshoe contains every color that your eyes can see. The forbidden colors, which cannot be produced by any light, lie outside the horseshoe. Beautiful!

Finally, in case you were wondering, the black interior curve corresponds to the spectrum emitted by a blackbody, an idealized object that absorbs all incoming light. At room temperature, such an object would appear black; approaching 1,000 degrees, it glows red-hot; past 10,000 degrees, it radiates blue-white. Ironically, the hottest stars in the night sky glow in “cool” colors that we typically associate with icy climates. The point labeled D65 corresponds to the surface of the Sun. It should come as no surprise that our visual sensitivity has evolved to peak at about the same wavelength as sunlight!

Part IV: Color Technology

Scientists propose and test theories to explain how natural things work. Engineers apply theories to design new, useful things. In a way, these activities are inverses of each other. Together, they enable technology: the art of manipulating nature to one’s will.

Having learned the scientific basis for color perception, let’s put on our engineer hats and see what we can create! Video is a technology that convincingly (to human eyes) reproduces the visual stimulus of a detailed scene. This requires the ability to produce a spectacular variety of colors at a moment’s notice. While it would be an impressive feat to faithfully replicate the spectral power distribution from a real scene, our goal in practice is less ambitious: to fool the human eye. To produce convincing scenery, we need only produce the right RGB activations.

TVs and computer screens can be built from tiny subpixels that each emit a dot of one specific color. We can turn each light off, or have it on at any brightness. Since the CIE space projects away the brightness dimension, each type of subpixels is anchored to one point in the space.

White pixels alone suffice to give us black-and-white television: just dim the pixels to produce darker shades of grey. Now suppose we had two types of subpixels: red and green. Adjusting their brightness in different proportions yields a range of hues such as orange and yellow. In color space, we’re capable of representing not just one point, but the entire line segment connecting the two source colors.

In general, the range of colors we can represent will be the convex hull of the source colors. For example, the sRGB standard uses red, green and blue subpixels to produce colors in a triangular region, as shown inside the horseshoe above. If your monitor is sRGB, it’s incapable of displaying colors outside the triangle. The region outside the triangle but inside the horseshoe corresponds to colors that you might find in the wild, but that your screen cannot display. Due to this limitation, the color space image on this webpage is actually drawn in the closest colors that your screen can display.

It’s often said that red, green and blue are the additive primary colors. Additive refers to the act of mixing colors by emitting, hence adding, different source lights. Colors made from one source light are primary; colors made from an equal mix of exactly two source lights are secondary. Red, green and blue occupy distant corners of the color space, suggesting their suitability as primary colors.

To display richer colors than sRGB is to grow its region in color space. This can be done in two ways. The first is to make our subpixels purer, i.e., place them closer to the horseshoe boundary, corresponding to monochromatic light. This is the approach taken by UHDTVs. The alternative is to add more primary colors.

Using these two approaches, we might hope to capture all the colors. However, since the convex hull of any finite set is a polygon, it cannot possibly fill the rounded edges of the horseshoe. We conclude that this type of technology must necessarily miss some colors.

If we cheat a bit, we can view the forbidden colors corresponding to pure R, G and B cone activation as the ultimate primary colors. Together, they form a big triangle that fully contains the horseshoe; we know this because every color corresponds to some RGB cone activations. Unfortunately, these “primary colors” are not real colors realizable by any physical process!

In retrospect, we can view our IT class’s RGB Venn diagram as a simplification of additive color mixing. What about the RYB diagram from painting class? That’s subtractive color mixing. Unlike TVs, paint and printer inks don’t produce any light of their own. Instead, you shine an ambient light, such as from a bulb or the Sun, on a sheet of paper. The paper is a good reflector at all visible wavelengths, so it appears white. The ink absorbs, hence subtracts, selected wavelengths, preventing their reflection.

At this point, maybe we’re feeling tired and don’t want to dig deeply into the physics of subtractive mixing. That’s ok! We can reason through the simplified, though technically incorrect, model that discretizes light into three varieties: long-wave (red), medium-wave (green) and short-wave (blue). In this simplified model, it follows that the primary subtractive colors correspond to secondary additive colors, and secondary subtractive colors correspond to primary additive colors. This is best explained with pictures:

Indeed, most color inkjet printers have three primary inks: cyan, magenta, and yellow (CMYK). K stands for black, and is preferable to mixing CMY inks because it’s cheaper and typically yields a deeper, sharper black. Cyan and magenta are sometimes called “process blue” and “process red”, respectively. Thus, you can think of CMY as a modern improvement over the traditional RYB primaries.

Conclusion

Phew! We covered a lot, but all of our conclusions were logical consequences of the fact that we have three types of light-sensitive cells following Grassman’s law. We might try to memorize every detail of additive and subtractive color mixing, the horseshoes, the triangles and so on. We might see them as random, interesting, messy facts about the world.

Instead, we took a different approach. We explored how a variety of ideas from physics, biology, geometry, and painting could come together and tell a story. We thought critically, asking questions and testing potential answers, some of which stood in apparent contradiction to one another. Through a theoretical exercise, we discovered the hidden beauty of colors.

I hope you enjoyed this post! There’s a lot more to human color perception than presented here. If you want to learn more, a good place to start is the HyperPhysics Color Puzzles.

If you’d like to read someone else’s explanation of the same topic, check out:

Hey Kids! Red’s Not a Primary!

A Beginner’s Guide to Colorimetry

Note, however, that grey is effectively a darker white, and brown is a darker orange. Our perception makes relative comparisons that take context, light sources and shadows into account. Optical illusions take advantage of this. By necessity, this exposition contains simplifications. ↩

Announcing the Algorithm Cookbook

2017-06-19T00:00:00+00:00

I built a reference cookbook of algorithms and data structures for contest problem solvers. It’s written in the Rust programming language, as I believe it’s ideally suited to the task. For more info, please check out the repository at github.com/EbTech/rust-algorithms.

While I believe Rust is not too difficult in absolute terms, it does present a significant departure from most developers’ mental models. If you’d like to practice the language on small toy problems, contests can serve as a useful playground. Unfortunately, it’s hard to get started when there are still so few examples of Rust contest code out in the wild, and no established guidelines to tie Rust’s compile-time discipline with the constraints of contest programming. This project seeks to remedy the situation.

Note that it’s not meant to act as a full-fledged general-purpose library. Contest problems often require understanding an algorithm so well that you can dig in and make subtle modifications to make it suitable for a brand new problem. Therefore, in this setting, one ought not to rely on blackbox implementations. Instead, I try to distill each algorithm into its simplest possible form, so that you can quickly read over the code, understand it, and augment it to suit your needs.

Rust and Codeforces represent two of my favorite technology communities, so I’m interested to see how they can support each other. If you’re a Rust programmer interested in honing your technical interview skills or solving cool algorithmic puzzles, you might enjoy Codeforces. If you’re a Codeforces member and find that debugging is a huge time drain, Rust’s emphasis on safety may give you a competitive edge. In either case, I hope this reference will help ease the learning curve. I’m still learning too; suggestions are welcome!

Out of Hiding

2017-06-18T00:00:00+00:00

Woo finally, site launch! ¹ Pleased to meet y’all fancy people :)

After more than a year spent learning in the Silicon Valley, masquerading as some non-crazy dude, I’m ready to take my lessons out into the world. This site, opening today, is my portal. Welcome! Your encounter with this page was preordained by the choice of Stein’s Gate. HAHAHAHAHAHA!

Er, why was that necessary? What are you, an agent of CERN? Fine, here’s my story:

5 years ago, I entered grad school with no clue of what I was trying to do or why. 3.5 years, several great friendships and life lessons later, I dropped out, still having hardly a clue. So I did the modern equivalent of picking up some books, and dedicated myself to studies like never before. Free of any structured requirements, inspired in equal parts by sheer curiosity and by the very real challenges faced at work, I put my heart in the work. Turns out the web is chock-full of quality free resources! It didn’t cost a dime to learn about:

But of course, it’s not enough to just read. As a former theoretician, it’s been an interesting struggle to learn my way around a living, whirring computer at work and beyond. Even the toolchain needed to generate this site was a level up. And, though I skip over it here, I’ve also been busy picking up a bunch of non-technical skills.

As a result… uh I still don’t know what the goal is :/ OK there is something. Indeed, the first stages of disruption are well under way! I’m all-in on this movement: human driving on public roads is both wasteful and dangerous. For a minority, it’s not even an option. So I’d like to play a small role here. Besides, I kinda left Canada while my driver’s license was still in the mail…

What’s next? Well, I’d like to share what I’ve learned in collaboration with you awesome people. You’ll have to look to official channels for the Waymo stuff, but my roommate Shriphani and I have some fun things planned of our own ², so please stay tuned!

Helped in part by today’s 40℃ heat wave. ↩
Collectively, we are DeepFear :D You can already check out our first collaboration. ↩