“Human, All Too Human”: ChatGPT and the Conjunction Fallacy

When a new technology becomes popular in AI, just like in other fields, it’s quite frequent to give in to one of two opposite temptations. Either we try to find tasks where the technology fails miserably, taking a questionable pleasure in diminishing its capabilities, or we start collecting examples where the new tool performs impressively well, spreading enthusiasm around the novelty and sharing the excitement with interested people. Nowadays, ChatGPT is all the rage in the AI and NLP community. Therefore, I started pondering which one of the temptations above I really wanted to give in in this case. And while unsure about which temptation was stronger, I caught the chatbot failing on a fairly simple task in a glorious kind of way. Namely, failing in a way which shows how good ChatGPT is. Let me illustrate this finding.

I actually had a long and very entertaining conversation with the bot, starting from axiomatic set theory, going through the greek etymology of some Italian words, and finishing with a discussion of the relationship between probability theory and logical implication. The last topic is where I was mostly fascinated by the way that ChatGPT seems to reason. A classic case where boolean logic and probability theory are found to clash with human reasoning is what cognitive scientists call the “conjunction fallacy”. In a nutshell, a conjunction fallacy is a logical mistake that we make when judging that a conjunction of two statements is more likely to be true than one of its conjuncts. This judgment contradicts probability theory, since the joint probability of two events cannot be larger than the probability of either one occurring alone. However, the fallacy is pretty frequent in human reasoning—and not clearly due to a lack of education in probability theory and mathematical logic. Here is an excerpt from my discussion with ChatGPT:

The question about Linda poses a task that even the most rudimental implementation of probabilistic reasoning should be able to address correctly. However, many humans fail on that task. And here is what happens with ChatGPT. Like a human, ChatGPT falls victim to the conjunction fallacy. But unlike some humans, ChatGPT is well-aware of the conjunction fallacy. Now, unlike any human I have seen so far, ChatGPT seems to remain blind to the fact that it commits the fallacy, even after being caught with a smoking gun. Or maybe I’m lying to myself here, and ChatGPT is simply more human than I’m willing to believe. Indeed, many humans seem to remain blind to their own mistakes—especially after being caught with a smoking gun.

How German Is Machine Learning?

While learning about lean thinking, I got a bit concerned by the similarities I see between the German engineering tradition on the one hand, and applied machine learning (ML) on the other hand.

German Engineering

German technology has an incredibly high reputation all over the world. With automotive industry being probably the brightest example, “Made in Germany” is almost synonym with “meeting the strictest quality standards”—and rightfully so. Even Germans are not perfect, though:

A […] German weakness has been the tendency to substitute the voice of the product engineer for the voice of the customer in making trade-offs between product refinement and variety on the one hand and cost as reflected in product price on the other. While quality may be free, variety and refinement almost always entail costs, particularly when products are designed without much attention to manufacturability. Good hearing is therefore needed to ensure that product designs contain what customers want rather than what designers enjoy making.

Womack, James P., & Jones, Daniel T. (2003). Lean Thinking: Banish Waste and Create Wealth in Your Corporation.

A Cardinal Sin of Applied ML

While German tech industry might have more problems than just the one described above, many of its weaknesses are definitely balanced by a number of strengths. But rather than looking at these strengths now, we shift instead our focus to ML, and particularly to how ML is used in industry. If we spend some time looking at how ML is applied to engineering products in high-tech industry, we will notice a few things. Let’s make a concrete example. Imagine we are dealing with the ranking model to be used by the recommender system of an e-commerce platform, i.e. the software component which is responsible for picking up the most relevant recommendations to be shown to our customers on the website. During the product design stage, it might be pretty easy to completely disregard cost considerations, operational requirements, and customer expectations. For example, deciding on the degree of personalization that the ranking component should exhibit when recommending products to different customers, or choosing between a simple logistic model and a deep neural network, is a design step where engineers (or “applied scientists”) usually feel detached enough from its implications in terms of budget and operations to not actually bother about them. When facing the choice between a linear and a non-linear model, the question whether the latter model will have a lower error on the training data is usually considered the most relevant one. On the contrary, the question whether a linear approximation will ever cause a measurable difference in customer experience—and in particular, whether such difference justifies any additional engineering and operating costs—is usually perceived as less relevant or less pressing, and probably too hard to answer. Here, the reasoning is typically along the following lines: “Let’s go for the most accurate solution—if it takes a faster CPU or a larger RAM, then we’ll look for better hardware”. Or: “If maintaining the system gets too complex, then we’ll try to hire more DevOps engineers”. To some extent, this attitude might even be reasonable, given that product design in the ML area is often tightly coupled with research, hence it needs some flexibility w.r.t. the application constraints in order to properly explore the most promising solutions. Moreover, the subtle—and often very indirect—way that algorithmic choices in product design end up affecting customer experience in a measurable way hardly makes a compelling case for worrying about customer impact already at a design stage. However, every design choice generates costs downstream in the development pipeline. If a modeling choice brings adequate value to customers, then its costs are fully justified. Otherwise, it is nothing but waste. Hence, the problem with ML design is not with the costs it generates as such, but rather with its intrinsic tendency to detach itself from value stream analysis. To sum up, applied ML scientists often—and sometimes quite happily—indulge in “the tendency to substitute the voice of the product engineer for the voice of the customer”.

Compensation Patterns

Is there a problem at all? Judging by the impact that ML has made in industry over the last few years—shifting the focus of preexisting business models more and more towards data virtually everywhere—we would hardly blame ML developers for building technologies which are a bit too expensive, heavy on operations, and not always rooted in unquestionable customer needs. After all, many—if not most—ML developers still have a background in academic research, where customer-centricity or operational excellence are not relevant evaluation criteria. But for sure, business would never forgive ML its academic sins, if they were not offset by a number of strengths. And how are ML folks managing to make those sins go almost unnoticed? If we’ll look back at the situation in one or two decades, maybe we’ll summarize it as follows:

• Because skill levels were so high on the plant floor it was possible to fix each problem as it arose rather than fix the system which created the problems in the first place. The finished product handed to the customer was usually of superlative quality, even if also of high cost.

• Because the skill level of product development engineers was so high, they could reengineer designs coming from upstream rather than talk to upstream specialists about the problems their designs were creating. Again, the end product reaching the customer was superlative in achieving the promised performance, but at high cost.

• Because of the technical depth of a firm’s functions, it was often possible to add performance features to products which offset their inherently high development and production costs.

Womack, James P., & Jones, Daniel T. (2003). Lean Thinking: Banish Waste and Create Wealth in Your Corporation.

These words were actually used to explain how German manufacturing has traditionally been able to compensate for its inefficiencies. I’m genuinely impressed by how smoothly the diagnosis above can be recast from the German manufacturing domain to the applied ML scenario.

Compensating by DevOps

The first type of compensation is something we see when a ML system goes live, and problems are observed which we did not anticipate early enough. For example, in the recommender system case discussed above, once the ranking model starts serving online customer traffic, we might realize that the amount of data transfer required for the ranker to consume all relevant feature vectors is causing an unbearably high latency. We then resort to all tricks of the trade in order to overcome this issue, such as enabling data compression/decompression before/after the transfer, moving to a different hardware configuration in order to optimize network performance and (de)compression speed, or increasing the volume of data which are cached in local memory—if a local cache is available. Here, the bottom line is: If we can count on great operational skills, there’s virtually no runtime glitch that we can’t overcome.

Compensating by Reengineering

The second way to compensate for inefficiency is something which occurs when the ML product hasn’t reached our end consumers yet, but the research engineers have handed it over to the product engineers in order to “put it to production”. We move now from a software prototype to a business product which must be able to cope with any available requirements in terms of design, reliability, and performance. In the ranking system example we were imagining above, what might happen is something like the following. While the prototype was lightheartedly filling a local memory cache with as much data as possible (e.g. with all consumed feature vectors) in order to minimize latency, once we go to the large scale the original cache is not sufficient anymore, and we can’t cache all feature vectors that the ranker needs to consume. But we are so skilled at reengineering our model that we quickly think of a suitable dimensionality reduction technique. Via dimensionality reduction, we manage to squeeze the size of our feature vectors to a minimum, while not hurting significantly the original ranking accuracy. By switching to lower dimensionality (which might involve quite a bit of refactoring/reconfiguration throughout the ML pipeline), all the needed feature vectors fit again into the local memory cache. Bottom line: When engineering skills are high, no design flaw prevents us from redesigning the ML pipeline to meet any relevant constraints.

Compensating by Gimmicks

Finally, a third compensation strategy is to let production inefficiency go almost unnoticed, by surrounding it with a number of gimmicks. Just think of the many, highly beneficial side-effects of putting to production state-of-the-art ML techniques. Prestigious scientific conferences—such as NeurIPS, KDD, or RecSys for the recommender system domain—regularly host talks and publish papers contributed by ML researchers from industry, and most tech players on the market strenuously compete to get their contributions accepted for publication. If a product using ML has a high development cost, then publishing papers about the underlying technology at world-leading scientific conferences definitely makes that cost more easily acceptable—due to the return in terms of reputation, employer branding, and implicit marketing. Other gimmicks consist in winning public competitions, letting tech demos go viral on the Web, and so on. The most notable example in this direction was probably provided by IBM, when Deep Blue was awarded the Fredkin Prize because of defeating the world chess champion Garry Kasparov—although Deep Blue was still rooted in old-fashioned AI rather than in modern ML.

Forgiving ML?

Compensation efforts can give rise to genuine excellence. They play indeed a crucial role in making both German Technik and ML technology so great and successful. There’s nothing wrong with compensation per se. However, what makes me uncomfortable is the temptation to absolve ML engineering from its academic sins just because ML engineers are so damn good at compensating for those sins. Business does not adhere to any logic of forgiveness. Although we know that being German is such a good thing in engineering, that’s not a good reason to forgive ML for being so German.

Smashing Inertia

Right now, I’m being upset by something I read in Lean Thinking—a wonderful book by James Womack and Daniel Jones—under the section titled “Smashing Inertia to Get Started”:

there’s a […] very serious paradox inherent in introducing lean thinking in real organizations to pursue perfection. […] the catalytic force moving firms and value streams out of the world of inward-looking batch-and-queue is generally applied by an outsider who breaks all the traditional rules, often in a moment of profound crisis. We call this individual the change agent.
In fact, there is no way to reconcile this paradox, no way to square the circle. The change agent is typically something of a tyrant […] hellbent on imposing a profoundly egalitarian system in profoundly inegalitarian organizations.
Yet there are tyrants and there are tyrants. Those who succeed in creating lean systems over the long term are clearly understood by the participants in the firm and along the value stream to be promoting a set of ideas which have enormous potential for benefiting everyone. […] Because lean systems can only flourish if everyone along the value stream believes the new system being created treats everyone fairly and goes the extra mile to deal with human dilemmas, only beneficent despots can succeed.

Womack, James P., & Jones, Daniel T. (2003). Lean Thinking: Banish Waste and Create Wealth in Your Corporation.

The argument made by the authors can be paraphrased as follows:

  1. Lean thinking is superior (in some agreed sense of the word “superior”) to batch-and-queue thinking;
  2. Lean thinking is not compatible with the establishment (rules, processes, culture) of organizations which operate in a batch-and-queue fashion;
  3. Therefore, lean thinking can only be introduced by breaking the rules.

Simply put, what I learn from the argument above is that, sometimes, you have to break the rules in order to introduce an improvement in your corporation. Still, breaking the rules is not the most appreciated trait of a professional. For sure, so far I’ve never seen a job posting where a company was looking for candidates with “proven experience in breaking the rules”. Neither I have seen an executive praising an employee for his exemplary inclination to break the rules. Should we conclude that, if we constantly strive for improvements in our organization, we will ultimately have to give up being professional? Or rather that, if we want to be professional at all costs, we will ultimately have to give up pursuing improvements? If we agree with what we read in Lean Thinking, “there is no way to reconcile this paradox, no way to square the circle”.

I think there is a way to reconcile this paradox. But in order to square the circle, we need to get a bit philosophical—just a little bit. First of all, if we abstract a bit from the usual large-scale interpretation of the concept, we can use the term “revolution” to denote the type of change discussed by the authors of Lean Thinking. Here, a revolution is simply a change, within an organization, which involves breaking the traditional rules of that organization in order to be introduced. Now, one general way of classifying revolutions is by distinguishing peaceful revolutions from violent ones. While the notion of violence may seem inappropriate to analyze the organizational changes that we usually see in modern corporations, another bit of abstraction will make that notion useful to reconcile our paradox—at least if we assume that violence is something we want to avoid at all costs. An old teacher of mine once said: “Violence is when you get out of dialogue”. I find this definition of violence incredibly inspiring, as it keeps reminding me that the only way to avoid violence is by engaging into dialogue. But then, here is the thing. If it takes a revolution in order to make the lean leap, and if it takes dialogue in order for a revolution to be peaceful, then it takes dialogue in order for the lean leap to not end up in tyranny. A successful change agent, as Womack and Jones call it, is one who effectively engages into dialogue with all of the stakeholders. How else could the change agent be “understood […] to be promoting a set of ideas which have enormous potential for benefiting everyone”, if not because of the stakeholders being involved in an effective dialogue with the agent? And here, the dialogue is not an easy one to entertain, as it is about breaking the rules—the most subversive type of dialogue you can have within a corporation. If that’s the case, then we should always be mindful of John Kennedy’s maxim: “Those who make peaceful revolution impossible will make violent revolution inevitable”. Promoting and preserving dialogue, especially inconvenient dialogue, is an essential part of allowing for positive change within an organization.