BOKISSONTHRONE: How the Future of AI Is Impacted by a Horse from the 1800s

Artificial intelligence (AI) researchers are unable to explain exactly how deep learning algorithms arrive at their conclusions. Deep learning is complex by nature, but that does not excuse the pursuit of seeking clarity and understanding of the black-box decision making. The quality of a machine learning algorithm requires some level of transparency and an understanding of how a decision was arrived—this impacts the generalizability of the algorithm and the reliability of the output. Recently in March 2019, researchers from the Fraunhofer Heinrich Hertz Institute, Technische Universität Berlin, Singapore University of Technology and Design, Korea University, and Max Planck Institut für Informatik, published in Nature Communications a method of validating the behavior of nonlinear machine learning in order to better assess the quality of the learning system.

The research team of Klaus-Robert Müller, Wojciech Samek, Grégoire Montavon, Alexander Binder, Stephan Wäldchen, and Sebastian Lapuschkin discovered that various AI systems using what psychologists would characterize as a “Clever Hans” type of decision based on correlation.

In 1891 in Berlin, a performing show horse named “Clever Hans” and his trainer Wilhelm von Osten would hold exhibitions to demonstrate the horse’s unusual human-like intelligence capabilities. The duo would stun audiences as von Osten would ask Clever Hans a variety of questions in subjects such as reading, spelling, music, mathematics, and color identification; and the horse would correctly respond with actions that include tapping its hoofs. For years, audiences, investigators, and psychologists were baffled and unable to determine whether the horse’s intellectual gifts were real or not.

Fast-forward 16 years later: In a paper published in 1907 by Oskar Pfungst at the Psychological Institute at the University of Berlin, it was determined that the horse was responding to minute, and most likely unintentional, cues from its human trainer. Pfungst arrived at this conclusion after conducting a series of experiments and observing behavior.

How much does context weigh in when considering intelligence? If a deep learning system learns the concept of “tractor” mainly due to pixels associated with surrounding farm field, versus the pixels of the tractor itself, is this truly learning?

As the saying goes, “Correlation does not equal causation.” If deep learning arrives as its decisions from contextual clues, versus a true understanding of the subject itself, and those clues are absent in future input data, the algorithm may reach the wrong conclusion. Generalization becomes an issue for the algorithm.

Another way to look at this is when a student passes an exam by memorizing the multiplication table without learning the underlying principles. When presented with two numbers that the student did not memorize, she or he is unable to produce the correct answer because the concept of multiplication wasn’t truly “learned” in the first place.

The researchers used a semi-automated spectral relevance analysis (SpRAy) and layer-wise relevance propagation (LRP) to identify valid and invalid machine learning problem-solving behaviors. In the first experiment, the team created a machine learning model based on Fisher vectors (FV) that were trained using the PASCAL VOC 2007 image dataset with excellent state-of-the-art test set accuracy on select categories such as car, train, person or horse. A pretrained deep neural network (DNN) was the model’s competitor. The team deployed LRP to inspect the basis of the decisions and discovered that the FV heatmap focused on the lower left corner of the image, whereas the DNN’s heatmap indicate the horse and its rider as the relevant. The FV “cheated” by relying on a source tag on the horse images. The team confirmed this by adding the horse source tag to a Ferrari image, and the FV predicted the car image incorrectly as a horse.

In the second experiment, the researchers analyzed neural network models trained to play Atari Pinball. The team once again deployed LRP heatmaps in order to visualize the DNN’s behavior, and learned that the model has “learned to abuse the “nudging” threshold implemented through the tilting mechanism in the Atari Pinball software.” In a real game of pinball, the player would most likely lose and the machine would tilt if the pinball machine was nudged.

In both the first and second experiment, the team concluded that “even though test set error may be very low (or game scores very high), the reason for it may be due to what humans would consider as cheating rather than valid problem-solving behavior,” and it “may not correspond to true performance when the latter is measured in a real-world environment.”

In a third experiment, the researchers analyzed a DNN playing Atari Breakout. The heatmaps showed the “unfolding of strategic behavior” and “conspicuous structural changes during the learning process.”

In the final experiment, the researchers used SpRAy for whole-dataset analysis of classification behavior in a semi-automated way. When the team applied the horse images of the PASCAL VOC dataset to SpRAy analysis, they identified four classification strategies based on detecting a horse and rider, a source tag in portrait or landscape position, or contextual elements such as a wooden hurdle. SpRAy was able to summarize the various strategies the classifier is using without any human intervention.

According to the researchers, their work, “adds a dimension that has so far not been a major focus of the machine intelligence discourse, but that is instrumental in verifying the correct behavior of these models, namely explaining the decision making,” and that there is a “surprisingly rich spectrum of behaviors ranging from strategic decision making (Atari Breakout) to ‘Clever Hans’ strategies or undesired behaviors.”

Similar to how fMRIs (Functional magnetic resonance imaging) and optogenetics are a window to the human brain in neuroscience, this AI research team has demonstrated a methodology to enable a peek into the window of complexity of the artificial brain of machine learning. That’s how a 19th century horse, who turned out to be not-so-clever, may impact the future of artificial intelligence in the 21st century and beyond.