Lambdas and Tuples and (Panda) Bears… Oh my!

Lili Beit
Analytics Vidhya
Published in
7 min readMar 29, 2021

--

“Sorry, I can’t talk now. I’m iterating on a random forest that’s orthogonal to your value proposition.”

Techy people talk funny.

At least that’s what it feels like when you enter the tech world from a not-quite-technical background. In this story, I’ll explore the origins and meanings of some interestingly named Data Science terms.

When I started learning to code for Data Science, I was told that I would need Python. But in order to get Python, I would need Anaconda. Once I had both Anaconda and Python, I should then import pandas. And I began to wonder if all data scientists are obsessed with exotic animals.

Photo by David Clode on Unsplash

Python

As you may know, the name ‘Python’ is not a reference to constrictor snakes, but to its creator’s fondness for Monty Python. Python creator Guido van Rossum was reading scripts from the TV Series “Monty Python’s Flying Circus” while working on his new programming language. As the official Python documentation explains in its Frequently Asked Questions, “Van Rossum thought he needed a name that was short, unique, and slightly mysterious, so he decided to call the language Python.”

Anaconda

Anaconda Inc’s documentation does not explain the origins of its name. But since Anaconda is a distribution of the Python programming language, one could surmise that its creators named their product after another long constrictor snake to highlight its relationship to Python.

Although pythons are found in Africa, Asia, and Australia, and anacondas live in South America, both snakes can reach 9 meters (30 ft) in length and are constrictors, meaning they wrap around and suffocate their prey. Both are heavy, but the anaconda outweighs the python and can weigh up to 250 kilograms (550 pounds). Anacondas can also reach 1 meter (3 ft) in diameter. Yikes! I’m glad I’m just using the software package.

Pandas

Once I set up Anaconda and Python, I was finally ready to import pandas. I live in Washington DC, where the National Zoo has imported pandas from China since the 1970s. According to the Smithsonian Institution, the preferred method of international travel for pandas is a FedEx cargo plane equipped with a zookeeper and panda snacks.

Photo by Sid Balachandran on Unsplash

Oh, wait. They meant pandas, as in the Python library used to analyze structured data sets. Not the giant, bamboo-eating bears (they really are bears, by the way.) According to Wes McKinney, the library’s author, “The library’s name derives from panel data, a common term for multidimensional data sets encountered in statistics and econometrics.” Ah, I see — pan + da. Knowing this derivation does help me understand the library’s primary purpose, which is to provide an easy way to store and manipulate tabular data in Python. It also helps explain why the standard alias for pandas is “pd”.

Seaborn

Like Python’s author, the creator of Seaborn drew inspiration from popular culture. Michael Waskom named his library of statistical visualization tools after fictional character Samuel Norman Seaborn of the TV series “The West Wing”. Hence, the library’s standard alias is its eponym’s initials, sns.

Matplotlib

Seaborn is built on the visualization library Matplotlib, whose name, in contrast, makes perfect sense. This Python library for making great-looking plots is based on MATLAB, an older programming language, according to the Matplotlib User’s Guide. MATLAB’s name, in turn, is an acronym for Matrix Laboratory, according to its author, Cleve Moler of MathWorks. Interestingly, the word matrix derives from the Latin word mater (mother), and its original Latin meaning was “a female animal used for breeding” according to Merriam Webster. This meaning inspired James Sylvester to use “matrix” to describe a rectangular array of numbers in 1850:

“I have in previous papers defined a “Matrix” as a rectangular array of terms, out of which different systems of determinants may be engendered as from the womb of a common parent.” (The Collected Mathematical Papers of James Joseph Sylvester: 1837–1853, Paper 37, p. 247)

Photo by Antoine Dautry on Unsplash

Tuple

As a beginner coder, I began to encounter unfamiliar words. For example, what is a “tuple” and why does Python keep accusing me of passing them?

A tuple in Python is an ordered, unchangeable collection of objects. It is different from a Python list because lists can be changed. Tuples are written with parentheses while lists are written with square brackets.

The word derives from the suffix -tuple used in English words such as quintuple, sextuple, septuple, and the generic term n-tuple.

Float

I was also surprised by the data type “float” which means any real number that includes a decimal point. Why not just call it a decimal? The answer is that “float” derives from the term “floating point arithmetic” which is a specific way that computers represent real numbers. A number’s decimal point can “float” or be placed anywhere, and then the number is scaled using an exponent.

Boolean

Similarly, I could think of many more sensible names for “Boolean”, a data type which can have two possible values such as True and False. We can thank George Boole who invented Boolean algebra for this term. It turns out we have a lot to thank him for, as his invention of symbolic logic in the 19th century laid the foundation for modern computer science.

Lambda

While “pandas” derives from “panel data”, a lambda expression has nothing to do with lambs or data, although that would be adorable. A lambda expression in Python is used to create a function on the fly that will apply to each item in a series. Mathematicians will recognize this term, as it derives from lambda calculus. Alonzo Church’s choice of the Greek letter lambda to represent the concept of a generic function derives from the fact that he initially used a caret, as in ^x, then changed it to ƛx for ease of printing. (http://www.users.waitrose.com/~hindley/SomePapers_PDFs/2006CarHin,HistlamRp.pdf, page 9)

Photo by Evan Dennis on Unsplash

Random Forest

After getting the hang of lambdas, tuples, and panda bears (oh my), I was ready to begin running machine learning models, such as a random forest. The name “random forest” is probably the least random term in Data Science. This model consists of a group of decision trees, each one using randomly selected features from the same data to predict an outcome. The predictions of all the trees are merged to produce the final prediction.

Iterating

Machine learning models need tuning to produce optimal results, which means running them over and over with minor changes to see which version is best. This is why data scientists use the word “iterating” all the time. This term “iterate” has a few different definitions, the most basic of which is “to repeat,” and derives from the Latin word iterum (again). In computer science, to iterate means “make repeated use of a mathematical or computational procedure, applying it each time to the result of the previous application.” Machine learning algorithms are built to iterate until they find the best result. Apparently, people can also iterate — they can tune a model repeatedly and learn from their results to improve the model. Hence the phrase, “Let’s iterate on this model.”

Orthogonal

Oh, orthogonal. Somehow this term drifted out of math and statistics into software engineering, and from there into business and legal jargon. I appreciate how super fun it is to say “orthogonal”, but I don’t think it provides a more precise meaning than more commonly understood words, such as “unrelated” or “independent.” But it does sound really cool.

Orthogonal” started as a math term meaning at right angles, and derives from the Greek orthos (upright) and gonia (angle). In math, “orthogonal” is used to describe perpendicular lines, as well as vectors which are at right angles, whose dot product is 0, and that do not influence each other. In statistics it applies to variables which are uncorrelated (based on the vector definition above). In software engineering (and now many other fields), it has been used to describe two pieces of a project that do not influence each other. Its use confused Supreme Court Chief Justice John Roberts in 2010, when an attorney used it in an argument.

Photo by Markus Spiske on Unsplash

Men

While I did not set out to explore the history of math and computer science, it is striking that all the mathematical concepts and programming libraries in this story were named by men. Even the popular culture references (Monty Python, the West Wing) were written mostly by men about men. I recently googled why the Data Science library scikit-learn uses random state 42 so often in its documentation, and found that 42 is a reference to the “Hitchhiker’s Guide” books, also written by a man about men. I am in no way accusing these men of purposely promoting a male-dominated culture, merely noting its pervasiveness in the world of computers and programming. It is not comforting that the matrix (a foundational concept in computer science) is named after ‘a female animal used for breeding.’ I hope that soon more women write programming languages and libraries, and choose interesting names for them.

--

--