Data Normalization in Pointless
“Artificial Intelligence Is Pointless without Human Intelligence.”
― Khalid Masood
Pointless was written and designed by Avery Nortonsmith as a "scripting language for fun and learning". This fun little language was created as a functional language he would have liked to learn as a beginner.
There is a lot to like about Pointless. The name, the logo, and the functionality are all full of personality which makes it great. It is heavily inspired by the ML languages, but without a Hindley-Milner type system.
Every Pointless program terminates at a sink named output
output = [1, 2, 3] |> printLine [1, 2, 3]
You also have the extremely useful pipes |>
that make writing complex code simple.
As a small starting example. I was thinking earlier about the distribution of digits in various real numbers. Are there more even or odd digits in a real number decimal expansion?
A Bayesian would tell you that there are $50%$ even vs odd numbers if even and odd numbers are uniformly distributed. They seem to be so distributed, at least in the natural numbers, so it may be reasonable to expect that to hold in a decimal expansion.
I decided to learn Pointless a little better by testing this theory on $\pi$. In a previous post we calculated the digits of $\pi$. Now we'll just get them from the internet.
Using this site I got several hundred thousand digits of $\pi$ and stored them in a text file.
The first task for Pointless, reading from a file.
At present, Pointless provides two input sources - lines read from stdin, and random numbers.
-https://ptls.dev/docs.html#input%20commands
That's easy enough. Just call the file with cat pi.txt | ptls digits.ptls
. That uses a custom alias I made to call the Pointless interpreter.
stdin
is read with a number of commands. I found readLines
, which lazily reads from stdin
, to work quite well.
output = readLines |> println
The pipe operator is a mainstay in functional programming and it allows you to "chain" operations together. Instead of setting conditionals and loops as you would in most imperative languages, you can simply send data down the pipe. This is not universally a good idea, but it generally works well and is a fun challenge.
After we have the digits of $\pi$ in a list, we convert each to a float.
output = readLines |> map(toFloat) |> println
Then, we can perform our parity check using a custom function isEven(n) = n % 2 == 0
output = readLines |> map(toFloat) |> map(isEven) |> println
This gives a list of boolean values [false, false, true, false, false, false, true, true, false, false, ...
which, coincidentally, maps to $3.141592653$. Excellent. We then can $\Sigma$ sum these values up and divide.
output = readLines |> map(toFloat) |> map(isEven) |> map(toInt) |> reduceFirst(add) |> println
This gives us $0.49775$. Which seems to confirm our hypothesis, at least without doing something much more formal.
Less Strange
It's time for the data science portion. Normalization.
Often times, when working with data, it may have some crazy outliers. These can be very large or very small values. On either end, they can potentially muddle the statistics of the data when performing analysis.
Therefore, we want to Normalize the data. Or, shift it about to a more manageable state. There are a number of different normalization schemes, we'll briefly explore three here.
Normalize Standard Score
The first scheme uses population statistics for normalization. It often is used when the data is believed to be normally distributed. This is because it makes use of $(\mu, \sigma)$ parameters.
The formula $\frac{X - \mu}{\sigma}$ operates, in 1 dimension, on our data $X$. It shifts by the mean, and scales by the standard deviation. This nice, scale invariant, method can be written in Pointless as follows.
mean(li) = reduceFirst(add, li) / length(li) std(li) = map(sub(mean(li)), li) |> map(pow(2)) |> mean |> pow(0.5)
We start by defining simple mean and standard deviation functions. We use many of the powerful standard library functions that come with Pointless and provide functional processes with which to manipulate our data.
Then, we use these definitions to subtract the mean and divide the standard deviation.
normalizeStdScore(li) = map(sub(mu), li) |> map(mul((-1))) |> map(div(std(li))) where { mu = mean(li) }
This shows a good and bad feature of Pointless. The good, is the where
clause which allows us to define useful pieces of reusable code. The bad is the fact that sub
is backwards from what I'd expect, so we have to multiply by negative $1$.
output = li |> normalizeStdScore |> println
This code will transform [1, 2, 3, 4]
to [-1.3416407864998738, -0.44721359549995793, 0.44721359549995793, 1.3416407864998738]
which can show how it is shaped. Pointless has a fun little plotting utility, but it doesn't have a y-axis scale and so it doesn't help in our understanding of scaling the data.
Min-Max feature normalization
The second method squashes all the values of your data between $[0,1]$. This is often done when you want to restrict the range of some operation. It is used in feature engineering to make a feature more robust to outliers.
The equation $\frac{X - X_{min}}{X_{max} - X_{min}}$ again operates on our data $X$ but this time doesn't need knowledge of the population parameters. It simply uses information from the data itself.
Again, in just a few lines of Pointless code (lol)
minMaxFeatureScaling(li) = map(sub(minVal), li) |> map(mul((-1))) |> map(div(maxVal - minVal)) where { minVal = minimum(li) maxVal = maximum(li) }
we can implement Min-Max normalization.
Now, we transform [1, 2, 3, 4]
to [-0, 0.33333333333333331, 0.66666666666666663, 1]
which looks closer to my intuitive grasp of normalization.
Quantile normalization
Finally, if you have a source distribution and you want to match its statistics with some target distribution for comparison (if you are controlling for socioeconomic status), then you can use quantile normalization.
There isn't a nice equation, but you essentially sort and take the average of each value in the data and use that average as the new data point value.
quantileNormalization(li1, li2) = for n in range(length(li1)) yield (at(n, s1) - at(n, s2) / 2) where { s1 = sort(li1) s2 = sort(li2) }
This then "pulls" the statistics of li1
to match li2
. Notice how everything is functional and we use at()
for list indexing.
And so...
Normalization is a very useful tool in a data scientist's tool box. It is used for feature engineering and statistical data exploration. Pointless is a fun language that makes difficult tasks simple through a clean functional workflow.
Check out Pointless and subscribe to the email newsletter if you'd like to receive updates when I publish more posts.