Logistic Regression in Yorick
(takes the skull) Alas, poor Yorick! I knew him, Horatio, a fellow of infinite jest, of most excellent fancy. He hath borne me on his back a thousand times, and now, how abhorred in my imagination it is! My gorge rises at it.
-Hamlet: Act 5 Scene 1 Page 8
Welcome back! In this post we'll be exploring another array based language. However, unlike J this language is not packed into as few characters as possible.
Yorick was created by physicist David Munro in 1996 at Lawrence Livermore national lab. It was designed for numerical computation and mathematical simulation. It was fairly easy to install and has a nice C-like syntax.
You simply fire up the repl and can start coding
$ yorick Copyright (c) 2005. The Regents of the University of California. All rights reserved. Yorick 2.2.04x ready. For help type 'help' >
> print, "Hello, World!"
Hello, World!
My first order of business was to see if I can plot something. I ran > plg, random(2)
. The docs suggested that a graphics window should open and display the desired plot.
I spent a few hours trying to get this to work with little success. I opened an issue on Github and decided to cheat a little. I wrote my own simple plotting utility.
func plot(a, b){ file = text_csv("data_file.csv", a, b); close,file // system, "scatter --f data_file.csv"; system, "python3 plot.py data_file.csv"; }
Yes, yes, I know. I wrote a python file that reads a csv (written to via Yorick's text_csv
function) and plots the contents. I have no illusions of scalability, I just wanted something a bit quick.
It's really unfortunate I couldn't get plotting to work natively, there is a large body of work online that suggests Yorick's plotting is top notch. If any of my readers have a solution, it would be greatly appreciated.
However, other than this small hiccup, coding in Yorick was a joy.
Logistic Regression
Imagine you are a young data scientist, cleaning data like a digital detergent, fitting regression lines left and right when you come across a problem. What is the likelihood that a physical therapy patient has back pain given a number of measurements?
Not to fear, you are armed with regression!
$$y = Wx + b$$
You solve for $W$ via gradient descent, or use an exact solution... but then you look back at the problem statement and see that you need "likelihood".
Well, you likely remember that likelihood is a synonym for probability. How do we get probabilities? That's where the logistic part of logistic regression comes in to play.
The logistic function $f: \mathbb{R} \to [0,1]$ $$\frac{1}{1 + e^{-x}}$$ can be used to interpret real values as probabilities.
In our case, we take the beautiful equation from above $y = Wx+b$ and wrap it in a logistic function to obtain probabilities. $$\frac{1}{1 + e^{-(Wx + b)}}$$
Boom, so by using that equation we can get the likelihood of someone having back pain (yes or no) given a number of measurements.
The code!
We'll start by including our hand made plotting function, and a linear regression utility
#include "plot.i" #include "regress.i"
We then need to read in our data, Yorick makes this process extremely simple.
df = text_cells("back_pain_data.csv", ",")(1:-1,);
Here we are slicing off the last column which is not useful for our purposes. Something important to note, and disappointing for me is that Yorick is 1-indexed.
> a = random(10) > print, a [0.0891647,0.0719207,0.739632,0.191635,0.350893,0.597056,0.420973,0.608159, 0.607138,0.0986435] > a(0) 0.0986435 > a(1) 0.0891647 > a(-1) 0.607138
Many high powered scientific languages are 1-indexed, so the choice makes sense.
I then decided that the built in numberof
function was insufficient. It returned a full count (similar to #
in J) of all the items in this list.
So I wrote a small utility shape
function for working with 2-d arrays. Again, it doesn't scale very well, but it worked for my purposes.
func shape(df){ if(numberof(df) > 0){ return [numberof(df(1,)), numberof(df(,1))]; } }
This allowed me to see row and column information of the newly created dataframe
> shape(df) [310,13]
Excellent! I spent several hours trying to figure out how to cast strings to integers. I felt so silly, I scoured the documentation (or so I thought) and couldn't find anything.
I brought this up to some colleagues, and within 3 minutes they had found the tonum
function... That's what I get for coding instead of sleeping, lesson learning.
We can now cast various columns of df
no numeric types with no trouble. I started by converting the proper labels to a binary representation.
/* convert the labels to binary digits*/ df(0,)(where(df(0,) == "Normal")) = "0" df(0,)(where(df(0,) == "Abnormal")) = "1" df = tonum(df)
This takes the last column and sets the values according to a where criterion. The where( == )
function returns a mask that we can use to index into our list.
I wrote another small utility function, head
, that allows me to peak at the array.
func head(df){ print, df(, 1:6) }
> head, df [[63.0278,22.5526,39.6091,40.4752,98.6729,-0.2544,0.744503,12.5661,14.5386, 15.3047,-28.6585,43.5123,1],[39.057,10.061,25.0154,28.996,114.405,4.56426, 0.415186,12.8874,17.5323,16.7849,-25.5306,16.1102,1],[68.832,22.2185,50.0922, 46.6135,105.985,-3.53032,0.474889,26.8343,17.4861,16.659,-29.0319,19.2221,1], [69.297,24.6529,44.3112,44.6441,101.868,11.2115,0.369345,23.5603,12.7074, 11.4245,-30.4702,18.8329,1],[49.7129,9.65207,28.3174,40.0608,108.169,7.9185, 0.54336,35.494,15.9546,8.87237,-16.3784,24.9171,1],[40.2502,13.9219,25.1249, 26.3283,130.328,2.23065,0.789993,29.323,12.0036,10.4046,-1.51221,9.6548,1]]
For the curious, the column values are
["pelvic_incidence","pelvic tilt","lumbar_lordosis_angle","sacral_slope","pelvic_radius","degree_spondylolisthesis","pelvic_slope","Direct_tilt","thoracic_slope","cervical_tilt","sacrum_angle","scoliosis_slope","Status"]
I'm not going to focus on what each column represents in this post, but in general they are physical measurements relating to the back and body.
We're in the home stretch now. We just need to split our independent and dependent variables.
y = df(0,) x = df(1:-1,)
I tried for quite some time to shuffle the data, I even wrote a shuffle function that I was pretty proud of:
func randint(low, high, n){ return floor(low + (high - low) * random(n)) } func shuffle(li_in){ li = li_in for(i=1;i < numberof(li);i++){ r = sread(totxt(i + (randint(0,numberof(li)+3,1)(1) % (numberof(li) - i))), "i") tmp=li(i) li(i)=li(r) li(r)=tmp } return li }
However, the sread
function was being a bit finicky, and would only spit out "doubles" and I needed a "long" to index into the array properly. So I decided not to pursue that further.
So in this analysis, performance will purely be measured on the training set, which is generally bad practice, so we won't really be able to measure generalization of our model.
However! In spite of the small set back, we can now find our $W$ values using the build in OLS regression function.
w = regress(y, grow(transpose(x), 1))
We include a column of 1s on the end to account for the bias term.
We can then run inference by using these weights.
func infer(x_i, w){ x_i = grow(transpose(x_i), 1) y_hat = x_i(,+) * w(+) return (1 / (1 + exp(-y_hat))) }
These three lines add a column of 1s to the data we want to infer. Then we do an element wise dot product. Yorick allows us to specify which dimension operations occur on with a (+)
symbol and commas.
We then return the logistic value for the data, which will likely be an array if you are working with more than a single data point.
The output of this function is a collection of probabilities which can be used as you see fit. I decided to threshold them for classification, just to showcase.
y_hat(where(y_hat > .61)) = 1 y_hat(where(y_hat < .61)) = 0
Then, if we run this on our training data, we see
[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] [1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,0,1,1,0,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,0,0,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,0,0,0,0,0,1,1,0,1,1,1,0,0,1,0,0,1,1,0,0, 0,1,1,0,0,0,1,1,0,1,1,1,1,1,0,1,0,0,0,1,1,1,1,0,0,1,0,1,0,0,1,0,1,0,1,1,0,1,1, 0,0,0,0,0,0,0,1,1,1,1,0,0,1,0,1,0,0,1,0,0,0,0,1,1,0,1,0,1,1,0,0,1,1,0,1,0]
Which can be summarized using an accuracy function
func accuracy(y, y_hat){ return tonum(totxt(sum(y_hat == y))) / tonum(totxt(numberof(y))) }
> print(accuracy(y, y_hat))
0.816
Wrapping up
In conclusion, I enjoyed Yorick somewhat. I felt its type system was far too Dynamic for my taste. I often found myself confused at the type of object returned or expected by various functions. However, there are a number of really great utilities for working with numerical data in a scientific computing setting. It's definitely a language I'll try again sometime.