Pubsplained #1: How to fit a straight line through a set of points with uncertainty in both directions?

Publication

Thirumalai, K., Singh, A., & Ramesh, R. (2011). A MATLAB™ code to perform weighted linear regression with (correlated or uncorrelated) errors in bivariate data. Journal of the Geological Society of India, 77(4), 377–380. 
doi: 10.1007/s12594–011–0044–1

Summary

We present a code that fits a line through a set of points (“linear regression”). It is based on math first described in 1966 that provides general and exact solutions to the multitude of linear regression methods out there. Here is a link to our code.

Pubsplainer

Fitting a straight line through a bunch of points with X and Y uncertainty.

Fitting a straight line through a bunch of points with X and Y uncertainty.

My first peer-reviewed publication in the academic literature described a procedure to perform linear regression, or, in other words, build a straight line (of “best fit”) through a set of points. We wrote our code in MATLAB and applied it to a classic dataset from Pearson (1901).

“Why?”, you may ask, perhaps followed by “doesn’t MATLAB have linear regression built into it already?” or “wait a minute, what about polyfit?!”

Good questions, but here’s the kicker: our code numerically solves this problem when there are errors in both x and y variables… and… get this, even when those errors might be correlated! And if someone tells you that there is no error in the x measurement or that errors are rarely correlated - I can assure you that they are most probably erroneous.

York was the first to find general solutions for the “line of best fit” problem when he was working with isochron data where the abscissa (x) and ordinate (y) axis variables shared a common term (and hence resulted in correlated errors). He first published the general solutions to this problem in 1966 and subsequently published the solutions to the correlated-error problem in 1969.

If these solutions were published so long ago, why are there so many different regression techniques detailed in the literature? Well, it’s always useful to have different approaches to solving numerical problems, but as Wehr & Saleska (2017) point out in a nifty paper from last year, the York solutions have largely remained internal to the geophysics community (in spite of 2000+ citations), escaping even the famed “Numerical Recipes” textbooks. Furthermore, they state that there is abundant confusion in the isotope ecology & biogeochemistry community about the myriad available linear regression techniques and which one to use when. I can somewhat echo that feeling when it comes to calibration exercises in the (esp. coral) paleoclimate community. A short breakdown of these methods follows.

Ordinary Least Squares (OLS) or Orthogonal Distance Regression (ODR) or Geometric Mean Regression (GMR): which one to use?!

Although each one of these techniques might be more appropriate for certain sets of data versus others, the ultimate take-home message here is that all of these methods are approximations of York’s general solutions, when particular criteria are matched (or worse, unknowingly assumed).

  • OLS provides unbiased slope and intercept estimates only when the x variable has negligible errors and when the y error is normally distributed and does not change from point to point (i.e. no heteroscedasticity).

  • ODR, formulated by Pearson (1901), works only when the variances of the x and y errors do not change from point-to-point, and when the errors themselves are not correlated. ODR also fails to handle scaled data i.e. slopes and intercepts devised from ODR do not scale if the x or y data are scaled by some factor. Note that ODR is also called “major axis regression”.

  • GMR transforms x and y data and can thus scale estimates of the slope and intercept but works only under the condition when the ratio of the standard deviation of x to the standard deviation of the error on x is equal to that same ratio in the y coordinate.

Most importantly, and perhaps quite shockingly, NONE of these methods involve the actual measurement uncertainty from point-to-point in the construction of the ensuing regression. Essentially, each method is an algebraic approximation of York’s equations, and whereas his equations have to be solved numerically in their most general form, they provide the most unbiased estimates of the slope and intercept for a straight line. In 2004, York and colleages showed that his 1969 equations, (based on least-square estimation) were also consistent with (newer) methods based on maximum likelihood estimation when dealing with (correlated or uncorrelated) bivariate errors. Our paper in 2011 provides a relatively fast way to iteratively solve for the slope and estimate.

In our publication, besides the Pearson data, we also applied our algorithm to perform “force-fit” regression - a unique case where one point is almost exactly known (i.e. very little error and near-infinite weight) - on meteorite data and showed that our results were consistent with published data.

All in all, if you want to fit a line through a bunch of points in an X-Y space, you won’t be steered too far off course by using our algorithm.

References

#Pubsplained

I am introducing a new series on this blog called Pubsplained, where I plan on breaking down my peer-reviewed publications into (more) digestible blog-posts. The motivation for this is threefold:

  1. To see if its possible to broaden the audience of some of these manuscripts
  2. To be more productive on Paleowave
  3. To “keep in touch” with my older publications.

The idea is to provide an accessible summary (perhaps a tweet-length synopsis) on our publications, and also provide a little more background on the topic, including problems and challenges, for those who might be interested.    

Podcast Review: Bubble

There’s a new podcast on the block called Bubble, put out by Maximum Fun and it has been an immensely enjoyable ride so far. Bubble is set inside the city of Fairhaven, in the not-so-distant future (or alternative present), where a giant bubble protects the town from the “brush” outside. The brush is a wild landscape with exotic plants, psychedelic herbs, and deadly prehistoric monsters — some of which manage to get inside Fairhaven from time to time. It is also filled with mysterious peoples who have shunned the cushy life inside the bubble and tend to fend for themselves; fierce, proud, and earthy. If you’re inside the bubble, however, you’ll find yourself in a millennial utopia(/dystopia) replete with the uber-ification of pretty much everything, including monster hunters with IG profiles for when the terrors of the Brush decide to show up in your house; but make sure you give them 5 stars only if you think their special powers are totes entertaining. The main plot revolves around Morgan, Mitch, Annie, and Van Joyce - late 20/30 somethings who are somewhat unwittingly pushed into the business of monster hunting. 

Where Bubble succeeds is its potent mixture of sci-fi landscapes and cyberpunk charm, bolstered by the depth of the characters and the quirks of the city. Ultimately, the podcast relies on a strong plotline with sharp, tongue-in-cheek, absurd, and deprecative humor dotted along the way, which fans of BoJack Horseman, Arrested Development, or 30 Rock will not find out of place. Bubble is the podcast version of Archer set inside a Transmetropolitan-lite world, seen through the eyes of Broad City's lead characters. Oh and also, Bubble has a star-studded cast with many guest appearances. The arc is set to last for a total of eight episodes, and Episode 6 came out this week — so its relatively easy to catch up. 

Go check it out.