The concept of linearity in the linear model refers to linearity in the parameters and not the predictors themselves. A model becomes nonlinear if the parameters are involved in nonlinear functions, e.g. \(\beta_1 ^{x_1}\), \(\text{sin}(\beta_1 x_1)\) etc. In referring to linearity violation, one is usually considering whether the relationship between a predictor and outcome is being adequately represented by a linear term.
For an example of a non-linearity relationship, one might look towards the age versus height in humans. In our early, or early to mid-years, the height/age relationship is somewhat linear. However, from age 18 onwards, increases in age would not be expected to result in changes in height. This may change again, much later in life, when height might start to decrease with increasing age. More generally, one could consider any such relationship between an outcome and an explanatory variable in which the relationship between the predictor and outcome is not a straight line on the scale of the linear predictor. When this occurs, a one unit change in the explanatory variable will result in a varying change in the outcome that depends on where in the scale of the predictor the change was made.
Piecwise regression is a way to model such structure where breakpoints are defined (usually arbitrarily) along the range of the explanatory variable, allowing the relationship with the outcome to vary. Formally, a piecewise regression model with \(K\) breakpoints can be defined as:
and these give zeros when \(x \le \tau\) and are linearly increasing when \(x > \tau\). The parameters associated with these basis functions influence the mean when \(x_i > \tau_k\) and \(\delta_k\) represent the incremental change to the slope at the breakpoint \(\tau_k\). This form of model gives a kink in the functional form of the association at each \(x = \tau_k\) which allows the slope to change while ensuring continuity.
Figure 1 shows the basis functions for \(x\) over the \((0, 10)\) interval. The basis functions are an essential concept in the construction of splines.
Piecewise basis
x <-seq(0, 10, len =100)# Breakpointstau1 <-2tau2 <-4.5tau3 <-8.5# Construct basis functionsx1 <-pmax(0, x - tau1)x2 <-pmax(0, x - tau2)x3 <-pmax(0, x - tau3)d_fig <-data.table(cbind(x = x, basis_1 = x1, basis_2 = x2, basis_3 = x3))d_fig <-melt(d_fig, id.vars ="x")ggplot(d_fig, aes(x = x, y = value, group = variable, col = variable)) +geom_line() +scale_x_continuous("x", breaks =0:10) +scale_y_continuous("f(x)") +scale_color_discrete("") +theme_bw() +theme(legend.position ="bottom" )
Figure 1: Basis functions - piecewise regression
B-spline to P-splines
Selecting some arbitrary values for the parameters, data can be simulated as shown in Figure 2 and these will be used throughout.
\(x\) is a continuous predictor
\(\tau_1 = 3\)
\(\tau_2 = 6\)
\(\beta_0 = 1\)
\(\beta_1 = 0.2\)
\(\delta_1 = 1.0\)
\(\delta_2 = -1.2\) flatlines
\(\delta_3 = -0.8\)
\(\sigma = 0.15\) standard deviation for epsilon noise
Data simulation
N <-250d_1 <-data.table()d_1[, x :=sort(runif(N, 0, 10))]# tau defined above# Construct basis for piecewise regd_1[, u_1 :=pmax(0, x - tau1)]d_1[, u_2 :=pmax(0, x - tau2)]d_1[, u_3 :=pmax(0, x - tau3)]# Parametersbeta0 <-1beta1 <-0.2# initial slopedelta1 <-1.0# change in slope at tau1delta2 <--1.2# change in slope at tau2delta3 <--0.8# change in slope at tau2sigma <-0.15# Generate responsed_1[, mu := beta0 + beta1 * x + delta1 * u_1 + delta2 * u_2 + delta3 * u_3]d_1[, y :=rnorm(.N, mu, sigma) ]d_fig <-copy(d_1)ggplot(d_fig, aes(x = x, y = y)) +geom_point(size =0.5) +scale_x_continuous("x", breaks =0:10) +scale_y_continuous("f(x)") +# scale_color_discrete("") +theme_bw() +theme(legend.position ="bottom" )
Figure 2: Simulated data
Estimating a regression function based on a piecewise specification is trivial but the lack of smoothness might be a concern in many settings. One might therefore want to opt for an approach that has smooth transitions across the range of \(x\) rather than the kinks. In doing so, we have made an assumption on the qualitative properties of the relationship, i.e. that the \(f\) from \(y = f(x) + \epsilon\) is smooth.
Roughly speaking, splines are piecewise polynomials that are continuously differentiable (i.e smooth) up to a certain degree and connected at a sequence of breakpoints called knots. In contrast to global polynomial models, which combine higher-order terms for \(x\) (i.e. \(x^2\), \(x^3\) etc) splines partition \(x\) into smaller intervals (as was does with the piecewise regression) and influence the fit locally, rather than globally.
B-splines basis can be computed using an recursive algorithm, which is skipped here as the implementation is provided in several R packages. The basis matrix is formed as:
The simulated data generated earlier can be used to form a B-spline basis matrix from the \(x\) values. The result can then be passed into a linear regression as the design matrix to obtain estimates for \(\gamma\).
If we choose \(K=\) 5 knots and with functions of degree 3, we end up with 8 columns in the basis matrix (assuming the intercept was dropped from the basis and included independently in the model, as is done by default when using lm).
To predict the values for the fitted curve, we use the reference basis matrix that was created using sequential values of \(x\) over a range of interest and then run predict, see Figure 4 (A).
The number or placement of knots clearly has an impact on the nature of the fit and the amount of curvature that can be represented. In the toy example, we know the true data generation process and thus can see that the spline has overfit the data. If the number of knots is increased to 50, overfitting is further exacerbated as shown in Figure 4 (B). However, the extent the wiggliness depends not only on the number of columns in the basis matrix, but also the parameter values. Specifically, when the \(\gamma\) parameters are highly erratic, there will be more movement in the fitted curves.
The main idea of P-splines (penalised B-splines) is to take a large basis matrix and then apply a penalty to some measure of roughness. For the penalty, differences (first or second) in neighbouring values of the parameters can be considered from which roughness can be computed as the sum of squares. In traditional least squares, the roughness is entered into the objective function, i.e.
\[
\begin{align*}
O = (y - B \gamma)^\prime W (y - B \gamma) + \lambda || D \alpha ||^2
\end{align*}
\]
where \(\lambda\) is a derived measure of roughness, \(D\) is a matrix difference operator and \(W\) corresponds to a set of regression weights.
Model based on spline with 5 knots
# Construct basis functions except exclude the interceptd_B <-data.table(bSpline(d_1$x, knots = knots, intercept = F))names(d_B) <-paste0("g_", 1:ncol(d_B))d_1 <-cbind(d_1, d_B)# x has now been replaced entirely by the linear combination of basis# functionsfmla_1 <-as.formula(paste("y ~ ", paste(names(d_B), collapse ="+")))f1 <-lm(fmla_1, data = d_1)# predict y for the new x (as represented by the reference basis)d_B_ref[, y_hat :=predict(f1, newdata = d_B_ref)]# save the plotp1 <-ggplot(d_1, aes(x = x, y = y)) +geom_point(size =0.5) +geom_line(data = d_B_ref, aes(x = x, y = y_hat),col ="red" ) +scale_x_continuous("x", breaks =0:10) +theme_bw() +theme(legend.position ="bottom" )
Model based on spline with 50 knots
knots <-seq(1, 9, len =50)d_2 <-copy(d_1[, .(x, mu, y)])# Construct basis functions except exclude the interceptd_B <-data.table(bSpline(d_2$x, knots = knots, intercept = F))names(d_B) <-paste0("g_", 1:ncol(d_B))d_2 <-cbind(d_2, d_B)# x has now been replaced entirely by the linear combination of basis# functionsfmla_1 <-as.formula(paste("y ~ ", paste(names(d_B), collapse ="+")))f1 <-lm(fmla_1, data = d_2)x <-seq(0, 10, len =200)# Recreate reference basis with 50 knotsd_B_ref <-data.table(x = x, bSpline(x, knots = knots, intercept = F))names(d_B_ref) <-c("x", paste0("g_", 1:(ncol(d_B_ref)-1)))# Predict outcomed_B_ref[, y_hat :=predict(f1, newdata = d_B_ref)]# This was just out of interest as to where the values of the coefficients were# relative to the where the modes of the basis functions sit on xB_modes <-apply(d_B_ref[, .SD, .SDcols =names(d_B_ref) %like%"g_"], 2, which.max)# d_B_ref$x[B_modes]p2 <-ggplot(d_2, aes(x = x, y = y)) +geom_point(size =0.5) +geom_line(data = d_B_ref, aes(x = x, y = y_hat),col ="red" ) +# geom_point(# data = data.table(# x = d_B_ref$x[B_modes], # y = coef(f1)[-1]),# aes(x = x, y = y), pch = 2, size = 2# ) +scale_x_continuous("x", breaks =0:10) +theme_bw() +theme(legend.position ="bottom" )
Figure 4: Simulated data and fitted splines with 5 knots (A) and 50 knots (B)
Bayesian variations
A Bayesian analogue of the B-spline fit can is shown below and in absence of any penalty, the posterior mean of the fitted curve align closely to those that were obtained from least squares, see Figure 5 and parameter comparison. However, the results are sensitive to specification of the priors for the basis matrix parameters.
data {int<lower=0> N;vector[N] y;// basis (excludes intercept)int Q;matrix[N, Q] B;// for prediction over domain of xint N_ref;matrix[N_ref, Q] B_ref;vector[2] prior_g;int prior_only;}parameters {real b0;real<lower=0> s_e;vector[Q] g;}transformed parameters{}model {target += normal_lpdf(b0 | 0, 2);target += normal_lpdf(g | prior_g[1], prior_g[2]);target += exponential_lpdf(s_e | 1);if(!prior_only){target += normal_lpdf(y | b0 + B * g, s_e); }}generated quantities{vector[N_ref] mu = b0 + B_ref * g;}
Running MCMC with 1 chain...
Chain 1 finished in 1.2 seconds.
Comparing selection of frequentist and Bayesian model parameters
In a Bayesian context, a P-spline analogue would aim to replicate essential aspects of the approach to penalisation. One way to do achieve this is to induce a dependency over the set of spline parameters through a random walk prior. As the P-splines commonly use second differences on the parameters as the input to calculating a roughness measure, from which a derived value is used as a penalty, a second order random walk (RW2) prior is apt. Assuming equal spacing in the knots1 we can specify a RW2 using a difference representation, which closely maps to a relatively efficient implementation.
RW2 assumes that the second differences on the latent process \(\theta_j\) are IID normal:
In the implementation, we draw starting values for \(\theta_1\), \(\delta_1[1]\) (indexing the first element of \(\delta_1\) of length \(Q-1\)) and \(\delta_2\) (which has length \(Q-2\)) from our priors and then we reconstruct the first differences as:
The stan implementation and fitted results are shown below, Figure 6. The model takes a quite a while to fit under MCMC, but addresses some of the overfitting issues, even though we continue to fit the model based using a basis matrix setup over 50 knots. Accepting the fact that for this data, 50 knots is excessive, the point is that we can effectively reduce the effective dimension of the model via the RW2 prior.
data {int<lower=0> N;vector[N] y;// basis (excludes intercept)int Q;matrix[N, Q] B;// for prediction over domain of xint N_ref;matrix[N_ref, Q] B_ref;int prior_only;}parameters {real b0;real<lower=0> s_e;// offsetsreal z0;real z1;vector[Q-2] delta2; real<lower=0> s_d;}transformed parameters{vector[Q] g;vector[Q-1] delta1; g[1] = z0; delta1[1] = z1;for(i in2:(Q-1)){ delta1[i] = delta1[i-1] + delta2[i-1] * s_d; }for(i in2:Q){ g[i] = g[i-1] + delta1[i-1]; }}model {target += normal_lpdf(b0 | 0, 10);// standard normal prior on the offsetstarget += normal_lpdf(z0 | 0, 1); target += normal_lpdf(z1 | 0, 1); target += normal_lpdf(delta2 | 0, 1); // prior on sd for rwtarget += exponential_lpdf(s_d | 1); target += exponential_lpdf(s_e | 1);if(!prior_only){target += normal_lpdf(y | b0 + B * g, s_e); }}generated quantities{vector[N_ref] mu = b0 + B_ref * g;}
Running MCMC with 1 chain...
Chain 1 finished in 78.5 seconds.
Figure 6: Unpenalised (A) vs Penalised (RW2) fit (B)
Imposing sparsity on deviations
A further restriction might be imposed on the deviations to induce sparsity where parameters as implemented via a mixture of normals.
The first order difference components from the RW2 deviations, used to give the skeleton for the fitted curve, are shown in Figure 7 for the RW2 model and the RW2 with sparsity models. While the sparsity is applied to the second order differences, the approach has induced sparsity in the first order differences, whereby low magnitude shifts have been brought closer to zero and their uncertainty reduced.
The approach is shown in the implementation below and can help avoid wiggliness in parts of the domain where the relationship is flat, Figure 8.
functions {real spike_slab_lpdf(real delta, real pi, real sigma1, real sigma2) {real log_p1 = log1m(pi) + normal_lpdf(delta | 0, sigma1);real log_p2 = log(pi) + normal_lpdf(delta | 0, sigma2);return log_sum_exp(log_p1, log_p2); }}data {int<lower=0> N;vector[N] y;// basis (excludes intercept)int Q;matrix[N, Q] B;// for prediction over domain of xint N_ref;matrix[N_ref, Q] B_ref;int prior_only;}parameters {real b0;real<lower=0> s_e;// offsetsreal z0;real z1;vector[Q-2] delta2; real<lower=0, upper=1> pi; // mixture weightreal<lower=0> s_spike;real<lower=0> s_slab;}transformed parameters{vector[Q] g;vector[Q-1] delta1; g[1] = z0; delta1[1] = z1;for(i in2:(Q-1)){// delta2 has already been scaled via the horseshoe above delta1[i] = delta1[i-1] + delta2[i-1]; }for(i in2:Q){ g[i] = g[i-1] + delta1[i-1]; }}model {target += normal_lpdf(b0 | 0, 10);// standard normal prior on the offsetstarget += normal_lpdf(z0 | 0, 1); target += normal_lpdf(z1 | 0, 1); // mixing probtarget += beta_lpdf(pi | 1, 1);target += exponential_lpdf(s_spike | 1); target += exponential_lpdf(s_slab | 1); for (i in1:(Q - 2)) {target += spike_slab_lpdf(delta2[i] | pi, s_spike, s_slab); }target += exponential_lpdf(s_e | 1);if(!prior_only){target += normal_lpdf(y | b0 + B * g, s_e); }}generated quantities{vector[N_ref] mu = b0 + B_ref * g;}
Running MCMC with 1 chain...
Chain 1 finished in 79.5 seconds.
Warning: 1 of 1 chains had an E-BFMI less than 0.3.
See https://mc-stan.org/misc/warnings for details.
Figure 8: Penalised (RW2) (A) vs Penalised (RW2) with sparsity (B)
Other perspectives
There are many other approaches to spline-based modelling. For example, it is common to convert a spline model into an equivalent mixed model form. This approach imposes a shared distribution on the basis matrix parameters. An alternative to the B-spline is a low-rank thin plate spline which uses a radial basis. I believe this is the default approach in brms and in mgcv.
Footnotes
This is important caveat and the approach needs to be changed somewhat if equal spacing is not possible.↩︎