Identifiability constraints

Author

David L Miller

Published

January 8, 2026

If you’ve spent a bit of time working with GAMs with multiple smooths in them, your mind might have stayed to thinking about how to make sure different components of the model don’t “overlap”. That is: if each smoother in the model has the capacity to duplicate basically any function then it needs to include an intercept. If we have multiple intercepts in the model then we can just add some number to one of them and subtract it from another and end up with the same model, right?

Not only does this problem make it hard to interpret the model coefficients, it also makes fitting pretty much impossible (we have an infinite set of possible solutions to our optimization problem, since we can add/subtract any number \(a\in\mathbb{R}\)).

Ignoring the GAM case for now, we can see a version of this problem occurs in linear models when we have a factor covariate. Naively writing-down a model with an intercept and a categorical variable \(x\) with two levels (“A” and “B”, say) we would have the following model: \[\begin{equation} \mathbb{E}(Y_i) = \beta_0 + \beta_A \mathbb{I}(x=A) + \beta_B \mathbb{I}(x=B), \label{notidentifiable} \end{equation}\] where \(\mathbb{I}\) is an indicator function that is 1 when the expression inside it is true.

Model (\(\ref{notidentifiable}\)) is not identifiable: we can add some number to \(\beta_0\), then subtract it from \(\beta_A\) and \(\beta_B\) with no effect on \(\mathbb{E}(Y_i)\). Our usual way to deal with this problem in R is to use contrasts; we absorb one of the level coefficients into \(\beta_0\) and let the other be the difference (or “contrast”) between \(\beta_0\) and that level. So we might let \(\beta_0=^D\beta_A\) and let \(\beta_{B^+}\) be the difference between levels A and B, giving the following model: \[\begin{equation} \mathbb{E}(Y_i) = \beta_0 + \beta_{B^+}\mathbb{I}(x=B). \label{identifiable-contrast} \end{equation}\] So the effect of the A level is \(\beta_0\) and the effect of the B level is \(\beta_0+\beta_{B^+}\). As an alternative, we could instead use the “index variable” approach where we do-away with \(\beta_0\) altogether and estimate only \(\beta_A\) and \(\beta_B\).¹

Extending our categorical variable to three levels², we have a naive model that looks like \[\begin{equation} \mathbb{E}(Y_i) = \beta_0 + \beta_A \mathbb{I}(x=A) + \beta_B \mathbb{I}(x=B) + \beta_C \mathbb{I}(x=C), \end{equation}\] which we can write in matrix form per (\(\ref{matform}\)) with (taking just the first six rows of the data with some imagined values) \[\begin{equation} \mathbf{X} = \pmatrix{1 & 1 & 0 & 0\\ 1 & 1 & 0 & 0\\ 1 & 0 & 1 & 0\\ 1 & 0 & 1 & 0\\ 1 & 0 & 0 & 1\\ 1 & 0 & 0 & 1 }, \label{simpleX} \end{equation}\] where we can see we can obtain the first column as a sum of the other three. Our approach above, dropping a level and absorbing that effect into the overall intercept is equivalent to dropping the second column here, leaving \[\begin{equation} \mathbf{X}^* = \pmatrix{1 & 0 & 0\\ 1 & 0 & 0\\ 1 & 1 & 0\\ 1 & 1 & 0\\ 1 & 0 & 1\\ 1 & 0 & 1 }. \end{equation}\]

So, how can we translate this idea into a general approach? First let \(\mathbf{X}_1\) be the non-all-ones columns in \(\mathbf{X}\) (the last three columns in (\(\ref{simpleX}\)), for example), so we might now write \(\mathbf{X} = [\mathbf{1}, \mathbf{X}_1]\). We might then imagine some matrix, \(\mathbf{C}_1\) which maps between \(\mathbf{X}\) and \(\mathbf{X}^*\), so that \(\mathbf{X}^* = \mathbf{X}\mathbf{C}_1\). In our example case, we can write \[ \mathbf{C}_1 = \pmatrix{0 & 0\\ 1 & 0\\ 0 & 1 }. \] We refer to \(\mathbf{C}_1\) as a contrast³ and there are a number of different rules we can use to form different contrasts that give different interpretations. You can see which one your R session with the following:

options("contrasts")

$contrasts
        unordered           ordered 
"contr.treatment"      "contr.poly"

With these examples in mind, we can start to think about what happens when we have splines (or more generally structured random effects) as terms in our models and how we might generalize the above ideas to allow for automatically identifiable models.

In one dimension

When we add a smoother to our model, we might write it as \(s(x)\) (a smooth function, \(s\), of a variable, \(x\)). We model \(s(x)\) as a set of basis functions, so we can decompose \(s(x)\) into a sum of (\(k=1,\ldots,K\)) basis functions (\(b_k\)) multiplied by coefficients (\(\beta_k\)) that we estimate: \[ s(x)=\sum_{k=1}^K \beta_k b_k(x). \]

Thin-plate splines as an example

The exact form of the \(b_k\)s will depend on the basis we choose. For the default thin-plate regression splines in mgcv (Wood, 2003), the basis can be thought of in two parts (here simplifying to just one dimension to avoid headaches). Splitting the \(K\) functions into set of length \(K_1\) and \(K_2\) (such that \(K=K_1+K_2\) and elements of them are indexed by \(k_1\) and \(k_2\)⁴): \[\begin{equation} s(x) = \sum_{k_1=1}^{K_1} \delta_{k_1} \eta_{k_1}(\lvert \lvert x-x_{k_1} \rvert \rvert) + \sum_{k_2=1}^{K_2=2} \alpha_{k_2} \phi_{k_2}(x), \label{tprs-basis-1d} \end{equation}\] where the first summation contains (locally-acting) radial basis functions⁵ (\(\eta_{k_1}\), functions of the radial distance between \(x\) and the \(k_1^\text{th}\) knot, \(x_{k_1}\)), with corresponding coefficients \(\delta_{k_1}\) which are penalized (they are in the range space of the penalty). The second summation contains (globally-acting) orthogonal polynomials \(\phi_{k_2}\) with coefficients \(\alpha_{k_2}\) which are not penalized (they lie in the nullspace of the penalty, so even with arbitrarily high levels of smoothing, they remain in the model).

In constructing the polynomial part of (\(\ref{tprs-basis-1d}\)) in one dimension we include the terms \(\phi_1(x)=1\) and \(\phi_2(x)=x\). As in the case with categorical variables above, \(\phi_1(x)=1\) will cause an issue, as it’s effectively going to cause us to estimate a second intercept term. As such, we need to remove it to ensure our model fits and we can interpret the results.

General approach using constraints

Thinking about our problem in terms of matrices, we start by writing down our model in matrix notation: \[\begin{equation} \mathbb{E}(\mathbf{y}) = \mathbf{X}\boldsymbol{\beta} \label{matform} \end{equation}\] where \(\mathbf{y}\) is the response, \(\mathbf{X}\) is the design matrix and \(\boldsymbol{\beta}\) are the parameters to be estimated. We form \(\mathbf{X}\) by evaluating \(\eta_{k_1}\) for \(k_1 \in 1,\ldots,K_1\) and \(\phi_{k_2}\) for \(k_2 \in 1,\ldots,K_2\) at each datum (giving \(K\) values) and placing the values in the rows of \(\mathbf{X}\). The identifiability problem manifests itself in the design matrix by having columns which can be formed as linear combinations of each other.

Back to factors

How can we deal with identifiability in a general way? We can recast our identifiability constraint as a constraint on our optimization problem. The constraint we need is that the coefficients

Wood (2017) Section 1.8.1 covers this for linear models

Tensor products

Bayesian stuff

McElreath, R. (2020) Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second edition. Chapman & Hall/CRC texts in statistical science series. Boca Raton: CRC Press.

Wood, S. N. (2003) Thin plate regression splines. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65, 95–114.

Wood, S. N. (2017) Generalized Additive Models. An Introduction with R. 2nd ed. Texts in Statistical Science. CRC Press.

Footnotes

This approach is particularly useful in a fully Bayesian setting where we might want to set the same prior on each level. See McElreath (2020) section 5.3.1 for a friendly introduction.@wood_generalized_2017a Section 1.8.2 covers this in more mathematical detail.↩︎
This example is borrowed from Wood (2017) Section 1.8.2.↩︎
The term contrast generally means a linear combination of variables that sum to zero see [the wikipedia page](https://en.wikipedia.org/wiki/Contrast_(statistics).↩︎
Usually, we would set \(K_2=n\), the number of observations in the data, then do a reduced rank approximation using an eigen decomposition, but we don’t care about these technicalities here. \(K_2\) (usually referred to as \(M\)) is defined as \({m+d-1 \choose d}\) where \(m\) is the penalty order and \(d\) is the dimension of the problem. Again, let’s ignore these technical details here and assume second order penalties in 1 dimension, as would be usual, making \(K_2=2\).↩︎
Let’s not worry about their exact definitions here. You can look at Wood (2003) if you really want to know.↩︎