A fact about linear regression with a factorial predictor

A collaborator generated a set of RNA-seq data from brains of B6 and a panel of congenic mice. Each congenic mouse had one mutation that would increase an individual’s risky of developing Alzheimer’s disease. The objective was to reveal the pathways that link the mutations to Alzheimer’s disease.

Data arrived in batches. The first batch data was from B6 and one congenic strain, which I will call “strain X” for this post. To identify differential genes, I did a t-test on log transformed TPM between the two groups. Later on, with data from all other strains in hand, I put expression levels of each gene in all samples into a linear regression model, in which strain identity was included as a factorial predictor. B6 was the reference level. For each gene, this model returned estimates of the changes from B6 to any of the congenic strains.

The collaborators quickly complained that results for “strain X” from the second round analysis were quite different from that of the first round, when t-test was applied. About a quarter of the genes in the first list disappeared from the second. This did not surprise me at the beginning, because I knew one sample of strain X was taken off in the second analysis for a genotype issue. I expected this one sample to account all the differences. To prove this, I applied the same t-test again on strain X without the problematic sample. I expected this new t-test to give me exactly the same results as that of the linear regression.

Turned out it was not the case. Also, applying linear regression only on B6 and strain X returned the same result as the t-test did, which was expected. So, including additional levels of a factorial predictor changes estimation of other levels of the predictor. How can this be? I knew parameter estimates for existing levels did not change after including additional levels. Residuals of the given observations also did not change given the same parameter estimates. I imaged the model was borrowing information among the levels, but how?

By unfolding the math of linear regression, the differences came from computing the “residual variance”. By including more levels in the factorial predictor, residual variance was changed by two ways: sample size and degree of freedom. The changes on residual variance further change each parameter’s standard error, and hereby t and p values.

All the details are included in this attached PDF.

To summarize, by including data from different conditions into a single regression model, we implicitly assumed that they all share the same sources of variations. Therefore, we expect to have better estimation of the variation by including more data, including data from another factorial level, which seems completely independent from other levels. As a matter of fact, any new data of a model would affect estimating standard errors of all other parameters through residual variation. It is also worth noting that including more data can lead to bigger or smaller error estimation of other parameters.

Xulong Wang, Ph.D., was a postdoctoral associate in The Carter Lab at The Jackson Laboratory. He covers spontaneous topics about his work on computational genetics and neurobiology. Follow him on Twitter @xulong82.