## Similarity Examples

### A Triangular Digression

An entire branch of mathematics focuses on shape factors for a particular kind of geomeftric figure, the triangle.  Trigonometry defines an array of functions which describe interrelationships of angles and shape factors, the latter resulting from ratios of 2 lengths of the sides of a right triangle (a triangle with 90° =  $$\pi/2$$ radians as one of its angles).  The most well known functions are the Sine, Cosine, and Tangent.

Albert Einstein is said to have proven the theorem of Pythagoras using the similarity principal.  If you think that ought to be easy, you should try it before looking at the following.   However the proof is simple and elegant as follows:

We wish to prove that the lengths of the sides of the right triangle obey the following relationship:

$$\Large A^2+B^2=C^2$$

Line D partitions the large triangle into two smaller ones, each of which is geometrically similar to the larger outer triangle.  The latter is true because each of the smaller triangles has exactly the same three angles as the original large triangle.  Each has a 90° angle, by construction, and each of the two smaller triangles shares an angle with the original; therefore 2 angles of each small triangle are identical to the large.  Since the sum of the three interior angles of every triangle is 180°, all 3 angles are the same for all three triangles; they are all geometrically similar.  Now we can invoke the similarity principal to prove the theorem of Pythagoras:

From the similarity principal:

$$\Large D/A = C''/B = B/C$$

and

$$\Large C'/A = D/B = A/C$$

The letters represent the lengths of the sides indicated.  Using simple algebra:

$$\Large C''=B^2/C$$ and $$\Large C'= A^2/C$$

Since $$\Large C''+C'=C$$

$$\Large B^2/C+A^2/C=C$$ and

$$\Large B^2+A^2=C^2$$

#### Half Empty or Half Full?

For the next example of the similarity principal, consider the extremely practical problem of a martini glass filled with your favorite form of liquid refreshment.  At each level of fullness, the liquid can be considered as filling a circular cone and each is geometrically similar to all of the others.  Suppose your glass is only "half full", that is the height of the liquid is only one half of the original full-to-the-brim level.  Even though you know the formula for the volume of a cone is $$V = \pi r^2 h/3$$, what is your guess as to how much fluid is left based on the similarity principal?

The correct answer is that the glass contains only 1/8 of the original volume!  That's because volume is proportional to length cubed as we have seen; the answer is instantly apparent from the similarity principal!   However it can also be derived the long way around. The original formula for cone volume can be re-expressed in terms of the height, h, since r = k h (the similarity principal) where k is the shape factor.  Substituting k h for r in the original formula gives V = π(k h)2 h/3 = π k2 h3/3 .  If h is then replaced by half its original value, then V=π k2 (h/2)3/3 =π k2h3/3/8. (1/8 of the original volume).
The lower section of the figure displays the martini glass in its actual half-full configuration.  What would you guess the correct height of the liquid to be?

ANSWER:  h/21/3 ~ 0.79 h, i.e. almost 80% of the original height.

#### All Boxed Up

Consider now the situation of a rectangular box with edges L, W, and H (length, width, and height).  Besides being a relatively simple example conceptually, the box analogy to follow exhibits similarities and difficulties that occur when normalizing dimensional data for body size, e.g. from echocardiograms.  The example is very relevant.  Unlike the sphere, boxes need not all be the same shape.  The volume of each box is V = L W H and we can quickly derive 2 additional characteristic lengths from the geometry -- the diagonal d = (L2 + W2 + H2)1/2, and a length derived from the volume: LV = (L W H)1/3.

Rectangular boxes come in different shapes, a fact that is represented unambiguously by any of the shape factors L/W, W/H, H/L, L/d, L/LV, etc.  The choice of which box edge is represented by L, W, or H is arbitrary in this case.  We are going to focus on, L/LV because LV is one of the global lengths of the box; it depends on all of the other lengths.  We'll allow the shape of the boxes to vary somewhat and see that some interesting statistical issues arise.

In the first example, we're letting the shape of the boxes vary minimally and plotting the length of one edge, L, against the volume, V.

Anyone unfamiliar with the geometric issues would attest to the fact that the edge length data can be represented quite well by a linear regression, L = m V + B with box volume, V, as the independent variable and m and B the regression parameters corresponding to the slope and intercept of the regression line, respectively.  For a range of V that is sufficiently restricted, this doesn't even cause any discernible statistical anomalies.  Nevertheless, the green line depicts the actual nonlinear relationship between l and V.  We know also from the foregoing discussion, and from the formula V = L W H, that the relationship between V and L passes through the origin.  The linear regression depicts a positive intercept on the y axis indicating nonsensically that edge length can be positive when volume is 0.  Anyone conducting an uninformed experiment, i.e. without considering the geometry, might conclude that length and volume are linearly related.  A more informed approach results from a power regression of the form L = A V B where A and B are the regression parameters and statistically determined values are shown with the figure.  The actual average shape factor used to create the data was L/LV = 0.8 and the actual  exponent is 1/3 due to the nonlinear relationship between length and volume.  Hence the power regression procedure provided good estimates of the actual parameters in this case.

Since length and volume are not linearly related, we have only to include a sufficient range of box sizes for the true nature of this relationship to be apparent and for the linear regression method to fail.

These data could still be represented ostensibly by a linear regression.  However we begin to see obvious visual evidence of a statistical anomaly that precludes valid linear regression for this example.   Below we see the result of plotting the residual of the regression; the residual is defined as the difference between the actual value and the regression prediction.  The expectation, i.e. requirement for a valid regression procedure, is that the residual points are distributed randomly with an expected value (average) of 0.

This residual plot does not depict a random distribution but quadratic dependency, i.e. a curve that crosses the zero line twice as shown.  The red line shows the result of another regression, i.e. a regression of the residual data against V of the form R = A + B V + C V2, where R is the residual data and A, B, and C are the regression parameters.  If the value of C is significantly nonzero (in a statistical sense), then there is significant quadratic dependency which indicates that the original data are not well represented by a linear regression.  This procedure should be performed every time you are attempting to regress data to a formula!  It is one of the ways to evaluate whether the regression procedure is valid.

We also observe that the linear regression parameters in the first example differ substantially from those in the second whereas the power regression parameters are virtually the same and, in fact, are even closer to the "actual" values than in the first example.  The linear regression parameters will differ for every substantially different range of data because the actual ("local") slope and intercept change.  We know in this case that the data actually are represented by the formula form L = A V B where A = 0.8 and B = 1/3 by construction.  Differentiating the formula (calculus) shows that the slope of this relationship is dL/dV = AB V B-1 ≈  0.267 V -2/3; the slope approaches infinity at small values of V and the approaches 0 for very large values of V.  Hence we cannot hope to determine any characteristic features of the relationship from a linear regression because the data are not representable by a straight line.

Carrying this sequence to its natural conclusion, we observe that the linear regression is entirely untenable for a sufficiently wide range of data (below).   Apparent in the previous example, it is now grossly evident that confidence intervals determined from the linear regression procedure do not depict the expected range of values for L in any statistically valid or practically useful way. In particular, we know that the relationship between length and volume must pass through the origin and that a linear regression with nonzero intercept does not.  This necessitates the conclusion that the linear regression will depart from accurate data representation whenever we include data that is "close enough" to the origin.  The power regression continues to represent the known parameters appropriately and automatically passes to the origin because of its functional form.

The residual plot resulting from the linear regression is now grossly nonlinear and cannot even be adequately represented by a quadratic regression (red line).

This example relates directly to the regression of linear echocardiographic data against body weight.   Body weight is proportional to body volume and we can expect  spectacularly poor results from such a regression.  It is also simply incorrect and the result of not checking the statistical validity of the regression assumptions.  It's possible to get away with this approach for a sufficiently narrow range of the independent variable as shown  in the first box regression.   Almost any function will behave linearly locally -- this is one of the fundamental posits of calculus, that a function can be approximated locally by a straight line.

There are even subtler issues to be understood from this simple example that apply to analyzing size interrelationships within the cardiovascular system.  In subsequent examples, we continue to work with rectangular boxes but allow for a greater degree of variation in the shape factors.  In the next example, we allow size to vary greatly but retain average shape over the entire range of size.  The plot of length against volume gives the now obvious and familiar nonlinear relationship;  the linear regression (in blue, shown with prediction confidence intervals) is of no practical value.

We now know better than to attempt to establish a linear regression between two quantities that are related nonlinearly.  Let us circumvent this problem by plotting box length against the cube root of volume, LV, and perform a linear regression procedure again.

The result yields the expected linear relationship through the origin, but the confidence intervals for the regression are still inappropriate.  Now the data can be seen to exhibit heteroscedasticity; the data are distributed randomly on both sides of the regression line, but the variance is a function of (changes with) the independent variable.  We determined previously that the relationship between L and LV must pass through the origin exactly, i.e. with no variance whatsoever.  This means that the confidence intervals of the regression line should decrease for smaller boxes, reaching 0 at the origin.

Admittedly this result is a consequence of the way in which this hypothetical set of boxes has been constructed.  The characteristics of the set have been specified, i.e. determined, by allowing the shape (L/LV) to be randomly distributed and independent of box size.  This is illustrated directly by plotting the shape against LV.

Here we have finally arrived at a circumstance where we may legally perform a standard linear regression, i.e. plotting the shape (L/LV), against LV.  We see here that L/LV is randomly distributed on either side of the regression line and that the variance is independent of the box size.  It so happens that the slope of the regression line is also statistically equal to zero, again by construction.    The shape factor remains close to the design value of 0.8 for the entire set of boxes.   Because the set of boxes was designed with shape independent of size, the value of the exponent of volume is truly 1/3 (L = 0.8 LV1= 0.8 V 1/3).

Because L/LV is not a function of box size in this example, we are justified in determining the statistics of this quantity without performing a regression, i.e. representing its statistics in terms of its mean, standard deviation, median, range, etc.  Having obtained numerical values for these quantities, recognize that the shape factor L/LV is a slope or rate of change between these two quantities.  In determining its statistics, we can determine the confidence intervals for the slope of a line that passes through the origin.  Here now is a plot of L vs LVwith the mean slope and ± 2 standard deviation slopes depicted, i.e. the confidence intervals for L/LV .

This appears now as we expect with confidence intervals diminishing to zero at the origin.  Regression and confidence intervals derived from a power regression are not mathematically identical but appear quite similarly to the ones shown. It will be shown now that the regression line derived by this approach amounts to a weighted linear regression through the origin.

Suppose now that we have a situation where box shape changes systematically with box size.   In terms that have been presented thus far, we would seek to determine whether L/LV is correlated with box size; in the previous example it was not.  Before proceeding as indicated, however, let us reconsider a naïve but generally recommendable approach that we have already used, the power law regression, and return to the relationship between L and box volume.

$$\Large y=Ax^B$$

When data conform to this relationship, it can also be said that they obey an allometric scaling law.  Statisticians will breathe a sigh of relief when we apply the power regression to this situation, for reasons we have already seen, but let us consider this equation in light of what we determined from geometric considerations.  Suppose we have box length (y) and volume (x) data and do not yet understand the geometric relationships outlined previously so that we must determine regression parameters A and B using a statistical procedure, e.g. least squares regression.

$$\Large L=AV^B$$

It would not be surprising now if B were determined to be numerically close, i.e. statistically equal to 1/3.  Suppose, however, that we perform the procedure and the value of B is not 1/3.  This situation can be rewritten mathematically as an expression that is identical to the first as follows:

$$\Large L=(AV^{B-1/3})V^{1/3}$$

Here the geometric expectation, that B = 1/3, is isolated from the term in parentheses, the shape factor, and the formula indicates that the shape is a function of the independent variable, V.   Recognition of this fact allows the allometric formula to be interpreted in light of the similarity principle so that we can correctly interpret the power law regression as size dependent variation in shape, depending on the numerical value of the regression parameter B.  If B = 1/3, then A V B-1/3 = A, where A is a constant; we have the previously determined relationship where the shape factor A is not a function of size.  If B is not equal to 1/3, then the term in parentheses varies with size; i.e. shape is a function of size.

Note however that the function A V B-1/3 (or more simply A V C where C = B - 1/3 in this example) may not be a good choice to describe the variation of shape with size in a particular case.  The power law regression is a statistical expedient that simplifies the determination of shape dependency.  Note, for example, that we are raising a physical quantity (volume) to a fractional exponent.  It makes perfect sense to evaluate the cube root of a volume which has physical units of length cubed; it makes no sense to evaluate the square root of a volume or raise the volume to any exponent that does not result in a physical quantity.  Furthermore we should recognize that the term in parentheses is not O[1] ; the formula predicts physical absurdity at both large and very small volumes, indicating that the ratio of two lengths approaches 0 or infinity depending on the value of B.