Principal Component J~nalysis
for a Dynamical System of Emigration
Manabu Yuasa and Wasaburo Unno
Kinki University, Research Institute for Science and Technology, Higashi-Osaka, Osaka 577 Ja.pan
(Received January 17, 199ii)
Abstract
Dynamical system theory of migration is developed and applied to study causal relation among principal components describing the emigration from England and Wales, 1861-1900. The ma- terial is taken exclusively from Morgan {Appendix in Baines, 1985). Our method can elucidate driving forces of migration as far as information is included in the input data. even implicitly.
Morgan's data system is found to be controlled essentially by 4 principal components for over 90
%of its activity. The first principal component which governs about 50% of system variations, however, is the combination of several variables in the data. base, implying that Morgan's data base does not contain direct agencies responsible for emigration. The driving force of emigration depends by only about 50 % on 4 principal components describing 90 % of the system activity so that the controlling factors of the other 50 % lie in the outside of the system constructable from the data base.
Key words: Emigration, Principal component analysis, Dynamical system, Causal relation of activities
1 Introduction
In our preceding study [1], a method is devel- oped to find causal relation among various entities related to migration by use of principal compcr nent analysis (PCA). In this paper, the data base used by Morgan [2] for the linear covariance anal- ysis is employed to construct multi-dimensional phase space in which the dynamical system of em- igration is embedded. The result of the analy- sis is different from Morgan in that the dynami- cal structure of emigration can be obtained in the form of a differential equation. For this purpose, multi-dimensional phase space is constructed not only with observational data hierarchy including nonlinear products but also their time derivatives.
The data are taken from Morgan's article ( Ap- pendix in Baines,1985), treating migration in Eng-
land and Wales, 1861-1900. The inclusion of the time derivatives in the extended PCA enables us to elucidate the driving agencies of the system.
In Morgan [2], the data base contains: X1 = (Age), X2 = (.~Agric), X3 = (ilLiteracy), X4 = (Ll Wages), X5 =(Urban), and X6
=
(Mlag), Y1=
(male overseas emigrant),
Y2 =
(female overseas emigrant), Y3 = (male internal imigrant), Y4=
(female internal imigrant). The meaning of these variables should be referred to Morgan in Baines [2] (see, aiso Unno, Yuasa and Onishi [1]). These data are given for 52 counties and for each of 4 decades from 1861 to 1900. Therefore, these ten variables and their combinations are considered as the state variables (like pressure, density, temper-
- 67-
ature, and velocity components in fluid dynamics) describing the state of each county at each decade.
In other words, these variables and their combina- tions form a vector phase space in which dynami- cal state of the system is embedded. For complete embedding without ambiguity, however, (2n+1)- dimensional space is needed for embedding a n- dimensional system. Since n may be fairly large, we better increase the number of variables by tak- ing not only the laged emigration, X6 = (Mlag),
but also 2nd order quantities:
zl
=r-?, z2
=YiY2,
Z3 =
YiY3,
Z4 = Y1Y4 , Z5 = Y1X6 , etc. as state variables constructing a state vector.We are now interested mostly in how the rate of change of emigration 8t(Y1 + Y2) depends on state variables Xi, }j and Zk· The extended principal component analysis developed in the previous pa- per [1] is appropriate to establish such a dynamical system formalism.
2 Standardization of Variables
Variables should be normalized in the PCA, since we are interested in deriving causal relations.
Normalized variables xi, Yi' Zk are given by the following formulae:
Y. - (}j- (}j))
] - Uyj
'
Zk
=
(Zk- (Zk))' (1)azk
where (Xi) and axi etc. denote the average and the variance of Xi etc., respectively. The values of these normalized variables should be given for every county and for every decade. Extrapolation and/ or interpolation may be used to evaluate the normalized variables, if the data are not available at the right epoch. The multiplicity can be re- garded as (m1 + m2) for all variables, because in this paper we are investigating- the net overseas
emigration, Y(= Y1 + Y2), under the background of all variables. Then the average and variance are given for
xi
by52 4 52 4
(Xi)= L:~:)m1+m2)XdL:L:(m1+m2), (2)
c=lp=l c=lp=l
and
[ E~!l E!=l (m1 +
m2)(Xi- (Xi) )2
]112
axi
=
52 " 4 ( ) ':Ec=l L..tp=l m1 + m2
(3) where summation extends for counties from c = 1 to 52 and for epochs from p = 1 to 4. Similar equations hold for }j and Zk· Thus, we have all the variables normalized for each of 52 counties and for each decade of 4 epochs, and 208 points can, therefore, be plotted on the phase space of 15 dimensions, (x1 ,.,x6, Y1 ,.,y4, z1 ,.,z5).
3 Principal Component Analysis
Each emigrant plots one point representing his background of emigration on the 15-D phase space. Then, each of the above 208 points should have the multiplicity (m1 + m2). The 15-D ellip- soidal distribution approximates the distribution of these points in the phase space, and the princi- pal axes of ellipsoid correspond the principal com- ponents. The principal axes are, therefore, spec- ified by the eigen-values of the matrix of correla-
tions that are taken for all combinations of two coordinates among (xi, Yil zk)·
Let qn(n=1, .. ,15) be the unified symbol for
Xi, Yi, and Zk in this order. A set of values of qn(n=1, .. ,15) are obtained for 208 cases (denoted by c=1 to c=208) of different counties and epochs to which emigrants belong with the multiplic- ity m1 + m2 or the weight We given by We = (m1 + m2)/ L:~~8
1
(m1 + m2)· Note that these qn's-68-
have been normalized so that the average of qn are zero: (qn) = L:~~8
1
wcq~c) = 0, and the variance are unity· • aqn 2 = ""208 w L.,..c=1 c [q(c)] 2 n = 1 for all n. The (iJ') element of the correlation matrix, rii is given by(4) Eigen-values Ao: and the corresponding eigen- vectors J..l(o:) are defined by the following set of linear equations,
15
direction-cosine of the principal component vec- tor J..l(o:) with respect to the qcaxis. The value of a principal component Po: of a point qn(n=1, .. ,15) in space is, therefore, given by
p~) =
L
15 J..l~o:)q~c) (a= 1, .. , 15). (7)n=l
The importance of a principal component Po: is judged from the value of the eigen-value Ao: being larger or smaller than unity. This is because the eigen-value is the variance of the principal compo-
"'"' (a:) - A (a:)
LJ TijJ..lj - o:J..li ' (i
=
1, .. , 15), (5) nent as shown by the expression,j=1
having nontrivial solutions under the fulfillment of
the characteristic equation, c=1
(8)
i=1 c=1
(6) which measures the contribution of the principal component Po: in controlling the system. We note here that L:~~
1
Ao: = 15, as derived from the trace of the matrix in equation(6).det being the determinant.
The principal component axes are the eigen- vectors J..l(o:) belonging to the roots Ao: (a = 1, .. , 15) of the characteristic equation, and the component J..l~o:), (L:i!dJ..L~o:))2 = 1), defines the
The results are as follows, eigen values:
A1 = 7.581, A2 = 3.521, A3 = 1.322, A4 = 1.451, A5 = 0.404, A6 = 0.322, A7 = 0.206, As = 0.194, A9 = 0.126, A10 = 0.071, A11 = 0.050, A12 = 0.035, A13 = 0.017, A14 = 0.003, A15 = 0.002, eigen vectors corresponding to the 4 largest eigen values:
0.159, 0.327, 0.332, 0.188, -0.179, -0.169, 0.548, 0.084, 0.087, -0.382, -0.008, -0.107,
0.113, 0.281, 0.332, 0.409, -0.185, -0.191, 0.252, -0.034, -0.031, 0.081, -0.096, -0.004,
-0.211, 0.111, 0.333, -0.111, -0.215, 0.053, 0.151, -0.224, -0.261, 0.819, 0.134, -0.086,
-0.211, 0.115, 0.335, -0.290, 0.438, 0.045, -0.023, -0.367, -0.253, -0.328, -0.004, -0.085,
0.288, 0.167, 0.327 ), -0.334,
0.432, -0.166 ),
-0.333, -0.371, 0.126 ), -0.072, -0.014, -0.106 ).
(9)
(10)
The largest( first) principal component de- nent is composed of several variables. The second scribes about 50
% (
= A1 /15) and the 4 largest and the third are also having no eminently effec- ones describe over 90% (
=(A1+
A2+
A3+
A4 tive variables. The fourth principal component )/15 ) of the system. The first principal compo- depends considerably on q3 (Literacy).-69-
4 Extended PCA
We examine the extended PCA to find driv- pal component of the extended PCA and the eigen ing forces of emigration. In our case the system vectors corresponding to the significant eigen val- is controlled over 90 % by 4 significant principal ues of the first PCA. Errors of this equation in- components belonging to eigen-values larger than elude the contributions from insignificant princi- unity; Then, if 8tq0 , the normalized value of 8tY = pal components and other external causes that 8t(Yi + Y2), is added to qn 's to form the extended were not considered. In this investigation we system, the extended PCA will yield practically adopt the principal components corresponding to the same set of significant principal components, the 4 largest eigen values in the preceding section.
since the new additional principal component will The results of the extended PCA are as follows, be an insignificant one. If we neglect the new ad-
ditional insignificant one, we obtain the following eigen values:
concluding equation,
lS
8tqo =
L
anqn,n=l
(11) where an can be expressed by using the eigen vec- tor corresponding to the new insignificant princi-
Ill
= (
0.306, -0.248, J.l2= (
0.246, 0.629,J.l3
= (
-0.718, -0.388, J.l4= (
-0.489, 0.576,J.ls
= (
0.302, -0.247,A1
=
1.425, A2=
1.002, A3=
0.998,A4
=
0.998, As=
0.577,eigen vectors:
0.116, 0.577, -0.706 ), 0.737, -0.009, -0.001 ), 0.570, 0.096, -0.002 ), -0.321, 0.570, -0.001 ), 0.118, 0.577, 0.708 ).
(12)
(13)
If we neglect As ,which corresponds to the new additional principal component, the following concluding relation is obtained.
8tqo
=
0.220q1 -0.047q2 -0.683q3 +0.246q4 +0.074qs-0.084q6 -0.007q7 -0.153qg +0.121qg +0.127q10 (14) -0.014qu -0.080q12 +0.061q13 +0.058q14 -0.019qls·
The largest driving force for the emigration
(=A/
../2). Consequently, the driving force of seems to be q3 (Literacy). If q3 decreases, namely emigration depends by only about 50% on 4 prin- Literacy increases, the rate of change of emigra- cipal components describing 90% of the system ac- tion increases. Other small driving forces can tivity so that the controlling factors of the other also be interpreted similarly. But the value of 50% lie in the outside of the system constructable As is rather large and the error derived from the from the data base.neglection of As is estimated to be about 50%
References
[1] Unno,W., Yuasa,M. and Onishi,T. Dynamical System Theory of Migration, Science and Technol- ogy, Kinki University, No.8, 1996.
[2] Morgan,M. in Migration in a Mature Economy: Emigration and Migration in England and Wales, 1861-1900 by B.Dudley, (Cambridge University Press, 1985).
-70-