SYMPOSIUM ON INFORMATION THEORY IN BIOLOGY y r, o SYMPOSIUM ON INFORMATION THEORY IN BIOLOGY Gatlinbiirg, Tennessee, October 29-31 , 1956 Edited by HUBERT P. YOCKEY Oak Ridge National Laboratory With the assistance of ROBERT L. PLATZMAN HENRY QUASTLER Purdue University Brookhaven National Laboratory SYMPOSIUM PUBLICATIONS DIVISION PERGAMON PRESS NEW YORK LONDON • PARIS • LOS ANGELES PERGAMON PRESS INC. 722 East 55th Street, New York 22, N. Y. P.O. Box 47715, Los Angeles, California PERGAMON PRESS LTD. 4 & 5 Fitzroy Square, London W.L PERGAMON PRESS, S.A.R.L. 24 Rue des Ecoles, Paris, V^ First published J 958 Library of Congress Card No. 58-9687 Printed in Northern Ireland at The Universities Press, Belfast CONTENTS PAGE Foreword ix A. M. Weinberg Preface xi PART I. INTRODUCTION A Primer on Information Theory 3 Henry Quastler Some Introductory Ideas Concerning the Application of Information Theory in Biology . 50 Hubert P. Yockey PART II. STORAGE AND TRANSFER OF INFORMATION Editorial Introduction 61 The Cryptographic Approach to the Problem of Protein Synthesis 63 George Gamow and Martynas Ycas The Protein Text ^ 70 Martynas Ycas Discussion 101 Protein Structure and Information Content 103 L. G. Augenstine Discussion 1 23 Specific Mechanisms of Protein Synthesis and Information Transfer in the Developing Chick Embryo 124 H. R. Mahler, H. Walter, A. Bulbenko and D. W. Allmann Discussion 135 The Mechanism of Action of Methyl Xanthines in Mutagenesis 136 Arthur L. Koch Evidence for a Negative Feedback System Controlling Liver Regeneration 148 Andre D. Glinos Fluctuations in Neural Thresholds 153 Lawrence S, Frishkopf and Walter A. Rosenblith PART III. DETERMINATION OF INFORMATION MEASURES Editorial Introduction 169 Chemistry and Biochemistry at Low Temperatures and Discrimination of States and Reactivities 171 Simon Freed Discussion 1 80 75193 vi Contents Information Content of Tracer Data With Respect to Steady-state Systems 181 MoNES Berman and Robert L. Schoenfeld The Domain of Information Theory in Biology 187 Henry Quastler Discussion 196 Some Membrane Phenomena from the Point of View of Information Theory 197 Herman Branson Efficiency of Information Transmission by Biochemical Co-factors 204 Peter D. Klein Discussion 209 Antigenic Specificity 211 Bernard N. Jaroslovv and Henry Quastler Information Content and Biotopology of the Cell in Terms of Cell Organelles 218 Charles F. Ehret Quantification of Performance in a Logical Task With Uncertainty 230 A. Rapoport PART IV. DESTRUCTION OF INFORMATION BY IONIZING RADIATION Editorial Introduction 239 Electron Spin Resonance in the Study of Radiation Damage 241 Walter Gordy A Physical Mechanism for the Inactivation of Proteins by Ionizing Radiation 262 Robert Platzman and James Franck Information and Inactivation of Biological Material 276 Harold J. Morowitz Discussion 28 1 The Absence of Radiation-Induced Disulfide Interchanges 283 Arthur L, Koch A Proposed Mechanism of Protein Inactivation 287 L. G. Augenstine Discussion 29 1 PART V. AGING AND RADIATION DAMAGE Editorial Introduction 293 A Study of Aging, Thermal Killing, and Radiation Damage by Information Theory 297 Hubert P. Yockey Entropic Contributions to Mortality and Aging 317 George A. Sacher Contents vii A Quantitative Description of Latent Injury from Ionizing Radiation 331 H. A. Blair Some Notes on Aging 341 Hardin B. Jones Cancer as a Special Case of a General Degenerative Process 347 Harry Auerbach Discussion 351 Free Radicals as a Possible Cause of Mutations and Cancer 353 Walter Gordy PART VI. INFORMATION NETWORKS A Probabilistic Model for Morphogenesis 359 Murray Eden Functional Geometry and the Determination of Pattern in Mosaic Receptors 371 John R. Platt PART VII. THE STATUS OF INFORMATION THEORY IN BIOLOGY The Status of Information Theory in Biology: A Round Table Discussion 399 Edited by Henry Quastler Author Index 403 Subject Index 41 1 FOREWORD Alvin M. Weinberg Director, Oak Ridge National Laboratory The reader of this book may wonder why it is that an institution such as the Oak Ridge National Laboratory, which is primarily interested in the control and release of nuclear energy, should also be interested in sponsoring a meeting on Information Theory in Health Physics and Radiobiology. The answer rests in the fact that among the activities that are pursued at this Laboratory there are two which bear very directly on general problems of growth and of the impairment of growth by radiation and allied agents. Broad programs in fundamental research in the basic physical mechanisms and in the basic biological manifestations of radiation damage have been established in the Health Physics Division and in the Biology Division. In the Biology Division there is a great deal of experimental work being done on protein synthesis, on the mechanism of action of the nucleic acids, and on problems of the characterization of the nucleic acids. In the Health Physics Division there is a lively interest in the problems of dosimetry and the basic mechanisms of the interaction of radiation and matter. It is in establishing a tie-up between the physical and biological aspects of radiation damage that information theory may play an important role. We hope that this conference will help to assess the value of information theory to phenomena involved in the interaction of radiation and living matter. (/) ; in our example: 2p{i) • z, = 1/2 + 3/8 + 3/8 + 3/8 + 4/16 + 5/32 + 5/32 = 70/32 = 2.19 i From p{i) = (1/2)^ we get : logg p(/) = z, • logg ( 1 /2) and, because: logg (1/2) = —1 we have: z^ = —loga /?(/)• We get (for/?(/)'s which are integral powers of 1/2!) the following result: Average number of binary symbols per event = —^p(i) logg /)(/). i We will check this result for the case of equiprobable categories. For r categories, the probabihty of every one will be 1/r; so: -lp(i) log2/'(0 = -r--- log2 - = log2 r i r r This is the expression previously obtained for equiprobable categories. Any Probabilities — What if probabihties are not limited to the values 1/2, 1/4, 1/8, etc. ? In this case, it will — in general — not be possible to make divisions into exactly equiprobable groups. We would suspect that in this case the coding will be less than optimally efficient; accordingly, the average length of a code word will be somewhat higher than —^p(i) logg /?(/). The approxi- mation is usually not bad. This is illustrated in the following example which shows the construction of a binary code for the letters of the English alphabet, taking into account their relative frequencies. As expected, it turns out that A Primer on Information Theory 15 each category, /, is represented by a code word of approximately —\og2pO) digits; accordingly, its contribution to the weighted average is not far from the ideal value of —p{i) log, /?(/), and the mean code length is only very slightly greater than the limiting value of —]£/?(/) logg /?(/)• Table I. Fano Code for English Letters 1 2 3 4 5 6 7 No. of digits in code word Contribution / pU) Code -10g2/'(/) to weighted -/>(/) X logipO) average 2x4 2x5 E .132 HI 3 2.92 .393 .384139 T .105 110 3 3.25 .315 .341411 A .086 101 3 3.54 .258 .304398 .080 1001 4 3.64 1 .320 .291508 N .071 1000 4 3.82 .284 .270938 R .068 0111 4 3.88 .272 .263725 I .063 0110 4 3.99 .252 .251275 S .061 0101 4 4.04 .244 .246137 H .053 0100 4 4.24 .212 .224606 D .038 00111 5 4.72 .190 .179278 L .034 00110 5 4.88 .170 .165862 F .029 00101 5 5.11 .145 .148126 C .028 00100 5 5.16 .140 .144436 M .025 0001 1 1 6 5.32 1 .150 .133048 U .020 000110 6 5.64 .120 .112877 G .020 000101 6 5.64 .120 .112877 Y .020 000100 6 5.64 .120 .112877 P .020 000011 6 5.64 .120 .112877 W .015 000010 6 6.06 .090 .090883 B .014 000001 6 6.16 .084 .086218 V .009 0000001 7 6.80 .063 .061162 K .004 00000001 8 7.97 .032 .031863 X .002 0000000011 10 8.97 .020 .017931 J .001 0000000010 10 9.97 .010 .009965 Q .001 0000000001 10 9.97 .010 .009965 z .001 1.000 0000000000 10 9.97 .010 .009965 4.144 4.118347 We have already met a situation where a binary code was less than optimally efficient (in the sense of minimum length of code words); that was the case of r equiprobable categories, when r was not an integral power of 2. In this 16 Henry Quastler instance, it was possible to approximate optimal efficiency by symbolizing groups of events instead of single events. The same principle works in the case of probabilities which are not integral powers of (1/2). We will illustrate the method in the case of a situation with two alternatives. Example: Let there be two categories of events, 'A' and 'B', with associated probabilities, p{K) and /?(B) : /'(A) = .7 /KB) = .3 The limiting value of symbols per event is: -IKO log2 AO = -(0.7 logo 0.7 + 0.3 log2 0.3) = 0.881291 . . . i If this situation is to be represented on the basis of single events, then one needs one binary digit per event. Event Probability Representation A 0.7 1 B 0.3 Average number: 1.0 symbol per event; excess 12 per cent. The following two-event clusters are possible: AA, AB, BA, BB. If the two events are independent, then the probability that both occur is the product of their individual probabilities : p(AA) =7;(A) -piA), p(BA) =/7(B) - p{A\ etc. Setting up a Fano code, we get: Event Probability Representation AA .49 AB .21 BA .21 BB .09 1 1 1 Average 1.81, or 0.905 symbols per event; excess 3 per cent. If we can encode groups of three real events, then we get still closer to optimum economy : Event Probability Representation AAA .343 AAB .147 ABA .147 BAA .147 ABB .063 BAB .063 BBA .063 BBB .027 1 1 1 1 1 1 10 11 1 Average: 2.686, or 0.895 digit per event; excess 1^ per cent. A Primer on Information Theory 17 Even with more pronounced unbalance of frequencies, tiie minimum value of binary digits per word is soon approximated. For/)(A) —■ .89 and /;(B) = .11, the limiting value is .50. In single-event-code, one needs one digit per event; for two-event-sequences, .66 digits; for three-event-sequences, .55; and for four-event-sequences, .52. We have begun our discussion of binary representation with the case of 2, 4, 8, 16, ... , equiprobable categories. We then generalized to cases with any number of categories, and proceeded from the representation of single events to clusters of events. Next, we introduced unequal probabilities, of value 1/2, 1/4, 1/8, ... . Finally, we dropped all restrictions. We can now state, with full generality: If a real situation is categorized into r categories, with associated proba- bilities p(i), (where / = 1, 2, . . . , r), then it is possible to represent each r event with an average of no more than — 2 p(i) log2 pii) binary symbols. i = l Representation Theorem — In general, the closer we v/ant to approximate the minimum bulk of representation, the larger the groups of sequences which must be encoded. This entails the following penalties : 1. There will be a delay in waiting for a whole group of events to occur or to be registered, and 2. The encoding and decoding procedures, and the code book itself, will become the more elaborate the larger the groups coded. It is obvious that the code which is most economical in terms of bulk of representation is not necessarily optimum in over-all performance. There will be cases where it might be worthwhile to sacrifice economy in word length for ease in decoding. If the reader will work through exercise 4, then he surely will appreciate this possibility. Whether or not minimum bulk of coding is favorable, in a given case, cannot be derived from informational analysis. What information theory does is to establish a limiting value of the number of symbols, of a given kind, which are needed to represent the information in a given factual situation; in some cases, like those here discussed, information theory will also show how such coding economy can be achieved; but it can never prescribe that this is what should be done. It would be quite legitimate to inquire, at this point, why we have gone to so much trouble to find out how to achieve binary representation with minimum bulk ? Is not the result of doubtful value, in view of the fact that a tolerable approximation to minimum bulk can usually be achieved with the simplest means, and that a close approximation often entails prohibitive costs in encoding and decoding? The answer is this: by establishing the minimum length of code words in standard binary representation, we have implicitly established a general condition of representability : If an event can be represented by (on the average) n binary digits, then it can symbolically represent, or be represented by, any other event that can also be coded into n binary digits. This can be immediately generalized to groups of events: Let 5"^. and Sy be the number of real and symbolic events in a group, and n^. and «„ the average 18 Henry Quastler binary representation per event and per symbol. Then, the general condition of representability can be stated as follows : ^y ' f^u ^ ^X ' ^X EXERCISES 1. A weakness of the Paul Revere code is that there is no positive signal for "peace and quiet". Hence, the colonists could not be sure whether the absence of a warning signal meant "peace and quiet" or a disturbance in the communication system. Show how two lights could be used to indicate the four situations by positive signals. 2. Any integer can be written as a sum of powers of 2(1,2,4,8,16, • • •)• For instance: 27 = 16 + 8 + 2 + 1 = 2* + 2=' + 21 + 2» In binary notation, one indicates the power by position, and writes a '1' in appropriate position if this power does enter the sum, a '0' if it does not. Thus, '27' becomes 11011. (a) Write the following numbers in binary notation: 0,1,2,3,4,5,6,7,8,9,10,12,16,1955. (b) Write the following binary numbers in decimal notation: 1001, 1011, 10010011, 100000. Any proper fraction can be written as a sum of powers of 1/2, (1/2, (1/2)^ = 1/4, (1/2)^ = 1/8, etc.). For instance: .75 = 1/2 + 1/4, or, in binary notation, .11. (c) Translate into decimal notation: .001, .1001001 3. (a) encode the message 'ABCDE' in code (a) of the five-word codes described earlier, (b) decode the message: '000001011011' in code (b). 4. This assignment is coded in the Fano code for English letters given earlier. 001001001 10000101 1 1001 1 10001 10001001 10101001001001 1000001010001 1001010 1 101001 10000010101 1111111 1001001001001 1 1 1 1 10001 10010101 101000000101001 0101 100000001 1 1 100000101 10100010101 1 1000100001 1 101 1000010101 101 1001010 0101 100101 1 1 1 1 1 101001000100001 101111 101 101 110111 101 1000001 1 10010010010 001 1 100001 110101111111 1001001 1 100001 1111011 100101 100101 1 10001 1 1 101 1000 001001 1 1 100100101 1 10010001 100101001001001001 1 1 1 1 100001001 101 1001001 100 100101 1 10100100101 1 1001001 1 1 10 100000 1 10010000001 1 1 10000010001001 1 1 1000 001001001001 1 101 101000000101 101 1000001 1 1001 1 1 1 1 1001001001001 1 101 101000 000 101 1010001 1 1 1 1 101010101 101000101 1 1 1001 1001 1000000001 1 1 1 1 10010001 10 010110011000 111 (This assignment is very tedious but it is good practice.) 5. Given a real situation with three categories and probabilities p(A) = .8, p{B) = .15, p(C) = .05. Construct a binary code which comes within 10 per cent of the minimum bulk. 6. A protein is thought to be a linear arrangement of amino acids of which there are (about) twenty kinds in each cell. The specificity of a protein depends mostly on the sequence of amino acids, i.e. a protein can be considered as a 'message' written in a twenty-letter alphabet. It is known that, in the living cell, protein specificity is determined by nucleic acids. These are linear arrangements of nucleotides, of which there are four different kinds. Question: what is the minimum number of nucleotides needed, on average, to specify each amino acid? Assume all amino acids to be equiprobable. III. THE MEASURE OF INFORMATION OR UNCERTAINTY It seems reasonable to equate the amount of information acquired, as a result of an event, to the amount of uncertainty which its occurrence has A Primer on Information Theory 19 abolished*. The prior uncertainty does not depend on the event that has actually happened, but, rather, on the whole set of events which could have happened at this particular occasion. For instance, if one wishes to compute how much information is acquired, on the average, by a glance at the speedo- meter, one proceeds to estimate how uncertain a motorist is before he glances. The amount of this uncertainty must depend on the number of needle positions which the motorist thinks he can distinguish. Suppose his speedometer scale reaches from zero to one hundred and he can read the position to the nearest mile per hour; then, he will be able to distinguish 101 positions, and the amount of his uncertainty will be somehow related to this number. However, it wouldn't be realistic to relate his uncertainty only to this number, 101. Because, suppose his speedometer scale ranges up to 150 instead of 100 miles per hour; yet, when he is driving along the highway at a moderate speed, this extra portion of scale does not contribute in any way to his uncertainty; he will be quite sure that his needle will not be in this interval. In fact, he will expect to find his needle somewhere within a range of about 10 m.p.h., and he will be almost certain to find it within a somewhat larger range of, say, 20 m.p.h. Thus, to describe his uncertainty realistically, we must not only state every possible result of his reading, but will have to qualify each by a statement of expectation or probability. The Amount oj Uncertainty As before, we turn to a binary situation to obtain a simple perspective of the problem. Suppose somebody has made a record of 100 tosses of a coin; he has registered only whether the coin fell 'head up' or 'tail up', but neglected all other features such as on what spot the coin came down, which direction the head faced, etc. What is the average amount of information in the record of any one toss? In other words, what is the amount of uncertainty before the record is seen ? The uncertainty must be a function of 'two', the number of alternatives; it must be modified by their relative frequencies. If it is known that the record is that of a coin so thoroughly biassed that 'head' always turns up, then there will be no uncertainty at all; if the coin is moderately biassed, then the outcome of a toss will be uncertain but not qui.te as much as with an unbiassed coin. If we don't know the bias of a particular coin, then we do not know exactly how uncertain we should feel about the outcome of a toss. If we know that the record contains 60 'heads' and 40 'tails', then a record of 'head' will show up with a probability of .60, a record of 'tail' with a probability .40. The uncertainty can be described by a statement of these probabilities: Probability of head up 0.60 Probability of tail up 0.40 In the same way we can describe any number of binary uncertainties with a 60-40 choice between any class 'A' and its complement 'non-A' — where 'A' and 'non-A' may be males and females, hits and misses, friends and foes. * At some time there was some discussion whether uncertainty and information should be given opposite signs. Present usage prescribes the same sign for both. 20 Henry Quastler These uncertainties differ in any number of respects from each other. They win be of interest in very different situations; the kind of infomiation needed to produce certainty is not the same; neither is the usefulness of this information, and so on. However, there is something in common between all uncertainties which can be characterized by the probabihties: Probabihty of 'A' 60 Probabihty of 'non-A' ... 1 — .60 One aspect of this 'something-in-common' is that an arrangement of any 60 A's and 40 non-A's can be coded to represent any other 60 A's and 40 non-A's —heads or tails, males or females, hits or misses, friends or foes. Once such representation has been established, then the uncertainty concerning one event will be abolished by information concerning the other. We have previously equated the amount of information with the amount of uncertainty it removes. Accordingly, it can be said that the amounts of uncertainty and information must be equal in all situations characterized by a binary alternative with probabilities .60 and .40. The foregoing consideration exposes the fundamental features of the measure of information : (1) Information is a measurable abstract quantity; its value does not depend on what the information is about, just as length, or weight, or tempera- ture have values which do not depend on the nature of the thing which is long, heavy, or hot ; (2) Information is related to the ensemble of possible outcomes of an event; its value depends on the probabihties associated with these outcomes, but not on their causes, and not on their consequences. What remains is the development of a measure which comphes with this concept of 'amount of information'; this is merely a technical problem. An obvious generalization states that whenever two events have the same number of possible outcomes, and identical sets of probabihties are associated with the two ensembles of possible outcomes, then these two events have identical information contents. However, we wish to be able to compare events with quite different probability sets; for instance, we wish to be able to say which uncertainty is greater, that associated with a situation with three equiprobable alternatives, or that where there are four possibilities with probabilities .8, .1, .05 and .05. To answer such questions, we have to derive a measure which is a single number, whatever the number of possible categories and their associated probabihties. Such a measure is readily derived from the equivalence of uncertainty with the information which removes it. We may represent the information content of an uncertainty-removing piece of intelligence in any manner we wish. We stipulate that this information should be represented in a standard fashion, namely, by using a binary alphabet. In addition we stipulate that the binary representation be coded in such a manner that the expected number of symbols is minimized. We thus obtain a unique number; namely, the minimum average number of binary symbols needed to abolish the uncertainty associated with a given situation. This number will be called the amount of uncertainty or information of this situation. A Primer on Information Theory 21 The function here needed has already been derived as the condition of representabiHty. If two situations can be made to represent each other, then information on one can aboHsh uncertainty concerning the other. Thus, mutual representabiHty implies equal information content, and representation in the standard binary system yields a general measure of information content. This measure is the 'amount of selective information' as defined by Shannon and Wiener (4, 5). It is expressed as follows: Let X be a classification with categories i and associated probabilities p{i); then the information content oj x is designated H(x) and given by*: H(x)^ -2 p(i) logo p(i) i The units of this function are the binary digits needed for representation of a given event, and are called bits. It must be remembered that the 'bit' is a technical unit of amount of information and not a small piece of information. A single chunk of information may contain many bits or a fraction of a bit. Some Properties of the Shannon- Wiener Information Function The Shannon-Wiener information function has been derived (admittedly, in a loose fashion) from a consideration of standard representation of information. We will now consider a number of its properties and see that they correspond losely to the behavior which one would intuitively expect from a good measure of information. (1) Independence — Let / be one of the possible categories of an event x, p{i) the associated probability, and F{i) the contribution of the /th category to the uncertainty. It is desirable that F{i) be a function of and only of p{i). The function / ^ F{i)^ -pii)\og^p(i) fulfills this requirement. / (2) Continuity — A small change of /;(/) should result in a small change in F(i); in other words, F(i) should be a continuous function of p(i). The function p{i) log2 p(i) is continuous. /(3) Additivity — It is desirable that the total information derived from two dependent sources should be the sum of the individual information; in other * The information function looks (except for a scale factor) like Boltzmann's entropy- function; this is not a mere coincidence. The physical entropy is the amount of uncertainty associated with a state of a system, provided all states which are physically distinguishable are considered as different, that is, if the categorization is taken with the finest grain possible. In most situations dealt with in information theory, large numbers of states which are physically distinguishable are lumped into equivalent classes. The category "one light on the steeple" is a good example; an enormous number of physically distinct states are compatible with this definition, but they are all lumped into one class. The distinctions upon which categorizations are based are usually a very small percentage of the distinctions one could make. Thus, physical entropy is an upper bound of the information functions which can be associated with a given situation, but it is a very high upper bound, usually very far from the actual value. For this reason, I prefer not to use the word 'entropy' as synonymous with 'information'. A very thorough discussion of the relation between information and entropy has been given by Brillouin (9). 22 Henry Quastler words, the uncertainty concerning independent events should be the sum of the individual uncertainties. Let y be an event with categories j and associated probabilities p{j). Let p{i,j) be the probability of the event pair that x falls into category / and v into category y. Then, the function Hix,}') = -lp{i,j)\og2p(i,j) will measure the uncertainty associated with the event pair. If X and y are independent events, then p{Uj)^p{i)-p{]) As a matter of fact, this relation is often used to define independence. In this case, we have H{x, j) = - 2 p{i.j) logo p{i) ■ pij) i.j = -lp(hj)^og^pii) - lp('J)\oz.2p(j) It is known that J.piUj)=p{i) j IpiUj)=p(j) Substituting these expressions, we obtain Hix, >0 = - 2 Pii) log2 /XO - 2 p(j) loga p(j) i . = H(x) + H(y). ^ H^^^) ^ H^'^^^'i^f^^ Thus, the Shannon-Wiener function fulfills the postulate of additivity. (4) Natural Scale— X\yQ prototype of uncertainty is that associated with a 50-50 choice. So, the unit of uncertainty should be the uncertainty associated with this situation. In this case, both/s have the value 1/2, and Hix) = -(1/2 log2 1/2 + 1/2 log2 1/2) - 1 Thus, the Shannon-Wiener function is seen to have an appropriate scale factor. We have derived the infonnation function from the postulate of eflScient binary representation, and have found that the function so defined has the desirable properties of independence, continuity, additivity, and natural scale. We could have started differently, setting up these four properties 2i^ postulates. It can be shown that these four postulates (or other sets of four similar postu- lates) define uniquely the Shannon-Wiener function. Working it this way, we would have derived the fact that the function so defined has the desirable property of efficient binary representation. The function F{p) is plotted against/; in Fig. 1. The graph shows a curve which originates and terminates at F = 0, and has a flat top with a maximum A Primer on Information Theory 23 of F= 0.53 for/) = 0.37. Inspection of the graph reveals some more important properties of the function F{p) : (5) nO) = 0: When a particular class of events is certain not to occur {p = 0), then it does not contribute to the measure of uncertainty. (6) F(1) = 0: F(p) = - p logp p F(p) 0.3 Fig. 1. Graph of F{p) as a function of/? When a particular class of events is certain to occur {p = 1), i.e. excludes all other classes, then there is no uncertainty about the outcome. (7) Effect of Averaging: F > i[F(p,) + F(p^)] The function of the average is greater than or at least as large as the average of the function. When the probabilities associated with two disjoint categories are averaged, then the uncertainty becomes larger. Figure 2 is a graphical demonstration of this effect. The extreme case of averaging occurs if all r categories in a classification are considered equiprobable. Then, Pi') = 7 I, 1 1 11 max. of H(.x) = — ^ - log., - = — /• • - lo?., - ,-1 /• ^- r r - r max. of H(.x) = log, r 24 Henry Quastler In particular in a binary classification, r = 2 max. of H{x) = 1 Thus, the maximum uncertainty associated with two alternatives is one bit; it occurs if both alternatives are equally probable (this is the case of the unbiassed coin!). (8) Ejfect of Pooling: F(pi + P2X F(Pi) + np2) The function of the sum is smaller than the sum of the functions. That is, pooling of two classes in one equivalence class reduces uncertainty (exactly P|+ Po F(P|1 + F(P2J Fig. 2. Graphical demonstration of the effect of averaging by that uncertainty which is associated with the distinction between the two pooled classes). Extreme pooling results in a single category with probability 1 ; this means uncertainty 0. Figure 3 demonstrates the effect of poohng. The function F(p) = —p logg p has been tabulated. The reader is advised to use Fig. 1 to obtain approximate values for use in working the exercises below. For more precise values, one of the existing tables may be consulted (10, 11). EXERCISES 7. Compute the uncertainty associated with: p(A) = .60 /•(non-A) = .40 8. Compute H(x) for two alternatives, and plot the value against /7(A). 9. Answer the question posed previously: which uncertainty is greater, that associated with a situation (x) with three equiprobable alternatives, or that (y) where there are 4 possibili- ties with probabilities .8, .1, .05 and .05. A Primer on Information Theory 25 10. Estimate the uncertainty of a motorist like the one described at the beginning of this section. 11. Certain languages have considerably fewer letters than English (that is, about 18 to 20), yet the information content per letter is nearly the same. How is this possible? 12. A situation has an unlimited number of alternatives, with probabilities of 1/2, 1/4, 1/8, 1/16, etc. in geometric progression. What is the measure of uncertainty? F(P|)+ Flp^) FlPl + Pg) fj P,+ Pg P2 p,+ P2 2 Fig. 3. Graphical demonstration of the effect of pooling The function of the sum is on the intersection between the curve and the ordinate over the sum; the sum of the functions is on the intersection of the same ordinate with a straight line through the origin and the midpoint of the straight line which connects the intersections of the curve with the ordinates over pi and P2, hence: F(p, + P2) < F(p,) + F(p,) IV. INFORMATION MEASUREMENTS PERTAINING TO TWO RELATED VARIABLES In the two preceding sections we have discussed how to represent information, and how to measure amounts of information. Both procedures become impor- tant if information is to be manipulated. The manipulation most commonly used is communication. 26 Henry Quastler In infonnation theory, we use the word 'communication' in a wider sense than usual — just as the word 'information' is used in a wider sense than usual. We understand by 'communication' any relation between variables, accomplished by any means whatsoever, conscious or otherwise, provided that it results in a mutual reduction of uncertainty. For instance: if one watches one of two tennis players, without looking at the other, he derives a considerable amount of information about the unseen player's action. Thus, the seen player transmits information about the unseen player — although in this case, the transmission of information is incidental and not normally utilized, as one ordinarily looks at both players. An Example of Two Related Variables The following example is purposely selected to represent an instance of unintentional communication. The table below is based on Pearson and Lee's measurements of heights on 1376 father-daughter pairs. To simplify the analysis, we have grouped the data in coarse intervals of 3 in. each, and converted all frequencies into percentages. Table II. Heights of Fathers and Daughters; Probabilities and Information Measures Joint probabilities of heights, pii,)) (Pearson and Lee's data, 1376 father-daughter pairs) jt 59.5 62.5 65.5 68.5 71.5 74.5 pU) -p\og2P 1% = 53.5 .001 — . — — — .001 .01 56.5 .001 .007 .006 .001 — — .015 .09 59.5 .005 .022 .060 .027 .005 — .119 .37 A-t 62.5 .004 .042 .156 .152 .039 .001 .394 .53 65.5 — .009 .075 .175 .095 .010 .364 .53 68.5 — .001 .011 .035 .039 .010 .096 .32 71.5 — — • — .003 .006 .002 .011 .07 Pij) .010 .082 .308 .393 .184 .023 1.000 1.92 -plog^p .07 .30 .52 .53 .45 .13 * height of fathers, in 3 in. intervals t height of daughters, in 3 in. intervals + center of intervals 2.00 Information Functions: H{x) = -S/'(/)log2/j(/) = 1.92 bits i my) = -i:p(j)\og,p(j) = 2.00 bits H(x) + H(y) = 3.92 bits H{x,y) = -i:piij) log, p(i,j) = 3.70 bits ij Tix.y) = H(x) + H(y) - H{x,y) = 0.22 bits A Primer on Information Theory 27 From the marginal sums, the uncertainties concerning the height of daughters, H{x), and of fathers, H(y), are computed as described in the preceding section. The uncertainty concerning both heights in a father-daughter pair is computed in similar fashion from the joint probabilities, p(i,j). This function is properly called the Joint uncertainty, or uncertainty of the two-part system ; its symbol is H(x,y). It is compared to the sum of the two individual uncertainties. If the two heights were completely independent of each other, then the joint uncertainty should be equal to the sum of the individual uncertainties. In our case, it is smaller by 0.22 bits. The deficit is a measure of the internal constraints in the system, which lead to an association between heights of fathers and daughters. The function is designated by the symbol T(x;y). Its defining equation is : j,^^ .^^ _ ^^^^ _^ ^^^^ _ j^^^.^^ This information function is germane to other statistics which measure the relatedness of two variables, such as the coefficients of correlation and of contingency. The T-measure is of very general applicability; the values of the variables do not have to be quantitative, not even ordered — they must only be distinguishable. For instance, one can compute a T-measure for a relation between color and shape. The two functions, H and T, differ in the way in which they are affected by change of scale. Let us consider what would have happened if he had chosen one-inch intervals instead of three-inch intervals. It could be the case that only one one-inch interval out of any group of three is occupied at all. Then, the information that a certain height falls into a given three-inch interval would automatically locate it in some one-inch interval; hence, the uncertainty is not increased by the subdivision of intervals. However, this is an extremely unlikely situation. It is much more likely that the three one-inch intervals are populated with approximately equal frequencies. In this case, additional information of logg 3 = 1.58 bits is needed to specify the proper one-inch interval. Then, the uncertainty concerning the height of fathers with regard to a one-inch scale will be 2.00 + 1.58 = 3.58 bits, and the uncertainty concerning the height of daughters 1.92 + 1.58 = 3.50 bits. The joint uncertainty will be increased by a factor of logg 9 = 3.17, because each cell in the table will be replaced by nine cells as one goes from three-inch intervals to one-inch intervals. If one uses a still finer grain, going from inches to millimetres, then the individual uncertainties can be increased by another 4.7 bits, the joint uncertainty by 9.3 bits. This is quite the expected behavior. The more categories are recognized, the greater the uncertainty of classification. The uncertainty can become infinite for a continuous function. However, it will always remain finite for any set of real observations. T, on the other hand, depends very little on the scale interval used. With very coarse grouping, T tends to be less. In the extreme cases, where all heights are pooled into one single class, all individual and joint uncertainties vanish, and with them their differences. In the other extreme case, where measurements are taken and registered to so many digits that no two results are alike, we must get //(x) = //(;,') = //(x,v) = r(x;v) = loga 1376. But, between these un- reasonable extremes, the measure of constraints is characteristic of the system and not of the scale which is used in measuring it. 28 Henry Quastler Two-part Systems in General We proceed to a general treatment of a two-part system x, y. Let / and 7 be the categories of x and v, respectively, and p{i) and p{j) the associated probabilities. Further, let p{i,j) be the probability of the joint occurrence [{x = i) and (y =;)]. Then: H(x)^ -2 p{i) 10^2 p{i) i H(y) = -Ipij)^og,p(j) H(x, y) = - 2 p(i, j) logs p{i, j) ij We introduce the conditional probabilities, Piij) Prob { V = y if X = /} />,.(O....Prob{x = /ifj=y} When X = i then y must have some value j with certainty (or probability 1.0), that is IPiiJ) = 1 j Equally, Ip^iO = 1 i Furthermore, the probability of the joint occurrence [x ~ i and y = j] can be factored into the product of the probability that x equals /, times the conditional probability that y = j ii x — i; equally, it can be factored into the product of pij) times Pj{i). So : P(i,j)=pii)-Pi{j) ^Pij)-Pj(0 The conditional probabilities yield naturally conditional uncertainties. For instance, the uncertainty of j, if it is known that x = i, will be Hiiy) = -IPiij) loga/^XO 3 The average uncertainty of j, under the condition that x is known, is designated by H/y). It is obtained as the weighted average of the //Xv)'s- i Substituting the value of H^{y), we get tJxiy) = -Ipii) 1 Piij) ^og2Pi(j) I j and remembering that Pii}) - -jay we get A Primer on Information Theory 29 Expanding the logarithm gives HAy) = -IpUj') iog2/X^y) + Ip(ij') Iog2/X0- ij io Noting that lpiJJ)=p(i) 3 we get H/y) = -IpiiJ) loga /?(/,;■) + lp(i) loga /?('■). ij i We have seen that the first term on the right side is H(x, y) and the second -H{x). So: H,(y) = H(x, y) - H{x) and H(x, y) = H(x) + H^y) A parallel development shows that H(x, y) = H(y) + H,{x) This relation is quite obvious if put into words: the joint uncertainty con- cerning two variables is equal to the sum of the uncertainty concerning either one variable plus the conditional uncertainty concerning the second variable if the first one is given. H( K) W/////////////////^^^^^ - > ' f II ^ ^ "• lly (X) *■ 1 "(x;y) — U f V ^ K - H^(y) y//////////////^^^^^^ .^ H (> 1 y) V » \ ^ Fig. 4. The relation between information functions shown graphically The difference in uncertainty concerning )', depending on whether or not x is known, H{y) - Hly\ is the gfl/rt in certainty about y derived from observing x. Substituting for ^^rCj')' weget: H{y) - Ely) = H{y) + H{x) - H{x, y) The expression on the right side is the defining equation for T{x\y): H(y) + H{x) - H{x, y) - T{x; y). It follows from this derivation that Tis a symmetrical function: r(x; jO = rO-; x) = H{x) - H,{x) - H{y) - HJy) and it becomes clear why Tis a measure of the mutual reduction of uncertainty. The relations between the six information functions, H(x), H{v), H(x, v), H^(y), Hy(x) and T(x;y), can be demonstrated graphically as in Fig. 4. 30 Henry Quastler In normal code representation, i.e. reduced to efficient binary operations, the information functions have the following meaning: H(x) . . . .number of operations which specify x Hy{x) . . . .no. of operations which specify x if v is given T{x; v) . . . .no. of operations which apply to the specification of both x and v H(x, y)- ■ ■ .no. of operations which specify the whole system. Inspection of the graph shows that: H(x) > H,ix) H(y) ^ H,(y), that is, the conditional uncertainty cannot be greater than the unconditional uncertainty.* Communication Systems When a system not only transmits information but exists primarily for that purpose, then it is called a communication system. No class of two-oart systems has received as much attention as that of the communication system. In a simple communication system, tlie two parts are called the source and the destination of information. The distinction between source and destination must be based on external grounds; the informational relations between the two are perfectly symmetrical. The relevant states of the source are called the inputs, or signals sent, and the relevant states of the destination are the outputs, or signals received. A single state is called a symbol, and a higher unit composed of several symbols, a message. The conditional probabilities for each pair of signals sent and received form a matrix called the channel. Note that the word 'channel' is again used in a sense wider than customary. A 'channel' may but does not have to be a means of physically conveying information. For instance, if two variables x and y do not affect each other but are both affected by a third variable r, then knowledge of the state of x is likely to reduce the uncertainty concerning the state of y, and vice versa; hence, information is transmitted between the two variables, and they are connected by a 'channel' in the sense of information theory — although they do not communicate with each other directly. * However, this is true only for an average conditional uncertainty, and does not apply to every particular condition. The following example will help to fix the ideas: Consider a diagnostic test for a certain disease; suppose the nature of the test and the occurrence of the disease are such that in 98 per cent of the patients the test is negative ; that of the positive tests, 50 per cent are spurious ; and that virtually every case of the disease will give a positive test. Then, if the test is not performed at all, the diagnostician's uncertainty as to the presence of the disease in any given patient, is -(.99 log2 0.99 + .01 log2 0.01) = .081 bits/patient. If the test was negative then the uncertainty is zero. But, if the test is positive, the chances are equal that it is or is not spurious; hence, the uncertainty is I.O bit, and the diagnostician is more in doubt than he was before. However, the average uncertainty, conditional upon his performing the test, is reduced to .98 X + .02 X 1.0 = 0.020 bits/patient. A Primer on Information Theory 31 The information functions in a communication system are designated as follows: H(x) H{y) Tix;y) .uncertainty of source .uncertainty of destination . ambiguity .equivocation .information transmitted, or communicated Amounts of information transmitted must be referred to some unit of action. In particular, it is customary to compute transmissions per symbol or per unit time. A channel which associates one and only one output with each input, and no output with more than one input, is called a noise-free channel or transducer; in this case, H{x) = H(y) = H(x,y) = T(x;y); HJ,y) = H,{x) = 0. We can think of a noise-free channel as a means by which information at the source is represented at the destination. Physically, this involves two acts of representation: first, states of the channel are selected so as to represent the inputs, according to some agreed-upon code; this is called encoding. Next, the states of the channel are translated into meaningful states at the destination ; this is called decoding. All we have stated about representation, representability and amounts of information could now be restated in terms of encoding and decoding operations. In this sense, the relation which we introduced as the 'condition of representability' is also known as the Theorem of the Noise-free Channel; and all the examples and exercises of representing information could be re-interpreted as coding operations. Noise — Few real channels are noise-free; in general, more than one output can follow a particular input. For instance, the 'channel' which links a daughter's height to her father's is far from noise-free; the following table gives the conditional probabilities: Table III. Data of Table II in Form of a Communication Channel Conditional probabi ities, p (0 / = 53.5 56.5 59.5 62.5 65.5 68.5 71.5 HAx) j = 59.5 .10 .50 .40 _ 1.36 62.5 .01 .09 .27 .51 .11 .01 — 1.80 65.5 — .02 .19 .51 .24 .04 — 1.74 68.5 — — .07 .39 .45 .09 .01 1.70 71.5 — . .03 .21 .52 .21 .03 1.74 74.5 — — — .04 .45 .43 .09 1.55 32 Henry Quastler The last column, Hj{x), is the uncertainty concerning the height of the daughter if the height of the father is known ; it is not too surprising to find this uncertainty smallest in the extreme cases, and always smaller than the unconditional uncertainty of 1.92 bits. The father's height 'communicates' some information about the daughter's height; the amount communicated is 0.22 bits. It is not more than that for a number of reasons. Some of the deficit in information about the daughter's height is undoubtedly due to ignorance, and could be reduced by taking proper account of various concomitant factors. Some of the uncertainty may be irreducible, due to a truly random process — possibly the selection of the particu- lar chromosomes which go into determining the daughter's height. In the strict sense, the term 'noise' is reserved for the effects of random disturbances, and not to the eff"ects of ignorance. However, the problem of the final distinction between uncertainty due to randomness and uncertainty due to ignorance is an extremely delicate one; the practical information analyst will usually be satisfied to treat any uncertainty as due to noise, which results in the greatest reduction of certainty. This interpretation will be subject to revision in the light of additional knowledge. The two-part system 'father's height-daughter's height' is not a communica- tion system, and this is one reason why so little information is transmitted. Suppose the numbers which define the 'father's heights' categories were not observed in a given population but could be chosen arbitrarily; for instance, they might be input voltages applied to a system. Accordingly, the 'daughters' heights' might be output voltages, and the table of conditional probabilities becomes a statement of the transfer function of the system. It is obvious that this system can be made to transmit more than 0.22 bits per symbol. For instance, using onlyy = 59.5 andy = 74.5, with equal frequencies, one would transmit about .90 bits per signal. In general: for each channel, Piij), there exists a set of input probabilities, p(i), which maximizes the transmission rate. The rate so obtained is called the channel capacity. Even with best utihzation of the possibilities of a channel, it can do no more than transmit all the input information, and in general it will not transmit quite all of it. This leads to an important generalization : Manipulation of information cannot increase its amount; it can at best preserve it, and it is likely to reduce it. This important statement will be clarified by the discussion of an apparent exception. Suppose A wishes to send a message to B over the channel C; conditions being very good, B picks up not only almost perfectly the message sent by A but acquires, in the course of doing so, considerable amount of information about conditions in the channel. His total information received might be more than that contained in A's message; still, he has lost some of the information contained in the message. In general: as a result of manipulating information, there can be more output information than there was input information — but the contribution of the input information to the total cannot be more than the amount of input information. Error Detection and Correction A codebook states which output should be associated with any given input. A noise-free channel fulfills these requirements perfectly. In a noisy channel A Primer on Information Theory 33 Other outputs than the required ones appear; in other words, a noisy channel produces errors. Errors lead to loss of information, and a reduction in the rate of transmission; in a noisy channel, Tix;y) 0. This loss is unavoidable. However, it is at least possible to spot and correct the errors which have occurred. It is one of the main endeavours of information theory to devise methods to do this efficiently. An error in a message can never be found unless the message contains some extra information which can be used for this purpose. For instance, if the message consists of a string of four digits chosen without any constraint: 5 3 8 7, one has absolutely no possibility of knowing whether or not it contains any errors. If it has been agreed upon that the message will be repeated, then one can detect errors : 5 4 8 7 5 3 7 7, and if the message is repeated several times, these errors can be detected and corrected, with arbitrary certainty if the number of replications can be made sufficiently large: 5 3 8 7 5 3 7 7 5 3 8 7 5 4 8 7 5 3 8 1. In the second case, the possibility of error detection was bought at the price of making two digits do the work of one; the message is said to be 50 per cent redundant. In the last case, the price of error correction is the use of five digits to transmit a single one, or a redundancy of 80 per cent. Introducing redundant information in the fonn of a simple replication is straight-forward and eiTective, but not very economical. Error detection could be achieved more efficiently by simply adding the sum of the digits to the message: be achieved more efficiently by simply adding the sum of the digits to the message : 5 3 8 7 2 3. Here, the redundant information is only one-third of the total. In fact, giving only the last digit of the sum as 'signature' is almost as effective, and requires only 1 digit in 5, or 20 per cent redundant infonnation. The signature check illustrates a general principle: a given amount of redundant infonnation in a 34 Henry Quastler message can be used for error checking the more effectively the more evenly it is related to all parts of the message. It is always possible to achieve reliability, in the presence of noise, by the use of redundant information; in fact, one can approach perfect reliability arbitrarily closely if one is willing to provide enough redundant information. The amount of redundant information needed, for a given noise level and a given desired reliability, will depend on the efficiency of coding. The ideal relation between noise level and redundant information needed is formulated in Shannon's fundamental Theorem of the Noisy Channel. This theorem can be stated as follows: if a certain amount of information is to be transmitted with perfect reliability in the presence of noise, then it is necessary to provide at least as much redundant information as the amount of equivocation introduced by the noise ; furthermore, this amount will be sufficient if the coding is maximally efficient. There exist several proofs of this theorem; none of them is easy to follow, and all are existence proofs — that is, they prove that an error-checking code exists which will fulfill the requirements, but they do not say how to construct it. In fact, perfectly efficient error-checking codes seem to be realizable only in a few special cases; however, close approximations to ideal efficiency are easily obtained if it is permissible to use message blocks of great length (12). The economics of error-checking are dominated by three factors: (I) the frequency and costliness of errors (II) the cost of adding redundant information (III) the availability and costliness of checking procedure (encoding and decoding). The work of Shannon and his followers has dealt with one particular situation : encoding and decoding procedures are supposed to be reliable and gratis, the error frequency is to be reduced to almost zero, and redundant information is supposed to be used as sparingly as possible. As long as the theory is not completed even for this case, one cannot expect to develop a more general theory. Some qualitative notions of what it will entail can be gathered from a considera- tion of a much-used, and presumably well developed communication system, namely, printed language. Symbols are gathered into various checking units (words, sentences, paragraphs, chapters) ; on each level, there operate constraints which will help to locate and correct errors. For instance, this sentence will be read corretly even though one letter has been onitted and one word misspelled. It 3eems that the redundancy per letter, in a coherent English text, is about 60 per cent. Paragraphs are constructed in such a way that the sense can be grasped even if whole words or even sentences are missing or perturbed, and the essence of a whole chapter is, in general, understandable even if a whole paragraph should be left out. Actual Communications System So far we have dealt with two-part systems in a purely abstract way. 'Sources' and 'destinations' are defined simply by the states which they can assume. 'Channels' are tables of conditional probabilities; in the simplest case, the channel is a kind of telephone book which associates every input to some A Primer on Information Theory 35 particular output. If the association is not unequivocal, then the channel is said to be noisy. 'Noise' is defined as a random perturbation of the input- output link. Those are nice, clean concepts, not to be confused with realities. The 'channel' exists on paper only, and is not the same as the mechanism which links two parts of a system. The infonnational relation between heights of fathers and daughters does not reveal the nature of the mechanisms involved; whether fathers affect their daughters' heights by means of their genes, or of the food they provide, or of the mother they select for them, cannot be decided on grounds of informational relations. Indeed, I believe that Buddhist tradition would explain the correlation on the grounds that daughters select their fathers; as far as information theory is concerned, this is perfectly acceptable. The scheme shown in Fig. 5 is a somewhat closer approximation to reality: NOISE SOURCE MESSAGE. • ENCODER TRANSMITTER SIGNALS CHANNEL SIGNALS DESTINATION [ J^^SSAGE ^ qe-qqqer l.^ RECEIVER (— ' Fig. 5. A diagrammatic representation of a communication system It is customary to treat all links but the channel as noise-free. If need be, one can introduce noise into the other links of the model by some straight-forward adaptations. If signals and channels are physical entities, then it is relevant to investigate their physical capacity of carrying information. Suppose the nature of a unit of action and the physical constraints are such that the channel can assume any one of m states during one unit of action; then, these states can be made to represent log., m bits of information. It is the function of the encoder-trans- mitter system to match the diversity of messages generated by the source to the diversity of states which can be assumed by the channel; those, in turn, are matched to the diversity of messages intelligible at the destination by the receiver-decoder system. As long as the demands on the channel are light, the matching process is not much of a problem. However, it may become very difficult if the channel is to be driven at capacity, and if the various states of the channel are not of equal value; some may be more subject to noise effects than others, some may need more time than others, some may necessitate more effort than others. In general, one will tend to favor the safest, shortest, and easiest states. However, this must not go too far; if one goes to the extreme of using the very 'best' state, then the channel does not transmit any information at all. To find optimum compromises between informational needs of source and destination and physical capacities of the channel, between amount of information used to carry messages and amount of information needed for noise reduction, is one of the fundamental problems of the theory of information and communication. 36 Henry Quastler EXERCISES 13. The following table gives the number of times the four possible combinations of two flower colors with two pollen shapes were found : Pollen shape Flower color : Purple Red Long 296 27 Round I 19 85 Is there information transmission between these two characters ? 14. Define the following functions, and derive their values (in terms of //-functions) J{x,y; z) T(x; y, z) T(x; y; z) 15. Ad agnostic test gives the following results: true negatives . . false negatives . . true positives false positives . . 85% 5% 3% 7% What is the informational value of the test? What is the maximum informational value that any test could give in this situation? 16. A teletype machine sends 2.3 groups of five binary symbols per second. What is the maximum possible rate of information transmission ? 17. Same machine as in Exercise (16). All code groups are equiprobable. Error probabili- ties are as follows: symbols nos. 1 and 4 are always received correctly, nos. 2 and 3 are wrong 1 1 per cent of the time, no. 5 is wrong 1 per cent of the time. All errors are equiprobable. Compute equivocation and amount of information transmitted. 18. You are to send 2-bit messages through a channel which has the property that one in five binary symbols is bound to be in error. Construct four sequences of five binary messages which will allow the reconstruction of the original message. What is the efficiency of the code? V. ORGANIZATION Systems, Structures, Pattern A system is an organized whole made up of interrelated parts. Organization is based upon the interrelations between parts. The parts may be strongly or weakly coupled; their effect on each other may be quantitative or qualitative. (Z> Kr) Fig. 6. A simple communication network If two parts are coupled in any fashion, then knowledge of the state of one must imply some information about the state of the other. Accordingly, any interrela- tion can be technically represented as a channel. So, two components of a system can be symbolically represented by a simple communication network of two parts, referred to as two nodes and one channel: A Primer on Information Theory 37 Let H(x) be the amount of infonnation needed to know what state x is in. If y is known, some of this information becomes unnecessary, or redundant. This amount, T{x;y), is an index of the degree of coherence, constraint, integra- tion, or organization which prevails in the system. Consider the pair of words 'green valley'. These two words form a small system — a whole made up of interrelated parts. The whole has a meaning which neither part alone has. The price for this feature is elimination of many other possible connotations of 'green' and 'valley'. As a result, the information content of the word combination is smaller than the combined information contents of the two words. The difference must show up as redundant informa- tion. The presence of redundancy implies that each word contains some information about the other. This is best demonstrated by successful error checking. The errors 'preen 'for 'green', and 'volley' for 'valley' would not be found in isolated words, but can be spotted in the pair. System Analysis — There seem to be three general viewpoints under which relations within a system are assessed: (a) the amount of information trans- mitted — on the technical, semantic and pragmatic level ; (b) the degree of control or cause-effect relations, dominance; and (c) the utility, or value, of the relation to one or both of the related parts. Information theory deals only with the first viewpoint. It does not concern cause-effect relations, or what causes the information to flow, and it is not concerned either with the utility of the flow of information. Informational analysis of a system will be of interest if and only if the informational challenge is serious, that is, when a system has to process informa- tion at a rate which crowds its capabilities. The informational challenge is the result of: (1) The diversity which is characteristic of the tasks; this can be expressed as ///task. A system which is faced with the same task all the time or most of the time may be working very hard but the difficulty is not an informational one. (2) The precision which is required ; this can be expressed as the ratio TIN. That is, the diversity of tasks is informationally challenging only insofar as it is expressed in a diversity of responses. A system with a small response repertoire may be working very hard, but not in the informational domain. (3) The time which is allotted for the fulfillment of each task. A system with very modest informational equipment can solve many tasks if given ample time. For instance, the extremely simple logical machine devised by Turing (13) will solve any solvable problem if given very much time. The time rate of informational challenge of the system is the product H T tasks _, . . X 7> X — TT — = Tlumt time. task H unit time The infoiTnational output of the system will be measured in //-measures but the effective output, or informational performance, in terms of T-measures, as T per task or T per unit time. The limits of the informational performance of a system can be found by systematically varying the informational challenge and observing the resulting performance. In such studies it is important to 38 Henry Quastler make sure that the system's performance is hmited informationally, and not by difficulties of sensing inputs or generating outputs. It is possible to vary the informational challenge in a number of modes; e.g. one can vary the number of sources of information, or the amount of information per source. Challenging in various modes reveals whether or not there exist several modes of limitation. It seems that the informational perfor- mance wliich a system can produce in single tasks may be limited by the follow- ing factors, singly or in conjunction : (1) the amount of information which can be processed effectively in a single task, (2) the number of independent information-carrying components which can be involved in a single act of infomiation-processing, (3) the informational contribution from each independent component, (4) all information-carrying components must be assembled within a certain length of time ; (5) in addition, there seem to be two general limitations on time rates: there is a minimum time for each act of information processing, and (6) the over-all rate of information-processing is limited (only this last limitation has the character of a channel capacity). This list of limitations is based on psychological experiments (14) but is believed to apply to all types of systems. Multi-part Systems — The informational system analysis is not restricted to two-part systems. A system of three components can be represented as a three-node network with a connecting channel: Fig. 7. A simple three-node network Again, it is merely a matter of convenience which node, or set of nodes, one treats as the input, or independent variate. The treatment can be extended to any number of components. Thus, a nine-node network is equivalent to one man receiving infomiation from eight sources, or feeding information into eight sinks; or, to four men watching two sources, communicating with each other, and feeding information into three sinks; to a sentence of nine words; to a decision based upon eight factors. The more parts there are to a system, the more difficult becomes the infor- mational analysis (15, 16). This is territory that has been but recently opened, and we are still largely concerned with the formulation and highly tentative application of concepts. It will be helpful to consider a parallel effort, namely, the study of organization by game theory (17). One result of this study is that each time a new player is added, the organization (the 'game') acquires a new qualitative feature. One-person games deal with problems of maximum; the addition of a second person introduces competition; of a third person, coalition; of a fourth person, an asymmetric role of one player in relation to the group of the other three, von Neumann (17) points out that it is at this junction that the most remarkable problems begin to appear; also at this junction, A Primer on Information Theory 39 there occurs a change from a rigorous and complete exposition to a heuristic and incomplete one. The situation is similar in the study of organization by information theory. Each time a new part is added to a system, a qualitatively new information function appears. As long as one deals with a single variable, the problem is one of efficient use of existing variations. A two-part system introduces relations between parts; a three-part system, relations between relations; a four-part system, relations between a part and a complex of relations. Unitization — It is an empirical fact that when a system is complex enough to require very many components, the phenomenon of unitization occurs. That is, some components get organized in such a way that they interact strongly among each other, and act as a unit with respect to the remainder of the system and the external world. Unitization seems to be a necessary evil; it might be an important key for the study of complex organization and complex mental activities. The phenomenon has never been really explained; it is possible that a quantitative treatment will be made possible through the use of infor- mation theory (18). Unitization is always coupled with the phenomenon of limited span. Any real part has a limited information content. In any single act of communication, the capacity for non-redundant transmission of a part is limited by its own infomiation content. This amount must somehow be partitioned into inter- action with the external world, and interaction with the other members of the unit. If each of these interactions is to be of significant size, then only a limited number is possible. The interaction of a unit with the outside may be only a fraction of the information traffic within the unit. Hence, several units can be organized into a secondary structure of greater versatility, and this process can be repeated on successive levels of organization. There appears, thus, a possibility that information theory can be helpful in formulating both the causes and the effects of unitization, and in establishing rational interpretations of the size of the units. This would be a very important contribution to any theory of organization. Conclusion — We have proceeded from simple processes of representation to discussions of communication and, finally, organization. It was attempted to treat in a heuristic and perspicuous manner the basic principles of Information Theory: there exists a generalized concept of 'information' which includes communication and organization and is so general that every real event or structure has its informational aspects; this general concept is related to a measurable quantity; the operation of taking a measurement of this quantity is done by means of symbolization in a standard language. The functions as defined obey two fundamental theorems: the Representation Theorem, and the Theorem of the Noisy Channel. Both theorems impose a limit on the amount of information which can be effectively processed in a given situation; both also state that it is possible to reach this limit. APPENDIX I THE EVALUATION OF INFORMATION CONTENT The examples and exercises should have familiarized the reader with the techniques of taking information measurements. However, the investigator 40 Henry Quastler who wishes to use this knowledge in his field is bound to run into some diffi- culties. A typical difficulty is that a natural situation does not present itself neatly classified with a complete set of categories and probability measures. It often takes considerable ingenuity to supplement the missing components of the picture. Wherever ingenuity must be used, the result will not be unequi- vocal. Hence it becomes important to estimate not individual information measures but rather whole ranges compatible with reasonable assumptions. The Relativity of Information Measures 'Information content' is a measurable quantity, just as length; and, just as length, it is a function and not a property of a particular set of events. The theory of relativity asserts that the measured length of an object depends on certain relations between the object and the measuring system. However, under everyday conditions these relations will not produce any significant effect and, most of the time, lengths behave as if they were properties of objects. The infomiation content of an event depends on the manner in which this event is related to the frame of reference of the evaluating system. Unlike with length, these relations are not fixed under everyday conditions. Therefore, information content behaves only rarely as if it were a property of an event. The amount of information, H{x), associated with an event, x, is defined as the expectation of the logarithm of the probability that x will fall into some category, /. Thus, the measure of information depends on three decisions: (1) the choice of a unit event, (2) the establishment of categories, (3) the selection of a set of probabihty measures. In general, each of these decisions involves a degree of arbitrariness. Accor- dingly, a considerable range of information measures will be compatible with a given real situation. The question of an appropriate selection of a unit event cannot be solved by mechanical application of hard and fast rules. There is a lower limit to the size of elements, imposed by limits of observability. In general, selection of these lower limits will force one to take cognizance of a tremendous amount of detail, most of which is bound to be irrelevant. Thus, one will try to select a unit event broad enough that all irrelevant details are submerged in its internal structure, yet narrow enough so that no relevant relations get lost within the unit event. In practice, one has to make a guess, subject to revision by later experience. This difficulty occurs with all kinds of analyses, and is not specific to informational analysis. The situation is quite similar with respect to categories. There, too, exists a bound, imposed by the capabiHties of discrimination. In general a large number of discriminations can be made which are irrelevant to the problem at hand. For instance, if one deals with the semantic content of a printed message, it will be quite irrelevant to categorize by shapes of letters, quality of paper, type of printing ink, etc. The decision is not always so easy. For instance, in categorizing the atoms found in living matter it will, by and large, not be necessary to distinguish between isotopes; in the overwhelming majority of occasions, differences between isotopes will have no effect. Occasionally, of A Primer on Information Theory 41 course, a particular isotope located in a sensitive spot and decaying at a critical moment can have very large effects. In a case like this, the selection of a set of categories becomes a matter of compromise. The probabilities, finally, are never actually known. We have to estimate them, on more or less sound bases. In many situations where generalized information theory is used, the bases for estimating probabilities are rather uncertain. Therefore, it becomes important to assess the dependence of information functions on fluctuations of probabilities. The contingent nature of information measures has not always been obvious. All early applications of infomiation theory dealt with telecommunication systems. In all of these, all informational characteristics are perfectly well defined. In Morse code, all we have to know is whether a particular information- carrying element is a blackness or a whiteness, and whether it is long or short. In pulse code modulation, the only thing that counts is presence or absence of a pulse within a stated interval of time. In pulse amplitude modulation, all information is vested into the amplitude of pulses. In all these cases, there is no question about the infomiational characteristics of the process under consideration. The situation is radically different in the larger domain of applied infor- mation theory. For instance, take the case of two people transmitting information to each other by talking. The information-carrying element is a clause; to simplify our analysis, let us consider just words (remembering that the infor- mation content of a clause cannot be greater than that of its constituent words). Now, each person culls his words from a reservoir which is known to be large, but its actual size is not exactly known. The information content of a single word depends on the probability of its use, and these probabilities are not exactly known either. Furthermore, they will hardly be the same for both persons involved in a conversation. Also, each word can have several meanings, one of which may be more or less determined by the context. The relations between words, meanings, and context, again, are not the same for any two people. This is not all. Information is conveyed not only by the choice of words but also by inflection of voice, loudness, timing, and accompanying gestures. In such a situation we have obviously no hope ever to obtain a precise, unequivocal, and incontestable measure of information content. We are, thus, confronted with two alternatives. These are: not to use infor- mation theory, or to try to devise ways of producing usable approximate estimates. Obviously, our choice is the latter alternative (19). Approximation MetJiods It appears that the approximation methods to estimate infonnation functions are based on the following rules: 1. Averaging increases uncertainty; 2. Pooling decreases uncertainty; 3. Disregarding constraints increases uncertainty; 4. Rare events have small effects on uncertainty measures; 5. Smafl variations in probability have small effects on uncertainty measures; 6. In systems, information functions can be estimated in different ways, and care should be taken to select the most appropriate one; 42 * Henry Quastler 7. If it is not possible to measure the actual infonnation functions desired, then one can try to substitute closely related measurable quantities. In the following paragraphs, these rules will be amplified and illustrated. 1. Averaging Increases Uncertainty — The fact was demonstrated in Section III. It suggests a simple bracketing procedure: obtain a lower and upper bound of uncertainty by using probabilities which are certainly more and less unbalanced than they actually are. In particular, if the number of categories is known but their respective probabilities are not, then one can follow Laplace's procedure and set all probabilities equal which maximizes uncertainty. 2. Pooling Decreases Uncertainty — This, too, has been proven in the third section. It is equally of value in bracketing procedures: using only categories actually discriminated puts a lower bound on uncertainty; assuming more categories than could be of interest establishes an upper bound. 3. Disregarding Constraints Increases Uncertainty — Let x and y be different events, where y may differ from x only in time or place of occurrence or in any other respects. If H(x) is the uncertainty of x, and Hy(x) the uncertainty of .Y if y is known, then: H^x) < H(x). That is, knowing some other event, y, cannot increase the average uncertainty concerning x; it will leave it unchanged if there is no association between x and y; it will reduce it if constraints exist which are manifested in a statistical association between x and y. Rule 3 can be used for a bracketing procedure. Disregarding constraints yields an overestimate of H(x) ; introducing constraints known to be too strong, an underestimate. Constraints have to be very marked to cause large changes in H(x). For instance, the large inequalities of letter frequency in English texts reduce H from a possible maximum of 4.7 bits per letter to 4.1 bits; the strong constraints between successive letters and words result in an additional reduction to 1.5-2.0 bits per letter. Formally, rule 3 is a special case of rule 1. 4. Small Effects of Rare Events — The information functional is a sum of terms of the form (—p log/»). This function rises steeply between zero and .10, hence, small probabilities contribute little to the total sum. For instance, ten equiprobable alternatives correspond to an H of 3.32. If one of these alternatives is replaced by ten separate sub-categories, each of probabihty .01, then the resulting H is 3.65. If instead of ten, one introduces 100 equi- probable sub categories, each with probability .001, the resulting H is 3.99, or equivalent to sixteen equiprobable categories. A good example turned up in a study by A. A. Blank. He calculated the information content of single Enghsh words. For particular reasons, the sample was restricted to four letter words. Thorndyke's list contains 1550 such words. H, based on the observed frequency of these words, is 8.13 bits per word. Of these words, 119 occur with the greatest frequencies. Computing H on the basis of these words alone gives a value of 6.34 bits per word. Thus, taking into consideration only about one tenth of all categories already yields about four-fifths of the final information function. A Primer on Information Theory " 43 This means that information functions can be estimated successfully as soon as the more common occurrences are categorized. The remaining infrequent occurrences will not contribute very much, and that contribution can be easily bracketed between values based on numbers of categories which are certainly too small and too large. 5. Small Effects of Small Variations in Probability — The curve of the function F(p) =^ —p log p has a flat top. Small changes in probability in this region have small effects. Consider the simplest case, of two categories. If their probabilities are equal, then //= 1. If the ratio of the probabilities is 1:2, then 7/= .92. If the ratio is 1 :3, a very considerable deviation from equality, H is still .81. For a larger number of categories, the insensitivity of H against probability distortion is still mOre pronounced. If one replaces equiprobable alternatives by probabilities staggered arithmetically or geometrically, stipulating only that the span between the extreme value should be not more than one order of magnitude, then the resulting changes in //are quite small. This implies that the assumption of equiprobability, which gives an upper bound as stated in rule 1, will not go very far from the true value unless proba- bilities are radically unbalanced. The stretch bracketed between an upper bound based on equiprobability, and a lower bound based on a distortion undoubtedly stronger than the real one, will not be very large. 6. Alternative Ways of Estimating Information Functions — In systems with several nodes, the compound infonnation functions can always be esti- mated in several ways. For instance, in a two-node communication system, the quantity which is the function of greatest interest, the amount of information transmitted, T(x;y), can be computed in three alternative ways: as differences between input uncertainty and equivocation, as difference between output uncertainty and ambiguity, or as difference between the sum of uncertainties of input and output and the uncertainty of their union. It usually is worthwhile to inspect the data very carefully to estabhsh which of the set of functions can be most easily and most accurately computed. In many cases, the quantities most readily computed are not those which result directly from the plan of obser- vation or experimentation. For instance, in most experiments it would be natural to measure output uncertainty and ambiguity, but it is easier to measure input uncertainty and equivocation. 7. Substitution of Related Quantities — In many cases where it is not practical to compute the proper information measures, one can compute information measures associated with related quantities. Take the case of estimating the amount of information v/liich an individual can transmit after a single glance at a display. This quantity is very difficult to determine; but, it is fairly easy to determine the amount of information which can be elicited from an individual by a short interrogation procedure after he has had a glance at the display. This function is not quite the one we want, but presumably closely related to it. Another example: in the case of mental arithmetic, we have no way of estimating the actual amount of information processed, but we can readily estimate the amount of information which must be processed if computations are done in the way in which the subject claims he computes. In cases of this kind one will use the measurable quantity instead of the desired one. Of 44 Henry Quastler course, results so obtained have to be used with a certain amount of restraint. Example: Rate of Information Transmission in Conversation — ^The working of the approximation methods can be shown by two examples. The first example is that which we used to illustrate the need for approximation methods; namely, that of estimating the amount of information in conversation. We consider first the infomiation carried in words. To establish an upper bound, we ask how much information must be transmitted so that the receiver can recognize every single word spoken. This upper bound, in bits per second, is the product of the rate of words per second times bits per word. A rate of 2.1 words per second is typical for lively discussions. The number of bits per word in English context has been estimated as 6.5 bits (±25 per cent). This yields 11 to 17 bits per second. Words are not the only method of communication between two persons conversing face to face. It can be shown, however, that all other means of communication add little to the total transmission rate. We will now try to establish a lower bound. Of course, no general lower bound exists; it is easy to find examples where infomiation is transmitted at the rate of 1 millibit per second, or less. What we want is an 'upper lower bound' a lower bound of the amount of information transmitted between people who try to communicate at some speed, and under reasonably favorable conditions. Such a bound is obtained by analysis of pragmatic communication. We look at situations where the verbal messages elicit or control actions. We make an informational analysis of the relations between actions and verbal messages. This will yield an amount of information demonstrably transmitted, and it certainly represents a lower bound to the amount of information com- municated. At this time, we have a single case where pragmatic communication has been evaluated accurately in informational terms. Felton, Fritz and Grier (20) measured the amount of pragmatic communication between an airplane pilot coming in for a landing and the control tower operator. They found an average rate of 2 bits per second, computed in terms of actual effects of the messages. Both pilot and control tower operator have all interest to communicate as fast as they can. On the other hand, they do so in the presence of a very high level of noise which reduces verbal communication to probably about one third of its optimum rate. We conclude, thus, that information transmitted through verbal communi- cation is certainly not less than 2 bits per second nor more than 17 bits per second, and very likely within the range between 6 and 12 bits per second. This estimate is rough but not at all unrealistic. Example: Information Content per Printed Letter— A very elegant way of computing an information measure under unfavorable conditions was used by Shannon in his analysis of the 'entropy' of printed English (21). The information content of a single letter is easily determined as a function of relative letter frequencies. However, constraints between neighboring letters lead to a reduction of information content, and in order to estimate this reduction exactly one would have to investigate the probability distributions for long sequences of letters. This is manifestly impossible. Shannon, therefore, proceeded to estimate a related quantity; namely, the amount of information A Primer on Information Theory 45 concerning language constraints which can be ehcited from a person familiar with printed English by a carefully planned interrogation. The subject is given a text which is truncated at some point; he is asked to guess the next letter. If he is successful, then he is told to go on; if not, he is told to try again. Records are taken of the number of times a letter is correctly identified at the first, second, third, . . . statement. In this setup, the experimenter acts as source of auxiliary infoimation, emitting sequences of the type 'wrong . . . wrong right', with an 'alphabet' of twenty-six different sequences (if repetitions are excluded, the letter must be identified after no more than twenty-five wrong guesses). The informational output of the auxiliary source depends on the relative probabilities of the various sequences. These probabilities are very unequally distributed. In a large percentage of the cases, the first statement is correct; the most frequent message from the auxiliary source is 'right'. The next highest probability is for the sequence 'wrong-right'. Messages with up to three 'wrongs' make up the vast majority of cases; the remaining categories, with from 4 to 25 'wrongs', have low probabilities. As was pointed out before, they contribute little to the estimated value of H. This means that we arrive at an estimate of the information furnished by the auxiliary source essentially as a function of two to four probabilities. The amount of information per single letter is known to be about 4.1 bits (on the basis of relative frequency of letters in English texts). This is the amount of information per letter which the subject needs to reconstruct the whole text. Of this amount of information, a certain measurable fraction is furnished by the auxiliary source. The remainder must come out of the subject's head, and is based on his knowledge of language constraints. The amount of infor- mation so elicited will not be quite as high as the information content of language constraints, but it is a closely related quantity. By the ingenious trick of effectively reducing the size of the alphabet, this quantity has been made easily measurable. APPENDIX II ANSWERS TO EXERCISES 1 . One light — peace and quiet two lights, vertically — enemy approaches by land two lights, horizontally — enemy approaches by sea two lights, diagonally — enemy approaches by land and sea (This is not the only possible solution) 2. (a) 0, 1, 10,11,100,101,110,111, 10000, 1001,1010,1100,10000, 11110100011 (b) 9, 11, 147,32 (c) .125, .6703125 3. (a) 10110100010000 (b) EDCBA 4. 'Construct a confusion-free code using five binary digits for each letter and compare the performance of this code with that of the above by encoding and decoding a message like this one'. Use part of the 32 code words made up of 5 binary digits, such as: 1 1 1 1 1 , 1 1 1 10, 1 1 101 , 11100, etc. The message will be, on average, 21 per cent longer than with the most efficient code (5 is 121 per cent of 4.14), but it is much easier to decode. Some of the unused code words can be used for punctuation, etc. The teletype works on this principle. 46 Henry Quastler 5. Limiting value : -(.8 logo .8 + .15 log, .15 + .05 logo .05) = .883 Single event code: A 1 .8 B 1 .3 C .1 1.20 -0.883 1.20, excess is — = 36 per cent. 0.883 Two-event code: Event pair Prob. Code AA .64 1 .64 AB .12 1 1 .72 BA .12 1 AC .04 11 .32 CA .04 10 BB .0225 1 .09 BC .0075 1 .0375 CB .0075 1 .06 CC .0025 1.0000 1.8675 .934 digits per event excess = = 5J%<10% 6. Let X designate amino acids ;, and y nucleotides. nx = log; J 20 = 4.322 Sx=l tty = log ;, 4 = 2.0 1 > ' ^-^^^ - 2 161 Sy 2.0 ^-^^^ 7. P -p\og,p .60 .44 .40 ^ H(x) = .97 8. The curve looks similar to F(p), but has a flatter top and is symmetrical, with a maximum of 1.0 at p{l) = .50. 9. H(x) = log2 3 = 1.58 -p log, p y:.S .1 .26 .33 .05 .22 .05 .22 i/(j) = mx) 1.03 my) A Primer on Information Theory 47 10. A realistic description of his uncertainty might be: prob (55-64) = .95 prob (55-54) -- .02 prob (65-70) = .02 prob (any other speed) = .01 Within each range, all speeds are considered equiprobable. We will derive the answer in two steps, obtaining first the uncertainty as to the speed range: Range p —p log., p 55-64 .95 .07 50-54 .02 .11 65-70 .02 .11 any other speed .01 .07 .36 bits Next, we observe that the range from 55 to 64 miles per hour contains ten speeds (deter- mined to the nearest mile) which are equiprobable. The uncertainty measure for ten equi- probable categories has been found to be log., 10 = 3.32. This uncertainty will arise 95 times out of 100; its expected contribution to the total uncertainty is 3.32 ■ 0.95 = 3.15. The other ranges are treated equally : Range No. of sub-classes ir) log^r P • logo r 55-64 10 3.32 3.15 50-54 5 2.32 .05 65-70 5 2.32 .06 all other 81 6.35 .06 3.31 bits We thus need (on average) .36 bits to determine the range of speeds, and an additional 3.31 bits (on average) to identify the speed to the nearest mile, within the range. The total uncer- tainty is 0.36 + 3.31 = 3.67 bits. Of course, different expectations would yield different uncertainties. 1 1 . The letters occur with more nearly equal frequencies. 12. Two bits. „, . , /323 323 104 104\ ^^ ^. 13. i/(shape) = - — log, h — lo", — -- .80 bits \427 ^-427 427 "■427/ rrr , ^ /315, 315 112, 112\ ^. ,. //(color) = - — loga h — log., — = .83 bits \427 ^427 427 ^-427/ 17/1 K ^ ^96 , 296 27 , 27 19 , 19 //(color, shape) = - — log, 1 log., 1 log., — \ 427 ^- 427 427 ^' 427 427 ^" 427 + ^log„^) =1.26 bits 427 ^- 427/ r(color; shape) = .80 -I .83 - 1.26 = .39 bits 48 Henry Quastler 14. T{x, y; z) = mutual reduction of uncertainty between x and y on one hand, z on the other = H(x, y) + Hiz) - H(x, y, z) nx;y, z) = H(x) + H(y, z) = H(,x,y, z) T(x;y; z) = total constraint in a tri-variate system = H{x) + H(y) + H{z) - H{x,y, z) 15. Test Actual pos neg pos 3 7 10 neg 5 85 90 8 92 H{y) = .40 H(x) = .47 H(x,y) nx;y) .84 .03 The informational value of the test is .03 bits. Its maximum possible infonnational value equals the amount of uncertainty before the test, viz. .40 bits. 16. 2.3 X 5 X 60 = 690 bits/minute 17. Begin by computing the output uncertainty. The probabilities of receiving each signal are obtained as the sum of receiving it correctly (0.2 for Nos. 1 and 4, .178 for 2 and 3, .198 for 5) plus the addition due to errors (1/4 of the errors, for each erroneous transmission). This procedure yields //(out) = 2.32 bits. Next, compute the ambiguities. These are zero for symbols no. 1 and 4. For 2 and 3, the ambiguity can be computed as the sum of the information needed to ascertain that an error has occurred (—0.11 loga 0.11 — 0.89 loga 0.89) plus the information needed to find out which of the possible and equiprobable four errors has occurred, which is 0.11 x 2.0 bits/symbol. Symbol no. 5 is treated similarly. The average of the ambi- guities is 0.31 bits, hence T equals 2.32 — 0.31 or 2.01 bits — a loss of about one-sixth of the input information. 18. One solution is the following: 11000 10101 OHIO 00011 A single error will result in the reception of a word which is not in the code book. If one follows the rule of substituting that message in the code book which differs from the received one by one digit only, then every error (provided there is only one!) will be corrected. A five-digit binary message can carry five bits of information. If it is known that one error has occurred somewhere in a group of five symbols, then the information needed to locate the error is loga 5 = 2.33 bits. With maximum efficiency, one should use only 2.33/5 or 46.5 per cent of redundant information (which could be achieved by coding large sequences of five-digit words!). In our case, the redundant information is 3/5 or 60 per cent, and we trans- mit with an efficiency of 40/53.5 = 75 per cent. (Observe that there is less uncertainty if it is known that there is one error in every five-symbol word, than when it is only known that the error rate is 20 per cent !) A Primer on Information Theory 49 REFERENCES 1. L. Szilard: tJber die Entropieverminderiing einem thermodynamischen System bei Eingriffen intelligenter Wesen. Z. Phys. 53, 840-856 (1929). 2. R.A.Fisher: On the mathematical foundations of theoretical statistics. Phil. Tram. {A) 222, 309-368 (1922). 3. R. V. L. Hartley: Transmission of information. Bell Syst. Tech. J. 7, 535-563 (1928). 4. N. Wiener: Cybernetics, J. Wiley and Sons, New York (1948). 5. C. E. Shannon: A mathematical theory of communication. Bell Syst. Tech. J. 27, 379-423, 623-656 (1948). 6. C. E. Shannon and W. Weaver : The Mathematical Theory of Communication, University of Illinois Press, Urbana (1949). 7. L. N. Ridenour: Computer memories. Sci. Amer. 192, 92-100 (1955). 8. R. M. Fang: The transmission of information. Tech. Rep. Mass. Inst. Tech. Res. Lab. Electron., no. 65 (1949) 9. L. Brillouin: Science and Information Theory, Academic Press, New York (1956). 10. L. DoLANSKY and M. Dolansky: Tables of log-, Ijp, etc.. Tech. Rep. Mass. Inst. Tech. Res. Lab. Electron., no. 227 (1952). 11. E. Klemmer: Tables for computing informational measures. Tech. Rep. A. F. Cam- bridge Research Center, ARDC. 12. Articles by J. E. Golay, P. Elias, I. S. Reed, R. A. Silverman, and M. Balser in: Trans- actions of the I.R.E. Professional Group on Information Theory (1954). 13. A.M.Turing: On computable numbers, with an application to the Entscheidungs- problem. Proc. Lond. Math. Soc. 42, 230-265 (1937). 14. H. Quastler: Studies of human channel capacity. In: Information Theory, ed. by C. Cherry, Academic Press, New York (1956). 15. Wm. McGill and H. Quastler: Standardized nomenclature : an attempt. In: Infor- mation Theory in Psychology, ed. by H. Quastler, Free Press, Glencoe, 111. (1955). 16. H. Quastler: Information theory terms and their psychological correlates, ibid. 17. J. von Neumann and O. Morgenstern: Theory of Games and Economic Behavior, Princeton University Press, Princeton (1947). 18. H. Quastler, H. H. Chase, W. Montagna, M. V. Edds, Jr., P. F. Fenton, and P. B. Weisz: Essays on biological unitization. Rep. Control Systems Laboratory, Univ. 111., No. R-52(1953). 19. A. A. Blank and H. Quastler: Notes on the estimation of information measures. Rep. Control Systems Laboratoiy, Univ. 111., no. R-56 (1954). 20. F. Fritz and G. W. Grier, Jr.: Pragmatic communication. In: Information Theory in Psychology, ed. by H. Quastler, 232-243, Free Press, Glencoe, 111. (1955). 21. C.E.Shannon: Prediction and entropy of printed English. Bell Syst. Tech. J. 30, 50-64 (1951). SOME INTRODUCTORY IDEAS CONCERNING THE APPLICATION OF INFORMATION THEORY IN BIOLOGY Hubert P. Yockey Oak Ridge National Laboratory, Oak Ridge, Tennessee Abstract — The model of protein synthesis in the cell which has been built up as the result of the work of many researchers has been used as a basis for applying the principles of infor- mation theory in biology. The main Une of the argument has been the role of noise in the genome. The discussion has been kept as independent as possible of special models. It was shown that in a real organism noise must exist in the genome and that an ensemble of organisms may be represented by a probability distribution in H, p{H, A). Individuality is thus incorporated in a very natural way. Dancoff 's principle requires that there be a lower limit for viability for this distribution. Ha. The action of a deleterious agent which induces errors in the genome by acting on nucleo- tide pairs is assumed to be represented by an equation of the first order: ^ = -j(X)p,(j) + ija) where /(A) measures the effectiveness of the deleterious agent, of which A is a measure, in producing defects. A differential equation for H(X) is derived and it is shown that {dHldX)E^ as a function of A behaves like J{,X). I. INTRODUCTION Information theory finds its place in biological thought through its ability to deal quantitatively with organization and specificity. The importance of these concepts has long been recognized in biology, but this realization is rather sterile unless a quantitative form of expression can be found. One is reminded of a quotation from Lord Kelvin, 'When you can measure what you are speaking about and express it in numbers, you know something about it, but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind.' The need for expressing biological quantities in numbers is clear but solving the problem of how to do it is very much like belling the cat. Biology doesn't seem to have any problems both really simple and terribly important such as some which occur in the physical sciences. The application of first principles has come much more slowly in biology for perhaps this reason. That ideas of great general application do exist in biology is exemplified by Mendel's laws and by the theory of evolution. One of the purposes of this article, and indeed one of the purposes of this book, is to explore the practical and theoretical consequences that may be found in the discovery that biochemical specificity of proteins is carried, largely at least, by the exact order of twenty amino-acid residues. The suggestion of 50 Some Introductory Ideas Concerning the Application of Information Theory in Biology 51 Watson and Crick (1) that genetical infomiation is carried by the exact order of four kinds of nucleotide pairs provides a molecular vehicle for the genetic control of protein specificity. Gamow (2) was the first to see that this control implied the existence of a four-letter to twenty-letter code. Thus by following the logical consequences of purely biological, or perhaps biochemical, problems one is lead directly to a problem purely mathematical in character. This notion of the role of order, which is basic to information theory, is worth pursuing in biology since it provides a way of measuring what we are speaking about and expressing it in numbers. Furthermore, from the results of applying the theory to specific problems, we may obtain an experimental check on the validity of these ideas as first principles. In this article we shall apply these considerations to the storage and transfer of biochemical specificity. We shall explore, in particular, the role of noise in the genetical message. In my article in Part V the theory is applied to the practical problem of calculating and understanding survivorship curves. The present status of the means of storage and transfer of specificity is given by Gamow, by Ycas and by Augenstine in their respective articles in this volume. The question of the exact way in which information is destroyed by read-off error, radiation damage, aging, thermal fluctuations, biochemical side reactions, and so forth, is of equal importance. This problem is also discussed in this volume but no final and detailed account can be given at this writing. Nevertheless, since there is virtue in attempt, we shall attempt the development of a mathematical formahsm which is information theoretic in character. Most animals and plants exist at one time, at least, in the form of a single cell; we can consider that cell to contain a substantial part of the directions for the development of the organism. Since infonnation is conserved unless lost due to noise, it shall be assumed that the mature organism is characterized by substantially the same information content as the fertilized egg or seed. In order to fix the idea we shall develop the formalism on the basis of Watson and Crick's suggestion concerning the role of DNA. It should be remembered that the central ideas of this paper are independent of much of the detail embodied in Watson and Crick's papers and are dependent only on the possi- bility of genetical endowment being conveyed by a series of structures composing an information bearing molecule. Suppose we imagine the symbols A, B, C, D (Gamow's predilection is to the less prosaic spades, clubs, hearts, diamonds!) arranged in one-to-one correspondence with the nucleotide pairs of the DNA found in a particular given cell. The cell will have been selected from a number of similar but not identical cells in a colony under study. This colony may be thought of as being indefinitely large, so that in principle we may consider the ensemble of all possible organisms identifiable as being members of the colony. Since the number of nucleotides in DNA is finite, the number of elements in this ensemble is also finite. Because of this one-to-one correspondence it will be seen that the set of symbol sequences, which is the mathematical model of the ensemble of organisms, will contain the informational or specificity properties of the ensemble of organisms. 52 Hubert P. Yockey The importance or value of a theory lies, among other things, in its capability of treating a wide variety of phenomena from a single point of view. It is well to think, at the start, of the field of validity this theory may have and, if it should fail, the significance of its failure. If it should be discovered that Watson and Crick's suggestion has very little bearing or applicability then this development, while negative, is still a valuable result. One would then perforce search for another explanation for the great detail and specificity characteristic of any biological phenomenon. At present it is the most detailed proposal based specifically on molecular chemistry. The theory here developed is essentially statistical and may be expected to express its results in the form of expectation values, probabihty distributions, and their functions. The statistical character of the theory is directly in the line of thinking of both modern biology and modern physics. It should be kept clearly in mind that information theory deals with organizational problems and so some aspects of organisms will be outside its scope. In this sense it may be that the role information theory will play in biology will parallel that played by thermo- dynamics in physics and chemistry. II. NOISE IN THE GENETICAL INFORMATION The Instability of a Perfect System Let us consider an ensemble of organisms and discuss the communication of information from the DNA to protein. There is evidence discussed by Gamow and by Ycas in this volume that the code which translates information from the four-symbol DNA code via RNA to the twenty-symbol protein code is based on triads of nucleotide pairs. Indeed it can be seen that it must be at least the triads since a twenty-symbol alphabet carries 4.32 bits per symbol whereas the pairs in a four-symbol alphabet carry exactly four bits per symbol, assuming no intersymbol constraints. The triads carry six bits per symbol and so this represents some inherent redundance. It would be desirable to express this formalism in terms of the DNA triads of nucleotide pairs ; however, this requires a knowledge of the DNA to protein code. These data are missing. Our objective is to develop the mathematical fomialism in as simple a way as possible so it appears more appropriate to consider the communication of specificity from DNA to RNA. Here we are dealing with a coding between two four-symbol alphabets. Suppose we are considering an ensemble of organisms which is isogenic, and further that this means that each organism is characterized by exactly the same order of nucleotides in the DNA of its nucleus. We shall now show that this situation is unstable and that therefore a real ensemble of organisms will be represented by an ensemble of messages recorded in its DNA. From this it will follow that there is a distribution in the message entropy, characteristic of any ensemble of organisms, even one which is isogenic. The message entropy is H=H,-H, (1) where H^ is the message entropy of the genetical information and H„ is the loss of information due to noise. That is, //„ is the loss of information from Some Introductory Ideas Concerning the Application of Information Theory in Biology 53 some fault cither in the duplication process in the germ line or the somatic line or from incorrect rcad-o(T of any kind. //„ may be expressed in terms of the read-off or transition probabilities (3) of a letter of kind / to a letter of kindy, Piij). The probability of letter / is p{i). H=H,-\-y p{i) p^ij) log2 p,{j) (2) Consider the case where these probabilities are a function of some variable 1. In the application of these considerations A is the measure of some deleterious influence such as dose of ionizing radiation. Form the derivative dHjd?.: ciHldX = log2 e 2 (MO ic¥>^) P.ij) + Pii) loge pSi) {dIdX) p,{j) + P.(j)ioi,p,{J){dldX)p{i)] (3) The absolute value of dHfdX will become indefinitely large because of the second term in equation (3) as any p^{j) approaches zero if p{i) ^ and (dldX) pi{j) 7^ 0. This may happen, in particular, if any p/ij) approaches one for then SL\lpi{k), (j ^ k) approach zero. This situation {p,{j) = 1) corresponds to the assumption that there is always a correct reproduction in the DNA duplication or in the RNA read-off. Under these circumstances the first term is finite and the third term is zero. Watson and Crick regard a mutation as being reflected by a change in order of the nucleotide bases in DNA. This is apparently always possible; they have suggested a biochemical scheme by which this can be affected. This means that in a real biological system p{i) ^ and {djdX) p/ij) 7^ 0. A real ensemble of organisms will be represented by an ensemble of genetic messages. This will be true even if the ensemble is isogenic. Some noise must exist in the genetical information; if the noise is less than equilibrium it is quickly intro- duced. There is some experimental evidence in support of this conclusion. Burdette (4) prepared populations of isogenic Drosophila. One strain had the same low incidence of tumors in both sexes (about 4 per cent) and the other had a high incidence (about 60 to 80 per cent) even greater in males than in females. The tumor incidence of the isogenic strains was initially much lower in each case than the stock from which it originated. But in each case, by the twelfth genera- tion, the tumor incidence of the isogenic strain had returned to about the same rate as that of the original stock. Tumor incidence is a morphological mal- function and, as shown in this and other experiments, is under genetic control. The fact that all flies were not tumor bearing and the gradual return of the isogenic strains to the tumor incidence of the strains from which they were selected, reflects the accumulation of errors in the genome. The results of the experiment are in accord with the proposition proved above. Representation of the Ensemble 0/ Organisms by a Probability Distribution in H: piH, A) If we grant that perfect systems do not exist, the other side of the coin is, how imperfect may they be? This question was first discussed by Dancoff and QuASTLER (5) and their conclusion, which is known as Dancoff's principle, states that the amount of redundance is just that required to reduce the error 54 Hubert P. Yockey rate to a tolerable level. According to this principle, we may expect that errors will continue to accumulate in the genome of a given organism until at some point serious difficulty including death will occur. This will be reflected by some value of H, which we call //^, limited by viabihty. An argument for a lower limit H^^ has been given previously (6). Errors will accumulate in the genome but at the same time there is a favorable selection for those members of the ensemble which have low equivocation. This represents a certain reserve capacity to withstand the insults of existence. It may therefore be expected in general that the message entropy of the ensemble of organisms will be described by a probability distribution. This distribution can, perhaps, be calculated from first principles, at least for simple cases, when more is known about the storage and transfer of genetical information. Death of an organism is defined in different ways in various fields of biology. Permanent loss of reproductive power is the definition of death usually expressed or implied in bacteriology (7). This is the definition chosen in spite of the fact that there are many inteiTnediate stages between the active living cell and the dead cell. It is known that yeast cells which have lost the power to multiply may still be able to fennent (8). Zelle and Hollaender (7) have recently pointed out that attempts to explain the bactericidal effects of irradiation on the basis of one mechanism are unrealistic. In the case of animals the cessation of metabolism, not the loss of fertility, is the criterion of death. These criteria of death are not really different or antagonistic. Since loss of function is implied by loss of information content any experimentally convenient definition of lethality may be used to suit the problem at hand. The lower end of the distribu- tion in message entropy will therefore be determined by the specificity required by the environment. A communications analogy may clarify the notion further. Suppose we have a message, with redundance, which is sent through a communication channel with a small but finite noise level. The message contains instructions to perform some necessary task. A recording is made and the message is sent through again, and so forth. Eventually, depending on the noise level of the channel and the redundance in the message, it will be just barely intelligible. No further recordings can be made without loss of part of the required information content. The ensemble of recordings is analogous to the ensemble of organisms. It will be seen in either case that there is a distribution of information content among the elements of the ensemble. Individuality finds a place in the theory developed here in a very natural way. This feature corresponds more to reality (9) than theories which must explain non-uniform response as fluctuations. Besides the experiments of Burdette mentioned above it will suffice to note one other example of biological individuality. Consider the experiments of Schott (10, 11), Hetzer (12), Lambert (13), GowEN (14), discussed by Gowen (15), on Salmonella tvphimurium in mice and Salmonella gallinarum in fowl. The host population is exposed to the pathogen and the survivors are chosen for further breeding. The case for mice is typical. The survival ratio improved from 18 per cent to 93 per cent in six generations, but remained nearly constant after that. One hundred per cent survival was not achieved. The survival ratio is characteristic of the ensemble not of the Some Introductory Ideas Concerning the Application of Information Theory in Biology 55 individual. Gowen (15) also prepared six strains of mice by sibling malings for twenty or more generations. When survival was tested the survival ratios were 1, 14, 34, 63, 64, 83 and 88 per cent. These results again stress the importance of individuality as Gowen pointed out.* Point Mutations and Chromosome Aberrations We have now arrived, via our discussion, at territory familiar to the radiation biologist. This is the controversy over the role played by point mutations and chromosomal aberrations induced by deleterious agents such as x-rays. This subject has been ably discussed recently by Muller, Kaufmann, Giles, Carlson, SwANSON and Stadler, and by Kimball (16). The point of view of these authors varies. Kimball takes the stand with Lea (17) that the death of cells is due to chromosome aberrations which become effective at cell division. Swanson and Stadler point out that the two effects occur together and that a clear cut separation has not yet been accomplished. Muller points out some difficulties with the mutation by breakage interpretation. Russell (18) states that gross chromosomal aberrations, although they cause early death of embryos, are probably not an important radiation hazard to man. From the point of view of this article each of these effects is a way of intro- ducing disorganization in the genome. The point mutation mechanism is the biological analogue of the 'white noise' of the communications engineer. The other extreme is not found in communication engineering but involves a strong correlation between errors and is reflected as a loss of whole paragraphs or other gross mutilation of the message. Each of these extreme cases will be important in applications of information theory in biology. Unfortunately, the second case has not been studied mathematically and so it is not known how to calculate the equivocation it introduces. It is therefore necessary to proceed with the calculation of only the part of the equivocation which corresponds to point mutations. Since one of our objectives is to develop a fundamental theoretical treatment of radiation hazard to man, Russell's comment encourages one to think that this procedure is v/orthwhile. It should be remembered that equivocation from these two extreme conditions may have the same dependence on the deleterious influence. This is a point which requires further mathematical study. The Interaction of the Deleterious Agent nith DMA and the Decay of H According to the Watson and Crick model of DNA there seems to be no biochemical reason why there should be an interaction between nucleotide pairs. The biological requirements for protein specificity do not seem to demand an intersymbol influence (19). The matter is not closed, but the evidence favors regarding the interaction of a deleterious agent with a nucleotide pair to be of the first order. We have previously suggested that the action of ionizing radiation or other deleterious agent may be such that the nucleotide pair is altered in such a way that it mimes another symbol as far as protein synthesis is concerned (6). It * Individuality as an integral feature in biology has been emphasized recently by Rcxier J. Williams: in Biochemical IncUvidiiality, J. Wiley and Sons, New York, Chapman & Hall, London (1956). 56 Hubert P. Yockey may be thrown into an excited tautomeric form from which it recovers by relaxation. Possibly one can account for biological recovery by such a mechanism. The consideration of recovery is omitted from this paper for simplicity and we shall need only the notion expressed in the first sentence of this paragraph. In view of the above remarks we may write the following equation for the rate of change of /?,()) with A: idldX) p,ij) = -y,,(A) p,{j) + c,,(A) (4) The first terni represents the loss in nucleotides responsible for the {i,j) transition. The second term is due to the gain in nucleotides engaging in the (i,j) transition coming from other nucleotides altered by the deleterious agent. This can be brought into sharper focus by thinking of the binary case. Suppose q is the correct and p is the incorrect read-off probability. We are calculating the equivocation, or damage to the message, resulting from point errors. This means that, accordingly, a letter is not deleted but is read off either correctly or incorrectly. This letter switching process may continue until half the letters are correct and half are incorrect; at that point p = Ijl and q = 1/2. The infor- mation content vanishes. In the case of a four letter alphabet a letter which is acted upon and which may therefore change or may retain its original read-off character has an a priori probabiUty of 1/4 to remain or to become a correct letter. Thus the second term is required by the normalization condition. Equation (4) describes the effect of the interaction of the deleterious agent, say the x-ray dose, with the information bearing molecules in the cell. It corresponds to current views of reaction kinetics. Should it be discovered that some effect, for example, inter-symbol influence, should be taken into account then equation (4) may be altered suitably. The following argument would then still be cogent except that the new form of equation (4) would be used. Present experimental evidence substantiates equation (4) and we have no present justification for greater complication. In fact the /./A) and c-j{X) represent more detail than is available. Sum equation (4) over ally: 2 (d/dX) pij) = - 2 JM plj) + 1 cM (5) Since J J IPi(j)=l; I(dldX)p,(j) = (6) j j o = -2^a)AO')+2c.>a) (7) 3 3 If the /,/A) and the c^/A) may be replaced by an average value J(X) and c(l), equation (7) becomes, for a four-letter alphabet: = -J{X) + 4 cU) (8) c(X) = +yiX) (9) Equation (4) may be written as follows: (dldX) p,(j) - -7(A) p,{j) + i/(A) ( 1 0) Some Introductory Ideas Concerning the Application of Information Theory in Biology 57 Given (dldX) p^d) as some function of A, equation (3) may be regarded as a differential equation for //(A). This equation has a simple form if the y,v,(A) and the c,,(/l) may be replaced by their averages y(/l) and iJ{?>.). {dHldX) = log2 e 2 {p{i)J{X)[ p^ij) + I] -{-p{i)m [-^p,{j) + I] loge/;,(;) -\-p,{j)\og,pij){dicrA)p{i)] (11) {dHldXy= -J{X) log2 e lp{i)p,{j) loge/^(7) +i J(A) log2 e 2 p{i) loge PiCj) +log2 e 2 Piij) HePiij) {dldX) pii) (12) Substituting equation (2) in equation (12) and rearranging we have {dHldX) + J{X)H = J{X)H, + ya) 2 pii) \o^z Piij) i.j + lPi(j)iog,p,(j)(dld?i)p{i) (13) i.j The third term on the right of equation (13) is negligible for biological systems. To show this we must discuss first the method of calculating the {dldX) pU). By definition (3) the following relation holds: p{i)=lpii)p^ii)- (14) i Form the derivative with respect to A and substitute equation (4) : {dldX) p{i) = llpij) {dldX) p,{i) + pAi) (dIdX) p(j)] ( 1 5) j {djdX) pii) = - 2 y,, /.,(/•) pij) + 2 q. pij) + 2 pM) i^m p(j) ( 1 6) j i J The equations (16) are a set of differential equations for the p{i). They may be rearranged in the usual form: {dldX) pii) - 2 PjO) idldX) pij) = - 2 Jji PiU) pij) + 1 c,i pij) i 1 7) j j J We are interested in the conditions when the id/dX) pii) vanish. The condition is of course that the terms on the right of the equations (17) are all equal and that the detenninant of the coefficients of the idjdX) pii) be different from zero. Among the circumstances in which this will occur are those where all p^ii) = q and all Pi{k) = p ii j^ k). That is, all letters are equally probable and one kind of error is as likely as the other. In my paper in Part V the behavior of dH/dX under the much stronger conditions that the J^j and c^j vanish at A = will be needed. Then, of course, providing that the determinant of the coefficients of the idldX) pii) be different from zero, all id I dX) pii) = 0. It may therefore be expected that except under most exceptional and special conditions the idfdX) pii) will be very small or will vanish. It can be further shown that for a nearly perfect system the coefficients of the idldX)pii) in equation (13) are small compared to one. Dancoff and 58 Hubert P. Yockey QuASTLER (5) have estimated the error rate per cell per generation to be some 10-1 tQ jo-2 times the spontaneous mutation rate per cell generation (10^* to 10~i^). Taking this to mean that Piii) = q^{\-p) and p^ij) = p ^ \Q~^ (i ^j) we see that /'^(01og2/7,(0 = +log2(l -p) ^-p = -10-6 p,(j) log^Piij) = -6 X 10-« log2 10 ^ -10-5 (18) Because of the discussion given above this term in equation (13) may be neglected. Equation (13) gives the value of {dHjdX) at the values of /?/;) corresponding to Hg^. Let these values be plij). dH = J{X)[H, -H, + 12 P{i) log2 p'm ( ' 9) dH The coefficient of /(A) will be a constant so that — dl will behave as a function Ha of X like J{X). This result will be needed in my article in Part V. REFERENCES 1 . J. D. Watson and F. H. C. Crick : Genetical implications of the structure of deoxyribose nucleic acid. Nature, Lo/tcl. Ill, 964-967 (1953). /. D. Watson and F. H. C. Crick: The structure of DNA. Cold Spr. Harb. Symp. Quant. Biol. 18, 123-131 (1953). J. D. Watson and F. H. C. Crick: Molecular structure of nucleic acids. A structure for deoxyribose nucleic acid. Nature, Lond. 171, 737-738 (1953). 2. G. Gamow: Possible mathematical relation between deoxyribonucleic acids and proteins. Biol. Mcdd., Kbh. 22, (3), 1-13 (1954). 3. C. Shannon and W. Weaver: The Mathematical Theory of Communication, University of Illinois Press, Urbana (1949). 4. W. J. Burdette: Incidence of tumors in isogenic strains. /. Nat. Cancer Inst. 12, 709-714(1952). 5. S. M. Dancoff and H. Quastler: The information content and error rate of living things. In: Information Theory in Biology, ed. by Henry Quastler, 263-373, University of Illinois Press, Urbana (1953). 6. H. P. Yockey: An application of information theory to the physics of tissue damage. Radiat. Res. 5, 146-155 (1956). 7. M. R. Zelle and A. Hollaender: Effects of radiation on bacteria. In: Radiation Biology, ed. by A. Hollaender, Vol. 2, chap. 10, McGraw-Hill, New York (1955). 8. O. Rahn: The order of death of organisms larger than bacteria. J. Gen. Physiol. 14, 315-337 (1930). 9. R. Pearl: On the distribution of differences in vitality among individuals. Amer. Nat. 61, 113-131 (1927). 10. R. G. Schott: The inheritance of resistance to Salmonella aertrycke in various strains of mice. Thesis, Iowa State College Library, 1-59 (1931). 11. R. G. Schott: The inheritance of resistance to Salmonella aertrycke in various strains of mice. Genetics 17, 203-229 (1932). 12. H. O. Hetzer: The genetic basis for resistance and susceptibility to Salmonella aertrycke in mice. Genetics 22, 264-283 (1937). Some Introductory Ideas Concerning the Application of Information Theory in Biology 59 13. W. V. Lambert: Genetic investigations of resistance and susceptibility to disease in laboratory animals. Rep. Agric. Res., Iowa Agric. Exp. Sla., 89-90 (1931); 91-92 (1932); 115(1933); 142-143(1934); 158-159(1935); 147-148(1936). 14. J. W. Gowen: Genetic investigations of resistance and susceptibility to disease in laboratory animals. Rep. Agric. Res., Iowa St. Coll. Agric. Exp. Sta., 158-159 (1937); 151-153 (1938); 156-160 (1939); 192-194 (1940); 171-172 (1941); 189-190 (1942); 178-182 (1943); 204-210 (1944); 278-283 (1945); 257-260 (1946); 230-232 (1947). 15. J. W. Gowen: Significance and utilization of animal individuality in disease research. /. Nat. Cancer Inst. 15, 555-570 (1954). 16. A. HoLLAENDER (cd.): Radiation Biology, McGraw-Hill, New York (1955). 17. D. E. Lea: Actions of Radiations on Living Cells, Cambridge University Press, Cambridge, England (1955). 18. W. L. Russell: Genetic effects in mammals. In: Radiation Biology, ed. by A. Hollaender, Chap. 12, McGraw-Hill, New York (1955). 19. M. Ycas: The protein text. This volume. PART II STORAGE AND TRANSFER OF INFORMATION A CENTRAL issue in modern biology, which touches in some degree all branches of that science, is the problem of species specificity and its relation to protein- specificity and synthesis. The subject can be approached from many points of view but the one adopted by the authors of the papers in Part II is to seek the solution in terms of the properties of a communication system The justification for considering, from this point of view, a phenomenon which looks, at first sight, to be purely biochemical lies in the recent discovery that protein specificity is expressed as an exact order of amino acid residues. If this is even substantially the case then it is germane to discuss such problems in these terms. In fact, a number of current papers on protein synthesis and specificity have recourse, at one point or another, to the language of information theory. Since the specificity of proteins is thought to be coded in the exact order of pairs of nucleotide bases in DNA, the relationship of DNA, RNA, and proteins can be considered from aspects which are mathematical rather than purely bio- chemical. Gamow was the first to notice these mathematical aspects. He and Ycas pursue in this part some of the issues which they reveal. The influence one hopes these considerations will have on the experimentalist is clear. Additional data on the amino-acid residue sequences and other structural data for a large number of proteins can be put to immediate practical use in solving for the protein code, and therefore in understanding more about protein synthesis. Unfortunately, mainly due to the lack of sufficient protein text, few definite answers can be given. But it is possible to eliminate some past errors and to phrase the question in a sharper fashion than before. The notion that an abstract quantity such as information is stored in the genetic material and is transferred to proteins during their synthesis raises immediate questions as to how this is done, how much is transferred, and how this quantity is aff'ected by changing experimental conditions. These questions are attacked from diff"erent analytical and experimental points of view by the papers by Augenstine, by Mahler, Walter, Bulbenko and Allmann, and by Koch and by Glinos. The information theoretic properties of communication systems of particular concern to the papers in this part are the coding problem, the representation theorem, and redundance. Each paper deals with issues of its own but in terms of these ideas to a greater or lesser degree. It is in this way, among others, that information theory may grow to be as useful to the biologist as thermodynamics is to the chemist, whether his subject is clearly one in communication as is that of Frishkopf and Rosenblith or somewhat less clearly that of protein specificity. H. P. Y. 61 THE CRYPTOGRAPHIC APPROACH TO THE PROBLEM OF PROTEIN SYNTHESIS George Gamow and Martynas Ycas University of Colorado and University of New York Abstract — The Watson and Crick suggestion concerning the role of DNA in replication, mutation, and protein synthesis requires a coding between the four-letter DNA alphabet and the twenty-letter protein alphabet. An attempt has been made to discover this code by crypto- graphic methods. Various schemes have been worked out but no success obtained at this writing. There is hope that as the number of protein sequences increases this problem will be solved. Speaking about information storage and transfer in a living cell, one always likes to compare the cell with a large factory. The cell nucleus is the manager's office, directing the work of the factory, and the chromosomes are the file cabinets in which all blue prints and production plans are stored. The cytoplasm is the plant itself with the workers and machinery carrying out the actual production ; those are, of course, the enzymes catalyzing various biochemical reactions. If something goes wrong with the information stored in the chromosome, the corresponding enzyme will also do a wrong thing. Consider, for example, an enzyme which produces the pigment necessary for color vision. If the particular section of chromosome carrying the directions for producing that pigment is defective, the enzyme will not get the correct instructions, and will not produce the right type of pigment. As a result, the individual will be color blind. The materials of chromosomes and of enzymes are chemically different, except that in both cases we deal with long molecular chains formed by the repetition of a comparatively small number of different units. DNA (deoxyribo- nucleic acid), forming the chromosomes, is a sequence o^ four different units or 'bases': namely, adenine, thymine, guanine, and cytosine. For sake of picturesque presentation, we may associate them with four suits of cards: spades, clubs, diamonds and hearts. Each DNA molecule is equivalent to a sequence of cards many thousand units long, and the way in which different suits follow each other contains, in code form, the instructions to the original cell (fertilized ovum) and its descendants to develop into a rosebush, a skunk, or a man. The first question is this. How is information which is carried by DNA molecules of the chromosomes duplicated when the cell goes through the process of division? An answer can be given on the basis of the model of DNA proposed about three years ago by J. Watson and F. Crick (1). They started with the fact, first noticed by E. Chargaff (2), that the number of adenines in any given DNA molecule is always equal to the number of thymines, while the number of guanines is always equal to the number of cytosines (3). In the playing card analogy there are as many spades as there are clubs, and as many diamonds as hearts. This suggests that we deal here with a double-stranded 63 64 George Gamow and Martynas YCas ^\. / / / \ ^\ -U 1 \ I / \ ^o_: u -J. / -u c *c < ♦ o X-U-X -^ u=u. I I K r/ ~o 0=u -u — X \-/ \ /^" \ -u — ^ ■. -U X ./ -o \ / \ ^-\ /■ O"^ / X — u o / X I / /r -u ./ -u- -U — X ° I I — u \ 4> c 'c o 3 c m o ■«-> U The Cryptographic Approach to the Problem of Protein Synthesis 65 sequence in which red and the black cards are paired together. A heart is always paired with a diainond (and vice versa), while a spade is always paired with a club (and vice versa). The fact that DNA molecules also contain one sugar (ribose) and one phosphate for each 'base' suggests a molecular model similar to a rope ladder. The vertical ropes on both sides are formed by 'sugar- phosphate- sugar-phosphate-' sequences, while the paired bases form rigid horizontal steps attached to sugars on both sides. The reason why the above- mentioned pairing of bases takes place is two-fold. Cytosine and thymine (hearts and clubs) are 'pyrimidines', being formed by a single C — N — ring with different atomic groups attached to them. Adenine and guanine (spades and diamonds) are 'purines', and contain in their structure two connected rings, one with six atoms, and the other with five. The chain shown in Fig. 1 is a sequence of sugars and phosphates. To each sugar is attached a 'base', and in tliis section of the molecule you see four different bases. Two of them (hearts and clubs) are short, and two others (spades and diamonds) are long. Now, in order to run the second strand beside it in the parallel way, we should attach short bases to long ones, and long bases to short ones. Of course, in the playing card analogy again, one could also join a heart to a spade and a club to a diamond. But this is excluded because in these cases hydrogen atoms will be in the wrong places to form proper hydrogen bonds between these two bases. The evidence supplied by an x-ray diffraction pattern indicates in addition that the DNA molecule has a helical shape, being twisted around its central axis by 36° each step. Thus, it makes a complete turn each 10 steps. The Watson and Crick (4) theory of dupHcation of DNA molecules proceeds as follows. When the cell is ready to divide, there appears a large number of free nucleotides in the nucleoplasm surrounding the chromosomes. A nucleotide is defined as one of the four bases with a sugar and a phosphate attached to it. At that time the double stranded DNA molecule splits into two single strands along its main axis, and each strand is regenerated by catching the corresponding free nucleotides from the surrounding medium. Thus, each heart separated by splitting from its diamond gets another diamond from the solution, and each diamond gets another heart. As the results, we get two new double stranded DNA molecules, each identical with the original one. Once in a while a mistake may be made in this duplication process, and we call it a mutation. So much for the structure and functioning of DNA molecules. Now we come to the problem of information transfer from the chromosomes to the enzymes. How does the sequence of bases (card suits) in DNA determine the structure of the enzyme? Enzymes are proteins, and are formed by long sequences of twenty different chemical groups known as amino acids. It is well known that there are as many as twenty-four or twenty-five amino acids, but, as Dr Yeas tells us in more detail in the next paper, one can show that the extra ones in the original protein synthesis are modifications of the original twenty which take place after the protein molecule is synthesized. Thus, for example, 'proline' is an original amino acid used in protein synthesis, whereas 'hydroxyproline' is its postsynthetic modification. Since we symbolized four bases of nucleic acid molecule by four playing card suits, it is reasonable to symbolize the twenty basic amino acids, which have complicated chemical 66 George Gamow and Martynas Ycas names, by twenty letters of a (reduced) English alphabet. Thus, one protein molecule may look like: . . .arreducesugarreducesug. . . and another like: . . . akeacoloruisionpigmentma . . . Just to give an example of how the sequence of amino acids in protein molecules may affect their biochemical activity, we will give the example of two closely related hormones: oxytocine and vasopressin. Both are formed by a sequence of only nine amino acids: Oxytocine — Cys-Tyr-Z/ew-GIun-Aspn-Cys-Pro-Lew-Gly Vasopressin — Cys-Tyr-PAe-Glun-Aspn-Cys-Pro-vlr^-Gly The two sequences are identical except for the substitutions in the third and eighth place. However, their functions are rather different. Oxytocine has the property of causing the contraction of the uterus in the process of childbirth. If you inject it into the blood of a cow, even if the cow is not pregnant it will go through all motions it would go through if a calf were to be born. Vaso- pressin, on the other hand, has rather different properties: it contracts the blood vessels and causes increased blood pressure. Thus, simply by changing two amino acids out of nine, the action of the hormone is completely changed. Whereas replacement of some amino acids in a protein may completely change its biological function, there also exist replacements which distinguish the same protein taken from different species of animals. Thus, for example, insulin A, which is formed by a sequence of amino acids with twenty-one members, differs for cattle and swine in the eighth and tenth place. Human insulin, which has not yet been analyzed, possibly differs slightly from that extracted from cattle and swine. Nevertheless, the latter are successfully used on human patients. Since there must exist a definite relation between the sequence of bases in nucleic acid and the sequence of amino acids in proteins, we can ask ourselves what this relation is. Here we have to return to our analogy of a factory. The v/orkers from the factory do not walk into the manager's of^ce to find out what to do, and the manager also does not go to the plant to instruct workers per- sonally. There are people, called foremen, who get the information from the manager's ofiice and tell the workers. In the cell the role of foreman is carried out by RNA molecules (ribonucleic acid) which are, presumably, very similar to the molecules of DNA. They are different only in that one oxygen atom is missing in each sugar of DNA, and there is a slight change in one of the four bases, which in RNA is called urosil instead of thymine. RNA is presumably synthesized by DNA inside the nucleus and receives the set of instructions carried by DNA. Then it passes out into the cytoplasm, and is incorporated into the so-called microsomes, i.e. foremen's offices, where the synthesis of proteins takes place. We do not yet have a model of the RNA molecule. It seems, however, that in this case the pairing rules of adenine to thymine (urosil), and guanine to cytosine do not hold, which suggests that RNA molecules are single-stranded. The Cryptographic Approach to the Problem of Protein Synthesis 67 Since RNA serves as an intermediary between DNA and proteins, we have here two problems. First, how is RNA formed by DNA ? Second, how are proteins synthesized by RNA? The first problem may turn out not to be very difficult because of the close similarity between the two molecules. For example, RNA may be a non-regenerated half of DNA with small changes in sugars and in one of the bases. It may be that the absence of the oxygen atom in RNA's sugar is responsible for the failure to form a double-stranded configuration. However, we still do not know the answer to this question. The second problem concerning the synthesis of proteins by RNA mole- cules presents more challenge to the imagination. How can a sequence formed by four different units (four bases) be translated in a unique way into a sequence formed by twenty units (twenty amino acids)? Here is a possibility which seems to us to be very likely. Suppose one plays a game of poker in which only three cards are dealt, and pays attention only to the suit of the card. How many different hands will one have? Well, one can have a 'flush', i.e. three cards of the same suit. There are four different flushes: three hearts, three spades, etc. Then one can have a 'pair', i.e. two cards of the same kind, and one different. How many of those are there? One has four choices for the suit of the pair, and three choices for the third card. Thus, there are altogether twelve possibilities. The poorest hand will be a 'bust', i.e. three different suits. There are four different busts: no hearts, no diamonds, etc. We have altogether twenty different possibilities. This 'magic number' 20 is just the number of amino acids participating in the primary process of protein synthesis. We may imagine that each amino acid in the synthesized protein is determined by a triplet of bases in the RNA template. Since the distances between neighboring amino acids in the extended polypeptide chain are equal to the distances of neighboring bases in the poly- nucleotide chain (both being equal to 37 A), it was at first natural to suppose that the correlation between the two chains looks in a way shown in Fig. 2, RNA-Template where individual bases are shown by circles and the amino acids by triangles. This represents the so-called over-lapping code in which the neighboring amino acids have in common two bases in the RNA template. If the transfer of information from nucleic acid to protein is carried out according to such an overlapping code, there must exist a definite inter-symbol correlation between the amino acids constituting protein molecules. Thus, for example, if a certain amino acid is determined by two adenines and some other base, its neighbors will be preferably amino acids which also contain adenine in their template transcript. In order to see whether or not such a correlation between the 68 George Gamow and Martynas Ycas neighbors really exists in the known protein sequences, it is necessary to test all possible assignments between the twenty amino acids and the twenty possible base triplets. The number of all possible assignments of that type is 20! = 3.10^^, Since 3.10^^ represents the age of our universe (5 bilhon years) expressed in seconds, the straightforward test of that kind would require quite a consider- able time even if we could test one assignment each second ! However, as it often happens in cryptographic problems, one can sometimes find parts of the message which reduce quite considerably the amount of necessary work. Thus the code messages sent by German spies during the war were likely to contain the combinations of letters corresponding to various possible ports of embark- ation of American expeditionary troops. The same happens in protein sequence. For example, the adrenocorticotropin molecule contains the sequence: — Lys — Lys — Arg — Arg— Pro — Val — Lys — Val — In this sequence there are two identical amino acids in succession followed by another pair of identical ones. In the English language there are not many words having such a property. (Tennessee is one of the rare examples!) Then lys repeats again three steps later, and has identical neighbors (val) on both sides. These facts simplify the problem to such an extent that, instead of spending five billion years, it was possible to find a single assignment between the amino acids in the above sequence, and the base triplets in the course of an afternoon. At first it was thought the problem had been solved, but, when one tried to extend these assignments to the other parts of the ACTH molecule and to the other known protein sequences, one was led to direct contra- dictions. In the course of subsequent decoding work, other examples leading to similar contradictions were found, and it became clear that the thing just will not work. In fact, as Dr Yeas discusses in the following article, it seems that there is no correlation between the neighboring amino acids whatsoever. This negative result can only mean that the original hypothesis represented in Fig. 2 was incorrect, and that in the process of protein synthesis the nucleic acid molecule is not present in its extended form. If, as seems to be true, we deal here with a "non-overlapping code" in which each amino acid is determined by an individual base triplet of its own (Fig. 3), we are forced to assume that RNA-Template Fig. 3. the RNA molecule is shrunk by a factor of three. We can imagine, for example, that during the process of protein synthesis the RNA molecule has the shape of a spiral as shown in Fig. 4. Closely connected with the problem of a non-overlapping code is the problem The Cryptographic Approach to the Problem of Protein Synthesis 69 of "punctuation". Indeed, a sequence of bases can be broken into a set of non-oveiiapping triplets in three different ways depending upon the base with which we start. The three dilTerent readings of tiie same template can be des- cribed mathematically as 3n, 3n/l, and 3n/2 (3n/3 being the same as 3n). A| 3.6 A As was suggested by Dr Barbara Law, three possible readings of the same RNA template may explain an interesting regularity first noticed by Dr Martynas Yeas. He observed about two years ago that, in a case of seven proteins for which the sequences of amino acids were known, the total number of amino acids in the protein molecule was a multiple of three : nine amino acids in oxytocine and vasopressin, twenty-one in insulin A, thirty in insulin B, thirty-nine in ACTH, 126 in ribonuclease, etc. This could be explained if one assumes that each RNA template synthesizes the proteins in all three possible vv'ays, and that these three different readings are afterwards united in one linear sequence. If this were true, there must exist a cryptographic correlation between the first, second, and third "thirds" of each protein molecule. One thinks of how such a correlation could be checked, but it seems to be very difficult indeed. Recently, though, the existence of such a correlation became rather doubtful, since two protein sequences published recently contain 29 and 124 amino acids. In summing up, we should say that the problem of finding the nature of the correlation between polynucleotide chains of nucleic acids, and the polypeptide chains of the proteins is still unsolved, although various methods for establishing such a correlation have been worked out. We may hope, however, that with the increased number of known protein sequences, this problem will be solved in one way or another. REFERENCES 1. J. D. Watson and F. H. C. Crick: Molecular structure of nucleic acids. Nature, Lond. 171, 737-738 (1953). 2. E. Chargaff: for reference see S. Zamenhof, G. Brawerman, and E. Chargaff: On the deoxypentose nucleic acids from several micro-organisms. Biochim. Biophys. Acta. 9, 402-405 (1952). 3. G. R. Wyatt: Nucleic acids of some insect viruses. /. C^'//. P/(V.v/o/. 36, 201-205 (1952). 4. J. D. Watson and F. H. C. Crick: Genetical implications of the structure of deoxyri- bose nucleic acid. Nature, Lond. 171, 964-967 (1953). 6 THE PROTEIN TEXT Martynas Ycas Department of Microbiology, State University of New York Upstate Medical Center, Syracuse, New York And strange to tell, among that Earthen Lot Some could articulate, while others not: And suddenly one more impatient cried — 'Who is the Potter, pray, and who the Pot ?' The Book of Pots Abstract — The sequence of residues in proteins, regarded as a text written in a twenty symbol alphabet, is examined. The following tentative conclusions are drawn: 1. Twenty amino acids are distinguished by the protein-forming mechanism. Super- numerary amino acids arise from the regular twenty by secondary modification of protein- bound residues. 2. Each residue in the protein has a separate genetic representation. 3. There is no intersymbol correlation between adjacent residues. 4. Natural selection is not the only factor determining the frequency of occurrence of the various kinds of residues. It is suggested that the method of encoding protein sequence information in nucleic acid imposes differences in frequency of occurrence on the different kinds of residues. 5. Peptide chains are not multiples of some fixed number of residues. The encoding and transfer of genetic (DNA) information to RNA and protein is discussed, as well as the problem of the independent reproduction of RNA viruses. While the data set certain limits on the possible ways of encoding and transferring information, they are not sufficient for a unique solution of these problems. Ribonucleic acid of Tobacco Mosaic Virus (TMV) has been shown to deter- mine the sequence of amino acid residues in the protein of the virus (1, 2, 3). It seems logical therefore to believe that the sequence of other proteins is also determined by RNA.* Since RNA is essentially a linear sequence of four kinds of nucleotides, while proteins are linear sequences of about twenty kinds of amino acid residues, the RNA molecule can be regarded as a text, written in a four-symbol alphabet, which encodes another text, the protein, written with about twenty symbols. * The following abbreviations will be employed. RNA — ribonucleic acid; DNA — deoxy- ribonucleic acid; Ad — adenylic acid; Gu — guanylic acid; Cy — cytidylic acid; Ur — uridylic acid; ala — alanine; arg — arginine; asp — aspartic acid ; aspn — asparagine; asx — asparticacid or asparagine; cys — cysteine; glu — glutamic acid; glun — glutamine; glx — glutamic acid or glutamine; gly — glycine; his — histidine; ileu — isoleucine; leu — leucine; lys — lysine; met — methionine; phe — phenylalanine; pro — proline; ser — serine; thr — threonine; try — trypto- phan; tyr — tyrosine; val — valine; Hlys — hydroxylysine ; Hpro — hydroxyproline; serP — phosphoserine. Peptides are written with the amino group to the left, the symbols being connected by a dash ( — ). The sign (*) signifies a terminal residue. Sequences considered uncertain are in parentheses ( ). Symbols in parentheses, with commas between (ala, gly) mean that the sequence is not known. 70 The Protein Text 71 Several attempts, none completely convincing, have been made to determine the coding system employed (4, 5, 6, 7). Cryptography must be based on a study of texts, and 1 shall therefore attempt an examination of protein molecules from this point of view. The following aspects of protein structure will be examined : 1. The number of kinds of amino acids which occur in proteins. 2. The effect of mutations on amino acid sequence. 3. Whether intersymbol correlations exist between adjacent residues. 4. The frequency of occurrence of the various amino acid residues. 5. Whether any restrictions exist on the length of peptide chains. After considering the empirical evidence, I shall indicate its bearing on the problem of encoding protein sequence information into the RNA molecule. I. THE NUMBER OF AMINO ACIDS OCCURRING IN PROTEINS In previous studies (6, 7) it has been assumed that proteins are composed of exactly twenty different kinds of residues. Since in fact more than twenty kinds of residues occur in proteins, the assumption requires some justification. All organisms, from viruses to mammals, use the same building blocks for their proteins. With minor qualifications this is also true of the nucleic acids, but not true of the third major class of biologically-occurring high polymers, the polysaccharides. The amino acids which invariably occur in all organisms and virtually all proteins are the following: ala, arg, asp, aspn, cys, glu, glun, gly, his, ileu, leu, lys, met, phe, pro, ser, thr, try, tyr, val. The number in this list is exactly twenty. It will be noted that I omit cystine from this list. Because of its structure, cystine corresponds to two residues. The structure of insulin (8) shows that one cystinyl residue can occupy non-adjacent positions in a peptide chain or even participate in two different chains. Cystine is best regarded as an oxidation product of cysteine, formed after incorporation of the cysteinyl residue into the peptide chain. This view is supported by the recent discovery of an enzyme which reversibly catalyzes the reaction 2 cysteinyl :^ cystinyl when these residues are protein bound (9). Another example of such a reaction may be the cyclic oxidation and reduction of protein SH groups during the various stages of cell division (10). In addition to the above twenty, other alpha amino acids occur in nature. Some of these, such as homocysteine, citruline and ornithine are well known biochemical intermediates but do not occur in proteins. It is clear that the number of amino acids which occur in proteins is limited by an inability to incorporate, rather than make, amino acids. Hydroxyglutamic acid and norleucine, previously believed to be protein constituents, have been shown not to exist as natural products (11). Alpha amino-adipic acid has been isolated from an impure protein hydrolyzate, but it has not been demonstrated that it is a protein constituent in the same way as other amino acids (12). Diamino pimelic acid, commonly occurring in bacteria, appears to be associated with the polysaccharide material of the cell wall (13, 14). Nevertheless, there are amino acids, other than the twenty enumerated, 72 Martynas Ycas which certainly occur in proteins. These include hydroxylysine and hydro- xyprohne (in collagen), phosphoserine (in a number of different proteins (15)), thyroxine (in thyroglobulin) and tyrosine — O — sulphate (in fibrinogen) (16). The distribution of these amino acids is different from the regular twenty. Whereas the twenty amino acids occur in virtually all proteins, the super- numerary ones have an erratic distribution, being confined to one or to a few. The suggestion was first made by Crick, that the supernumerary amino acids are the result of modifications of some of the regularly occurring amino acids after these have been incorporated into a peptide chain. The biochemical evidence for this is as follows. When one of the twenty regularly occurring amino acids is presented labeled to an organism, it is rapidly incorporated into protein and most of the label is found in the corresponding residue. It should be noted that glutamine and glutamic acid are separately incorporated and do not arise one from another by addition or subtraction of amide groups after incorporation (17). (A similar demonstration for the analogous case of asparagine and aspartic acid is still lacking.) Clearly, therefore, these amino acids are the precursors of the corresponding protein-bound residues. The supernumerary amino acids behave differently. Thus lysine is the precursor of hydroxylysine (18), but C^* or tritium-labeled hydroxylysine is not incorporated into collagen (19). Similarly, proline is the precursor of hydroxyprohne, but proline is a much better precursor of the hydroxyprolyl of collagen than is hydroxyprohne itself (20, 21). These amino acids, then, are not incorporated as such, but presumably are formed by oxidation of protein-bound proline and lysine. Phosphoserine likewise is formed by phos- phorylation of protein-bound serine (22). Thyroxine is apparently formed from the tyrosine residues of thyroglobuhn (23). There is no information at present on the metabolism of tyrosine — O — sulfate. Since not all appropriate residues are secondarily modified, this inter- pretation imphes that the enzymes catalyzing such conversions show specificity for sequence in the protein. At least one enzyme is known which shows such specificity. Prostatic phosphatase dephosphorylates phosphoserine in the sequence asx-serP-glx-ileu-ala, but not in glx-serP-ala (24). It is therefore suggestive of some enzyme specificity that hydroxyprohne in collagen occurs mainly, if not exclusively, before glycine (25) (Table IV). Other amino acids, as shown later, shovv' no such neighbor preferences. The region determining whether proline is to be oxidized or not probably includes more than three residues, as indicated by the isolation from collagen of the tripeptides ala- pro-gly; ala-Hpro-gly and ser-pro-gly; ser-Hpro-gly (Table IV). The biochemical evidence thus appears to indicate that the protein-forming mechanism selects exactly twenty different kinds of amino acids, and that the supernumerary ones arise by secondary modification of protein-bound residues. A possible cause for error in this conclusion should be noted. It is virtually certain that amino acids are not incorporated as such, but in the form of some sort of activated derivative. If the same amino acid were to form more than one derivative, the number of items to be selected would of course exceed twenty. There is no evidence for this at present, and only further advances in biochemistry can decide whether this is the case. Tlie Protein Text 73 II. GENETIC EFFECTS ON PROTEINS There is an increasing body of evidence indicating that tiie details of protein structure are genetically determined. A study of the effect of mutations on proteins should therefore tell us something both about the nature of mutations and the protein forming mechanism. Known cases of genetic effects on proteins are listed below. 1. In man hemoglobin occurs in several electrophoretically distinguishable forms, the presence of each being apparently controlled by alleles of a single gene (26). Hemoglobin C differs significantly in amino acid composition from hemoglobin A (27). Hemoglobin A and S have been degraded in a controlled fashion with trypsin and the resulting peptides separated. The difference between these hemoglobins is apparently confined to a short section of the molecule (28). 2. Two electrophoretically different hemoglobins occur in sheep. Their presence is determined by alleles of a single gene (29). 3. Two forms of lactoglobulin occur in cow's milk, and like the hemo- globins are determined by different alleles of one gene. Crystallographic investigations indicate unit cells of the same size, but there are very slight differences in the diffraction pattern, which the investigators attribute, possibly, to the substitution of a few amino acid residues by others (30). 4. Mutants of Neurospora and Escherichia co/i produce abnormally heat- labile forms of tyrosinase (31) and a panthothenic acid synthesizing enzyme (32), respectively. It is clear that a change in the proteins has occurred, but unfor- tunately there is no further information on its physico-chemical nature. The genetic evidence indicates that there is no interaction between alleles controlling the synthesis of different variants of one protein. If both alleles are present, both types of protein are formed. A possible exception should be noted. The N-terminal groups of wheat gliadin are reported to be phe, of rye gliadin phe and glx, but unexpectedly the ghadin of wheat x rye hybrids was found to have no amino or carboxyl terminal ends, indicating, possibly, a cyclic protein (33). This case obviously needs further study*. The evidence cited above shov/s that the properties of proteins are gene- determined, but it does not indicate clearly what these properties are. More detailed information is available on this point from a comparison of homo- logous proteins of related species, if it is assumed, as is usually done, that species differences are the result of gene mutations. Available evidence on amino acid sequence of homologous proteins is * There is considerable confusion as to the N-terminal residues of wheat gliadin. Fraenkel- CoNRAT (51) misquotes Deich and Soreni (33) as stating that the N-terminal residues are phenylalanine and histidine, apparently because of a misunderstanding in Chemical Abstracts (138). KoROS, whose paper I was able to consult only in abstract (139), reports histidine as N-terminal. Ramachandran and McConnell (140), working with wheat gliadin but failing to specify the species, also find histidine. Deutsch (the same as Deich quoted above, the differ- ence in spelling being due to transliteration from the Cyrillic) reports that gliadin from Triticiim durum and Triticum milgare has N-terminal phenylalanine (141). This is misquoted as tyrosine, and tyrosine and glutamic acid, respectively, by Ramachandran and McConnell (140). The original paper of Deutsch (141) was also unavailable to me. 74 Martynas Ycas collected in Table I. Mutations (as inferred from differences between homo- logous proteins) do not produce a general scrambling of protein sequence, but a replacement of one or more residues, leaving the rest of the sequence unchanged. Since homologous proteins can differ by a one residue replacement, it is clear that individual residues, rather than groups of residues, are represented in the genetic material. Table I. Sequences in Homologous Proteins from Different Species Protein Species Insulin (34) . cys-thr-ser-ileu-cys . . cys-ala-ser-val-cys . . . cys-ala-gly-val-cys . . . cys-thr-gly-ileu-cys . . cys-thr-ser-ileu-cys . Pig Cattle Sheep Horse Whale Myoglobin (35) *val . . . *val . . . *giy . . . *gly . . . Finback whale Sperm whale Horse Seal {Phoca vitulina) Protamine (36) (Composition, not sequence) glyasefaalaovaliileui glyaseraalagvalaileuo Salmo irhleus Salnio trutta Serum albumin (37, 38) *asp-ala .... leu* *asp-thr .... ala* Man Cattle • Cytochrome c (39) . . . cys-ala-glun . . . . . . cys-ser-glun . . . Horse, Cattle, Pig, Salmon Chicken Vasopressin (40) . . . pro-arg-gly-NHo* . . . pro-lys-gly-NHa* Cattle Pig Protein The Protein Text Species 75 Hemoglobin (41) *val-leu . *val-gly . *val-glun *val-leii . *val-gly . *val-asx . *val-leu . *mct-gly . *val-Ieu . *val-ser . , *val-asx . *val-leu . *val-gly . *val-leu . Horse, Pig Dog Cattle, Goat, Sheep Guinea pig Rabbit, Snake Chicken Gliadin (33) *phe . . . *phe . . . Wheat *phe . . . *glx . . . Rye Fibrinogen (42) *tyr . . . *ala . . . Man *tyr . . . *glx . . . Cattle ACTH (43, 44, 45) . . . pro-ala-gly-glu . . . . . . pro-gly-ala-glu . . . Sheep Pig . . . glu-ala-ser-glu . . . . . . glu-leu-ala-glu . . . Sheep Pig Hypertensive p-ptide (46, 47) . . . val . . . Cattle . . . ileu . . . Horse 76 Martynas Ycas Protein Species Virus (48) . . . thr-ser-gly-pro-ala-thr* TMV (M, YA strains) . . . thr(thr,ala)pro-ala-thr* TMV (HR strains) It is possible that a mutation may suppress an amino acid determining site altogether. This is indicated by the tentative finding of Akabori (quoted in (41)), that the 'B' chain offish insuHn has the sequence . . . pro-lys*, as compared with the sequence . , . pro-lys-ala* in cattle. In some cases (ACTH, TMV), two adjacent replacements differentiate one homologous protein from another. It is not probable that this is due to two independent but adjacent mutations, but rather that a single mutational event has affected two residue-determining sites. Such a view is made plausible by the work of Benzer (49). He has shown that mutations in bacteriophage involve small sections of DNA, of molecular dimensions, but that these sections can be of diflferent lengths. Presumably the length of the mutated section deter- mines the number of residues changed in the protein. It is perhaps not too sanguine to hope that eventually it may become possible to measure crossover values in terms of distance in residues along a protein chain, and thus obtain an estimate of the number of bases in DNA determining a single residue selecting site. The present difficulties of such an approach are of course obvious (50). It would be of interest to determine if there are any restrictions on the replacement process. Restrictions might be expected on the following grounds. More than one nucleotide must determine an amino acid site. If the process of mutation were predominantly to change some, but not all nucleotides determining a site, then obviously not all sites would be interconvertible in one step. A study of any such restrictions would be of great value, since their nature would depend on the coding principle and could be used to infer the latter. Table II. Replacements Inferred from Table I and their Frequency of Occurrence Occurrence 3 2 2 2 2 2 Replacement val <->ileu ala <->thr ala ■<—> ser ala <->gly ala <-> leu ser <-^gly ala <-^gIx val 4^gly val <->met phe <-^glx slur i<-^ asx arg <->lys The Protein Text 77 Known replacements in homologous proteins are collected in Table II. In the small sample we have (nineteen replacements), half recur twice or more, suggesting strongly that the process, as observed, is not a random one. Unfor- tunately, the sample is not unbiased. Certain replacements arc lethal or semi- lethal (hemoglobin S, for example), and are, without doubt, selected against. What we actually observe has therefore passed through the sieve of selection. The direct genetic approach to this problem is tedious, because of the difficulty of determining the phenotype (the amino acid sequence), and rapid progress is scarcely to be expected. A much larger body of data on homologous proteins may, however, enable us to reach a decision on whether the replacement process is intrinsically restricted or not. An additional point emerges from a consideration of such protein mole- cules as consist of more than one chain (Table III). It will be noted that there Table III. Terminal Residues of Proteins having more than one Peptide Chain (The exact number of chains is not indicated.) Protein N-terminal C-terminal Reference Cytochrome c fhis jhis (51) Growth hormone [ala phel phej (51) Triosephosphate- dehydrogenase fval Ival met \ met (51) Collagen giy ala J (51) Gliadin (wheat) phe (33) Glladin (rye) fphe Iglx (33) ^ lactoglobulin peu |leu ileu \ ileu (51) Fibrinogen (man) ftyr lala (51) Fibrinogen (cattle) ftyr glx (51) Hemoglobin (horse) fval jval (41) Hemoglobin (cattle) fval \met (41) 78 Martynas Ycas is a strong tendency for the terminal residues of such proteins to be identical. This is certainly not due to the chains being identical in all cases, since the hemoglobins, for example, do differ in the penultimate positions (Table I). Rather it appears to indicate that multi-chain proteins arise by reduplication of genetic material, so that the several chains start out by being identical, but gradually diverge in the course of evolution in the same way as homo- logous proteins of different species. This hypothesis, as applied to the hemo- globins and insulin, has been previously discussed (6). Determinations of the residue sequence along different chains of one protein may therefore throw additional light on the replacement process. Table I shows that the process by which replacements become established is very slow. Elucidation of the sequence of homologous proteins may therefore make it possible to determine phylogenetic relations between large groups such as phyla, which cannot now be certainly determined from morphological and embryological evidence. III. CORRELATIONS BETWEEN ADJACENT RESIDUES Are there any forbidden combinations of adjacent residues? An examination of the sequence of residues in proteins (Table IV) could provide an answer to this question. <0 Q.Q. too 3>- CO _iq: men >-_» _i_i — << < < oo oo X 3(oHuJOttQ:>-a:_i ui>-iiJXD:uxir>4 ALA ARG ASP ASPN CYS GLU GLUN GLY H IS ILEU LEU LYS MET PHE PRO SER THR TRY TYR VAL •• •• • • • •• • 9 •• •• •• • • • •• •• • • • •• • • •• •• •• •• • •• #9 •• • •• •o •• •• •• • •• • •• • • — •• • •• •• • • •• •• • • • •• •• •• •• •• • • •• • •• •• •• •• •• • • • • « • •e •• •• •• •• •• •• • •• • •• •• • • •• • • •• • •• • • 9 • •• •• •• •• •• •• • • • • Fig. I. Dipeptlde sequences now known to occur in proteins, compiled from Table IV. The N-terminal amino acids are plotted in the rows, the C-terminal in the columns. There are of course 400 possible pairs of the twenty amino acids. The known protein sequences in Table IV have been broken down in the following way. A sequence, say, of ala-arg-gly is broken down into the dipeptides ala-arg, arg-gly, and the appropriate cells in Fig. 1 are then filled, the N-terminal residues being represented by the rows, the C-terminal by the columns. Using all the data available in Table IV, Fig. 1 shows that somewhat more than half of all possible dipeptide combinations are known to occur. The question The Protein Text 79 Table IV. List of Known Sequences in Proteins Actin (52) . . . his-ileu-phe* Adrenocorticotropin (45) *ser-tyr-ser-met-glu-his-phe-arg-try-gly-lys-pro-val-gly-lys-lys-arg-arg-pro-val-lys-val-tyr-pro- ala-gly-glu-asp-asp-glu-ala-ser-glu-ala-phe-pro-leu-glu-phe* Carboxypeptidase (53) *aspn-ser; ser-thr a Casein (54, 55) serP-glx ; *lys-ieu-val-ala-glx-asx Chymotrypsinogen (56, 57) leu-ser-arg-ileu-val ; aspn-ser-gly-(glun-ala) Clupein (58, 59) *pro-ser-arg; ser-ala-arg-arg* ; arg-arg-arg-arg; Collagen (60, 61) ala-Hpio-gly; ala-pro-gly; glx-arg; glx-Hpro-gly; gly-asx-gly; gly-glx; gly-pro-ala; gly-pro-glx; gly-pro-gly; gly-pro-Hpro ; ser-Hpro-gly ; ser-pro-gly; ala-gly-ala; gly-gly; ser-gly; thr-gly; ala-asx; asx-asx; asx-glx; asx-gly; glx-ala; glx-glx; glx-gly; glx-gly-gly; glx-met; glx-phe; ser-asx; val-glx; ala-arg; arg-gly-gly; arg-val-gly; ser-arg; val-arg; ala-lys; asx-arg; lys-gly; pro-ser; pro-thr; ser-ala; thr-ala; lys-pro-gly ; leu-ala ; ala-ala-gly ; Cytochrome c (39) . . . val-glun-lys-cys-ala-glun-cys-his-thr-val-glu r Globulin (rabbit) (62) *ala-leu-val-as\ . . . Glucagon (63) *his-ser-glun-gly-thr-phe-thr-ser-asp-tyr-ser-lys-tyr-leu-asp-ser-arg-arg-ala-glun-asp-phe-val- glun-try-leu-mct-aspn-thr* Hemoglobin (41) *val-glun-leu; *val-leu; (horse). *val-gly; *met-gly; *val-ser; *val-glx; *val-asx; (various species, see Table 1 .) 80 Martynas Ycas Hypertensive peptide (46) *asp-arg-val-tyr-val-his-pro-phe-his-leu* Insulin (cattle) (8) 'A' chain: *gly-ileu-val-glu-glun-cys-cys-ala-ser-val-cys-ser-leu-tyr-glun-leu-glu-aspn-tyr-cys- aspn* 'B' chain: *phe-val-aspn-glun-his-leu-cys-gly-ser-his-leu-val-glu-ala-leu-tyr-leu-val-cys-gly- glu-arg-gly-phe-phe-tyr-thr-pro-lys-ala* /5 Lactoglobulin (64) his-ileu* Lysozyme (65, 66, 67, 68) thr-asx-val-glx-ala ; ileu-glx-leu-ala-leu; asx-glx-ala; leu-thr-ala; glx-asx-ileu ; thr-glx-ala-gly ; ser-asx-gly-met-asx; asx-ala-met-lys-cys-arg; val-thr-pro-gly-ala ; ser-asx-arg; lys-phe-glx-gly ; arg-cys-glx-ala ; ser-phe-asx-glx ; thr-asx-arg-arg ; thr-gly-asx-val ; ser-val-cys-ala-lys-gly ; gly-cys-asx ; leu-gly-ala-val ; asx-ileu-pro-cys ; arg-cys-lys-gly ; ser-val-asx-cys-ala ; asx-leu-cys-asx ; arg-asx-cys-ileu; ser-arg-leu; ser-asx-cys-arg-Ieu ; arg-asx; arg-gly; asx-asx; gly-leu; ileu-arg; ileu-asx; ileu-val; leu-leu; ser-ala; ser-leu; val-ala; *lys-val-phe-gly-arg; arg-his-lys; asx-gly-ala-asx-leu* ; glx-ser-phe-asx ; ala-lys-phe-glx; asx-tyr-arg-gly ; arg-gly-tyr-ileu-leu ; asx-ala-tyr-gly-ser-leu-asx; leu-pro; ala-ala-met ; Melanophore expanding hormone (69, 70) *asp-glu-gly-pro-tyr-lys-met-glu-his-phe-arg-try-gly-ser-pro-pro-lys-asp* Myoglobin (71) * gly-leu Ovalbumin (72, 15, 73, 74, 64) val-ser-pro* ; asx-serP-glx-ileu-ala ; glx-serP-ala; ala-gly-val-asx-ala-ala ; cys-ala; cys-val; cys-gly; cys-phe; thr-cys; ser-cys; cys-glx; glx-cys; phe-cys; asx-cys; val-cys; Oxytocin (75) *cys-tyr-ileu-glun-aspn-cys-pro-Ieu-gly-NH2°' Papain (76) *ileu-pro-glu Pepsin (77, 15) *leu-gly-asx-asx-his-glx ; thr-serP-glx ; Prolactin (78) *thr-pro-val The Protein Text 81 Ribonuclease (79) *lys-glu-thr-ala-ala-ala-lys-phc-glun-aig ; lys-ser-arg-aspn-leu-lhr-lys-asp-aig ; lys-aspn ; tyr-glun-ser-tyr ; tyr-Iys; lys-his; asp-ala-ser-val* Salmine(80, 81) *pro-arg-arg; arg-pro-val-arg-arg; pro-ileu-arg; val-gly; arg-val-ser-arg ; arg-ileu-arg; arg-ala-ser-arg ; arg-gly-gly-arg; arg-ser-ser-arg ; val-gly; Serum albumin (37) *asp-ala (man); *asp-thr (cattle); Silk fibroin (Bombyx) (82, 83, 84) gly-ala-gly-ala-gly-[ser-gly-(ala-gly)„]8-ser-gly-ala-ala-gly-tyr n usually 2, mean value always 2. gly-val-gly; tyr-gly; phe-gly; gly-ser-pro-tyr-pro ; tyr-pro-ser-tyr Tobacco mosaic virus (48) thr-ser-gly-pro-ala-thr* Tropomyosin (52) ala-ileu-met-thr-ser-ileu"'' Trypsinogen (85) *val-asp-asp-asp-asp-lys-ileu Vasopressin (40) *cys-tyr-phe-glun-aspn-cys-pro-arg-gly-NH2* Wool (86) ser-cys; gly-cys; thr-cys; ala-cys; leu-cys; cys-gly; cys-thr; cys-ala; cys-val; cys-leu; cys-phe ; remains whether any of the blank cells represent forbidden combinations, or whether they are merely the result of accidents of sampling. To answer this question statistically, the frequencies of occurrence of various combinations have been plotted in Fig. 2. There are more blank cells here than in Fig. 1, as a portion of the data has been discarded to avoid obvious sources of bias. Thus the sequences of silk, collagen, wool and protamine have been omitted, since these proteins have an obviously aberrant structure. Likewise, sequences of less than three residues have not been used, since the ease of 82 Martynas Ycas isolation of various dipeptides varies, making it possible that the frequencies of some peptides have been systematically over- or underestimated. Figure 2 can now be treated as a contingency table with 761 degrees of < < O a: < CL tn < Z Q- < >- o _i o Z Z) -I o >- -I O X 13 U.' LJ -I CO >- -I 1- UJ 2 I a. O (E Ol UJ CO 1- V CC >- H -1 < > ALA 4 1 2 3 1 3 3 2 1 3 1 1 1 26 ARG 1 3 2 2 4 1 2 1 1 1 18 ASP 1 2 4 2 1 1 1 1 13 ASPN 2 2 4 3 2 2 3 1 1 2 2 24 CYS 4 2 3 1 1 2 1 1 1 1 1 1 19 GLU 4 1 1 2 1 1 1 1 1 1 16 GLUN 4 1 1 1 1 2 1 3 1 2 1 18 GLY 3 1 1 1 2 1 1 2 1 1 2 2 1 1 1 21 H IS 1 3 1 1 1 1 8 ILEU 1 2 1 1 1 2 2 10 LEU 1 1 1 2 2 3 1 1 2 2 4 20 LYS 1 1 2 2 1 2 1 1 1 3 1 1 1 2 20 MET 2 1 1 1 5 PHE 1 1 4 1 1 1 1 1 1 2 14 PRO 2 1 1 1 1 2 2 1 1 1 3 16 SER 1 4 1 3 1 1 2 1 1 2 1 1 1 2 2 4 28 THR 2 2 1 1 1 1 3 4 1 16 TRY 1 1 2 TYR 1 1 2 1 2 2 1 1 1 2 1 1 16 VAL 1 1 2 3 3 4 1 1 1 1 1 1 2 22 b c S2 58 20 38 II 24 21 45 19 3B 13 27 22 40 27 48 6 14 11 21 23 43 16 36 6 II 15 29 16 32 20 48 II 27 2 4 15 31 24 46 330 Fig. 2. Frequencies of occurrence of dipeptide sequences in proteins, plotted as in Fig. 1. The sequences of clupein, collagen, salmine, silk fibroin and wool have not been used. Sequences of less than three residues, as well as those where the acid and amide forms of glx and asx are not differentiated, were also not used. On the basis of the study of Ohno (68), glx and asx in lysozyme are assigned to glun and aspn, respectively. The seven-residue sequence common to ACTH and MEH was counted only once, a — marginal totals of rows ; b — marginal totals of columns c — marginal totals of rows and columns. freedom, and the null hypothesis, that there is no correlation between adjacent residues, tested. The deviation X from the expected distribution in Fig. 2 is calculated as: {a^_ -f- a,^{Oj, + ^.;) 1^ (1) {a,^ -I a.i){aj. + a.j) An where n is the sum of the marginal totals (330), a^j the value of a cell in column / and row j, Oj and a,, the marginal totals in column and row respectively of the residue defining the column, <7._, and a_,. the analogous values for the residue defining the row. For computational purposes (1) reduces to: ^="(2'K+«i. +«.)"') From Fig. 2, A = 392. The value of /, which is calculated from is 0.414, which is less than 1.645, the 5 per cent confidence limit. (2) (3) The Protein Text 83 It may therefore be concluded that there is no evidence for any intersymbol correlation between nearest neighbors. Inspection of sequences reveals like- wise no obvious correlations of residues more than one removed from each other, but to decide this question definitely will require more knowledge of longer sequences than is now available. Gamow, Rich and Ycas (6) have previously studied this question of intersymbol correlation. They examined a grid diagram, similar to Fig. 2 but embodying fewer data, to see whether the frequencies of entries follow the PoissoN distribution. This method is invalid, since it does not take into account the fact that different amino acids occur with very different frequencies. I am glad to avail myself of this opportunity to correct these authors. IV. FREQUENCY OF OCCURRENCE OF DIFFERENT AMINO ACIDS Amino acids occur with different frequencies in proteins. Some, like leucine, are consistently abundant, others, like methionine, consistently rare. The frequency of occurrence of the various amino acids in the bulk protein of a whole organism, Escherichia coli, is shown in Fig. 3. aiatt , , E.COLI PROTEIN 10 +H \ • AMINO ACID L^„ + TRIPLET j; \ \ ucA S6f _i 6 _ o +i°" 2 2 1 1 1 10 RANK 20 Fig. 3. Composition of bulk protein of Escherichia coli (87), amino acids arran- ged in order of abundance. The vakies for glu, glun and asp, aspn arbitrarily taken as half of glx and asx, respectively. The value of cysteine taken from Roberts and Cowie (88). 'Triplets' refers to the frequencies of triplets of nucleotides, calculated according to the hypothesis of Gamow and Ycas (7) from the composition of E. coli RNA (89). Data on the composition of twenty-three proteins are summarized in Table V. This table shows that the composition of individual proteins is not too different from that of bulk protein. The most abundant amino acid usually has a frequency of about 0.10 to 0.12, the least 0.005 to 0.01. Table V suggests the possibility that the differences in composition of various proteins may be merely the result of chance fluctuations from a mean, and not importantly related to biological function. This notion may not be as far-fetched as might appear at first sight. The most important function of proteins is catalysis, and the enzymatically active site probably involves only a few amino acids. In addition, proteins of a given organism appear to have 84 Martynas Ycas an important mutually complementary relation to each other which enables them to be retained by the cells. This is shown by experiments with injected catalase. Homologous catalase injected into guinea pigs is absorbed by the tissues, but heterologous catalase is rejected (108). Similarly, homologous antibodies readily pass the fetal barriers in rabbits, heterologous pass much less readily (109). This phenomenon is probably connected with the anti- genicity of proteins. The antigenically active sites of proteins are probably also small, and therefore the exact sequence and composition of the major part of the protein m.ay be irrelevant to function. It might be expected, then, that the exact structure of small parts of a protein molecule would be rigidly determined, and any mutation affecting this portion would be eliminated by selection. Mutations affecting the 'irrelevant' portions may not affect the viabihty of the organism, and the same protein in different species may therefore diverge by a process of 'evolutionary drift.' That this process is real is strongly suggested by the facts known about cytochrome c. This enzyme serves the same function and has the same prosthetic group in both yeast and mammalian tissues, but the two cytochromes have very different elution volumes from ion exchange resin columns (110), almost certainly indicating a large difference in amino acid composition. If for each kind of residue there is a characteristic rate of replacement by mutation, the proteins should approach a definite equilibrium composition, if selection is a minor factor. More definitely, each protein will constitute a 'random grab' from a universe of amino acids, the frequencies of the amino acids in this universe being determined by the equilibrium condition. Qualitative considerations suggest that there is something other than selection which tends to make a given amino acid occur with a certain frequency. Certain amino acids, alanine, leucine, isoleucine and valine have aliphatic side chains lacking any obvious reactive functional group. The data on replacements (Table II) indicate, apparently, that one is as good as another, as far as their function in a protein is concerned. Yet leucine is systematically more abundant than isoleucine. These two amino acids are so similar that it is difficult to separate them by paper chromatography. Each of the other aliphatic amino acids has its own characteristic frequency, likewise. Quantitatively, if a sample of « items is drawn at random from a population where an item of type A occurs with frequency p, the distribution of A in a large series of samples is given by the binomial (p + q)", where q = \ — p. In particular, the variance cr^ of the distribution of A is given by o- == npq (4) If the hypothesis of a 'random grab' is correct, then in a collection of proteins the variances of amino acids should be related to the mean value of their frequencies and to the size of the proteins, expressed as the number of residues per molecule. An immediate difficulty is that the sizes of the proteins listed in Table V are not known, and these certainly differ one from another. It should be particularly noted that the relevant size is not necessarily that obtained from physical measurements of diffusion, osmotic pressure and sedimentation. This is because there is ample evidence that physical molecules can be the result of The Protein Text 85 aggregation of smaller, chemically identical units. Furthermore, from the evidence presented in Table III, the several peptide chains constituting some proteins may not be identical, but are nevertheless quite similar. The statistically relevant size of hemoglobin would then be somewhere between 600, the approxi- mate number of residues in the whole molecule, and 150, the average size of the four subunits. Disregarding this difficulty, 1 have plotted the variance of each amino acid, calculated from Table V, against pq (Fig. 4). All points (except glx) fall within O.IO 0.05 • GLX • PRO gly^^'leu CYS» /ARG, 0.05 0.10 pq Fig. 4. Plot of variances of amino acids against/?^, where/? = mean frequency of occurrence of amino acid, ^ = 1 — /?. Line n = 100 calculated variance for sample size (protein) of 100 residues, (glx) is plot with the values from tropo- mysin and y casein omitted. (or very close to) one standard error of the line for n — 100. The fact that the sizes of the proteins are not identical tends to scatter the points, making agree- ment with the hypothesis somewhat more significant. The large deviation of glx is due to its abundance in two proteins, y casein and tropomyosin. If these are omitted the agreement is good. The evidence therefore permits (but of course does not prove) the hypothesis that the composition of proteins is mainly determined not by selection, but rather approximates to a 'random grab' from a single universe of amino acids. There is of course no question that selection can produce proteins of very 86 Martynas Ycas i to > .^ I ft.- <3 02 « R 3 o cr •~ -' ■^ S o V ;^' g E (Z.01) 3SBi;(joqdsoqj (90l) uriunq[B uinjag (901) u!inqoi§opBi ^ (50 1 ) uqnqoiS BuiopXpv (t'Ol) UpjOjd 3AJ3^ (6/,) asEspnuoqi-a (COl) 3SBU3§Ojp;(q3p 3p^q3piBJ33/(ir) (eoi) asBiopiv (301) ssBpijdadXxoqjB^ (10 1 ) aiuXzosXi E^edBtj (001) ujinqoio 'i (66) uiBdHd (86) UI3SE3 / (^6) SSBJ/CuiB iCjBAIlBS (96) 3SEU3§ojpXq3Q loqoDiv (S6) uiLunqjBpBq » (^^6) auiyfeos/Cq (e6) uiqoi§oXj>j (f6) uiqoiSoiuaH (36) uiso/duodojx (36) uiPV (16) uiqiuojqiojd (06) 9SB[/CjoqdsoqdsuBJx r^TtCTvio — oOfNmoor^-^— ^Ov^f^fNr^^ pOOOfNOOOOOOO-^OOOOO d d d d d d d d d d d o o o o o d d qp-^0-;OOOOOOOOOOOOo voododd-^'^cs^d^rirtiricnTt'-^'^r-^d odrnOsvdrnfSroHddd'^-^'^iO^-''^''^^ vodf^^Hr-^d'^ON'-^ Ttdio'^^io'--^rnr~-r~- oomoo'ovOr^ON'^'—'— 1 lod'^fodod—'^'^vd in(Nr-_poooN"^oqri- I ri>ororn-<^-^' Tt — i' fsj -^' t-^ i/S vd ts Tt Tt ■*' (N vS /^rnOO'0-^0\<-nO\rfirioooO'^oo tNfNod-^'iio-4^dr-^ nTt--ir)oofNi^a\vo c»TtON^--r-^(Nvdt~^vdf<-)<^v-ivovd— 'ro-^'r-^ fOvo(-^i/-ir4rn>nO\ vo— ••^r^O' (S On rn q ■^ CO w^ _ ro r-; r) q ^— ( On NO t- 1- 00 ro d ^ iri 00 CO t-' >n VD — q NO t^ q NO r~- d -^ oo' — 00 00 ro ND On ■* <^ NO ro q Tl- NO "* q NO r^ ON " 06 r-^ fN NO 00 fS r-^ NO V-) 00 ^ r^ 00 rn o\ V^ On CO ■* 1-^ — " Tt lY-i "■t T}- -H NO — + -* 00 n >/-i r»-i On 00 en — ro d oo' (N CN "* 2 Tt ^ (S NO /^ r-.55 « CO « u oij ao, 45 " »-S ~ C l-l-c{2-iI/3-f-'*J+- > 03 The Protein Text 87 unusual composition. This occurs mainly in cases where the mechanical pro- perties of protein fibers are important, as in keratin, collagen and silk. These have been omitted from Tabic V. The most extreme case known to me is the silk of the Congolese moth Anaphe nialoneyi, where glycine and alanine together constitute 94 per cent of the entire protein (1 1 1). Fox and Homeyer (1 12) have also noted the general similarity of composition of various proteins, but have interpreted it in a quite novel manner. Their suggestion is that proteins are similar because the time that has elapsed since the origin of life has been too short to allow more differences to develop between the various proteins, all of which are presumed to be descendants of a single molecule. I believe the composition of silk tends to indicate that there has been ample time for any conceivable differentiation. V. LENGTH OF PEPTIDE CHAINS I have previously called attention to the apparent fact that the number of residues in naturally occurring peptide chains is an exact multiple of three (113). Since then, a more exact determination of the composition of ribonuclease (79) and the elucidation of the structure of glucagon (63) have shown that this statement is incorrect (Table VI). In view of the predominance of chain lengths Table VI. Length of Protein and Peptide Chains in Number of Residues (Note: Cystine counted as two cysteine residues.) Protein or peptide Number of residues Reference Oxytocin Vasopressin Melanophore expanding hormone I (hog) Insulin 'A' chain Glucagon Insulin 'B' chain Melanophore expanding hormone II (hcg) Melanophore expanding hormone (ox) Ribonuclease 9 9 18 21 29 30 30 48 124 (75) (40) (69, 70) (8) (63) (8) (114) (114) (79) that are multiples of three, it might perhaps be suspected that the exceptions are due to secondary removal of residues, as occurs, for example, in the activa- tion of pepsinogen, trypsinogen, chymotrypsinogen and fibrinogen. The tenta- tive finding of Akabori (quoted in (41)), that the B chain of fish insulin has twenty-nine residues, rather than the thirty found in cattle insulin, makes it doubtful that secondary removal of residues is the explanation. Since twenty- nine (the number of residues in glucagon) is a prime number, and not a factor in the chain lengths of other peptides, it seems reasonable to conclude that peptide chains are not multiples of some fixed number of residues. VI. THE CODING PROBLEM Having examined the protein text, we can now discuss what conclusions we may draw as to the storage, transfer and replication of the information contained in the protein molecule. 88 Martynas Yc^as The gene, and by inference DNA, is thought to contain the infoiTnation which eventually appears as a sequence of amino acid residues in the corre- sponding protein. As shown by a study both of the replacement process and of the amino acid sequences, each residue has an independent genetic representa- tion. These representations are presumably aligned in linear order on the DNA molecule. There is in fact no evidence at present that the gene is anything other than a linear sequence of amino acid determining sites, although the possibility that it may also determine the structure of immunopolysaccharides in an analogous fashion cannot yet be dismissed. Recent biochemical evidence (which I shall not discuss here) indicates that it is RNA, not DNA, which is directly involved in the process of protein forma- tion. Transfer of information therefore involves at least two steps : DNA to RNA, and RNA to protein. The straightforward inference would thus be that DNA serves as a template for the formation of RNA. Absence of cytoplasmic inheritance supports the view that RNA is not a self-replicating structure. This is also supported by four lines of biochemical evidence : 1. The initial rate of incorporation of labeled precursors into nuclear RNA is much greater than into cytoplasmic RNA (115). 2. In Amoeba depleted of RNA, RNA only regenerates if a nucleus is present (116). 3. A one-way flow of RNA from nucleus to cytoplasm can be demonstrated (117). 4. The rate of RNA fomiation is minimal at the time DNA is replicating (118). Unfortunately, this conclusion may be an oversimplification. There is no lack of biochemical evidence pointing in the opposite direction: 1. The composition of nuclear and cytoplasmic RNA is not identical (119). 2. The time curves of precursor incorporation into RNA do not indicate that the nuclear fraction is the precursor of the cytoplasmic (115). 3. Radioactive precursor is incorporated into the RNA of enucleated Acetabularia plants (120). 4. Different strains of RNA viruses are self-replicating. This is difficult to explain if RNA is the product of a DNA template. The problem is to reconcile these apparently discordant facts. Consider first the determination of RNA structure by DNA. Since both DNA and RNA are texts written in a four symbol alphabet, it is natural to suppose that the coding problem is very simple. It is sufficient to assume that one nucleotide of DNA determines one nucleotide of RNA (121). Recent evidence indicates, however, that this is incorrect. It is possible to suppress protein synthesis in susceptible bacteria with chloramphenicol. When this is done using amino acid-requiring strains, it can be demonstrated that amino acids are required for RNA synthesis, even though no protein synthesis is taking place (122, 123, 124). The natural inference, supported by several converging lines of evidence, is that it is not the nucleotides themselves which are the precursors of RNA, but rather compounds containing both a nucleotide and an amino acid. This leads to a unitary picture of the synthesis of RNA and of protein. When such precursors are lined up on a The Protein Text 89 protein-synthesizing template (RNA), the amino acids polymerize to form protein; when lined up on DNA, the nucleotide portions polymerize to form RNA (Fig. 5). If this is correct, an obvious conclusion follows. Since omission of a single amino acid stops RNA synthesis, the RNA-fonning mechanism must distinguish not four, but a minimum of twenty different kinds of items. But since the product contains only four, the RNA in general must contain less information than the template that made it. Several nucleotides in DNA must be involved in selecting a single nucleotide of RNA. Since the template must contain more information than the product, RNA cannot be the template for itself; i.e. it cannot be self-replicating. There is an important exception to this statement. AMINO ACIDS NUCLEOTIDES TEMPLATE Fig. 5. Schematic representation of the synthesis of RNA and protein from common precursors (see text). The nature of the template is presumed to deter- mine whether the aligned precursors polymerize to produce protein or RNA. If the information in the template is reduced below a certain level, it is possible to obtain a product identical to the template itself. The formalization is as follows. While in process of formation, the RNA molecule can be visualized as a sequence of nucleotides to which amino acids are attached (Fig. 5). Before removal of amino acids on polymerization the informational content of the 'proto-RNA,' of length n, is n loga 20. After removal of the amino acids the information content is reduced to n logg 4. If restrictions of some kind exist on the number of combinations allowed, the number possible for 'proto-RNA' will be reduced to b[n loga 20]; (b < 1). Such restrictions on 'proto-RNA' will result in less severe restrictions on the RNA itself, since in general one con- figuration of RNA can correspond to numerous different configurations of 'proto-RNA'. Therefore, if there are 20*" possible configurations of 'proto- RNA', RNA itself has 4*^" possible configurations available (1 > c > b). The information content of RNA will equal that of 'proto-RNA' bn log2 20 = en log2 4 (5 when 1 > c^ 2.166. Since the information content of 'proto-RNA' is now the same as that of RNA, an RNA template could, fonnally, be self-replicating. It is now possible to reconcile the genetic and biochemical facts outlined above. Assume that the synthesis of RNA proceeds in two steps. At the first step, a strand of RNA is synthesized using a DNA template. Information is thus transferred from DNA to RNA. The next step is supposed to occur in the cytoplasin. RNA material is added to the nuclear-synthesized RNA, but in a manner which does not add to the informational content. A model for 90 Martynas Ycas this process could be the building up of a complementary strand of DNA, as in the Watson and Crick scheme for DNA reproduction (125).* Normally, the process stops at this stage, since the RNA molecule has insufficient information to act as a template for itself. In the case of viruses, however, the cytoplasmic process of adding new material to the original RNA Table VII. The Composition of the Protein and RNA of Viruses Composition of protein in moles per cent, of RNA as fractions of 1. t value assumed. It should be noted the influenza virus contains lipid, and the protein analysed may in part be of host provenance. Tobacco Tomato Turnip Southern Influenza Protein Mosaic Bushy Stunt Yellows Bean Mosaic A (126) (127) (128) (129) (130) ala 9.6 8.5 6.6 7.5 5.9 arg 6.4 5.3 1.6 6.4 6.0 asx 10.3 11.1 4.2 7.3 11.7 cys 0.7 0.8 2.3 0.9 — gbc 8.7 5.7 7.1 6.8 7.0 giy 3.9 8.6 4.2 8.9 7.0 his 0.0 1.2 1.5 1.3 1.9 ileu 5.2 3.3 9.0 6.2 8.3 leu 7.1 10.9 8.6 8.3 8.5 lys 1.1 3.4 8.0 3.0 5.2 met 0.0 0.8 2.1 2.6 3.2 phe 5.7 3.6 2.5 3.5 4.7 pro 5.5 3.9 10.2 5.3 4.7 ser 10.0 8.6 8.4 8.7 4.4 thr 11.6 11.0 13.9 11.5 6.5 try 1.1 0.5 0.6 i.ot 1.1 tyr 2.4 2.8 1.5 4.1 3.6 val 10.8 10.0 7.9 6.7 6.1 amide 12.7 11.4 8.0 — — RNA (131) (127) (132) (131) (133) Ad 0.30 0.26 0.22 0.26 0.23 Gu 0.25 0.29 0.18 0.26 0.20 cy 0.19 0.21 0.38 0.23 0.24 Ur 0.27 0.26 0.22 0.25 0.33 results in the production of material identical to the template itself. From this point of view, an RNA virus can be regarded as a specialized RNA molecule, which because of restrictions on the sequence of 'proto-RNA' can act as its own template, utilizing the normal RNA-synthesizing mechanism of its host. The composition of the RNA of viruses lends some support to these ideas. * It is obvious that until more is known about RNA structure the question of its replication can be discussed only in general terms. If RNA is a double-stranded structure, the nucleotide composition shows that bases in the two chains cannot be uniquely paired as in DNA, but each base must pair with one of two others, as shown by the equality of 6-keto and 6-amino groups (89). In attempting to elucidate the details of RNA reproduction information on the number of strands, whether each strand contains all the information of the whole structure, and where the complementary strand is synthesized, is of crucial importance. The Protein Text 91 Normally the number of 6-keto (Gu + Ur) and 6-amino (Ad + Cy) groups in RNA is equal (89). Virus RNA does not necessarily obey this rule, indicating that it differs in this respect, at least, from all the others (Table VII). This hypothetical scheme is presented to show that the apparent contradic- tions of the genetic and biochemical evidence do not make it logically necessary to abandon a unitary view of RNA reproduction. The coding of protein information into RNA has attracted considerable attention, but cannot as yet be considered as solved. Study of the protein text indicates that any solution will have to meet several requirements. Firstly, since exactly twenty amino acids are incorporated into protein, it is clear that at least three nucleotides are needed to determine an amino acid. Gamow (134) has proposed that 20 is a 'magic' number, which is the result of the existence of twenty possible sites of three nucleotides each. Four kinds of items, taken three at a time, give twenty different combinations, if order is disregarded. Crick, Griffith and Orgel (135) point out, however, that there is at least one other way of deriving a 'magic' 20 number. They start by considering the problem of what it is that delimits one amino acid-determining site from another, the 'punctuation mark problem'. Assuming that three bases determine a site, it is a problem why the 3n + U 3n + 2, 3« + 3 bases represent a site, while 3« + 2, 3n + 3, 3« + 4 do not. They solve this problem by assuming that only certain triplets of nucleotides correspond to an amino acid (sense sites), while others do not (non-sense sites). The criterion separating these two types of sites is the following. The set of sense sites are all triplets which, when placed next to each other in any possible combination, give sense sites only at positions 3/z + 1, 2?i + 2, 3n -j- 3, but not otherwise. For example, the triplet AAA is a non-sense site, since when placed next to itself it gives the sequence AAAAAA. The site is not unambiguously defined, as AAA occurs both at the 1-3 position and at the 2-4 position. They find that there are exactly twenty triplets (out of sixty-four) which satisfy the criterion of sense sites, as follows : ABA BCA ADC BDD ABB BCB ADD CDA ACA BCC BDA CDB ACB ADA BDB CDC ACC ADB BDC CDD Other ways of selecting twenty sense sites are also possible. The sense sites, these authors suggest, may correspond to amino acid-selecting sites of RNA. The 'punctuation mark problem' could, of course, also be solved if amino acids were selected in a sequential manner starting from one end of the template. Secondly, besides the requirement that at least three nucleotides are required to determine an amino acid site, the study of proteins indicates that these amino acid determining sites are independent and share no nucleotides with their neighbors. This conclusion follows from the absence of any intersymbol correlations in the protein text, and also from the fact that a mutation (as inferred from a study of homologous proteins) can result in a change at one site only, leaving the rest of the sequence unchanged. The number of nucleotides 92 Martynas Ycas in the template must therefore exceed the number of residues in the correspond- ing protein by a factor of at least three. Absence of intersymbol correlation shows that the 'overlapping' codes discussed by Gamow, Rich and Ycas (6) do not correspond to reality. The third requirement is somewhat more hypothetical. From the evidence presented above, it would appear that selection is not the sole factor determining the frequency of occurrence of the various amino acids. This is strongly suggested by the different frequencies of amino acids with aliphatic side chains, and particularly by the characteristic preponderance of leucine over isoleucine. It is therefore reasonable to believe that the coding principle itself imposes certain differences in frequency on the various amino acids. If only one configuration of nucleotides corresponds to each amino acid, the coding per se cannot make some amino acids frequent and others rare. This can be done, however, if some amino acids have more than one configura- tion of nucleotides to which they correspond. For this reason I am inclined to believe that the type of coding proposed by Crick, Griffith and Orgel (135) does not correspond to reality. Gamow and Ycas (7) have proposed a code that formally meets these three requirements. An amino acid is presumed to be determined by three nucleotides, taken without regard to order. In addition, the number of nucleotides in the RNA is assumed to be three times the number of amino acid residues in the corresponding protein. This has the following consequences: 1 . There are twenty such triplets, the same as the number of amino acids. 2. Neighboring triplets share no nucleotides between them. Any sequence of amino acids is thus permitted. 3. The frequencies of various amino acids, calculated on the assumption that the sequence in RNA is random, are unequal. This is because the expected frequency of any triplet is given by the product of the frequencies of the com- ponent nucleotides and the number of configurations for the given composition. Thus there are six triplets (all presumed to determine the same amino acid) of the type ABC, three of AAB and one of AAA. The pattern of frequency distribution of the various triplets, calculated in this manner, corresponds very closely to the amino acid distribution, as shown, for example, in Fig. 3 for the case of E. coli. I believe that this type of coding, even if not itself the one wliich actually occurs, is similar to the one that corresponds to reality. The most striking defect is that it provides no explanation, in fact contradicts, the requirement that in RNA the number of 6-keto groups should equal the number of 6-amino groups. H. A. Simon (136) has proposed a modification to take care of this difficulty. If RNA is a paired structure, somewhat similar to DNA, and 6-keto bases pair with 6-amino ones, then the following four pairs of nucleotides exist (again disregarding order) : Ad-Gu; Ad-Ur; Cy-Gu; Cy-Ur. If one takes these pairs, rather than the individual nucleotides, as units, one can maintain an hypothesis of determination by sextuplets, analogous to determination by triplets. The frequency distribution of sextuplets, calculated for a random RNA sequence, is very similar to that obtained for the triplet The Protein Text 93 distribution. This suggests that a whole series of codes of this type may exist, all having similar general properties. At present the major difficulty is not to produce a coding principle that explains the known facts, but rather to make a choice between the many that are possible. The correctness of a coding principle can, in general, be ascertained from a consistency of correspondence of the RNA and protein texts. Unfortunately, such a direct approach is not at present possible. Except perhaps in the case of RNA viruses, it is not possible to isolate a pure RNA corresponding to a pure protein, and were this possible, the sequence of nucleotides could not be deter- mined by any method currently available. If the composition only of a series of RNA's and the corresponding proteins is known, it is theoretically possible to check some coding schemes as follows: If the coding scheme is correct, the various configurations of nucleotides can be assigned to the amino acids in such a manner as to give, when summed over the protein, the experimentally determined RNA composition, and this con- sistently for all RNA-protein pairs. No assumption need be made that the RNA sequence is random. Actual application of this method requires a large number of RNA protein pairs of accurately determined composition, obviously diftering as much as possible from each other, and the facilities of an electronic computer. The electronic computer is much the easier of the two to provide. At present the data are hopelessly inadequate, although analyses of the proteins and RNA's of viruses may eventually make such an approach possible. However, in attempting a correlation of viral RNA and protein (Table VII), it should be remembered that some viral RNA's do not show the equality Ad + Cy = Gu + Ur characteristic of non-viral RNA (89). This suggests that normal RNA may be multi-stranded, while viral may not be. It is therefore not im- possible that viral RNA may contain all the information, but not all the material of a protein determining structure, and hence differ in composition from it. An additional difficulty is that it is not certain that all viral RNA is concerned in the determination of the protein which eventually appears in the virus particle. In lieu of anything better, I have attempted to make consistent assignments of triplets to amino acids on the assumption that the sequence in RNA is random. The random frequencies of triplets were calculated for liver (Fig. 5), Tobacco Mosaic and Turnip Yellow virus. I then tried to assign each triplet to an amino acid in such a manner that each member of the pair would have approximately the same frequency in the three cases. No satisfactorily consistent assignments could be obtained by this method. Assuming that the RNA's and proteins actually correspond, failure indicates one or more of the following: 1 . The coding principle used is false. 2. The RNA is not a random sequence. 3. The proteins of viruses are so small that relatively large deviations from expected frequencies may be found. The molecular weight of TMV protein is about 17000 (48, 137), that of Southern Bean mosaic about 26000 (129), Several of the amino acids occur as only a few residues per molecule, so that a 94 Martynas YCas difference of one or two residues from the statistically expected value produces very large relative deviations. Since the frequency of occurrence of an individual amino acid is small, even a larger protein such as hemoglobin may be too small to be a statistically valid sample for the purpose of calculating frequencies on the basis of a random RNA sequence. The following case is of interest. The RNA's of liver and of reticulocytes are virtually identical in composition, and therefore the proteins (bulk liver protein and hemoglobin) would be expected to have a very similar composition. Actually, this is not the case (Fig. 6). Considerable differences RNA RNA RAT LIVER RETICULOCYTES A D 18.4 17.5 6 U 33.1 34.7 C Y 30.5 29.9 U R 18.0 17.9 4 6 8 LIVER PROTEIN Fig. 6. The composition of bulk liver protein (142) and hemoglobin (93). The RNA composition of liver from (89) of reticulocytes (143). All in moles per cent. exist, as can be seen from the deviations of the points from the line of slope 1 . It would be better to use for this purpose the bulk RNA's and proteins of whole organisms and organs, were it not for the fact that bulk protein and RNA from various sources is so similar that no strong check on the coding principle is possible. The method of assignments from the assumption of a random RNA sequence fails, then, either strongly to confirm or to deny any proposed coding principle. It is possible that as more information becomes available some light may be thrown on the coding problem from a study of replacements of residues in homologous proteins, if replacements prove to be nonrandom. The reader will not fail to notice that the inadequacy of the data render most of my conclusions tentative. More information of the type considered The Protein Text 95 here will, of course, become available in the future and will not fail to clarify matters. 1 have attempted to organize and analyse such data as exist, in the hope that the value of this sort of information might become clearer, and in order to facilitate their examination as more become available. Obviously, data on composition and sequence are not the only possible sources of information bearing on coding. Strong hints will eventually be obtained from a study of RNA structure and sequence, as well as from other, more conventional, biochemical approaches. The solution of these problems will surely not be long delayed. Acknowledgment— \i is a pleasure to acknowledge the collaboration of George Gamow. I have profited from discussions of various aspects of these problems with Drs F. H. C. Crick, Beatrice S. Magdoff and Herbert A. Simon. Dr Louis J. Cote has given valuable assistance with statistical problems. The errors, of course, remain my own. REFERENCES 1. H. Fraenkel-Conrat : The role of the nucleic acid in the reconstitution of active tobacco mosaic virus. /. Amer. Cheni. Soc. 78, 882-883 (1956). 2. A. GiERER and G. Schramm: Infectivity of ribonucleic acid from tobacco mosaic virus. Nature, Lond. 177, 702-703 (1956). 3. A. GiERER and G. Schramm: Die Infektiositat der nucleinsaure aus tabakmosaikvirus. Z. Natiirf. lib, 138-142 (1956). 4. A. L. Dounce: Duplicating mechanism for peptide chain and nucleic acid synthesis. Enzymologia 15, 251-258 (1952). 5. D. Schwartz: Speculations on gene action and protein specificity. Proc. Nat. Acad. Sci., Wash. 41, 300-307 (1955). 6. G. Gamow, A. Rich, and M. Ycas: The problem of information transfer from the nucleic acids to proteins. Advances in Biological and Medical Physics 4, 23-68, Academic Press, New York (1956). 7. G. Gamow and M. Ycas: Statistical correlation of protein and ribonucleic acid com- position. Proc. Nat. Acad. Sci., Wash. 41, 1011-1019 (1955). 8. A. P. Ryle, F. Sanger, L. F. Smith, and R. Kitai: The disulfide bonds of insulin. Biochem. J. 60, 541-556 (1955). 9. W. J. NiCKERSON and G. Falcone: Identification of protein disulfide reductase as a cellular division enzyme in yeasts. Science 124, lll-lli (1956). 10. D. Mazia: Materials for the biophysical and biochemical study of cell division. Ad- vances in Biological and Medical Physics 4, 69-118, Academic Press, New York (1956). 11. E. E. Howe: Properties of amino acids. In Amino Acids and Proteins, ed. by D. M. Greenberg, pp. 13-55, Charles C. Thomas, Springfield, 111. (1951). 12. E. Windsor: a-Amino adipic acid as a constituent of a corn protein. /. Biol. Cheni. 192, 595-606(1951). 13. E. Work and D. L. Dewey: The distribution of a, e diaminopimelic acid among various micro-organisms. J. Gen. Microbiol. 9, 394-409 (1953). 14. H. Smith, R. E. Strange, and H. T. Zwartouw: a, e-Diaminopimelic acid in the peptide moiety of the cell wall polysaccharide of bacillus anthracis. Nature, Lond. 178, 856-866 (1956). 15. M. Flavin: The linkage of phosphate to protein in pepsin and ovalbumin. J. Biol. Chem. 210,771-784(1954). 96 Martynas Ycas 16. F. R. Bettelheim: Tyrosine-o-sulfate in a peptide from fibrinogen. /. Amer. Cheni. Soc. 76, 2838-2839 (1954). 17. J. M. Barry: Use of glutamine by the mammary gland for the synthesis of casein. Nature, Lond. 174, 315-316 (1954). 18. F. M. SiNEX and D. D. Van Slyke: The source and state of the hydroxylysine of collagen. J. Biol. Chem. 216, 245-250 (1955). 19. F. M. SiNEx: personal communication. 20. M. R. Stetten: Some aspects of the metabolism of hydroxyproline, studied with the aid of isotopic nitrogen. /. Biol. Chem. 181, 31-37 (1949). 21. G. Wolf, W. W. Heck, and J. C. Leak: The metabolism of hydroxyproline-a-C" in the intact rat. Radioactivity in amino acids from proteins. /. Biol. Chem. 223, 95-105 (1956). 22. G. Burnett and E. P. Kennedy : The enzymatic phosphorylation of proteins. /. Biol. Chem. Ill, 969-979 (1954). 23. J. Roche and R. Michel: Thyroid hormones and iodine metabolism. Amm. Rev. Biochem. 23, 481-500 (1954). 24. M. Flavin: The linkage of phosphate to protein in pepsin and ovalbumin. /. Biol. Chem. 210, 771-784 (1954). 25. A. Rich and F. H. C. Crick: The structure of collagen. Nature, Lond. 176, 915-916 (1955). 26. H. A. iTANo: The hemoglobins. Annu. Rev. Biochem. 25, 331-348 (1956). 27. T. H. J. HuiSMAN, J. H. P. Jonix, and P. C. Van Der Schaaf: Amino acid composition of four different kinds of human haemoglobin. Nature, Lond. 175, 902-903 (1955). 28. V. M. Ingram: A specific chemical difference between the globins of normal human and sickle-cell anaemia haemoglobin. Nature, Lond. 178, 792-794 (1956). 29. J. V. Evans, J. W. B. King, B. L. Cohen, H. Harris, and F. L. Warren: Genetics of haemoglobin and blood potassium differences in sheep. Nature, Lond. 178, 849-850 (1956). 30. D. W. Green, A. C. T. North, and R. Aschaffenburg : Crystallography of the ^-lactoglobulins of cow's milk. Biochim. Biophys. Acta. 21, 583-585 (1956). 31. N. H. Horowitz and M. Fling: The role of the genes in the synthesis of enzymes. In Enzymes: Units of Biological Structure and Function, ed. by O. H. Gaebler, pp. 139-145 Academic Press, New York (1956). 32. W. K. Maas and B. D. Davis: Production of an altered pantothenate-synthesizing enzyme by a temperature-sensitive mutant of Eschericia coli. Proc. Nat. Acad. Sci., Wash. 38, 785-797 (1952). 33. T. L. Deich and E. T. Soreni: Aminokontsevie gruppi gliadinov i ich izmenenie pod vlianiem mezhrodovoi gibridizatsii. C. R. Acad. Sci. U.R.S.S. 98, 623-626 (1954). 34. J. I. Harris, F. Sanger, and M. A. Naughton: Species differences in insulin. Arch. Biochem. Biophys. 65, 427-438 (1956). 35. J. C. Kendrew, R. G. Parrish, J. C. Marack, and E. S. Orlans: The species specificity of myoglobin. Nature, Lond. 174, 946-949 (1954). 36. K.Felix: Zur chemie des zelkerns. Experiential, 'i\2-'i\l {\952). 37. E. O. P. Thompson: The N-terminal sequence of serum albumins; observations on the thiohydantoin method. /. Biol. Chem. 208, 565-572 (1954). 38. W. F. White, J. Shields, and K. C. Robbins: C-terminal sequence of crystalline bovine and human serum albumins: relationship of C-terminus to antigenic determinants of bovine serum albumin. J. Amer. Chem. Soc. 11, 1267-1269 (1955). 39. H. TuppY and S. Paleus: Study of a peptic degradation product of cytochrome c. I. Purification and chemical composition. Acta Chem. Scand. 9, 353-364 (1955). 40. V. Du Vigneaud, H. C. Lawler, and E. A. Popenoe: Enzymatic cleavage of glycinamide from vasopressin and a proposed structure for this pressorantidiuretic hormone of the posterior pituitary. /. Amer. Chem. Soc. 75, 1880-1881 (1953). 41. H. Ozawa and K. Satake: On the species difference of N-terminal amino acid sequence in hemoglobin I. /. Biochem. 42, 641-648 (1955). The Protein Text 97 42. L. LoRAND and W. R. Middlebrook: Species specificity of fibrinogen as revealed by end-group studies. Science 118, 515-516 (1953). 43. P. H. Bell: Purification and structure of /^-corticotropin. J. Amer. Cfiem. Soc. 76, 5565-5567 (1954). 44. W. F. White and W. A. Landmann: Studies on adrcnocorticotropin. XL A pre- liminary comparison of corticotropin-A with /3-corticotropin. J. Amer. Chem. Soc. 11, 1711-1712(1955). 45. C. H. Li, L L Geschwind, L D. Raacke, J. L Harris, and J. J. Dixon: Amino acid sequence of alpha-corticotropin. Nature, Lond. 176, 687-689 (1955). 47. W. S. Peart: Composition of a hypertensin peptide. Nature, Lond. Ill, 132 (1956). 46. L. T. Skeggs, W. H. Marsh, J. R. Kahn, and N. P. Shumway: Amino acid composition and electrophoretic properties of hypertensin L /. Exp. Med. 102, 435-440 (1955). 48. C. L Niu and H. Fraenkel-Conrat: C-terminal amino acid sequences of four strains of tobacco mosaic virus. Arch. Biochem. Biophys. 59, 538-540 (1955). 49. S. Benzer: Fine structure of a genetic region in bacteriophage. Proc. Nat. Acad. Sci., Wash. 41, 344-354 (1955). 50. G. Pontecorvo and J. H. Roper: Resolving power of genetic analysis. Nature, Lond. 178, 83-84 (1956). 51. H. Fraenkel-Conrat: The chemistry of proteins and peptides. Annu. Rev. Biochem. 25, 291-330 (1956). 52. R. H. Locker: C-terminal groups in myosin, tropomyosin and actin. Biochim. Biophys. . Acta. 14, 533-542 (1954). 53. E. O. P. Thompson: The N-terminal sequence of carboxypeptidase. Biochim. Biophys. Acta. 10, 633-634 (1953). 54. T. Posternak and H. Pollaczek: De la protection contre I'hydrolyse enzymatique exercee par les groupes phosphoryles. Etude de la degradation enzymatique d'un peptide et d'un polyose phosphoryles. Helv. Chim. Acta. 24, 921-930 (1941). 55. N. Seno, K. Murai, and K. Shimura: Studies on the N-terminal lysylpeptides of a- casein. /. Biochem. 42, 699-704 (1955). 56. H. Neurath and W. J. Dreyer: The activation of chymotrypsinogen. Isolation and identification of a peptide liberated during activation. J. Biol. Chem. 217, 527-539 (1955). 57. F. Turba and G. Gundalch: Aminosaure sequenz in der umgebung des reaktiven serinrestes in chymotrypsin-molekiil. Biochem. Z. "ill, 186-188 (1955). 58. (This article consists of paragraphs by different authors) Erweitertes makromolekulares kolloquium. Angew. Chem. 65, 349-352 (1953). 59. K. Felix, R.Hirohata, and K.Dirr: Uberclupein. Hoppe-Seyl.Z.2l%,169-H9{\91^). 60. W. A. ScHROEDER, L. M. Kay, J. LeGette, L. Honnen, and F. C. Green: The con- stitution of gelatin. Separation and estimation of peptides in partial hydrolysates. /. Amer. Chem. Soc. 76, 3556-3564 (1954). 61. T. D. Kroner, W. Tabroff, and J. J. McGarr: Peptides isolated from a partial hydro- lysate of steer hide collagen. II. Evidence for the prolylhydroxyproline linkage of collagen. J. Amer. Chem. Soc. 11, 3356-3359 (1955). 62. R. R. Porter: A chemical study of rabbit antiovalbumin. Biochem. J. 46, 473-478 (1950). 63. W. W. Bromer, L. G. Sinn, A. Staub, and O. K. Behrens: The amino acid sequence of glucagon. J. Amer. Chem. Soc. 78, 3858-3860 (1956). 64. C. I. Niu and H. Fraenkel-Conrat: Determination of C-terminal amino acids and peptides by hydrazinolysis. /. Amer. Chem. Soc. 11, 5882-5885 (1955). 65. A. R. Thompson: Amino acid sequence in lysozyme. 2. Elution chromatography of peptides on ion-exchange resins. Biochem. J. 61, 253-263 (1955). 66. R. AcHER, U. R. Laurila, and C. Fromageot: Contribution a Tetude de la structure du lysozyme d'oeuf de poule. Peptide aromatique obtenue par hydrolyse enzymatique. Biochim. Biophys. Acta 19, 97-109 (1956). 98 Martynas Ycas 67. K. Ohno: On the structure of lysozyme. III. On the carboxyl-terminal peptide. /. Biochem. 42, 615-625 (1955). 68. K. Ohno: On the structure of lysozyme II. Characterization of aspartyl, asparaginyl, and glutaminyl residues in lysozyme. /. Biochem. 41, 345-350 (1954). 69. I. I. Geschwind, C. H. Li, and L. Barnafi: Isolation and structure of melanocyte- stimulating hormone from porcine pituitary glands. J. Amer. Chein. Soc. 78, 4494-4495 (1956). 70. J. I. Harris and P. Roos: Amino-acid sequence of a melanophore stimulating peptide. Nature, Lond. 178, 90 (1956). 71. V. M. Ingram: The application of Edman's peptide degradation method to horse myoglobin and haemoglobin. Biochim. Biophys. Acta 16, 599-600 (1955). 72. M. Flavin and C. B. Anfinsen: The isolation and characterization of cysteic acid peptides in studies on ovalbumin synthesis. /. Biol. Chem. 211, 375-390 (1954). 73. M. Otessen and A. Wollenberg: Stepwise degradation of the peptides liberated in the transformation of ovalbumin to plakelbumin. C R. Lab. Carhberg (Chim.) 28, 463^75 (1953). 74. M. Flavin: Cysteine and phosphoserine containing peptide sequences of ovalbumin. Nature, Lond. 173, 214 (1954). 75. V. Du Vigneaud, C. Ressler, and S. Trippett: The sequence of amino acids in oxytocin, with a proposal for the structure of oxytocin. /. Biol. Chem. 205, 949-957 (1953). 76. E.O. P. Thompson: Crystalline papain. IV. Free amino groups and N-terminal sequence. J. Biol. Chem. 207, 563-574 (1954). 77. M. B. Williamson and J. M. Passman: The amino acid sequence at the N terminus of pepsin. /. Biol. Chem. Ill, 151-157 (1956). 78. D. Cole and C. H. Li: N-terminal sequence of prolactin. Fed. Proc. 14, 195 (1955). 79. C. H. W. HiNS, W. H. Stein, and S. Moore: Peptides obtained by chymotryptic hy- drolysis of performic acid-oxidized ribonuclease. A partial structural formula for the oxidized protein. J. Biol. Chem. Ill, 151-169 (1956). 80. R. Monier and M. Jutisz: Contribution a Tetude de la structure de la salmine d'Oncor- hyncus. Biochim. Biophys. Acta 14, 551-558 (1954). 81. R. Monier and M. Jutisz: Contribution a I'etude de la structure de la salmine d'On- chorhynchus. II. Etude de quelques peptides resultant de I'hydrolyse trypsique. Biochim. Biophys. Acta 15, 62-68 (1955). 82. F. Lucas, J. T. B. Shaw, and S. G. Smith: Amino-acid sequence in a fraction of Bombyx silk fibroin. Nature, Lond. 178, 861 (1956). 83. L. M. Kay and W. A. Schroeder: The chromatographic separation and identification of some peptides in partial hydrolysates of silk fibroin. /. Amer. Chem. Soc. 76, 3564- 3568 (1954). 84. E. Abderhalden and A. Bahn: Isolierung von tyrosyl-seryl-prolyl-tyrosin beim stufenweisen abbau von seidenfibroin (Bombyx Mori). Hoppe-Seyl. Z. 219, 72-81 (1933). 85. E. W. Davie and H. Neurath: Identification of a peptide released during autocatalytic activation of trypsinogen. J. Biol. Chem. Ill, 515-529 (1955). 86. R. Consden and A. H. Gordon: A study of the peptides of cystine in partial hydro- lysates of wool. Biochem. J. 46, 8-20 (1950). 87. A. Polson: Quantitative partition chromatography and the composition of E. coli. Biochim. Biophys. Acta 2, 575-581 (1948). 88. R. B. Roberts, D. B. Cowie, P. H. Abelson, E. T. Bolton, and R. J. Britten: Studies of biosynthesis in Escherichia coli. Carnegie Institution of Washington Publication 607, p. 28, Washington, D.C. (1955). 89. D. Elson and E. Chargaff: Evidence of common regularities in the composition of pentose nucleic acids. Biochim. Biophys. Acta 17, 367-376 (1955). 90. F. Friedberg: The amino acid composition of adenosine triphosphate-creatine trans- phosphorylase. Arch. Biochem. Biophys. 61, 263-266 (1956). The Protein Text 99 91. K. Laki, D. R. Komintz, P. Symonds, L. Lorand, and W. H. Seegers: The amino acid composition of bovine prothrombin. Arch. Biochem. Biophys. 49, 276-282 (1954). 92. D. R. Komintz, A. Hough, P. Symonds, and K. Laki: The amino acid composition of actin, myosin, tropomyosin and the meromyosins. Arch. Biochem. Biophys. 50, 148-159 (1954). 93. A. Rossi-Fanelli, D. Cavallini, and L. De Marko: Amino acid composition of human crystallized myoglobin and hemoglobin. Biochim. Biophys. Acta 17, 377-381 (1955). 94. G. JoLLES and C. Fromageot: La proteine lysante II de la rate du lapin. II, Composition en acides amines. Biochim. Biophys. Acta 14, 219-227 (1954). 95. W. G. Gordon and J. Ziegler: Amino acid composition of crystalline a-lactalbumin. Arch. Biochem. Biophys. 57, 80-86 (1955). 96. K. Lange: Aminosaurezusammensetzung kristallisierte alkohol-dehydrogenase aus backerhefe. Hoppe-Seyl. Z. 303, 272-275 (1956). 97. J. Muus: The amino acid composition of human salivary amylase. /. Amer. Chem. ^oc. 76, 5163-5165(1954). 98. W. G. Gorden, W. F. Semmett, and M. Bender: Amino acid composition of y-casein. /. Amer. Chem. Soc.15, 1678-1679 (1953). 99. E. L. Smith, A. Stockell, and J. R. Kimmel: Crystalline papain. III. Amino acid composition. J. Biol. Chem. 207, 551-561 (1954). 100. E. L. Smtth, M. L. McFadden, A. Stockell, and V. Buettner-Janusch : Amino acid composition of four rabbit antibodies. J. Biol. Chem. 214, 197-207 (1955). 101. E. L. Smith, J. R. Kimmel, D. M. Brown, and E. O. P. Thompson: Isolation and pro- perties of a crystalline mercury derivative of a lysozyme from papaya latex. /. Biol. Chem. 215, 67-89 (1955). 102. E. L. Smith and A. Stockell: Amino acid composition of crystalline carboxypeptidase. J. Biol. Chem. 207, 501-514 (1954). 103. S. F. Velick and E. Ronzoni: The amino acid composition of aldolase and d-glyceralde- hyde phosphate dehydrogenase. /. Biol. Chem. 173, 627-639 (1948). 104. B. A. KoECHLiN and H. D. Parish: The amino acid composition of a protein isolated from lobster nerve. /. Biol. Chem. 205, 597-604 (1953). 105. E. L. Smith, D. M. Brown, M. L. McFadden, V. Buettner-Janusch, and B. U. Jager: Physical, chemical and immunological studies on globulin from multiple myeloma. J. Biol. Chem. 216, 601-620 (1955). 106. W. H. Stein and S. Moore: Amino acid composition of lactoglobulin and bovine serum albumin. /. Biol. Chem. 178, 79-91 (1949). 107. J. F. Velick and L. F. Wicks: The amino acid composition of phosphorylase. /. Biol. Chem. 190, 741-751 (1951). 108. R. N. Feinstein, M. Hampton, and G. J. Cotter: Species specificity of catalase. Enzymologia 16, 219-225 (1953). 109. I. Batty, F. W. R. Brambell, W. A. Hemmings, and C. L. Oakley: Selection of antitoxins by the foetal membranes of rabbits. Proc. Roy. Soc. B 142, 452-471 (1954). 110. B. Hagihara, T. Horio, M. Nozaki, I. Sekuzu, J. Yamashita, and K. Okunuki: Comparison of properties of crystalline cytochrome c from yeast, beef heart and pig heart. Nature, Lond. 178, 631-632 (1956). 111. F. Lucas, J. T. B. Shaw, and S. G. Smith: The chemical constitution of some silk fibroins and its bearing on their physical properties. Shirley Institute Memoirs 28, 77-89 (1955). 112. S. W. Fox and P. G. Homeyer: A statistical evaluation of the kinship of protein mole- cules. Amer. Nat. 89, 163-168 (1955). 113. M. YcAS: Numerology of peptide chains. Naturwissenschaften 43, 197-198 (1956). 114. B. J. Benfrey and J. L. Purvis: Purification and amino acid analysis of melanophor- expanding hormone from hog and ox pituitary glands. Biochem. J. 62, 588-593 (1956). 100 Martynas Y6as 115. R. M. S. Smellie: The metabolism of the nucleic acids. The Nucleic Acids, ed. by E. Chargaff and J. N. Davison, vol. 2, pp. 393-434, Academic Press, New York (1955). 116. J. Bracket: Action of ribonuclease and ribonucleic acid on living amoebae. Nature, Lond. 175, 851-853 (1955). 117. L. Goldstein and W. Plaut: Direct evidence for nuclear synthesis of cytoplasmic ribose nucleic acid. Proc. Nat. Acad. Sci., Wash. 41, 874-880 (1955). 118. K. Lark and O. Maaloe: Nucleic acid synthesis and the division cycle of Salmonella typhimurium. Biochim. Biophys. Acta 21, 448-458 (1956). 119. B. Magasanik: Isolation and composition of the pentose nucleic acids and the cor- responding nucleo-proteins. 77/^ Nucleic Acids, ed. by E. Chargaff and J. N. Davidson, vol. 1, pp. 373^07. Academic Press, New York (1955). 120. J. Bracket and D. Szafarz: L' incorporation d'acide orotique radioactif dans des fragments nuclees et anuclees d'Acetabularia mediterranea. Biochim Biophys. Acta 12, 588-589 (1953). 121. L. S. LocKiNGEN and A. G. DeBusk: A model for intracellular transfer of DNA (gene) specificity. Proc. Nat. Acad. Sci., Wash. 41, 925-934 (1955). 122. A. B. Pardee and L. S. Prestidge: The dependence of nucleic acid synthesis on the presence of amino acids in Escherichia coli. J. Bad. 71, 677-683 (1956). 123. F. Gros and F. Gros: Role des aminoacides dans la synthese des acides nucleiques chez Escherichia coli. Biochim. Biophys. Acta 22, 200-201 (1956). 124. M. YcAS and G. Brawerman: Interrelations between nucleic acid and protein bio- synthesis in micro organisms. Arch. Biochem. Biophys. 68, 118-129 (1957) 125. J. D. Watson and F. H. C. Crick: Genetical implications of the structure of deoxyri- bose nucleic acid. Nature, Loud. Ill, 964-967 (1953). 126. F. L. Black and C. A. Knight: A comparison of some mutants of tobacco mosaic virus. /. Biol. Chem. 202, 51-57 (1953). 127. D. De Fremery and C. A. Knight: A chemical comparison of three strains of tomato bushy stunt virus. /. Biol. Chem. 214, 559-566 (1955). 128. E. Roberts and G. B. Ramasarma: Amino acids of turnip yellow virus. Proc. Sac. Exp. Biol. Med. 80, 101-103 (1952). 129. B. S. Magdoff R. J. Block, and D. B. Montie: Amino acid composition of southern bean mosaic virus. Contr. Boyce Thompson Inst. 18, 371-375 (1956). 130. C. A. Knight: Amino acid composition of highly purified viral particles of influenza A and B. /. Exp. Med. 86, 125-129 (1947). 131. R. W. Dorner and C. A. Knight: The preparation and properties of some plant virus nucleic acids. /. Biol. Chem. 205, 959-967 (1953). 132. R. Markham and J. D. Smith: Chromatographic studies of nucleic acids. 4. The nucleic acid of the turnip yellow mosaic virus, including a note on the nucleic acid of the tomato bushy stunt virus. Biochem. J. 49, 401^06 (1951). 133. G. L. Ada and B. T. Perry: Specific differences in the nucleic acids from A and B strains of influenza virus. Nature, Lond. 175, 854 (1955). 134. G. Gamow: Possible mathematical relation between deoxyribonucleic acid and protein. Dansk. Biol. Medd. 22, No. 3 (1954). 135. F. H. C. Crick, J. S. Griffith, and L. E. Orgel: Codes without commas. Proc. Nat. Acad. Sci., Wash. 43, 416^21 (1957). 136. H. A. Simon: personal communication. 137. I. Harris and C. A. Knight: Studies on the action of carboxypeptidase on tobacco mosaic virus. /. Biol. Chem. 214, 215-230 (1955). 138. T. L. Deich and E. T. Szorenyi: Chem. Abstr. 49, 1882 (1955). 139. ZoTAN KoROs: Free amino groups of gliadin. Magyar Kim. Folyoirat 56, 131-136 (1950). 140. L. K. Ramachandran and W. B. McConnell: The terminal amino acids of wheat gliadin. Canad. J. Chem. 33, 1463-1466 (1955). The Protein Text 101 141. T. Deutsch: N-terminal amino acids of gliadins from wheal and rye. Ada Physiol. Acad. Sci. Hiwg. 6, 209-224 (1954). 142. B. S. ScHWEiGERT, B. T. GuTHNECK, J. M. Price, J. A. MiLiER, and E. C. Milier: Amino acid composition of morphological fractions of rat livers and induced liver tumors. Proc. Soc. Exp. Biol. Med. 72, 495-501 (1949). 143. G. Rost: Zusammensetzung der ribonucleinsaure der reticulozyten. Naturwissenschaften 43,499(1956). DISCUSSION Koch: I should like to comment on the result of some recent tracer experiments that have been conducted in Dr Swick's laboratory at the Argonne National Laboratory (1, 2, 3). What we have tried to do is to ask ourselves something about the total balance of the turnover of RNA, DNA, and protein in the tissue which is most often studied by the biochemist; namely, rat liver. The interesting thing that comes out of this is that when suitable tracer experiments are done, you can make the definite statement that in a single cell DNA is syn- thesized when it is produced and DNA stays as a cell compound until the death of the ceil, whereas on the other hand it is very easy to show that all of the RNA in the cell is turned over, and it is turned over essentially with about the same half-life that all of the proteins are turned over in the ceil ; that is, there are no special classes of proteins that are not turned over, especially classes of RNA that are not turned over in this tissue. The immediate conclusion from this is that, inasmuch as the amount of protein is many times more than the amount of RNA, on a molar or other basis, there can be no one-to-one hand-off of this kind. In other words, you cannot take the DNA and make the RNA from it without using it over and over again in a different way than has been suggested here. YcAS : While it may be true that there is turnover of RNA in rat liver, I believe, on the basis of work with micro-organisms, that there is no obligatory turnover of RNA associated with protein synthesis. The RNA, which is part of the protein forming mechanism, is a passive template, and apparent coupling or dissociation of protein and RNA turnover is adequately explained, I think, by the assumption that both have common precursors. Koch: I would just like to add that in the case of micro-organisms it is fairly clear that the protein turnover does not occur (4). It is also pretty well established that DNA and RNA turnover do not occur in an actively growing culture. So the concept of turnover in the micro- organism is not a relevant one. But what it does mean is that you cannot accept some of the proposals that have been described that inherently require the obligatory breakdown of some- thing (RNA), concomitant to the synthesis of another type of molecule (protein). MoROWiTz: I would like to introduce some evidence for an alternative approach to the problem of intersymbol influence. In some work recently published by Sidney Fox (5) analyses are reported on the total protein of soybean, corn, wheat, and rye. These analyses indicate that a very high proportion of the protein molecules have lysine in an N-terminal position and arginine in the next position. This approach to statistical constraints involves an experimental analysis of a population of proteins from a single source as contrasted to Dr Ycas' theoretical analysis of a population of unrelated proteins. We have attempted to determine if any constraints are to be found in E. coli protein. The preliminary results indicate that methionine is found in N-terminal positions in a proportion consistent with a chance distribution. Cystine and cysteine in N-terminal positions may show a considerably greater constraint. YcAs: I think that the method used by Fox and yourself introduces an obvious source of bias, if what you are trying to do is look for intersymbol correlations. The abundances of different species of protein in a cell are not equal, and more abundant proteins contribute more end groups. You have to examine the proteins one by one, giving the same statistical weight to each. A similarity in end groups of proteins from related species indicates not an effect of inter- symbol correlation, but rather descent from a common ancestor. As can be seen from the data I summarized, proteins change only slowly in evolution. Branson: There is one question which has been opened up by Dr Gamow's and Dr Ycas' comments; namely, the whole problem of redundancy in protein molecules. The evidence is fairly conclusive, I believe, that so far as the antigenic action of a protein is concerned, the 102 Martynas Ycas active region is approximately 1 5 A on a side. If the same is true of other biological functions, a great deal of surface area in a protein is passive. At least it is passive for a given specific function. Thus it is reasonable to inquire how much of a protein molecule you can whittle away and keep a given biological property. There is a fairly convincing teleological explanation for this redundancy. In the early history of living systems, the membranes containing the living material might have been rather leaky. Thus to retain the small biologically-active components within the cell, they had to be associated with a large but inactive structure which would not pass out through the large spaces. In the evolutionary scheme, then, there remain many large units where really the functional part is relatively small. So that when one amino acid is taken out and another put in, the sub- stitution does not make much difference so long as it is not in the essential small functioning unit of the protein molecule. YcAS : I am also of the opinion that mere size of an enzyme may be quite important for the totality of its biological functions, even if it seems to make no difference to the catalytic function as measured in a test tube. Which part of a protein is significant and which is not is a matter of what function we are measuring. I doubt that at present we know all the functions of a protein from the point of view of the organism itself. REFERENCES 1. R. W. SwiCK and D. T. Handa: The distribution of fixed carbon in amino acids. /. Biol. Chem. 218, 557 (1956). 2. R. W. SwiCK, A. L. Koch, and D. T. Handa: The measurement of nucleic acid turnover in rat liver. Arch. Biochem. Biophys. 63, 226-242 (1956). 3. R. W. SwiCK and A. L. Koch: The measurement of nucleic acid phosphorus turnover in rat liver by the constant exposure technique. Arch. Biochem. Biophys. 67, 59-73 (1957). 4. A. L. Koch and H. R. Levy: Protein turnover in growing cultures of Escherichia coli. J. Biol. Chem. 217, 947-951 (1955). 5. S. Fox: Evolution of protein molecules and thermal synthesis of biochemical substances. Amer. Sclent. 44, 347-359 (1956). PROTEIN STRUCTURE AND INFORMATION CONTENT* L. G. AUGENSTINE Brookhaven National Laboratory, Upton, New York I. INTRODUCTION In stating that a given system has an information content of a certain number of bits, care must be taken to specify not only the context within which this number has been derived but also an attempt must be made to give meaning and utility to this measure. Specifying the context is particularly important since for most systems there are many levels at which the information content can be derived. For example, the information content for a cell is very low, if one is concerned only whether it is living or dead, but it is very large if one is interested in specifying the parameters of each of its individual elementary particles. In this article, estimates will be made of the information content of given proteins by taking into account that they are a sequence of amino acids which can assume only a discrete number of configurations. An attempt will be made to study some of the factors which affect the infonnation content and the types of constraints which must operate in the elaboration of proteins. Some idea of the magnitude and types of the constraints pertinent to proteins can be obtained from parallel studies on proteins and printed English (for which the constraints are known). Finally, the information content based upon structure will be compared with estimates of information content obtained within the context of protein function. Although the fact has not always been fully appreciated, information measures are usually more effective in selecting among alternative hypotheses than in suggesting new ones. This particular trait arises from the fact that information estimates, which depend only upon the probabilities associated with a class of experimental outcomes, will often describe the degree to which a number of variables interact but indicate little or nothing about the behavior of the individual variables. As a result no novel synthetic procedures or selection principles are advanced here to explain the manner in which polypep- tide sequences and/or configurations are determined. Rather, in this paper information theory considerations have been used to evaluate alternative explanations of some aspects of protein construction. II. ESTIMATION OF STRUCTURAL INFORMATION CONTENT AND CONSTRAINTS At the structural level the total information content (/() of a protein will be treated as the sum of two terms; one (/,) depends upon the amino acid sequence * Research carried out at Brookhaven National Laboratory under the auspices of the U.S. Atomic Energy Commission. 103 104 L. G. AUGENSTINE • VALUES CALCULATED FROM PROTEINS X VALUES CALCULATED FROM ENGLISH PARAGRAPHS 800 700 '~^ MYOSIN • TROPOMYOSIN 600 500 I ALBUMIN rO''^^"^"^'^^^^"^ 400 — • INSULIN (48,000) • SILK FIBROIN EDESTIN '• • OVALBUMIN ZE1N» « •GROWTH HORMONE "^PEPSIN CHYMOTRYPSINOGEN 300 — ;3-LACT OGLOBULIN^' > 200 • GLIADIN .BOVINE SERUM ALBUMIN ^ •lactogenic HORMONE 100 HORSE MYOGLOBIN ♦•^^TH -RIBONUCLEASE >(» INSULIN (12,000) X •SALMINE X N 1 1 1 1 1 lyd^jmox 0.50 0.60 0.70 0.80 0.90 1.00 Fig. 1 . Values of /s/C/Jmax as a function of the number of symbols, A^ in proteins and paragraphs. N/m Fig. 2. Distribution of the normalized frequency, ^—^ of letters and amino acids in the language and protein samples. See the text for further discussion. Protein Structure and Information Content 105 and the other (I^) upon the configurations of the polypeptide chain in the native molecule. Treating sequence and configuration independently should lead to overestimates of 1„ since the pennissible configurations will depend upon the sequence. However, care has been taken to reduce the interaction of the two terms as much as possible, so that for the purposes of this paper no significant discrepancies should occur. Sequence' There are twenty amino acids which are most commonly incor- porated into proteins. Therefore the maximum value of /^ is 4.32 bits (logg 20) per amino acid residue.* It would occur when the twenty amino acids occur equiprobably. Values less than the maximum would occur due to any con- straints upon the amino acid sequence. Branson (I) calculated /, of twenty-six proteins for wliich the frequency of occurrence of the twenty amino acids had been determined (disregarding possible sequential dependencies). He found that those which formed part of a living structure of an organism had an ^ which was greater than 0.70 of the maximum value. His analysis is shown by the dots in Fig. 1. The X's show the result of a similar analysis on language samples. The language study was based on ten paragraphs chosen from diverse sources such as want ads, newspaper articles, textbooks, and magazines and differs from that usually used in analysis of language in that it is based on the paragraph rather than on large continuous samples.! In this case, letters have been treated like amino acids and paragraphs like proteins. Except for the single value of 0.99 the values from proteins and paragraphs agree quite well. Similarities between the distribution of amino acid frequencies and letters can be seen further in Fig. 2. There the ordinate indicates the number of times that a particular normalized frequency occurs ; the normalized frequency is the number of times, n^, that the /th symbol (either amino acid or letter) occurs, divided by N/m, the expected number of times that each type of symbol should occur if all m different kinds of symbols had equiprobable occurrence in the sample of TV symbols. As can be seen in Fig. 2 the distribution of the n ■ normalized frequencies -ttt- for the letters (solid fine) and the amino acids (shaded ^ A'//?; area) are almost identical except for the higher incidence of rarely-used letters in language. This small difference might not have occurred if some of the rarer amino acids, for which assays are difficult, had been included in the data. Constraints — The fact that the distribution of amino acids in non-structural proteins deviates from equiprobability about the same as (or possibly a little less than) the letters in written English, indicates that the constraints producing such unequal frequencies should be of the same order of magnitude as (or slightly less than) those governing English texts. However, this tells nothing about the * This value disregards any influence of residue 'complexions'. However, it is difficult to see how factors other than the identity of the residues can be very important, when one con- siders the freedom of rotation of the /^-groups with respect to the polypeptide chain. t It was felt that such a small-sample statistics study was preferable to one based upon large samples (such as a determination of confidence intervals for /, as a function of the paragraph size), since by essentially duplicating the analyses applied to proteins, insightas to the limita- tions of that procedure could be observed. 106 L. G. AUGENSTINE nature of the constraints or the manner in which they arise. The obvious question arises — is the unequal distribution due to unequal availabihty of the amino acids or is it due to constraints imposed in the processes of synthesis, i.e. by 'intersymbol influence' ?* Is the make-up of the pool of amino acids available to the protein-synthesizing centers indicative of the nature of the processes involved in amino acid synthesis or have these processes become adapted to the peculiar demands of the proteins being synthesized? This is essentially the same as looking at a collection of printer's type and asking the question, did the printer select his supply of type because this particular distribution of letters was all that was available to him or did he purposely purchase his particular assortment because he had found that it satisfied his needs? The possibility that the unequal availability of amino acids in the cellular pool may produce the unequal distribution does not seem likely. The experi- ments of Roberts, Cowie et al. (2, 3) at the Carnegie Institution indicate that it requires a five to thirty-fold excess of exogenous amino acids, such as valine, leucine and isoleucine, before the incorporation of these amino acids into protein is seriously affected in E. coli. In fact, once a substance has been incorporated into the amino acid pool of yeast, 1000 times the normal con- centration of exogenous amino acid does not affect its incorporation into protein (Cowie). Although these are excellent experiments they do suffer from problems of cell membrane permeability, intracellular diffusion, etc.; however, they, along with numerous experiments involving amino acid deficient mutants, suggest that as long as the minimum required amount of each amino acid is present the frequency distribution of the amino acids in the pool has a relatively small influence on the distribution of amino acids incorporated into protein. Two methods have been utilized in searching for intersymbol influence in proteins. In the first (reported previously (4)), the behavior of the normalized n- amino acid frequencies -rjj— were studied in individual proteins. The average normalized frequency of the individual amino acids for the twenty-six proteins was tabulated. Comparing the normalized frequency for the individual amino acids in particular proteins with the corresponding average value from the 26 proteins indicated large deviations in many cases. The gross deviations were examined for correlations between pairs of amino acids, both for positive and negative effects. Examination of the 26 proteins indicated that although there are some correlations between the frequencies of individual amino acids combined in single proteins, none was strong enough to be measurable with any degree of confidence for a sample as small as 26 proteins. Similar examinations of the normalized letter frequencies in paragraphs were investigated for significant deviations of pairs or groups of letters. Although strong intersymbol influences are known to exist between letters (e.g. between * 'Intersymbol influence' is a term commonly used to designate sequential dependencies, i.e. influences upon the identity of a particular element by neighbouring elements, which are not the only types of constraints which might be imposed by a synthesizing center. It is easy to imagine the possibility of unequal 'acceptability' for diff"erent symbols at individual sites on a template in which the factors affecting the specifications of each location are independent of the neighbors. Protein Structure and Information Content 107 q and u) no significant results were detected. Thus it can be concluded that such analyses do not exclude intersymbol influences of the same type or order of magnitude as those in language.* Gamow, Rich, and Ycas (5) have made a more exacting study of possible inter-symbol influences affecting amino acids. They treated the known amino acids as a series of dipeptides which they tallied into a 20 X 20 matrix similar to the 26 >: 26 digram matrices common in language analyses. The distribution for nonstructural proteins in such a 20 X 20 matrix followed quite closely a Poisson distribution. This they state is compatible with the assumption that the occurrence of a given amino acid does not affect the identity of its nearest neighbor. Their comparable analysis for English language gave a distribution which deviated from a Poisson. The Poisson distribution associated with the amino acid dipeptide analysis is not too significant since the sample of experimentally determined sequences is not necessarily a reliable representation of the bulk of amino acid sequences in nature. As Gamovv', Rich, and Ycas point out, their available sample is strongly affected by the composition of ACTH, lysozyme and insulin for which the complete sequences have been determined and the shorter sequences from other proteins are biased due to differential bond labilities within the protein which give rise preferentially to certain amino acids occurring as terminal peptides in the sequences isolated. It was felt that a possible explanation of the difference noted between digram analysis of letters and amino acids was that amino acids were also grouped into word-like structures but that the average number of symbols per 'word' was different than that found in English. Therefore, separate digram analyses were performed on English words having two to five letters, six to nine letters and those having ten or more. All the samples were selected so that the average cell density in the 26 x 26 matrix was 0.44, the same as that of Gamow, Rich, and Ycas, and these also all showed significant deviations from a Poisson distribution. MoROwiTZ (6) and some of the Biophysics group at Yale have been investi- gating the possibility that a polypeptide chain is a segment selected from either a single or a small number of repeating sequences which are invariant for a given chromosomal complement. The particular segments chosen and the unique fashion in which they are combined and folded would then account for the highly specific properties of the individual proteins. The possibiHty also exists that there was an initial long, or at least restricted, set of sequences from which present day polypeptide sequences have evolved in a manner similar to that by which organisms have evolved. Gamow^, Rich, and Ycas (5) have pointed out the most striking evidence for a "phylogenetically common ancestral sequence" in their comparison of the A and B chains of insulin, where the same amino acids occur in equivalent positions in both chains four times. The known sequences containing five amino acids or more (from Table I, ref. 5) were examined for repeating or matching sequences. (This was done by superposing the sequences in all possible permutations.) These data indicate that for proteins from a given species any single repeating sequence must * See the discussion by Dr Platt at the end of this paper. 108 L. G. AUGENSTINE be at least forty amino acid residues or longer. Comparing the sequences of different types of proteins indicated that (a) there is not a master sequence operating among species, or (b) evolution, i.e. amino acid substitution, has been so extensive as to make it undetectable, or (c) the master sequence is 200 residues or longer. The additional sequences (for hormones of sub-protein size) cited by Ycas (7) show that short polypeptide sequences with only minor amino acid differences do occur in cells of different species. Thus, the occurrence of repeating or a restricted number of amino acid sequences may be an explana- tion of the unequal amino acid frequencies observed. This possible restriction provides a basis for estimating the minimum value of Ig. A single, long, completely-detennined sequence would provide a situation of minimum infonnation content for polypeptides selected from it. To select A'^ residues from a sequence of S amino acids would require < log2 S bits to find N and < logo {S — N) bits to determine the starting point; or by another selection procedure, < logg (5" — 1) to find the starting point and roughly logg S/l to determine the end point. Either of these methods of selection gives an estimate of the minimum of /^ which is of the order of 2 log2 S bits. This is a very low minimum since according to the best present estimate (which is obviously too low) S f^ 200 and thus 2 logo 5 ^ 15. Therefore, the minimum of/,, is of the order of 0.1 bit/residue since A^ > 100 for proteins. Even if 5" is found to be 10^ (2 logg S)IN will still only be ~ 0.4 bits/residue. Thus, the search (6) for long master sequences of amino acids is of considerable interest with respect to information content considerations. Summarizing for 7^, we can say that for nonstructural proteins the potential information due to the amino acid sequence should be of the order of 0.85-0.95 of the possible maximum value. Although the constraints necessary to produce such an effect should be of the same order of magnitude as those in printed English, tests comparing language and the available proteins for which amino acid composition or sequences are known indicate that the constraints operating in the elaboration of proteins are probably different from those associated with language. Further, it seems unhkely that the unequal frequency of amino acids in proteins is due to unequal availability of the amino acids in the cellular pool. The possibility that polypeptide chains are segments selected from a single or restricted number of repeating sequences may be an explanation of the unequal frequencies, in which case /^residue would be close to zero. Configuration '• With the present state of knowledge the factors affecting /^ are much more difficult to assess. The number of states available to a poly- peptide chain whose bonds retained all of the lability they had as uncombined amino acids would be essentially innumerable. In fact, about the only con- figurations ruled out would be those resulting in closure of the chain upon itself. However the D- and L- forms do not both exist in nature and as has been pointed out by Pauling, Corey and Branson (8), the a-C, N and O group in the backbone of the polypeptide chain is essentially the planar, resonance O / / structure — C -N— . Other than these primary restrictions the polypeptide chain, in the absence of intramolecular or secondary bonding structures is essentially a random structure. Protein Structure and Information Content 109 Kauzmann (9) has given an excellent discussion of the known types of intramolecular bonds which are responsible for protein folding and which should therefore affect 7^. The most common type is the H— bond, especially those formed between the carboxyl O and the amide H. These are essentially non- specific bonds which can form between any pair of amino acid residues in which the C — O and N — H bonds are oriented at the proper angle. A stronger, more specific, but less common H^ — bond can form between the phenolic OH groups of tyrosine and the carboxyl group of glutamic or aspartic acid (9, 10). Another common type of bond stems from the van der Waals forces, which can exist between the atoms in different portions of the same or neigh- boring chains. The third type discussed by Kauzmann is the so-called hydro- phobic bond, which is distinct from the more commonly discussed van der Waals bonds. This results from the tendency of the more hydrophobic amino acid residues to avoid the aqueous phase and adhere together to form a sort of intramolecular micelle. These bonds, although they possess a low order of specificity, may contribute a good deal of stability since they arise as a result of the fact that the more hydrophobic amino acids cannot participate in the strong H-bonding with the solvent water molecules. Salt bridges, which are the ionic bonds formed between the negatively charged (glutamic and aspartic) and positively charged (lysine and argenine) residues, are another type. However, Jacobsen and Linderstrgm-Lang (11) have presented evidence which indicates that these bonds are of negligible importance as intramolecular protein bonds. One of the most important types of intramolecular bond (at least according to current theories (12)) is the highly specific S — S bond formed between cysteine residues in different portions of the same or neighboring chains. The formation of disulfide bonds as well as the 'strong' H-bonds greatly reduces the number of physical states available to the molecule since they can only be formed at a very few sites in the molecule. Since these two types of bond are the most specific of the intramolecular bonds, they are undoubtedly the most effective in determining variations in structure between different kinds of proteins. Repetitions Structures: Intramolecular bonds fonned in such a fashion as to produce repetitious structures reduce 4 tremendously. In the helical or pleated sheet structures proposed by Pauling, Corey and Branson (8) (and illustrated in (13)) the number of free parameters necessary to describe the configuration completely is extremely low and therefore the information content, /f, is also very lov/. In the helices it is only necessary to specify the length (that is, the total number of residues R), the pitch (3.7 or 5.1 residues per turn) and the exact orientation of the helix with respect to a reference point in the protein. An estimate of the lower bound of /^ can be obtained from these factors as follows: 1) To find the exact number of residues, /?, in a helix requires about 2 log2 R bits.* 2) The pitch requires 1 bit (3.7 or 5.1 residues/turn of the helix). * It is rather interesting that the determination of the value of any integer, either + or — (other than zero), requires exactly In bits, where 2"~^ < i? < 2" (which is close to 2 logs R): II bits are necessary to find that |^| is in the range indicated, // — 1 bits to find \R\ and 1 bit to determine R, i.e. the sign. For example, let R = —48: six questions which can be answered by yes or no will show that \R\ is 33-64; five more questions will determine that of the 32 possible values \R\ = 48 and one yes or no question determines R = —48. Thus, 2* < i? < 2« and 111 = 12 bits. 110 L. G. AUGENSTINE 3) A reasonable value for the number associated with specifying the interhelical bonds would seem to be 7?/2 bits. This arises by assuming RjA interhelical bonds, i.e. one bond per turn of the helix, and the previous discussion of intra- molecular bonding indicates that the identity of each interhelical bond requires about 2 bits of information. Another reasonable value for this factor is i?/4; this would occur for 1 one-bit interhelical bond or 1 two-bit bond every other turn, which attempts to take into account that disulfide and "strong" H-bonds are probably the most important interhelical bonds. Actually this factor could be zero since it may not be possible to specify interhelical bonds independent of the sequence. 4) The information necessary to specify the orientation of each helix with respect to some reference point in the protein is the most difficult factor to estimate. It may be almost zero, since the interhelical bonds may unequivocally determine the orientation of the helix. On the other hand, it should not be larger than (log2 R + 30) bits, where logg R bits is sufficient to determine a specific residue and 30 bits to specify its orientation. The 30 bits would be assigned to the six parameters associated with the two vectors necessary to specify orientation. An average 'grain' of 1 :32 is undoubtedly too coarse for specifying the orientation of a. single isolated helix, but is probably adequate for specifying a helix which is oriented in relation to others in the same molecule. The 7?/2 and 30 and the zero terms have been combined to give 'high' and 'low' values for the estimation of the minimum of /,.. These are calculated as /^residue (in bits) by /./residue - "!' LO W (1) and 30+3 logo R = 0.50H n ff^GH (2) The results as a function of R are shown in Fig. 3. Pauling, Corey and Branson (8) cite examples of heUcal polypeptides for which Ris 11, 18 and 36. The corresponding region of Fig. 3 has been shaded. From these considerations it would appear that the minimum value of /<. should be about 1 to 4 bits/residue depending upon R. Although many proteins appear to be helical in nature, there are others, such as ribonuclease (RNase), which from the available evidence would seem not to be. In RNase the structural specificity appears to be determined pre- dominantly by the S — S bonds with the other intramolecular bonds adding stabihty to the structure. A further discussion of the relative importance of the specific and non-specific intramolecular bonds in maintaining structure will be presented later. It is obvious that an upper limit cannot be assigned to /. as readily as to 7^.. However, since the structures proposed by Pauling, Corey and Branson probably represent polypeptide configurations for which /. is near minimum, it would appear that one bit/residue is a reasonable lower limit for 7^. From the estimates of 7, and 7^ presented here, it appears that for the proteins of general interest 7^ should have a value in excess of 4.5 bits/residue although Protein Structure and Information Content 111 if il is found that polypeptides are chosen from a single, long master sequence the value could be as low as 1 .0 bits/residue. Estimates of 4.5 bits per residue or greater at the structural level give a total information content, /,, for the non-structural proteins in excess of 500 bits (or in excess of 100 bits if the minimum estimate turns out to be the true one). Such an estimate is in sharp contrast to the estimates of 10 bits or less 1000 3 o UJ o z o UJ m 13 Z /RESIDUE-- 0.50 + 30+3 \oq, R 3 loq^ R Ij./ RESIDUE (IN Fig. 3. Limits for estimates of the minimum of 4 as a function of the number of residues per helix. The shaded area indicates helical polypeptide sizes reported by Pauling, Corey and Branson (8). obtained by Quastler and his co-workers (14) as the amount of information which must be transmitted for the proper functioning of most protein-controlled systems (e.g. enzymes, immune bodies). III. ESTIMATION OF STRUCTURAL INFORMATION CONTENT NECESSARY FOR FUNCTION A disparity of at least one order of magnitude or more in passing from one context or level of organization to another is of considerable interest. The ten-fold difference indicates that only a small part of the information potential is actually utilized in information transmission. Does this indicate that information transmission in such systems is very noisy and therefore organisms obtain good transmission by utilizing a very high degree of redundancy? Dancoff (15) proposed a principle of maximum 112 L. G. AUGENSTINE error in which he postulated that an organism (or for instance a protein- controlled system) will commit as many errors as are consistent with normal function, but that the inherent error rate, which is probably quite high for such reactions, is maintained at a tolerable level by the use of redundancy. Resorting again to the language analogy — a protein corresponds to a paragraph in complexity and its function may correspond to the thought which is conveyed by a paragraph. Does the difference in information content between the two contexts mean that in the process of evolution the organisms found that particular polypeptide configurations contained structures which could perform useful functions, but that these polypeptide permutations contained a large amount of excess and useless infomiation which has been perpetuated along with the small amount of information associated with the necessary structure ? Does it indicate that much of the protein structure is involved in secondary features of information transmission (e.g. the acquisition, concentration, and transport of energy) and only a small part of the total information content of the protein is intimately engaged in the process of information transmission ? Or does it indicate that each enzyme or p'rotein is capable of mediating many reactions and our experimental ingenuity has not been able to determine more than just a few of them ? (This is analogous to attempting to measure the information transmitted by a source wliich is transmitting through many channels, by monitoring only a single channel.) The discussion which follows will attempt to throw some hght on these questions. However, two important considerations must always be borne in mind when one is deahng with proteins. They are first and foremost colloidal in nature and therefore much of their activity falls in the realm of surface reactions. In the globular proteins it is quite likely that much of the total structural information content is in the interior of the molecules and therefore is unavailable to participate in information transfer occurring at their surface and can only participate in secondary operations similar to those mentioned above. The second consideration involves the question, just what is required for the transmission of one bit of information by a protein system? It seems very likely that one bit of potential structural information will not always transmit the same amount of information; rather, the efficiency of transmission will depend upon the context within which the performance is measured. For example, it is probably much simpler to attach either a hydroxyl or methyl group to a benzene molecule (which would involve one bit of determination) than it is to construct either a 3.7 or 5.1 helix (which also involves one bit of determination). This is somewhat analogous to the relative difficulties of determining whether a symbol is or 1, or to determining whether one should get married or not ! Ig necessary: It appears in some cases that a fairly large fraction of the potential surface information due to the amino acids present is superfluous. For instance, it has been found in insulin that a large fraction of the residues cannot be critical for function. lodination, sulfonation and chelation, each of which can mask surface i?-groups, have been found not to affect insulin activity. Those residues which are species-specific can also be ruled out as being critical for function. Unfortunately, it is difficult to determine the exact Protein Structure and Information Content 113 degree to which a particular type of residue is masked by a given treatment, so that it is impossible to state exactly the fraction of surface residues which are not critical. In a similar manner, it is possible to mask the lysine and arginine residues on the surface of trypsin without destroying its activity (16). In fact, acetyltrypsin is available commercially (17) and has the ideal feature that with its lysine and arginine /^-groups masked, its ability to act as a substrate for other molecules of trypsin is decreased. Haurowitz (18) has also pointed out that some of the antigenic properties of proteins are in many cases not affected by iodination or sulfonation of receptive surface groups. The work of Raacke (19) has shown that a certain amount of surface heterogeneity (as demonstrated by electrophoretic behavior) is still compatible with a fully active protein. Her results plus the uncertainty found in the analyses of amino acid compositions indicate that an uncertainty of the order of 3 to 10 per cent can occur in the amino acid complement without loss in charac- teristic function. The results of Roberts and Cowie (mentioned previously) involving competition in the amino acid pool also indicate that about 3 to 20 per cent variabihty in amino acid incorporation can occur. However, it should be borne in mind that each position in the polypeptide sequence may not have a 3 to 10 per cent tolerance associated with it; rather, those residues which participate in active sites likely have a zero tolerance. /j necessary: Kalnitsky and Rogers (20) have reported that approximately 15 per cent of the ribonuclease molecule can be digested off with carboxy- peptidase before activity is lost. Work reported by Anfinsen (10, 21) indicates that this estimate may be a little high. Rather, he reports that the carboxy- tenninal three amino acids (valine, serine, alanine) can be removed with no loss in activity; but, that digestion with pepsin which splits off these three plus their neighbor, aspartic acid, and also ruptures a "strong" hydrogen bond in the vicinity produces loss in activity. Partial digestion by subtilisin (10, 22), which apparently digests central portions of the polypeptide chain, leaves the activity of the RNase intact as long as the digested portion is not oxidized. It is also known that fragments obtained either by hydrolysis or partial enzymatic degradation from myosin (23-25), trypsin (26), chymotrypsin (27, 28), lysozyme (29), papain (30) and pepsin (31, 32) retain their activity in certain situations. The results with pepsin and papain are particularly striking. Hill and Smith report no loss in the molar activity of papain (toward a synthetic substrate) after an average of 120 of its 180 residues had been removed by leucine-aminopeptidase (an N-terminal type enzyme). Perlmann has reported that some of the dialyzable fragments (which represent 20 per cent of the total original protein) resulting from pepsin auto-digestion retained 1 to 5 per cent of the original activity toward hemoglobin, but about 75 per cent of the activity of the intact pepsin when tested against the synthetic substrate acetyl 1-phenylalanyl diiodotyrosine. These latter results indicate strongly that pepsin, at least, has more than one active site and the site specific for pep- tide linkages adjacent to an aromatic amino acid depends upon the integrity of only a small portion of the molecule. 4 necessary: Of parallel interest to the above considerations is the question of how much configurational infonnation, I„ is necessary for function? The work of Anfinsen and others (10, 33) indicates that the configuration of RNase 114 L. G. AUGENSTINE can be considerably disrupted without loss in activity. They found that rever- sible denaturation in 8 M urea did not cause permanent loss in activity; in fact the RNase was still active in 8 M urea in which its specific viscosity was 8.9 as compared with 3.3 in aqueous solution. This large increase in specific viscosity indicates that the so-called native configuration can be opened con- siderably without destruction of activity. However, Anfinsen reports that oxidation with performic acid, which disrupts the disulfide bonds, causes irreversible inactivation and an increase in specific viscosity to 11.6. The phenomenon of complete loss in activity upon the appearance of the full sulfhydryl titer has been observed in most proteins. It has also been known for a number of years that different degrees of loss in characteristic activity can occur. A number of workers (34, 35) have studied reversible inactivation of enzymes in which it has been observed that a partial unfolding of the mole- cule can occur with a rise in specific viscosity, change in the optical rotation of the protein solutions, changes in solubility, etc., which upon the proper treatment can be reversed. The thermodynamics for reversible denaturation shown in Fig. 4 indicate that quite hkely the first step is common from protein to protein since AF* is remarkably constant for all proteins. Reversible denat- uration invariably shows an increase in entropy. However, AS* is not constant from protein to protein but varies by a large amount as shown by the unhatched areas to the right in Fig. 4. The author has proposed (12, 36) and discussed elsewhere in this volume (37) a hypothesis involving three steps, which attempts to explain this pheno- menon by ascribing the constant AF* to the initial opening of a disulfide bond. This first step is followed by the rupture of a number of neighboring intramolecular bonds (step 2) with a resulting opening of the molecule indicated by the increase in entropy. According to the proposal, this opening of the mole- cule is sufficient to disrupt the spatial arrangement of critical amino acids causing loss in activity, but enough stability and configuration is retained so that under the proper conditions the original native structure, or at least a structure compatible with activity, can restitute. In this hypothesis the rupture of a second disulfide bond (step 3) allows irreversible inactivation to proceed with essentially complete destruction of the characteristic protein structure. A conversion (using an equivalence derived in reference (38)) has been made in Fig. 4 from AS* to A/^. By assuming an average amino acid residue Table I Protein M.W. X 10-3 M.W. ^- 120 A/, (bits) A/./iV (bits/residue) Pepsin Trypsin Emulsin 36 20 38 300 167 317 78 30 48 0.26 0.18 0.15 Amylase Hemoglobin 59.5 67 496 558 36 110 0.07 0.20 Egg albumin 40 333 226 0.68 Lacto peroxidase Insulin 93 12 775 103 340 18 0.44 0.18 Protein Structure and Information Content 115 weight of 120, A/,/residue is given in Table I for those proteins in Fig. 4 for which the molecular weights are available. Thus A/, for the loss in specific activity is of the order of 0.25 bits/residue (the 0.68 value for egg albumin does not correspond to a loss in specific activity). This indicates that destruction of the right 5 to 25 per cent of /, (assuming /p is close to our minimum estimate of I to 4 bits/residue) causes loss of function, which may be reversible or irreversible depending upon which intramolecular bonds are disrupted. PEPSIN T= 298° K T= 323° K PROTEINASE TRYPSIN (KINASE) TRYPSIN INVERTASE INVERTASE VIBRIOLYSIN TETANOLYSIN HEMOLYSIN (GOAT) RENNIN T = 328°K LEUCOSIN INVERTASE (YEAST I NVERTASE T=333 EMULSIN (WET) AMYLASE (MALT) SOLAN IN HEMOGLOBIN TOSS" K EGG ALBUMIN T=343°K PEROXIDASE (MILK) T= 353° K INSULIN Fig. 4. The equivalence between AS* and A/^ for thermal inactivation. The shaded areas to the left represent AF* and the clear areas to the right AS*. (Adapted from Fig. 1, ref 12, by courtesy of University of Illinois Press.) Summary The above discussions indicate that redundancy considerations are not the explanation of the large excess of structural information content; rather, that only a small fraction of the potential information on the surface of the molecule is actively utilized in information transfer. Haurowitz (18), for instance, has pointed out that experiments with substituted antigens indicate that the antigenic specificity resides in an area on the surface of the protein which is approximately 10 to 15 A in diameter. Results cited here suggest 116 L. G. AUGENSTINE that the four or so amino acid residues which would occupy such a surface area (13) may occur as neighbors on the same chain (30-32). Other results mentioned previously (20, 21, 33) suggest that the critical amino acids do not occur in sequence in a single polypeptide chain. This follows from the con- sideration that digestion should be able to consume an average of about 50 per cent of the protein molecule before an active site composed of four or five adjacent amino acids would be encountered ; whereas one of four or five amino acids making up an active site should be encountered, on the average, after about a 20 to 25 per cent digestion of the molecule if the amino acids are distributed roughly at random. In addition, Kennedy and Koshland (39) has found that phospho-glucomutase when placed in 6 M urea loses its activity but recovers it upon dilution, which also indicates separated locations for the critical amino acids. Therefore it may not be possible to state a general rule concerning the relationship between the loci of critical amino acids within polypeptide chains. It seems that the role of intramolecular bonds is to insure that the amino acids which are critical for function are maintained in the proper spatial relationship to each other so that function can occur. Here again it is impossible to state a general rule as to how many of these intramolecular bonds can be disrupted before loss of function occurs, since apparently all of the hydrogen bonds can be broken in RNase without loss in function but not so in phospho- glucomutase. However, the integrity of the more specific secondary bonds (such as S — S) seems to be much more critical for the maintenance of function. The digestion experiments with pepsin and papain indicate further that it is important where in the molecule the bonds are destroyed. Other than ruling out redundancy as a possible reason for the discrepancy between the large potential information and the measured performance, it is difficult to choose among the other possibilities mentioned. The results with pepsin and papain, which have been mentioned, suggest strongly that much of the information content may be unnecessary for function, but has been perpetuated along with the critical content. However, the results with pepsin indicating that multiple sites do exist makes it impossible to assign a certain fraction of the information content as 'garbage'. How much of the polypeptide chain is involved in secondary features of information transmission and the structural complexity necessary for transmitting one bit of information are factors which are now being actively investigated by a number of workers. The various estimates of 7^, I^ and /<. are tallied in Table II. Table II I total ~ I sequence + 'configuration Maximum Plausible Minimum Necessary for performing a single specific function — 4.32 — >4.5 3.5 >1.0 1.0 15/A^ 1.0 10-90% 25% 35-90% Protein Structure and Information Content 117 IV. CONJECTURES Some of the results considered in preparing this paper lead to rather interest- ing speculation. The repetitious minimum entropy polypeptide structures proposed by Pauling, Corey and Branson (8) have already been mentioned. Such configurations may be generally applicable to macromolecules, since helical structures have also been proposed for desoxyribonucleic acid (DNA) polymers (40) and some viruses (41). Crane (42) states that helical configura- tions occur in linear (uni-dimensional) crystals, i.e. structures where progression from each sub-unit to its essentially identical neighbor is by a repeated process of translation and rotation. Lumry and Eyring (43) predict that once hydrogen- bonded secondary structures are formed the characteristic protein 'conformation' is determined by tertiary folding such that the free energy is minimized. How- ever, this does not explain why crystallization should initially occur and be maintained in solution; and to the author's knowledge no one has advanced arguments which provide a complete basis to account for the apparent preval- ence of minimum entropy biostructures, although there have been discussions of how living organisms produce 'order from disorder' or 'order as a result of order' (44). Considering the innumerable configurations available to bio- logical polymers, the question arises 'Are there criteria which determine that the seemingly improbable, highly ordered structures occur spontaneously?' or 'Are these structures imposed at some specific stage in biosynthesis?' Studies on the reversible denaturation of proteins (34, 35) suggest that the latter possibihty is more probable: that is, mild mistreatment can be reversed; whereas, once a certain molecular disarray or instability occurs, an unfolded state results from which the characteristic, native structure does not reconstitute. Neurath et al. (35) make the interesting point, that even if denaturation is complete enough so that physical properties such as solubility, crystallizing ability, or diffusion constants are seriously affected, some of the molecules may subsequently revert to a biologically active form; whereas, others will tend to reverse the molecular disarray by forming a more condensed state but without successfully restoring the native biological properties. This suggests that, although polypeptide chains have an inherent tendency to form semi- condensed configurations, the highly ordered, biologically-active structures are probably not only imposed during biosynthesis, but represent quasi-stable structures with built-in constraints which tend to cause small fluctuations to revert, i.e. a limited amount of disorder can be restrained without the inex- orable Second Law prevailing. Neurath (35) has also reported that the amount of disarray compatible with reversibility depends upon the type of denaturation. Further, denaturation is not reversible under all conditions but may await a change in pH or temperature. However, it is interesting that although an entropy increase is invariably associated with denaturation, removal of the denaturing agents can cause a decrease, which appears to contradict the Second Law; we will later resolve this apparent contradiction. The quasi-stability of native configurations is suggestive of the situation in diatomic molecules where stability conditions are readily depicted as a local 'weir (relative to the surroundings) or null area in a two-dimensional energy-configuration plot. However, since two dimensions would allow only 118 L. G. AUGENSTINE a very gross specification of the myriad degrees of freedom of macromolecules, some form of multi-dimensional space will be necessary to represent their stability conditions. The biologically significant portion of such a macro- molecular space will also be a 'well', but in a multi-dimensional surface rather than a line plot and will be centered near the locus of native structures in configu- ration space. A fraction of the well will represent conditions consistent with an active macromolecule and the remainder, conditions characteristic of reversible inactivation. Anything outside the well will correspond to states inconsistent with the restitution of a native configuration. The multi-dimensional space can be of sufficient dimensionahty so that all configurations differing by a 'single step' are neighbors. In such a 'fine- grain' specification each microstate and its probability density (as a function of energy, for example) can be represented. However, such a scheme has drawbacks: first, it has little novelty since any situation can be completely described by a sufficient number of parameters ; second, a model dealing only with microstates would be extremely diflftcult to test experimentally; and third, the excessive dimensionality makes it useless as an aid in envisioning possible mechanisms of macromolecular rearrangements. Thus, a 'coarse-grain' specification, which requires reducing the dimension- ality by transforming the microstates into a more useful set of macrostates, is desirable. This general operation can be schematized by the use of the follow- ing contingency table : Table III ■< Molecular Energy > ^1 ^2 ^k ^n °'lll '''lia '^llk '^lln °'l21 (^122 <^12i- ^12n ^m °'ij2 '^lik OCljn "'all °'212 '^21fc ^2ln '^iil '^iji ■ "■ijk • a,. A plausible specification for a multi -dimensional space is given in Table III, where a sufficient number of binary digits is used so that each microstate can be unequivocally identified, e.g. the two atoms involved in each bond as well as the bond length and angle could be identified. Each ol^j^. represents Protein Structure and Information Content 119 the probability density of a given microstate for molecular energy state E,^, where the ranges of /,y and k can be essentially infinite. A transformation to a 'coarse-grain' scheme which seems worth consider- ation is as follows. Each macrostate, M, (depicted by the leftmost column of digits in Table III) designates only which bonds exist in the macromolecule, e.g. sulfur atom no. 7 is hooked to carbon no. 179 and sulfur no. 11, C-563 to C-564 and N-201, etc. Mechanistically all microstates, w,;, contained in a given macrostate, M,, are grouped together by ordering the digits (or analogously ordering the axes in space). To complete the transformation other bond properties, e.g. length and orientation (the other column of digits in Table III), and their associated probabilities (the right hand portion of Table III) are lumped into two gross categories to provide an intuitively manage- able representation. This 'lumped fine structure' for each macrostate, Af,- can be represented on an 'energy-deviation' {ED,) plane at the locus (in trans- formed configuration space) corresponding to A/,: 'deviation' is a measure of instabihty, i.e. the extent to which individual microstates, /77,^, deviate from the configuration »7,^ corresponding to maximum stability for macrostate Mj. An example of a method for constructing such values is: (a) find the set of digits «?,s in the middle column of Table III which represents maximum stability for macrostate M^ and (b) determine how many of the corresponding digits of /;?,, and m^j differ. This number provides an excellent measure of 'deviation' because each microstate has a unique Z)-value and 'neighboring' microstates have adjacent Z)-values. Assigning probabilities to pairs of 'energy' and 'deviation' values completes the "fine" to 'coarse-grain' transformation. This requires summing the probabilities, a,^;;,, of those microstates associated with a particular D-value. The probability densities for E and D values can be arranged into contours of equal probability to avoid further complications of adding a third coordinate to the ED plane. These contours will possibly be quite irregular in shape and may well be discontinuous, since the only obvious restriction on their form is that they be non-intersecting. It should be noted that 'lumping' on to 'energy-entropy' planes would have provided a simpler transformation than that to the 'energy-deviation' planes. The microstates corresponding to a given 'deviation' can be equated to an entropy value by the usual — S/^jlog/), procedure, where the /7/s are the probabilities (properly normalized) associated with the microstates. Such a scheme was considered, but was found to be intuitively less useful than the ED transformation. The 'energy-deviation' scheme is of considerable interest when one con- siders possible mechanisms of both protein inactivation and enzymatic activity. Suppose, for instance, that the energy of a molecule in a native configuration is slowly raised, e.g. by external heat: the point representing 'molecular state' will be driven to new loci in multi-dimensional space. Undoubtedly a trajectory is followed such that the locus resides, 'statistically', on the contour which has the maximum probability permissible or consistent with its energy content and macrostate at any instant. This means that the locus first progresses over the EDj plane of the particular native configuration. A/,-. Eventually a locus will be reached where the probability contour occupied is lower than the corre- sponding contour on an adjacent ED plane. The molecular state will then 120 L. G. AUGENSTINE jump to that adjacent macrostate by some fomi of bond rearrangement.* Even without an immediate change in molecular energy due to external heat, the jump will likely be followed by an instantaneous migration of the molecular state locus on the new ED plane. This would be anticipated since the new locus might not be the position of maximum probability for that instantaneous molecular energy. A sufficient increase in temperature would eventually drive the trajectory out of the fraction of the null region corresponding to an active molecule: with sufficient mistreatment the locus would be driven com- pletely out of the null region into the portion of configuration space representing irreversibly inactivated molecules. Molecular energy will decrease when external heat is removed, and the molecular rearrangements will be reversed or not depending upon the sym- metry of the multi-dimensional surface of the well. Where denaturation is reversed merely by reversing the denaturing conditions, apparently the inacti- vation trajectory is retraced or else the null region is a smooth "well" with no intervening metastable positions in the reversal trajectory. Thus, for reactiva- tion the two trajectories would not have to be identical but need only form a •closed loop. Asymmetry in the probability contours of even one of the ED plots traversed, could cause the inactivation and reversal trajectories to diverge sufficiently so that metastable, non-active configurations would result. Such situations have been observed experimentally; for instance, thermal denaturation at alkaline pH is not reversed upon cooling until the pH is adjusted to acidic conditions (35). Since a change in pH should alter the ED contours it is easy to envision how it could make the reversal of denaturation more likely by changing the transition probabilities between macrostates and thus alter the reversal trajectory. Such an alteration would resolve the apparent contra- diction of the Second Law: a changed pH would act as a 'Maxwell Demon guiding the footsteps of the reversal trajectory'. Considering its likely statistical nature, it is probable that much of the trajectory of the locus of molecular states proceeds along essentially negligible probability gradients, not only with respect to transitions from one macrostate to another but more particularly with respect to instantaneous displacement from the locus of arrival on a new ED plane. Such transitions should be readily reversible and in general of limited consequence except as they lead to regions of larger gradients. However, a 'low-gradient' region would allow considerable leeway in trajectories. This would permit multiple pathways which would account for the spectrum of effects often observed following physical denatura- tion. In those transitions involving bonds which latch large segments of the molecule together (12) (e.g. interhelical bonds) gross molecular rearrangements could occur so that the trajectory would pass through regions of large probability gradients. Such transitions would not be instantaneously reversible and would therefore be relatively important in driving the trajectory away from the "active" portion of or even out of the 'well'. My proposed inactivation hypothesis discussed later (37) attempts to * Somewhat more rigorous discussions of factors aflfecting the trajectory of the locus of molecular state in similar multi-dimensional plots have been given by Teller (45) and Lumry and Eyring (46). Protein Structure and Information Content 121 specify the identity and sequence of high-gradient transitions. On this basis energy from an absorbed quantum, ionization or thermal process would migrate through the molecule in a fashion represented mainly by a 'low gradient' trajectory. However, once the energy or charge becomes localized in a bond of low ionization potential involved in latching large segments of the molecule together, a 'high-gradient' transition, not readily reversible, would occur. The inactivation efficiency of absorbed energy will thus be a function both of the locus of the molecular state at the time energy is absorbed as well as its resulting trajectory; where the trajectory depends upon the amount of energy introduced, the point of absorption and any external factors which affect the contours on the ED planes. For instance, the quantum efficiency of UV varies considerably with pH for a number of enzymes (47). The interdependence of energy, configuration and probability proposed here provides a formalism for depicting enzyme action. It is fairly typical of enzyme, as well as other types of catalysis, that reactions proceed which are normally not feasible because of steric or energetic hindrances. It is entirely possible that because of their large size, enzymes act as large energy reservoirs whose function is to "deliver" a quantity of energy to a particular site or com- plex in an irreversible fashion. Another possibility is that energy may not be delivered per se but as a change in configuration of the enzyme with a corresponding alteration in the spatial relationship between reactants complexed to the enzyme. Within these proposals the formation of the enzyme-substrate complex could have an important function. It could act as an external agent affecting the ED contours so as to cause a directed alteration in trajectory, leading finally to a completed enzyme catalysis. Effective, i.e. rapid and essentially irreversible, enzyme catalysis will likely depend upon (1) an E — S complex formation which involves a high-gradient transition, so as to enhance a drastic alteration in the trajectory of molecular state, and (2) the directed trajectory passing through a high-gradient region, preferably just before completion of catalysis, in order to make reversibility unlikely. REFERENCES 1. H. Branson: Information theory and the structure of proteins. In: Information Theory in Biology, ed. by H. Quastler, 84-104, University of Illinois Press, Urbana (1953). 2. R. Roberts: Carnegie Institution Yearbook:, No. 55, 110-148 (1956). 3. D. Cowie: Carnegie Institution Yearbook, No. 55, 110-148 (1956). 4. L. AuGENSTiNE, H. BRANSON, and E. Carver: A search for intersymbol influences in protein structure. In: Information Theory in Biology, ed. by H. Quastler, 105-118, University of Illinois Press, Urbana (1953). 5. G. Gamow, a. Rich, and M. Ycas: The problem of information transfer from the nucleic acids to proteins. In: Advances in Biological and Medical Physics 4, ed. by J. H. Lawrence and C. Tobias, 23-68 (1956). 6. H. MoROwiTZ, et al.: personal communication. 7. M. Ycas: (preceding paper in this volume). 8. L. Pauling, R. Corey, and H. Branson: The structure of proteins. Proc. Nat. Acad. Sci., Wash. 37, 205-211, 235-285 (1951). 9. W. Kauzmann: In: The Mechanism of Enzyme Action, ed. by W. McElroy and B. Glass, Johns Hopkins Press, Baltimore (1954). 122 L. G. AUGENSTINE 10. C. Anfinsen: Informal lecture at the Fourth Buena Vista Conference on Protein Synthesis (1956). C. Anfinsen: Advances in Protein Chemistry 11, 1-100 (1956). D. Steinberg, M. Vaughan, and C. Anfinsen: Kinetic aspects of assembly and degrada- tion of proteins. Science 124, 389-395 (1956). 11. C. F. Jacobsen and K. Linderstr0M-Lang : Salt hnkages in proteins. Nature, Loud. 164, 411^12(1949). 12. L. Augenstine: Structural interpretations of denaturation data. In: Information Theory in Biology, ed. by H. Quastler, 119-124, University of Illinois Press, Urbana (1953). "Trypsin monolayers at the air-water interface," PhD thesis at the University of Illinois (1956). 13. L. Augenstine: Remarks on Pauling's protein models. In: Information Theory in Biology, ed. by H. Quastler, 75-83, University of Illinois Press, Urbana (1953). 14. H. Quastler: The specificity of elementary biological functions. In: Information Theory in Biology, ed. by H. Quastler, 170-190, University of Illinois Press, Urbana (1953). 15. S. Dancoff and H. Quastler: The information content and error rate of living things. In: Information Theory in Biology, ed. by H. Quastler, 263-273, University of Illinois Press, Urbana (1953). 16. J. Sri Ram, L. Terminiello, M. Bier, and F. Nord: On the mechanism of enzyme action. LVIII. Acetyltrypsin, a stable trypsin derivative. Arch. Biochem. Biophys. 52, 464-477 (1954). 17. Worthington Biochemical Corp.: Descriptive manual 8, Freehold, N.J. 18. F. Haurowitz : Protein synthesis and immunochemistry. In : Information Theory in Biology, ed. by H. Quastler, 125-146, University of lUinois Press, Urbana (1953). 19. I. Raacke: Heterogeneity studies on several proteins by means of zone electrophoresis on starch. Arch. Biochem. Biophys. 62, 184-195 (1956). 20. G. Kalnitsky and W. Rogers: The activity of ribonuclease after digestion with carboxypeptidase. Biochim. Biophys. Acta 20, 378-386 (1956). 21. C. Anfinsen: The inactivation of ribonuclease by restricted pepsin digestion. Biochim. Biophys. Acta 17, 593-594 (1955). 22. S. Kalman, K. Linderstrom-Lang, M. Ottesen, and F. Richards: Degradation of ribonuclease by subtilisin. Biochim. Biophys. Acta 16, 297 (1955). 23. J. Gergeley: Studies on myosin-adenosinetriphosphatase. /. Biol. Chem. 200, 543-550 (1953). 24. A. Szent-Gyorgyi: Meromyosins, the subunits of myosin. Arch. Biochem. Biophys. 42, 305-320 (1953). 25. E. MiLHALYi: Trypsin digestion of muscle proteins. II. The kinetics of the digestion. J. Biol. Chem. 201, 197-209 (1953). E. MiLHALYi and A. Szent-Gyorgyi: Trypsin digestion of muscle proteins. I. Ultra- centrifugal analysis of the process. /. Biol. Chem. 201, 189-196 (1953). 26. F. Nord, M. Bier, and L. Terminiello: On the mechanism of enzyme action. LXI. The self digestion of trypsin, Ca-trypsin and acetyltrypsin. Arch. Biochem. Biophys. 65, 120-131 (1956). 27. M. Kunitz: Formation of new crystalline enzymes from chymotrypsin. /. Gen. Physiol. 22, 207-237 (1938). 28. J. Gladner and H. Neurath: Carboxyl terminal groups of proteolytic enzymes. II. Chymotrypsins. J. Biol. Chem. 206, 911-929 (1954). 29. J. Harris, C. Li, P. Condliffe, and N. Pon: Action of carboxypeptidase on hypophyseal growth hormone. J. Biol. Chem. 209, 133-143 (1954). 30. R. Hill and E. Smith: Crystalline papain. Biochim. Biophys. Acta 19, 376-377 (1956). 31. G. Perlmann: Formation of enzymatically active dialysable fragments during auto- digestion of pepsin. Nature, Lond. 173, 406 (1954). Protein Structure and Information Content 123 32. G. Perlmann: Discussion of paper by D. Koshland in J. Cell. Camp. Physiology, 47, Supplement 1, 217-234 (1956). 33. C. Anfinsen, W. F. Harrington, Aa. Hvidt, K. Linderstrom-Lang, M. Otteson, J. ScHELLMAN : Studies on the structural basis of ribonuclease activity. Biochim. Biophys. Acta 17, 141-142 (1955). 34. A. Stearn: Kinetics of biological reactions with special reference to enzymic processes. Advances in Enzymology 9, 25-75 (1949). 35. H. Neurath, G. Cooper, and J. Erickson: The denaturation of proteins and its apparent reversal. /. Phys. Chem. 46, 203-211 (1942). H. Neurath, J. Greenstein, F. Putnam, and J. Erickson: The chemistry of protein denaturation. Cliem. Rev. 34, 157-265 (1944). 36. L. AuGENSTiNEand R. Ray: Trypsin monolayers at the air-water interface. III. Structural postulates on inactivation. J. Phys. Chem. 61, 1385-1388 (1957). 37. L. Augenstine: Discussion of a proposed mechanism of protein inactivation. (Part IV of this volume.) 38. L. Augenstine: Information and thermodynamic entropy. In: Information Theory in Biology, ed. by H. Quastler, 16-20, University of Illinois Press, Urbana (1953). 39. E. Kennedy and D. Koshland: Properties of the phosphorylated active site of phos- phoglucomutase. J. Biol. Chem. 228, 419-433 (1957). 40. J. D. Watson and F. H. C. Crick : Molecular structure of nucleic acids. Nature, Land. Ill, 737-738 (1953). 41. H. Fraenkel-Conrat: Rebuilding a virus. Sci. Amer. 194 (6), 42^7 (1956). 42. H. Crane: Principles and problems of biological growth. Sci. Monthly 70, 376-389 (1950). 43. R. LuMRY and H. Eyring: Conformation changes of proteins. /. Phys. Chem. 58, 110-120 (1954). 44. E. ScHRODiNGER : What is Life? Cambridge University Press, Cambridge, England (1946). 45. E. Teller: The crossing of potential surfaces. /. Phys. Chem. 41, 109-116 (1937). 46. R. Lumry and H. Eyring: Energy exchange in photoreactions. In: Radiation Biology III, ed. by A. HoUaender, 1-70, McGraw-Hill Book Co. (1956). 47. P. Finkelstehm and A. McLaren: Photochemistry of proteins VI. pH dependence of quantum yield and UV absorption spectrum of chymotrypsin. /. Poly. Sci. 4, 573-582 (1953). 48. H. Simon: On a class of skew distribution functions. Biometrika 42, 425-440 (1955). DISCUSSION Platt: Simon (48) has shown that skewed distributions (Yule distributions), such as those in Fig. 2, can be obtained from models based on probabiHty assumptions much weaker than those we were looking for. Thus our inability to determine constraints from a study of the distribution of amino acid and letter frequencies in proteins and words is not surprising. However (in agreement with our summarizing statement for that section), Simon points out that the occurrence of a Yule distribution does not obviate more stringent constraints as the underlying probability mechanism. SPECIFIC MECHANISMS OF PROTEIN SYNTHESIS AND INFORMATION TRANSFER IN THE DEVELOPING CHICK EMBRYO* H. R. Mahler, H. Walter, A. Bulbenko and D. W. Allmann Department of Chemistry, Indiana University, Bloomington, Indiana Abstract — Some preliminary data on precursors and pathways of protein biosynthesis in chick embryos have been presented. The tentative conclusions stated are: 1. Egg white proteins are not utilized for the synthesis of embryonic proteins up to and including the ninth day. Soluble proteins added to the yolk are incorporated effectively, and preferentially to some of the yolk proteins proper. 2. Proteins, peptides and amino acids injected into the yolk sac are incorporated at approximately equal rates. Considering the relative available pool sizes of the various pre- cursors present in the egg, added proteins have to be regarded as the preferred amino acid source of embryonic proteins. 3. A common precursor formed efficiently from proteins and relatively slowly from added amino acids and peptides is considered a likely intermediate in the process. 4. Homogenates of adult organs injected into embryos can be used to elicit a response previously reported for organ transplants, i.e. the apparently specific transfer of labeled material from donor organs to the corresponding organ in the embryonic host. The super- natant fraction of the cytoplasm appears to be, at least in part, responsible for the results observed. I. INTRODUCTION It is the purpose of this contribution to describe, in brief, some preHminary experiments on a controlled biosynthetic activity, namely, the precursors and pathways of protein formation. It differs from most of the papers in this symposium in dealing with phenomena rather than with concepts and in the absence of any attempt to establish a functional correlation between these biological phenomena and information-theoretical abstractions. It shares with other papers in this volume the properties of being highly tentative, and in presenting data and comments on a subject to which it is felt information theory should eventually make significant contributions. With the hope that arrival of that time might be hastened and that thought and discussion might be stimulated, our data are presented for consideration. Some of the results are derived from single experiments only and thus lack further con- firmation. All of the approaches and conclusions reported are still under active investigation and thus subject to revision and modification. Embryos were chosen for the experiments since their cells exhibit two fundamental and related properties, both apparently controlled by the nuclear * The investigations reported have been supported by grants-in-aid of the National Heart Institute, National Institutes of Health, U.S. Public Health Service (Grant No. H 2177) and of the National Science Foundation. This article is contribution No. 746 from the Department of Chemistry, Indiana University. 124 Specific Mechanisms of Protein Synthesis in the Developing Chick Embryo 125 machinery, which set them apart from otlier cells of higher organisms. These are: the capacity for replication, that is, rapid yet controlled growth; the capacity for differentiation, that is, continuous yet controlled change and evolution (1). Therefore, one might consider this the system of choice for attempts at discovering how the information content of the hereditary material, the genetic potentialities, are translated into progressive biochemical capabilities and thus into physiological and morphological realities (2). The experiments were done with chick embryos in ovo because of the ease of handling and the essentially closed and self-contained nature of the experimental system. Further- more, there is a relative paucity of reliable, modern information available about their metabolism and that of embryos of higher vertebrates in general, as contrasted to the large body of knowledge derived from experimental embryology. Our eventual aim is to study the initiation, the mode, and the control of synthesis of highly specific, respiratory enzymes as an indicator of controlled biosynthetic events; however, our initial investigations deal with the more modest one of a definition of parameters for embryonic protein synthesis (3). For any protein formed de novo, as has been pointed out by Spiegelman (4) essentially three different mechanisms may be envisaged: 1. The rearrangement of pre-existing protein molecules; namely, the urprotein hypothesis of Northrop (5), with suitable modifications. 2. The accretion of amino acids on to pre-existing proteins or peptides. 3. De novo synthesis from amino acids. In the special case of the formation of induced enzymes in rapidly dividing bacterial cells and cell-free systems derived therefrom, the evidence is over- whelmingly in favor of the third alternative (4, 6). The situation is not nearly as straightforward in the vertebrate systems studied. On the one hand, for example, Work and collaborators investigated the synthesis of milk proteins (7), Velick, Simpson and co-workers the synthesis of several specific enzyme proteins for muscle (8, 9), and Loftfield and Harris the synthesis of liver ferritin (10). All this work was in vivo and by different experimental techniques, but all these authors presented strong evidence for the last alternative and against the first two. On the other hand Anfinsen and his co-workers, working with hen's oviduct in vitro, have demonstrated that in short term incubations incorporation of amino acids into freshly formed ovalbumin is non-uniform, which is suggestive of the second alternative, but that after longer periods there is a redistribution towards unifonnity (11). Similar results have also been obtained for ribonuclease and insulin synthesis by pancreas sHces. In the case of the proteins of the chick embryo proper, Francis and Winnick have presented data on the incorporation of labeled amino acids in free and protein-bound form as possible precursors of cardiac muscle protein grown in tissue culture (12). The amino acids of the proteins did not exchange with large pools of the corresponding unlabeled acid in the medium, and from this and from experiments with doubly-labeled proteins it was concluded that proteins could be transferred from a nutrient embryo extract medium to heart muscle protein without release of free amino acids. Tracer experiments of this sort, as will be discussed later, do not, however, prove the direct transfer 126 H. R. Mahler, H. Walter, A. Bulbenko and D. W. Allmann of protein, but solely suggest that there may not be free equilibration between the free added amino acid pool and amino acids formed and utilized metaboli- cally during precursor protein breakdown and product protein formation respectively. Another potentially very fruitful line of investigation is provided by some experiments of Ebert's, the results of which tentatively suggest the incorporation of organ specific adult proteins into those of embryos subsequent to chorio- allantoic grafts of the donor organs (3, 13). These researches were the out- growth of findings by Murphy (14) and by Danchakoff (15), made some forty years ago, that such transplants of adult chicken spleen lead to a specific enlargement of the host organs. A systematic re-investigation of the phenomenon by Weiss led to the conclusion that transplants of kidney and liver, as well as injections of organ breis of six-day old chick embryos into four-day old hosts, could lead to similar effects (2). Weiss correctly pointed out that experi- ments of this sort did not permit a choice between a 'template' or a 'specific precursor' type of mechanism. Ebert's investigations are designed to shed some light on this question as well as on the more general ones of protein synthesis and organ specific growth control in embryonic development. In our own investigations we have made use of S^^-labeled organ homo- genates, isolated proteins, peptides, and amino acids to gain some insight into the pattern of embryonic protein biosynthesis. In this work we have been interested not only in the immediate but also in the original precursors, which in this case must consist of all or part of the egg white and yolk proteins. Preliminary accounts of some aspects of this work have appeared (16). II. METHODS AND RESULTS 1 . Preparation oj Labeled Precursors In the experiments to be reported in tliis and subsequent sections S^^- labeled proteins, peptides, and a mixture of amino acids were prepared bio- synthetically as follows: Torulopsis utilis was grown on S^^-sulfate (obtained from Oak Ridge National Laboratory), according to Wood and Perkinson. (17) After extraction with organic solvents (18) the yeast protein was hydrolysed with a 1 :1 mixture of 6N HCl and 90 per cent fomiic acid. Humin was removed by centrifugation and a portion of the neutralized hydrolysate, which also served as source of amino acids in the experiments to be reported, corresponding to 50 mc of the original S^^, was injected intraperitoneally into a laying White Leghorn hen in two doses, about five hours apart. Eight hours after the second injection the blood was withdrawn by heart puncture, allowed to clot, and serum albumin and serum globulin prepared (19). The oviduct was removed from the hen, and ovalbumin prepared essentially as described by Steinberg and Anfinsen (11). All proteins were treated with cysteine at a pH of 8.0 to 8.5 to assure removal of exchangeable S^^, and then dialysed. Peptides were prepared by peptic hydrolysis of the proteins. Aliquots of the radioactive amino acids, peptides, and proteins were prepared by standard methods and counted. In the tracer experiments, 0.05 to 0.1 ml aliquots of the radioactive precursor solutions, containing 0.3 to 1.8 mg and 6000 to 25,000 counts per minute each, were injected into the yolk or the albuminous portion of some two Specific Mechanisms of Protein Synthesis in the Developing Chick Embryo 127 to three dozen unincubatcd, embryonated White Rock eggs. The punctures were sealed with paralTin wax and the eggs then incubated at 38° C under conditions of controlled humidity. Starting with the fifth and ending with the ninth day after the injection, embryos were harvested and a number pooled. The mixture was homogenized for about three minutes in a Potter-Elvehjem homogenizer in Ringer's isotonic saline solution, made up to 10 ml (fifth and sixth days) or 20 ml (seventh through ninth days), and precipitated with tri- chloracetic acid (final concentration, 8 per cent). Dry protein powders were then prepared and counted (20). 2. Is There Evidence for Selective Utilization of Egg-white or Yolk Proteins'} In the first set of experiments, chicken serum albumin injected into yolk or egg-white was used as a protein tracer. Table I shows the results of two Table L Injection of Chicken Serum Albumin into Embryonated Eggs njection Egg white Egg yolk Day after i % of injected activity found per embryo Protein wt of embryo in mg % of injected activity found per embryo Protein wt of embryo in mg 5 .006 .008 5.5 7 0.79 1.12 5 6.5 6 .012 .100 13 16 1.34 0.31 11 19 7 .015 .029 28 29 2.84 1.58 17 27 8 .016 45 4.04 3.35 43 48 9 .088 .133 72 79 2.86 7.28 53 87 series of experiments. The spread of the data is indicative of the precision, reliability, and reproducibility usually obtained in experiments of this sort. Let us now make the following assumptions: (a) that the injected protein is a true tracer for egg-white and yolk protein respectively, i.e. that no permea- bility or other pool barriers exist for its equilibration with the corresponding unlabeled egg proteins; and (b) that there is no selectivity in the uptake mechan- ism of the embryo either for or against a serum albumin tracer as a typical precursor protein. Now we can calculate data shown in Table II and compare the observed mean of the amount of protein actually formed, with that expected on the basis of the above assumptions. The latter value is calculated by multiplying the weight of total yolk or egg-white protein, about 3000 mg each, by the per cent of the injected activity incorporated per embryo (from Table I). There are profound discrepancies between the calculated and the observed 128 H. R. Mahler, H. Walter, A. Bulbenko and D. W. Allmann values. Those for the egg white are only a small fraction of those expected, while those for the yolk are uniformly about two-fold greater. It is thus apparent that at least one of the assumptions cited cannot be valid. The simplest modification would be to postulate that assumption (b) is not true, and that over the time-period studied egg white proteins are not precursors Table II. Amounts of Embryonic Protein Formed Compared to that Calculated from Tracer Data Protein (mg/embryo) Day after injection Observed Calculatec Egg-white 1* Yolk 5 6 0.21 28.8 6 15 1.68 24.9 7 29 0.66 66.3 8 45 0.48 111.0 9 76 3.30 152.0 * From injected albumin tracer. of embryonic proteins. Soluble proteins injected into the yolk can be utilized for this purpose, and may be more efficient than some of the yolk proteins proper. 3. Is There Evidence for Selective Utilization of Amino Acids, Peptides or Proteins ? In the next series of experiments we compared serum albumin, albumin peptides and amino acids all injected into the yolk, with the same precursors injected into egg white. The design of the experiment was the same as before and the results of one run are summarized in Table III, Table III. Incorporation of Protein Precursors into Chick Embryos* Day after Precursors injected into YOLK Precursors injected into egg-white injection albumin albumin amino albumin albumin amino peptides acids peptides acids 5 0.75 0.44 0.34 0.0063 0.35 1.19 6 1.30 0.90 1.53 0.013 0.56 3.03 7 2.80 1.70 3.86 0.015 1.59 3.48 8 4.05 4.72 5.15 0.016 2.32 4.94 9 2.85 8.52 9.18 0.088 5.94 5.65 * Expressed as per cent of injected activity recovered per embryo. We see that except for albumin injected into egg-white, which has already been discussed, all the precursors tested appear to be utilized with approxi- mately equal efficiency regardless of whether they are injected into the yolk or the egg white. This is not limited to serum albumin, but holds true equally well for serum globulin and ovalbumin and their peptides as is shown in Table IV. Specific Mechanisms of Protein Synthesis in the Developing Chick Embryo 129 Table IV. Incorporation into Embryos of Proteins and Peptides Injected into the Yolk* Day after injection S. albumin S. globu- lin Ovalbu- min S. albumin peptides S. globulin peptides Ovalbumin peptides 5 0.75 1.10 0.45 0.44 0.20 0.95 6 1.30 1.75 0.80 0.90 0.55 1.65 7 2.80 2.35 0.40 1.70 1.15 2.20 8 4.05 2.55 1.45 4.72 2.20 — 9 2.85 4.50 2.95 8.52 4.50 6.60 Expressed as per cent of injected activity recovered per embryo. 4. Is There Evidence for Organ-specific Transfer? In order to test the hypothesis of organ-specific transfer advanced by Ebert we have attempted to extend investigations of this sort to the use of S^^-labeled aduh chicken Hver and heart homogenates. These were prepared from deep-frozen organs of a White Leghorn hen injected with a mixture of S^^-amino acids, and treated as described above. After several months the tissues were thawed and homogenized in a tris- (hydroxymethyl)-aminomethane buflfer solution at pH 7.4 containing 0.9 per cent KCl, first in a Waring blender and then in a Potter-Elvehjem homogenizer. The liver and heart homogenates, made up to 10 per cent (weight/volume) with the same buffer solution, were then treated with cysteine at a pH of 8.0 to 8.5 to assure removal of all exchangeable S^^. After dialysis, some undis- solved material was removed by low-speed centrifugation, and the relatively clear supernatant fluid was used for intravenous injection into 9-day-old chick embryos. Embryonated White Rock eggs were incubated at 38° C under controlled humidity conditions for a period of 9 days. They were then candled, and the location of the blood vessels was marked on the shell of each egg. An area of about 1 cm^ of the shell above the vessel was carefully cut out by means of a dental drill and burr without injuring the membrane, and the small square was removed with a razor blade. A drop of mineral oil was placed on the membrane to render it transparent, and 0.1 ml of the liver or heart homogenate was intravenously injected in the direction of blood flow. The eggs were reincubated for 24 hours and the embryos were excised. Hearts and livers were removed, the organs were pooled, and homogenized; dry protein powders were prepared for counting as described before. Similarly aliquots of the homogenates used for injection were prepared and counted. The results of these experiments are given in Table V. In all, two series of experiments make up the Table. In the first series, twenty-four embryos each were injected with heart and liver homogenates; of these, twenty-two and eleven respectively survived. In the second series, forty-four out of forty-seven embryos injected with the heart preparation survived, while the number of survivors was twenty-two out of twenty-eight for the liver homogenate. Thus the table summarizes data obtained on 99 survivors out of 123 embryos that were injected: 66/71 for heart; 33/52 for liver. 130 H. R. Mahler, H. Walter, A. Bulbenko and D. W. Allmann It can be seen that the relative specific activity of hearts is higher than that of hvers when chicken heart homogenate is injected, whereas the relative specific activity of the livers is higher than that of hearts when chicken-liver homogenate is injected. Table V. Incorporation of Activity from Adult-Tissue Homogenates into Nine- Day Embryos after Twenty-four-hour Incubation Injection Item Chicken heart homogenat Chicken liver homogenate CoLint/min per embryo injected 398 398 2780 2780 mg injected per egg 0.1 0.1 0.1 0.1 Organs investigated Hearts Livers Hearts Livers Livers Hearts Livers Hearts No. of organs cut out 22 11 22 11 11 11 11 11 Dry protein wt of organs obtained (mg) 38.2 72.0 38.8 70.0 84.7 20.9 77.6 22.4 Wt counted (mg) 18.3 29.8 23.4 30.0 30.1 11.6 30.2 12.6 Count/min observed* 21 24 22 19 366 173 389 214 Corrected count/min per 30 mg 28 24 25 19 365 286 386 340 Relative specific activity 1.00 0.86 1.00 0.76 1.00 0.78 1.00 0.87 Counts per minute are within 5 per cent standard deviation. III. CONCLUSIONS The experiments on soluble protein tracers added to yolk and egg-white demonstrate quite clearly that proteins added to the egg-white or, probably, egg-white proteins themselves are incorporated with such low efficiency as to rule out any important contribution from this source to the protein of the developing embryo, at least up to and including the ninth day. Incorporation of protein from the yolk is rapid, and soluble proteins injected into this source may be utilized preferentially to some of the yolk proteins themselves. This utilization of yolk rather than egg-white proteins as a source of embryonic protein during this period is in accord with other investigations, notably the quantitative protein depletion studies of Rupe and Farmer (21). For the intervals studied, amino acids, peptides and proteins, even those of relatively 'foreign' origin such as the serum proteins, all apparently provide an equally acceptable source of S^^ for embryonic protein synthesis (within an order of magnitude or so), provided they are injected into the yolk. Now the protein tracer must be diluted by at least a portion of the 3.0 g or so of yolk protein — an estimate of approxim.ately 50 per cent would appear reasonable in view of the results reported above. On the other hand, amino acids or peptides cannot be diluted to any appreciable extent since the pools of these substances in the egg are vanishingly small (22). From this one might conclude that proteins themselves or substances easily formed from them must be the preferred precur- sors of embryonic proteins. Since the egg protein ovalbumin is used no more efficiently than the more "foreign" serum proteins, the pathways of assimilation for these precursors, available to the embryo, must have at least some inter- mediates in common. The data on peptides may find a similar interpretation. Specific Mechanisms of Protein Synthesis in the Developing Chick Embryo 131 These intermediates are not free amino acids, as evidenced by their relatively low incorporation rates. They may be small peptides or activated forms of amino acids, formed readily and reversibly from protein precursors, but not identical and not in equilibrium with the pool of added low-molecular weight precursors. This view would be in accord with the findings of Francis and WiNNiCK (12), although not with their interpretation. The occurrence of pools of modified amino acids, incapable of equilibrating with those in the medium, has been demonstrated in micro-organisms. Thus Gale, working with Staphylococcus aureus, found that added glutamic acid could be so trans- formed, and the modified fonn used for protein synthesis (24). Similarly CowiE and Walton (25) have presented evidence that the pools of amino acids formed metabolically in Torulopsis utilis and utilized as effective precur- sors in protein synthesis, are present in some modified form, possibly as com- plexes adsorbed onto macromolecules, and do not equilibrate freely with added amino acids in the medium. In all the cases presented, this metaboli- cally active form of the amino acids may be formed by a variety of pathways as indicated below. Proteins 1 [Peptide Intermediates] y >' 1 Free peptides ^'Amino Acids'-^ Free amino acids (modified) Recent investigations, especially by Zamecnik and his collaborators, (26) have disclosed that free amino acids are first 'activated' by enzymes in the soluble portion of the cytoplasm (27), probably through mixed anhydride formation with adenylic acid (27, 29, 30) prior to their incorporation into a protein-bound form (30, 31), which takes place in RNA-rich granules associated with the microsomal fraction of homogenates (32, 33, 34). Whether or not the metabolically active form of amino acids alluded to above can be equated with these aminoacyl adenylates has not yet been established. An alternative explanation, which has been invoked to account for apparent preferential utilization of proteins over amino acid precursors in the formation of specific proteins, postulates proteolysis and protein synthesis sites in such close spatial juxtaposition as to permit ready transfer of intermediates from breakdown to synthesis site at the expense of penetration of the latter by added amino acids. This has been suggested by Loftfield and Harris (10) as the mechanism operative in ferritin synthesis, and by Walter et al. (20) in the transformation of serum into organ proteins. Purely spatial factors of this sort are probably not the determining ones in the present instance, since it can be demonstrated that the bulk of the proteolytic activity is centred in the yolk (23), and thus remote from the synthetic activity which is, presum- ably, occurring in the embryo itself. It is hoped that critical experiments now in progress will permit a choice to be made between the various alter- natives suggested. 132 H. R. Mahler, H. Walter, A. Bulbenko and D. W. Allmann We have shown that the organ-specific locahzation phenomenon, previously observed with chorio-allantoic transplants, can be dupHcated by the injection of homogenates of aduh tissue. Similarly Tumanishvili et al. (35) found almost simultaneously that host organ enlargement could also be elicited by the same technique. This demonstration of the essential similarity of two approaches clears the way for an investigation of the problem by means of relatively straightforward biochemical and enzymological techniques rather than the more demanding ones of experimental embryology. Obviously only a bare beginning has been made. The findings will have to be confiiTned and extended and several relatively trivial explanations excluded. Among such explanations are, for instance, the transfer of whole cells on the one hand, and differential composition and/or incorporation rates with respect to cystine and methionine in the two tissues studied, on the other. Ebert claims to have eliminated both these alternatives in his transplantation experiments; in the light of the available information, they are not very likely in the present case. Nevertheless they will have to be rigorously excluded. Our tentative interpretation of the prelimi- nary results described is identical with that advanced by Ebert: that we are dealing with a specific transfer of rather large units from the donor preparation to the embryonic organ. Preliminary experiments indicate that the injection of either heart or liver (donor) homogenates leads to an increase in specific activity in the liver as compared to the heart. The effect in this case is therefore non-specific and possibly related to the higher mitotic and synthetic activity of liver relative to heart, i.e. to fuller differentiation. Another line of approach which promises to be of some interest is to determine the cell fraction or fractions, if any, responsible for eliciting the effect both with respect to the donor and the acceptor organ. Impetus is added to this approach by the recent experiments which have focussed attention on the soluble and microsomal fractions as being involved in the initial phases of protein synthesis. In preliminary experiments with fractionated, dialysed heart homogenates the data of Table VI were Table VI. Transfer of Label from Donor Heart Fractions into Organs of Recipient Embryos Fraction Relative specific activity of embryonic organs (heart/liver) Homogenate Nuclei Mitochondria Microsomes Soluble 1.17, 1.32, 1.23 0.65, 0.74 0.22 (?) 2.56 1.85, 2.50, 1.49 obtained. The number of data in each row corresponds to the number of experiments actually performed. Thus the results for the microsomal and mitochondrial fractions must be regarded as exceedingly tentative. With this proviso, components of the soluble fraction of the cytoplasm might be regarded as responsible for the phenomenon observed with whole heart homogenates. Specific Mechanisms of Protein Synthesis in the Developing Chick Embryo 133 A similar observation has been reported by Kutsky who found the supernatant fraction of embryo extract to be most active in stimulating the growth of heart fibroblasts in vitro (36). REFERENCES 1. For a very stimulating and up to date review the reader is referred to: B. Ephrussi: Enzymes in Cellular Differentiation, in: O. H. Gaebler: Enzymes: Units of Biological Structure and Function, Academic Press, New York City, 29-40 (1956). 2. For an intriguing hypothesis involving control of organ development by diffusible com- ponents see the following articles: P. Weiss, in A. K. Parpart: The Chemistry and Physiology of Growth, Princeton University Press, 135-186 (1949). P. Weiss: Self regulation of organ growth by its own products. Science 115, 487-488 (1952). P. Weiss: Some introductory remarks on the cellular basis of differentiation. /. Embryol. Exp. Morph. 1, 181-211 (1953). 3. The present state of knowledge, with special emphasis on protein biosynthesis, is admirably reviewed in the following: J. D. Ebert: Some aspects of protein biosynthesis in develop- ment, in: D. Rudnick: Aspects of Synthesis and Order in Growth, Princeton University Press, 69-112(1956). 4. S.Spiegelman: On the nature of the enzyme-forming system, />;; O. H. Gaebler: Enzymes: Units of Biological Structure and Function, Academic Press, New York City, 67-89 (1956). 5. R. S. Alcock: The synthesis of proteins in vivo. Physiol. Rev. 16, 1-18 (1936). J. H. Northrop, M. Kunitz, and R. M. Herriott: Crystalline Enzymes, 2nd ed., Columbia University Press, New York (1948). S. B. KoRiK and H. Chantrenne: The relationship of ribonucleic acid to the in vitro incorporation of radioactive glycine into the proteins of reticulocytes. Biochim. Biophys. /^cfa 13,209-215 (1954). 6. D. S. HoGNESS, M. CoHN, and J. Monod: Studies on the induced synthesis of /3-D- galactosidase in Escherichia coli. The kinetics and mechanism of sulfur incorporation. Biochim. Biophys. Acta 16, 99-116 (1955). 7. P. N. CAiMPBELL and T. S. Work: The biosynthesis of protein. Uptake of glycine, serine, valine and lysine by the mammary gland of the rabbit. Biochem. J. 52, 217-227 (1952). B. A. AsKONAS, P. N. Campbell, and T. S. Work: The biosynthesis of proteins. Synthesis of milk proteins by the goat. Biochem. J. 58, 326-331 (1954). B. A. Askonas, p. N. Campbell, C. Godin, and T. S. Work: Biosynthesis of Proteins. Precursors in the synthesis of casein and )3-lactogIobulin. Biochem. J. 61, 105-115 (1955). C. Godin, and T. S. Work: Biosynthesis of proteins. The effect of intravenous peptides on casein synthesis in a lactating goat. Biochem J. 63, 69-71 (1956). 8. M. V. Simpson and S. F. Velick: The synthesis of aldolase and glyceraldehyde-3- phosphate dehydrogenase in the rabbit. /. Biol. Chem. 208, 61-71 (1954). M. V. Simpson: Further studies on the biosynthesis of aldolase and glyceraldehyde-3- phosphate dehydrogenase. J. Biol. Chem. 216, 179-183 (1955). 9. M. Heimberg and S. F. Velick: The synthesis of aldolase and phosphorylase in rabbits. J. Biol. Chem. 208, 725-730 (1954). S. F. Velick: The metabolism of myosin, the meromyosins, actin and tropomyosin in the rabbit. Biochim. Biophys. Acta 20, 228-236 (1956). 10. R. B. Loftfield and A. Harris: Participation of free amino acids in protein synthesis. /. Biol. Chem. 219, 151-159 (1956). 11. D. Steinberg and C. B. Anfinsen: Evidence for intermediates inovalb umin synthesis. /. Biol. Chem. 199, 25-42 (1952). C. B. Anfinsen and D. Steinberg: Studies on the biosynthesis of ovalbumin. /. Diol. Chem. 189,739-744(1951). 10 134 H. R. Mahler, H. Walter, A. Bulbenko and D. W. Allmann M. Vaughan and C. B. Anfinsen: Nonuniform labeling of insulin and ribonuclease synthesized in vitro. J. Biol. Chem. Ill, 367-374 (1954). M. Flavin and C. B. Anfinsen: The isolation and characterization of cysteic acid peptides in studies of ovalbumin synthesis. J. Biol. Chem. Ill, 375-390 (1954). D. Steinberg, M. Vaughan, and C. B. Anfinsen: Kinetic aspects of assembly and degradation of proteins. Science 124, 389-395 (1956). 12. M. D. Francis and T. Winnick: Studies on the pathway of protein synthesis in tissue culture. /. Biol. Chem. 202, 273-289 (1953). 13. J. D. Ebert: The effects of chorioallanteic transplants of adult chicken tissues on homo- logous tissues of the host chick embryo. Proc. Nat. Acad. Sci., Wash. 40, 337-347 (1954). 14. J. B. Murphy: The effect of adult chicken organ grafts on the chick embryo. /. Exp. Med. 24, 1-6 (1916). 15. V. Danchakoff: Equivalence of different hematopoietic anlagen (by method of stimula- tion of their stem cells). II. Grafts of spleen on the allantois and response of allantoic tissues. Amer. J. Anat. 24, 127-189 (1918). 16. H. Walter, A. Bulbenko, and H. R. Mahler: Precursors of embryonic chick proteins. Nature, Lond. 178, 1176-1177 (1956). H. Walter, D. W. Allmann, and H. R. Mahler: Influence of adult tissue homogenates on formation of similar embryonic proteins. Science 124, 1251-1252 (1956). 17. J. L. Wood and J. D. Perkinson Jr. : Yeast biosynthesis of radioactive sulfur compounds. J. Amer. Chem. Soc. 74, 2444-2445 (1952). 18. R. B. Williams and R. M. C. Dawson: The biosynthesis of L-cystin and L-methionine labeled with radioactive sulphur S^\ Biochem. J. 52, 314-317 (1952). 19. W. Friedberg, H. Walter, and F. Haurowitz: The fate in rats of internally and exter- nally labeled heterologous proteins. /. Immunol. 75, 315-320 (1955). 20. H. Walter, F. Haurowitz, S. Fleischer, A. Lietze, H. F. Cheng, J. E. Turner, and W. Friedberg : Metabolic fate of injected homologous serum proteins in rabbits. /. Biol. Chem. 224, 107-119 (1957). 21. C. O. RuPE and C. J. Farmer: Amino acid studies in the transformation of proteins of the hen's egg to tissue proteins during incubation. /. Biol. Chem. 213, 899-906 (1955). 22. A. L. Romanoff and A. J. Romanoff: The Avian Egg. J. Wiley and Sons, New York City. (1949). 23. A. L. Romanoff: Membrane growth and function. Ann. N.Y. Acad. Sci. 55, 288-301 (1952). 24. E. F. Gale: Assimilation of amino acids by gram-positive bacteria and some actions of antibiotics thereon. Advances in Protein Chemistry 8, 285-391 (1953). 25. D. B. CowiE and B. P. Walton: Kinetics of formation and utilization of metabolic pools in the biosynthesis of protein and nucleic acid. Biochim. Biophys. Acta 21, 21 1-226 (1956). 26. M. D. Hoagland, E. B. Keller, and P. C. Zamecnik: Enzymatic carboxyl activation of amino acids. /. Biol. Chem. 218, 345-358 (1956). 27. P. SiEKEViTz: Uptake of radioactive alanine in vitro into the proteins of rat liver fractions. /. Biol. Chem. 195, 549-565 (1952). 28. J. A. DeMoss and G. D. Novelli: An amino acid dependent exchange between inorganic pyrophosphate and ATP in microbial extracts. Biochim. Biophys. Acta 18, 592-593 (1955). 29. P. Berg and G. Newton: Acyl adenylates : the interaction of adenosinetriphosphate and L-methionine. /. Biol. Chem. 222, 1025-1034 (1956). 30. E. B. Keller and P. C. Zamecnik: The effect of guanosine diphosphate and triphosphate on the incorporation of labeled amino acids into proteins. /. Biol. Chem. 221, 45-59 (1956). 31. H. Sachs and H. Waelsch: The effect of pyrophosphate on amino acid incorporation into rat liver microsomes. Biochim. Biophys. Acta 21, 188-189 (1956). Specific Mechanisms of Protein Synthesis in the Developing Chick Embryo 135 32. J. W. LiTTLEFiELD, E. B. Keller, J. Gross, and P. C. Zamecnik: Studies on cytoplasmic ribonucleoprotein particles from the liver of the rat. J. Biol. Chem. 217, 111-123 (1955). 33. M. V. Simpson and J. R. McLean: The incorporation of labeled amino acids into the cytoplasmic particles of rat muscle. Biocliim. Biophys. Acta 18, 573-575 (1955). 34. G. C. Webster and M. P. Johnson: Effect of ribonucleic acid on amino acid incorpora- tion by a particulate preparation from pea seedlings. /. Biol. Chem. 217, 641-649 (1955). 35. S. TuMANisHviLi, K. M. Djandier, and I. K. Skanidze: Specific stimulation of the growth of the chicken embryo by the effects of tissue extracts. C.R. Acad. Sci., U.R.S.S. 106, 1107-1109(1956). 36. R. J. Kutsky: Growth stimulating effects by nucleoprotein and cell fractions on chick heart fibroblasts in vitro. U.S. Atom. Energ. Comm., U.C.R.L. No. 2270 (1953). DISCUSSION Quastler: It is useful to compare the informational requirements of various alternative methods of protein synthesis. If the whole protein is synthesized directly from amino acids, then each locus on the template must carry sufficient information to specify a single amino acid, or approximately four bits; this is well within the informational capacities of chemical reactions. If the incor- poration occurs in two steps, as has been suggested, then each step might have to specify no more than two bits. If the protein is synthesized from peptide chains, then the informational requirements are much more stringent. Consider the linking of two peptide chains of, say, five amino acids each. If each of the ten amino acids can be any one of the whole set of amino acids, then the linking operation must, in some way, identify ten amino acids, for a total of about forty bits — which is a very large amount of information to be processed in a single act. The requirements are greater — in fact, almost certainly too great — if two chains of ten amino acids are to be linked. The following possibilities exist which allow the use of large fragments without imposing high informational requirements : (a) the terminal amino acid in a chain identifies automatically the other members — this would imply very strong sequential dependencies within peptide chains, and consequently a low informational capacity of the whole amino acid sequence; (b) linkages are formed without reference to the nature of residues remote from the locus of linkage, and the resulting proteins are torn down again if not functional — in this case, the probability of producing functional sequences by chance is small, and the efficiency of protein synthesis is low; or (c) the protein studied is such that the exact sequence of residues is irrelevant. THE MECHANISM OF ACTION OF METHYL XANTHINES IN MUTAGENESIS Arthur L. Koch Department of Biochemistry, College of Medicine, J. Hillis Miller Health Center, University of Florida, Gainesville, Florida Abstract — The biochemical findings relating to the action of methyl xanthines on bacteria and bacterial extracts have been reviewed. These observations, together with those of Novick and SziLARD on the mutagenic activity of these substances, have suggested that the biological action results from an inhibition of enzymes of nucleic acid biosynthesis. Consequences of this hypothesis have been discussed relative to the regulation of growth of cell constituents. Alternative hypotheses are enumerated. I. INTRODUCTION A NUMBER of agents, both chemical substances and radiations, cause mutations. One particular class appears to be potentially most fruitful in an attempt to understand the genetic replication process. This class includes purines and related compounds. Particularly important are the plant alkaloids responsible for the pharmacological effects of coffee, tea and cocoa. If these substances are added to a continuously growing culture of bacteria, the mutation rate is caused to increase markedly (1, 2). If we compare the structures (Fig. 1) of the alkaloids caffeine, theobromine and theophylline, with the purine bases normally present in nucleic acids of Me N- = C C- ,/ Me Me N C- CAFFEINE H N- Me N- Me > THEOBROMINE Me N C-= I I SI Me- -N C- THEOPHYLLINE XANTHINE -NH, -OH H2N C / ADENINE Fig. 1 . The structure of purine derivatives GUANINE all species, adenine and guanine, the similarity is readily apparent. The former are methyl derivatives of xanthine, the latter amino and deoxy derivatives of xanthine. It is tacitly assumed that these agents are mutagens because of this similarity. 136 The Mechanism of Action of Methyl Xanthines in Mutagenesis 137 II. TRACER STUDIES The first possibility to test was that these compounds, or products derived from them, are utihzed for the synthesis of the nucleic acid of the host (3). To do this we prepared these substances as well as some others, labeled with carbon 14 in the 8-position of the heterocyclic nucleus. These were then added to growing cultures o^ Escherichia co// under conditions similar to those employed by NoviCK and Szilard (I, 2) in their studies. In Table I the data so obtained are presented. Adenine and guanine as well as the deaminated derivatives are very well incorporated into the nucleic Table I. Incorporation and Mutagenicity oj Various Purines RSA* of DNA purines Mutagenicity Adenine 0.3 + Guanine 0.20 ± Hypoxanthine Xanthine 0.30 0.20 — Theobromine 0.00002 + + + Caffeine 0.00001 + + + Theophylline 0.00001 + + + *RSA = relative specific activity = ratio of the specific activity of the purine isolated from the bacteria to that of the growth medium. acids of both the RNA and DNA type, whereas all methylated substances are incorporated only to a very small extent, if at all. On the other hand, the correlation of mutagenesis is the reverse. A mutation is a very rare event, and though these agents, when present in quite high concentration, may raise the mutation rate by a factor of fifteen or so, this still only corresponds to one event in 10'^ duplications. The small amount of radioactivity that is found associated with the DNA from cells grown in the presence of radioactive mutagens is probably experi- mental contamination. However, although these experiments are technically excellent, they cannot begin to exclude the possibility that a methylxanthine molecule is incorporated into the DNA molecule in the process of the rare mutational event itself, since the resultant incorporation for one locus would be many orders of magnitude below the trace amount observed here. Considera- tion of the structures of these substances, however, makes this possibility rather unlikely. In the formation of the normal 9-A^-riboside or 9-A^-deoxyriboside linkages, the single replaceable hydrogen which may be in either the 7- or 9-position is replaced by the glycosyl residue. In the case of caffeine or theobromine, which are 7-methyl derivatives, this is not possible because of the prior replacement of the hydrogen by the methyl group. Thus even though the methyl group is 138 Arthur L. Koch attached to the 7-position it prevents bond formation at the 9-position. Con- sequently, the methyl group must be removed if the molecule is to be incor- porated into the nucleic acids. The isotopic data, as well as other information, are adequate to demonstrate that there is not a single molecule of enzyme present in these bacteria capable of removing this methyl group (3). Therefore it would appear that certain of the mutagenic materials are not and cannot be converted into a form in which they can be linked covalently to cell materials, not at least by the 9-A'^-glycosyl bond which has been universally found in biological materials. III. PURINE METABOLISM IN ESCHERICHIA COLI The next possibility we investigated was that the mutagens act by inter- fering with nucleic acid biosynthesis. First, however, it is necessary to discuss the metabolism of the organism under study. Fig. 2 summarizes, from the Adenine Hypoxanlhine and derivatives Glycine CO, ''"'C '" ^^r Serine _- / \ 0" DR Glucose Formate Ammonia J ) _ "GR - '- G « ' 6D Guonine Xanthine and derivatives CELL WALL AD ♦ONA Fig. 2. The purine metabolism of Escherichia coli available tracer data, the pathways of purine synthesis in growing cultures of the test organism (4, 5, 6, 7, 8). C^*-labeled COg (4), glycine (8), and serine or formate (unpublished) lead to the formation of RNA adenine, DNA adenine, RNA guanine, and DNA guanine, all of equal specific activity. The activity in the purines derived from CO2 and glycine is such as to indicate that the well-accepted scheme for purine biosynthesis is the major pathway in tliis organism (4). C^Mabeled adenine and hypoxanthine and their derivatives yield adenine samples of equal, but lower, specific activity in both RNA and DNA. From these facts it is inferred that there are three pools at which purine metabolism branches, namely, a 'purine' pool which is common to all cellular purines, and an 'adenine' and a 'guanine' pool which are precursors of the corresponding purine in both types of nucleic acid. So far, attempts to find a precursor which enters purine metabolism at some point beyond the 'adenine' or 'guanine' pool have failed. Even when the intracellular adenine-C^"* ribo- nucleotides were specifically labelled (5), the incorporation into the purines of the ribose nucleic acid was equal to that in the deoxyribose nucleic acids. It should be mentioned that in organisms under conditions of rapid growth, the soluble intermediate pool concentrations relevant to this scheme are small (5). It was impossible to demonstrate guanosine, adenine deoxyriboside, guanine deoxyriboside or phosphorylated derivatives. The Mechanism of Action of Methyl Xanthines in Mutagenesis 139 Although the tracer data dehneate the pathways, they do not define the intermediates. It is, however, possible to conclude from available enzyme data that 'adenine' and 'guanine' pools are made up at least in part of the free bases themselves. This follows from the fact that the known enzymes of purine metabolism which might be involved in the conversion of the hypothetical 'purine' precursor to the two types of nucleic acids catalyze reactions involving the free purine base. The purine nucleoside hydrolases, purine nucleoside phosphorylases, purine #-trans-glycosidases, and purine nucleotide pyro- phosphorylases yield the free purine base. These enzymes and the postulated pathway of direct reduction of the riboside to the deoxyriboside constitute the only pathways of interconversion of ribose and deoxyribose purine com- pounds that can be imagined at present. Since the reductive pathway is known not to occur in E. coli (9) (although the interesting work from Volkin's labora- tory may be relevant (10)), it appears quite likely that the free purine base is involved in the 'adenine' and 'guanine' pools. In addition to these general considerations, the specific observation of Lampen and Manson (11) that purine deoxyriboside phosphorylase is inhibited by adenine led us to investigate the inhibition of phosphorylases of E. coli by methyl xanthines. IV. ENZYMATIC INHIBITION STUDIES The main conclusion from these studies (12, 13) was that the organism possesses enzymes, particularly nucleoside phosphorylases of both types (ribose and deoxyribose), that are inhibited by purines generally but specifically by the mutagenic substances. It was also found that even in the presence of 10 20 CAFFEINE CONC. (mM) 30 5 10 20 CAFFEINE CONC. (mM) Fig. 3. The inhibition of purine nucleoside phosphorylase. The effect of caffeine concentration on the arsenolysis of adenine riboside is shown at the left, and on adenine deoxyriboside on the right. The systems contain arsenate to prevent the complication of back reaction. large amounts of inhibitors enzyme action was not completely repressed (Fig. 3). In all cases this suggested the presence of more than one enzyme catalyzing the reaction under study. Studies of the effect of pH and the separa- tion of the bacteria into several chemical fractions supported this notion. 140 Arthur L. Koch The activity in various fractions was differently affected by caffeine and this effect was different in acid and at neutrahty and at alkaline reaction (see Table II). This finding explains the relatively low toxicity and bacteriostatic power of the plant alkaloids. Table II. Inhibition of Inosine Arsenolysis by Caffeine Enzyme preparation* No. Inhibition produced by 10 [J. moles caffeine per ml Distribution of activity (measured at pH 7) pH5.0 pH 7.0 pH 9.0 6-1 (soluble) 6-2 (particulate) 6-3 (phosphate extract) per cent 29 64 78 per cent 59 97 78 per cent 35 46 6 per cent 67 17 16 TOTAL 100 * Enzyme Preparation 6-1 was most active at pH 5, preparation 6-2 at pH 9, and preparation 6-3 at pH 7. In more recent work (13) three new enzymes have been demonstrated in extracts of this organism: an inosine hydrolase, a purine-pyrimidine trans- ribosidase, and a purine-purine transribosidase. All are inhibited to some degree by various purines. The results of the enzymatic studies are summarized in Table III. Table III. Enzymes of Nucleic Acid Metabolism Type Specificity Inhibition by methyl purines Adenosine deaminase Ribose Cytidine deaminase Deoxyribose Ribose Purine phosphorylases Pyrimidine phosphorylase Deoxyribose Ribose Deoxyribose Ribose some some Inosine hydrolase Purine-pyrimidine trans- Deoxyribose Ribose Ribose + glycosidase Purine-purine trans- Ribose -f glycosidase V. WORKING HYPOTHESIS The mutagenic agents do inhibit enzymes that appear to be directly linked to the path of nucleic acid synthesis, but how can such an interference affect the The Mechanism of Action of Methyl Xanthines in Mutagenesis 141 mutation probability? We have proposed (12) that this may result from a change in the steady slate concentrations of the intennediates that are to be assembled together to form the macromolecular DNA. This must happen without any change in the flow of intermediates, in accord with the experimental fact that the growth rate of the bacteria is not affected significantly by the mutagens when present at concentrations that give rise to large changes in the mutation rate (1). Let us first consider the consequences of lowering of the concentration of whatever adenine deoxyriboside or guanine deoxyriboside derivative is involved in the polymerization reaction leading to macromolecular DNA. The Watson-Crick model for DNA assumes that the specificity lies in the forma- tion of two or three hydrogen bonds between specific pairs of nucleotides: adenine and thymine, and guanine and cytosine. It has been suggested by Watson and Crick (14) that the mutational event is the entry of a heterocylic base which is not complementary. This would yield a double helix which is energetically less stable. Upon subsequent duplication this yields two stable molecules, one of the parental type and one of a new mutant type. guanine thymine Fig. 4. It is to be recognized that the mutational event is an improbable one, and therefore quite improbable structures may be involved. Two options for the unfavorable pairing are available. First, two pyrimidines or two purines may become situated opposite each other. This gives structures that should be capable of forming hydrogen bonds, but are either too long or too short. Alternatively, a purine and a pyrimidine may pair, but the purine may occur in the uncommon tautomeric form and consequently pairing will occur abnor- mally. Watson and Crick (14) suggested adenine in the lactim form binding with cytosine, more probable is the pairing of guanine with thymine (Fig. 4). This pair has the proper dimensions; there are no steric difliculties. In this structure guanine is written with the oxygen in the 6- position in an enol form. X-ray-diffraction workers have concluded that guanine is ordinarily found in the keto form, but the evidence is not strong that the keto form is even dominant (15), and considerations of the resonance possibilities indicate a considerable stabilization of the enol fonn because the latter allows aromaticity of the heterocyclic ring. Thus, guanine-thymine pairing might well be of likely occurrence. With this in mind, we have attempted in our enzyme studies to find differences of the effects of mutagens on the inhibition of reactions of the adenine compounds, as opposed to the guanine ones, that would be implied if this structure were to 142 Arthur L. Koch account for the mutational activity of these methylated purines. So far we have been unable to detect any such differences. We may have been examining the wrong systems. For the present we shall tentatively suggest the pair thymine-cytosine (Fig. 5) as the culprit. This pair is shorter than the conventional structures. In the very interesting paper by Donohue (16) a large number of possible pairings are suggested. For our purposes most of these are unsatisfactory because they give rise to helices possessing a two-fold axis parallel to the hehcal axis, whereas thymine cytosine Fig. 5. in the Watson-Crick structure this two-fold axis is perpendicular to the helical axis, and thus consistent hehces formed by substitution between the two types can not occur. One structure (Donohue's no. 22) would fit into the symmetry of the Watson-Crick model and it is the pairing suggested in Fig. 5. VI. STEADY-STATE CONSIDERATIONS Whatever may be the critical or quantitatively most significant substitution in this type of mutational change, the hypothesis we have proposed requires that the concentration of terminal pools be altered. The experimental data that we have obtained have been primarily with purine ribonucleoside phos- phorylase which catalyzes a step which is clearly non-terminal in DNA synthesis, and very likely the reaction catalyzed by purine deoxyriboside phosphorylase is also not the transformation of the last small-molecular-weight intermediate into DNA. Although it may be that the terminal processes are inhibited, let us examine some possible situations that might lead to an alteration of the steady-state concentration of the penultimate substance without influencing the steady-state flux of DNA synthesis. To do this, the question of bacterial growth itself must be raised. Bacteria grow autocatalytically. Hinshelwood (17) as well as others have pointed out that this results from an interaction of catalytic units. Thus, if the amount of one component, P (protein), controls the rate of synthesis of another component, N (nucleic acid), then dP dt dN ~di k,P (1) The Mechanism of Action of Methyl Xanthines in Mutagenesis 143 where k^ and k^ are characteristic constants. The steady-state solution of this pair of equations is P = p (,\Vki^k^l \ (2) where P^) and A^,, depend on the initial conditions and the constants Aj and ko. Thus both P and A^ increase exponentially at the same rate and each therefore appears to be 'autocatalytic'. Clearly, processes of this kind are responsible for the maintenance of constant growth rates and constant composition of cells during the exponential growth of bacteria. However, the control of the system by this type of interaction cannot explain the regulation of synthesis of intermediates for the biosynthesis of either P or A'^. Additional regulatory processes must be considered. From equation (2) it is evident that for any constituent of the cell (intermediate or enzymatic catalyst) the steady concentration increases autocatalytically. If expressed as amount per unit number of bacteria or per unit bacterial mass, any cell constituent may be considered constant. Thus, if such a transformation is made, we can consider a system with time-invariant concentrations of inter- mediates and catalysts and also time-invariant fluxes. Thus, the steady-state treatment of reaction rates is immediately applicable to our problem. The most general formulation is that of Christiansen and has been well described by Hearon (18, 19). In essence the rate expression for each step of a concatenated reaction scheme, in which a substance is produced in one step and utilized in the next, is written down. Each of the terms in these expressions is the product of the intermediate with a rate constant and also with either unity or with the concentra- tion(s) of the other chemical reactant(s). If the product of the two latter factors is set equal to a quantity W, bearing suitable subscripts to identify the term, and if the usual steady-state assumptions are made, then the solutions for both the flux of the system or the over-all reaction rate v and the concentration of each intermediate [A'J may be computed. If the very last reaction is irreversible, equations (3) and (4) are obtained. W.W^W^"- W, [X,] = V (3) Wi-,1 '"W^ (4) The assumption of the irreversibility of the last step is made necessary by the well-known metabolic stability of DNA. Recent experiments (20) demon- strate the extreme irreversibility in the normal adult rat. The evidence for growing cultures of E. coU is less stringent (21, 22) but does permit this assumption in comparison with the tremendous synthetic rate in these organisms. 144 Arthur L. Koch Now if in addition we assume that some step is either rapid in the direction of synthesis or irreversible, then it may easily be seen that the reaction velocity V, is completely independent of subsequent steps. Thus, the synthetic rate can be made to depend on the level of a few catalysts or other reactants involved earlier in the sequence. Consequently, increased protein synthesis would cause increased synthesis of a very few enzymes critical for nucleic acid biosynthesis, and this would lead smoothly to increased DNA synthesis without requiring exact synchronization in the increase of each enzyme on the biosynthetic path- way. The concentration of the last intermediate i'j_i can be seen from equation (4) to be vlW^, and thus is completely independent of any step that has no effect on the reaction velocity, v. This case does not therefore satisfy the requirements suggested above to explain the mutagenic effects of the plant alkaloids. The independence of growth rate in the presence of caffeine could be explained simply by assuming that the inhibition occurs after some fast or irreversible reaction; but the action of the inhibitor on any but the final step has no effect on the concentration of the immediate precursor of the macromolecule, and thus cannot affect the probability of mutation. The scheme considered above has two desirable features: it permits a reciprocal control of nucleic acid by the level of protein synthesis, and it prevents the accumulation of large amounts of intermediates. Let us now turn to a possible mechanism that will do these two things but also will fulfill the conditions imposed by our ideas of the mutation event. Such a mechanism occurs in systems showing product inhibition. Here the rate of production of the final product will depend on the level of some enzyme catalyzing a step late in the reaction sequence, but at the same time, the inhibition prevents the unlimited synthesis of earlier intermediates. Product inhibition is of common occurrence. It has been suggested as having metabolic significance in two cases (23, 24) in which the product of a reaction sequence inhibits some earlier reaction than its own formation. In the present case it has been shown that adenine deoxyriboside is an inhibitor of the phos- phorylase (12) as well as purine bases. Let us assume that all of these agents are competitive inhibitors of enzyme action, although this remains to be demon- strated conclusively. Under such conditions the reaction velocity is given by the well-known expression for competitive inhibition (see, for example, (25)) K, K, + Ul) + US) where Kis the maximal velocity obtainable, K^ is the Michaelis-Menten constant for the substrate S, and Ki is the constant for the binding of the enzyme with the inhibitor, 1. If Kj{I) is the dominant term in the denominator, this expression simplifies to give: K.V(S) ... In the present case, adenine deoxyriboside is the inhibitor which is formed from the substrate adenine and deoxyribose-l-POj. Now, if the net rate of The Mechanism of Action of Methyl Xanthines in Mutagenesis 145 removal of adenine deoxyriboside is to be maintained constant and determined solely by the process of removal, then a steady-state will quickly ensue in which (S) oc (/), and in which the rate of formation of / is dependent only on the rate of utilization. The concentration of / will become adjusted to estabHsh such a condition. In the presence of the mutagen, the total inhibitor is effectively derived from three sources; deoxyribosides, free normal bases, and the mutagen. While maintaining constant synthesis of DNA, the effect of the mutagen will then be to decrease the level of the normal reaction product, adenine. Similar relations will hold for guanine deoxyriboside. It should be noted that in this case, although not in the case considered above, any number of intermediates may occur between the step under considera- tion and the polymerization step, if these reactions are rapidly reversible. Then a change in adenine deoxyribose concentration will lead to a proportional change in the precursor immediately used for the formation of the macro- molecule. This model can then utilize the enzymatic finding, and the biological facts. There is, however, one additional fact that should be introduced, viz. certain specific substances, the purine ribosides (26), are anti-mutagens. That is, these substances will prevent the action of caffeine and related compounds in causing mutations. Moreover, they will decrease the so-called 'spontaneous mutation' rate. This can be tentatively explained on the basis that these substances are substrates or immediate precursors of the substrates of the key step, and that their increase simply affects the system so as to cause an increase in the concen- tration of purine deoxyribosides and thus a decrease in the mutation rate. VII. ALTERNATIVE HYPOTHESES In concluding, I should like to list various hypotheses that one should consider in this type of chemical mutagenesis. They will be considered in order of the intimacy of the mutagen with the duplication process. 1. The mutagen is incorporated into the nucleic acid. This is tentatively rejected as indicated above, from the tracer evidence, and the argument that methylation in the imidazole ring prevents A^-glycoside formation. It should be noted that production of a self-duplicating 'methylated gene' can be rejected because the mutants cannot metabolize methyl purines and certainly do not require them (3). 2. The mutagens inhibit enzymes of nucleic acid biosynthesis, and this causes a change in the concentration of intermediates. This latter effect changes the probability of mutation. This is the hypothesis we favor, but it is clear that a great deal of work will be required to establish it or some variant thereof. It is also clear from what has been said above that special circumstances must occur in order that the proposed mechanism can work. 3. The mutagen causes some change in the general metabolism of the organism and this leads to a change in the mutation probability. It is certainly true that the mutation probability is dependent on a great many factors. Kihlman (27, 146 Arthur L. Koch 28), working with plants, has suggested such a mechanism to explain chromo- some breakage induced with caffeine derivatives. He proposes that ATP is necessary for the aberrations produced by the compound 8-ethoxy caffeine. However, there appear to be considerable differences between the two systems; with the bacteria one thinks the process involved is one of 'point mutation', but certain clearcut differences are evident in the two types of material with regard to the interaction ofoxygen tension and ionizing radiations. (Compare (2) and (27), 4. The mutagen causes the organism to 'adapt' to its presence, and thus causes widespread alterations in the amount of enzymes and intermediates. This could lead to a change in mutation rate. This may be in fact the explanation of the effect of adenine (12). This substance inhibits the growth of bacteria which have previously been grown in its absence. Growth resumes when the organism has 'adaptively' produced an 'adenine deaminase' activity which is not de- monstrable in bacteria grown in its absence. This shift in metabolism can then be envisioned to lead to changes in the mutation rate. This list is probably sufficiently inclusive to include the right answer if there is only one, but at least the necessary research, both with test tubes and with pencil and paper, to test these possibilities is feasible. REFERENCES 1. A. NoviCK and L. Szilard: Experiments on spontaneous and chemically induced mutations of bacteria growing in the chemostat. Cold Spr. Harb. Symp. Quant. Biol. 16, 337-343 (1951). 2. A. Novick: Mutagens and anti-mutagens. Brookhaven Symp. Biol. No. 8, 201-214 (1956). 3. A. L. Koch: The metabolism of methyl purines by Escherichia coli. I. Tracer studies. J. Biol. Chem. 219, 181-188 (1956). 4. A. L. Koch, F. W. Putnam, and E. A. Evans Jr.: The purine metabolism of Escherichia coli. J. Biol. Chem. 197, 105-112 (1952). 5. A. L. Koch: Biochemical studies of virus reproduction. XI. Acid soluble purine meta- bolism. /. Biol. Chem. 203, 227-37 (1953). 6. M. E. Balis, C. T. Lark, and D. Luzzati: Nucleotide utilization by Escherichia coli. J. Biol. Chem. 212, 641-645 (1955). 7. E. Bolton: Biosynthesis of nucleic acid in E. coli. Proc. Nat. Acad. Sci., Wash. 40, 764^772 (1954). 8. A. L. Koch: The kinetics of glycine incorporation by Escherichia coli. J. Biol. Chem. 217, 931-945 (1955). 9. I. A. Rose and B. S. Schweigert: Incorporation of C^* totally labeled nucleosides into nucleic acids. /. Biol. Chem. 202, 635-644 (1953). 10. E. Volkin and L. Astrachan: Phosphorus incorporation in Escherichia coli ribonucleic acid after infection with bacteriophage T2. Virology 2, 149-161 (1956). 11. J. O. Lampen: Symposium on Piwsphorus Metabolism, vol. II, ed. by W. D. McElroy and B. Glass, Johns Hopkins Press, Baltimore, 363-380 (1952). 12. A. L. Koch and W. A. Lamont: The metabohsm of methyl purines by Escherichia coli. II. Enzymatic studies. /. Biol. Cliem. 219, 189-201 (1956). 13. A. L. Koch: Some enzymes of nucleoside metabolism oi Escherichia coli. J. Biol. Cfiem. 223, 535-549 (1956). 14. J. D. Watson and F. H. C. Crick: The structure of DNA. Cold Spr. Harb. Symp. Quant. Biol. 18, 123-131 (1953). The Mechanism of Action of Methyl Xanthines in Mutagenesis 147 15. D.O.Jordan: The physical properties of nucleic acids. In: The Nucleic Acids, Cii. hy E. Chargaff and J. N. Davidson. Vol. I, 447-492, Academic Press, New York (1955). 16. J. Donohue: Hydrogen-bonded helical configurations of polynucleotides. Proc. Nat. Acad. Sci., Wash., 42, 60-65 (1956). 17. C. N. HiNSHELWOOD : The Chemical Kinetics of the Bacterial Cell, Clarendon Press, Oxford (1946). 18. J. Z. Hearon: The steady-state kinetics of some biological systems. I. Bull. Math. Biophys. 11, 29-50 (1949). 19. J. Z. Hearon: Rate behavior of metabolic systems. Physiol. Rev. 32, 499-523 (1952). 20. R. W. SwiCK, A. L. Koch, and D. T. Handa: The measurement of nucleic acid turnover in rat liver. Arch. Biochem. Biophys. 63, 226-242 (1956). 21. A. D. Hershey: Conservation of nucleic acids during bacterial growth. J. Gen. Physiol. 38, 145-148 (1954). 22. A. L. Koch and H. R. Levy: Protein turnover in growing cultures of Escherichia coli. J. Biol. Chem. Ill, 947-957 (1955). 23. R. A. Yates and A. B. Pardee: Control of pyrimidine biosynthesis in Escherichia coli by a feedback mechanism. /. Biol. Chem. Ill, 757-770 (1956). 24. H. F. Umbarger: Evidence for a negative-feedback mechanism in the biosynthesis of isoleucine. Science 123, 848 (1956). 25. P. W. Wilson: Kinetics and mechanism of enzyme reactions, in Respiratory Enzymes, ed. by H. A. Lardy, 22-67, Burgess Publishing Co., Minn. (1949). 26. A. NoviCK and L. Szilard: Anti-mutagens. Nature, Lond. 170, 926-927 (1952). 27. B. Kihlman: Chromosome breakage in Allium by 8-ethoxy-caffeine and X-rays. Exp. Cell Res. 8, 345-368 (1955). 28. B. A. Kihlman: Oxygen and the production of chromosome aberrations by chemicals and X-rays. Hereditas 41, 384-404 (1955). EVIDENCE FOR A NEGATIVE FEEDBACK SYSTEM CONTROLLING LIVER REGENERATION Andre D. Glinos Growth Physiology Laboratory, Walter Reed Army Institute of Research, Washington, D.C. Abstract — Cell division was induced in the resting liver of the rat by lowering the concentration of serum constituents through plasmapheresis, and was inhibited in the regenerating liver by increasing the concentration of the serum by fluid intake restriction. Electrophoretic analysis of serum proteins and histochemical investigation of the organiza- tion of cytoplasmic ribonucleoprotein of the liver cells during regeneration suggest that plasma proteins may participate as information-carrying agents in a negative feedback system controlling the growth of liver cells. Liver is an excellent tissue for investigating mechanisms of growth control because it regenerates very rapidly. In the rat, removal of up to two-thirds of the total mass of the liver is followed by active cell division leading to complete restoration of the organ within two weeks. As early as 1923 Akamatsu (1) reported that tissue cultures of rabbit hver grew better in plasma from partially hepatectomized animals than in normal control plasma, and more recently it was shown that cell division can be induced in the resting liver of a parabiotic rat by a partial hepatectomy performed on its partner (2, 3, 4). These findings were considered to indicate the presence or the increase of growth-stimulating factors in the plasma of partially hepatec- tomized animals. In our own studies on the possible participation of the humoral system of communication in the control of this growth, blood serum from animals undergoing liver regeneration was assayed in tissue culture (5). These cultures showed a comparable outgrowth in a high concentration of serum of partially hepatectomized rats and in a low concentration of normal serum. A high concentration of normal serum showed inhibitory effects. Based on these findings a hypothesis was formulated with regard to the induction of the regenerative process in the liver which follows partial hepatectomy. According to this hypothesis, certain constituents of normal blood serum exert a growth-inhibitory action at their normal concentration. Partial hepatec- tomy would be expected to result in a decrease of the serum concentration of these constituents. Thus in turn regenerative growth is initiated. During regeneration, as the number of liver cells increases, the concentration of these constituents will also increase. When the original equilibrium between a given number of liver cells and a given concentration of the serum constituents is restored, further growth is expected to cease. The evidence for a negative feedback system of this type should satisfy the following two conditions: 1. Induction of growth in the resting tissue by plasma dilution. 2. Inhibition of growth in the regenerating tissue by plasma concentration. 148 Evidence for a Negative Feedback System Controlling Liver Regeneration 149 Figure 1 illustrates the application of the classical method for plasma dilution, plasmapheresis, and the results obtained. Normal adult male rats were used. Blood was withdrawn every twelve hours corresponding to 31 to 38 per cent of the initial total blood volume of the animals in the first group RATE TIME IN HOURS 36-48 60 72-96 MEAN 0% <.002 .003 .002 <.002 .004 .002 .002 31-38% <.002 .022 .002 .048 .009 .002 .014 39-46% .168 .002 .019 .054 .441 .197 .147 MEAN .048 .031 .163 Fig. 1. Induction of cell division in the resting liver by plasmapheresis. A total of eighteen adult male rats was used. The rate of plasmapheresis is expressed as the percentage of the initial total blood volume of the animal replaced by saline per 12 hours. In the control group rate refers to the fact that blood was merely withdrawn and re-injected, with the animals submitted to the same stressful conditions of restriction, anesthesia, venipuncture etc. as the experimental groups. The mitotic activity was obtained by counting 50,000 cells, and expressed as the per cent mitotic index. When no mitosis was found, the mitotic index was recorded as <0.002. and 39 to 46 per cent in the second. The bleedings were followed by re-injections of the blood cells suspended in an equal volume of saline. Under such conditions cell division was induced in the resting liver of adult rats and was intensified with increasing dilution of the plasma. In this experiment, then, the evidence obtained satisfies the first condition for a negative feedback system. With respect to the second condition, the method used to achieve plasma concentration was restriction of fluid intake as illustrated in Fig. 2. Two experi- mental groups were used, differing with regard to the weight of the animals and the extent of the partial hepatectomy. All animals were partially hepatec- tomized and tube-fed an identical isocaloric fluid diet containing 3 per cent water. The controls were given drinking water ad libitum but the experimental animals were deprived of water for the duration of the experiment, which was sixty-four hours, starting sixteen hours prior to the operation and continuing for forty-eight hours postoperatively, at which time the animals were sacrificed. A measure of total body-water loss obtained by this regimen is given by the difference in weight change between experimental and control animals in each group. A measure of the plasma concentration achieved is given by the difference in total protein change. In both the experimental groups an effective inhibition of cell division in the liver was obtained; this inhibition became greater with increasing concentration of the serum. On the other hand mitosis in the intestinal epithelium was not affected. The evidence obtained in this experiment, then, satisfies the second condition for a negative feedback system. The smaller extent of total body-water loss and plasma concentration in the first group can be ascribed to the greater initial weight of the animals in 11 150 Andre D. Glinos this group. It is well known that when dehydration proceeds slowly the main- tenance of plasma volume at the expense of extravascular fluid may be quite successful. This is significant since the extravascular fluid of the liver must parti- cipate in the transmission of information to the liver cells. The serum albumin fraction in this experiment was found to be low when liver cell division was present and nonnal or slightly increased when liver cell division was absent. In the framework of the present discussion this feature is somewhat suggestive DEGREE OF HEPATECTOMY TREATMENT NO. OF RATS ANO WEIGHT WEIGHT % CHANGE SERUM PROTEIN % CHANGE SERUM ALBUMIN % CHANGE MITOSES /50000 CELLS 30 % (MEDIAN LOBE) CONTROLS iO 457 - 7.9 - 4.7 - 16.0 81.4 FLUID RESTRICTION 9 458 - 12.9 + 5.0 - 14.0 18.0 10 % ( CAUDATE LOBE) CONTROLS 8 331 - 9.1 + 1.8 - 26.2 16.5 FLUID RESTRICTION 8 313 - 17.6 + 28.9 + 8.1 0.1 Fig. 2. Inhibition of cell division in the regenerating liver by fluid restriction. The experimental variables are defined in the text. The percentage changes refer to differences in weight, total serum protein, and serum albumin between the values obtained before treatment and those obtained at sacrifice. when it is considered that albumin is synthesized in the liver. In view of these facts it was thought that an investigation of changes in protein metabolism of the liver cells, early after partial hepatectomy, could help in elucidating the possible role of the serum proteins and the extravascular fluid of the liver in the transmission of information to the liver cells. In this, we took advantage of the many observations showing histochemically detectable changes in the organization of cytoplasmic ribonucleoprotein with increasing demands on the protein synthetic mechanism of the cells (6, 7). Briefly stated, these changes consist of the disappearance from the cytoplasm of discrete basophilic bodies which are associated with ribonucleoprotein; the cytoplasm then stains unifonnly with basic dyes. Rats were sacrificed at frequent intervals after partial hepatectomy and their livers fixed and stained with gallocyanin chrome alum. Within thirty minutes after partial hepatectomy the ribonucleoprotein-associated basophilic bodies started disappearing from the cells in the periportal area. This change proceeded gradually toward the center, so that eight hours after the operation all cells, even those in the centrolobular area, were affected. After this time reconstruction of the basophilic bodies proceeded in the opposite direction from the center of the lobule towards the periphery. At 24 hours cells in the centrolobular area had completed the cycle Fig. 3. Regenerating liver twenty-four hours after partial hepatectomy. Central vein at lower left corner. Adjacent centrolobular zone with cells con- taining ribonucleoprotein-associated basophilic bodies in their cytoplasm. Middle and periportal zones with mostly altered cells having a uniformly basophilic cytoplasm. Two mitotic figures in the middle zone among altered cells. to face p. 150 Evidence for a Negative Feedback System Controlling Liver Regeneration 151 showing well organized basophilic bodies, whereas cells in the middle and the periphery of the lobules remained altered (Fig. 3). Confirming the earlier data of Harkness (8), we found that cell division begins between 16 and 24 hours postoperatively in the periportal area. This is significant because cells in this area remained altered for the longest time. The changes in cytoplasmic ribonucleoprotein organization indicate an activa- tion of the protein synthesizing mechanism of the liver cells after partial hepa- tectomy, proceeding in a topographical pattern related to the direction of the intralobular blood flow. According to the Law of Mass Action these changes would be expected to appear with decreased protein concentration in the immediate environment of these protein-secreting cells. The cells in the periphery of the lobules would be expected to react faster and longer since the ones more centrally located are in an environment richer in protein produced by the more peripheral cells. This interpretation was, in part, verified experimentally by TREATMENT FLUID SERUM PROTEIN CHANGE LIVER RIBONUCLEOPROTEIN CHANGE ADDITION SALINE - 11.8 DEXTRAN - 31.2 + SERUM + 7.9 REPLACEMENT SALINE - 19.2 + DEXTRAN - 37.8 + SERUM - 8.7 Fig. 4. Induction of cytoplasmic ribonucleoprotein changes in the liver by plasma dilution. A total of six male adult rats was used. Addition refers to a single intravenous injection of 5.5 ml of fluid. Replacement refers to a 5.5 ml single plasmapheresis treatment. All animals were sacrificed two hours after treatment. Serum protein change refers to the percentage difference between the values obtained before treatment and those obtained at sacrifice. Liver ribonucleoprotein change refers to the disappearance of the basophilic bodies from the cytoplasm of the cells in the periportal area. showing that changes in the cytoplasmic ribonucleoprotein of the cells in the periportal area appear rapidly after a sudden decrease of the serum protein concentration (Fig. 4). After partial hepatectomy, however, these histochemical changes occur as we have seen within thirty minutes before any appreciable changes in the plasma proteins. The relationships between increased pressure in the portal system following 1 52 Andr£ D. Glinos partial hepatectomy and regeneration have been demonstrated by Grindlay and BoLLMAN (9). It is conceivable that, under conditions of increased pressure immediately following partial hepatectomy, the transfer of protein and water from the intravascular to the extravascular space is altered and results in a rapid lowering of the protein concentration of the interstitial fluid of the liver. This leads within a short period to increased protein production in the liver cells and sometime later to cell division. REFERENCES 1. N. Akamatsu: Ober Gewebskulturen von Lebergewebe. Virc/i. Arch. 240, 308-311 (1923). 2. B. G. Christensen and E. Jacobsen: Studies on liver regeneration. Acta Med. Scand. 234, Suppl, 103-108 (1949). 3. N. L. R. Bucher, J. F. Scott, and J. C. Aub: Regeneration of the liver in parabiotic rats. Cancer Res. 11, 457-465 (1951). 4. A. S. Wenneker and N. Susman: Regeneration of liver tissue following partial hepa- tectomy in parabiotic rats. Proc. Soc. Exp. Biol. Med. 76, 683-686 (1951). 5. A. D. Glinos and G. O. Gey: Humoral factors involved in the induction of liver regenera- tion in the rat. Proc. Soc. Exp. Biol. Med. 80, 421^25 (1952). 6. S. Lagerstedt: Cytological studies on the protein metabolism of the liver in the rat. Acta Anat., VII, suppl. 9, 1-116 (1949). 7. A. F. HowATSON and A. W. Ham: Electron microscope study of sections of two rat liver tumors. Cancer Res. 15, 62-69 (1955). 8. R. D. Harkness: The spatial distribution of dividing cells in the liver of the rat after partial hepatectomy. J. Physiol. 116, 373-379 (1952). 9. J. H. Grindlay and J. L. Bollman: Regeneration of the liver in the dog after partial hepatectomy. Role of the venous circulation. Surg. Gynec. Obst. 94, A9\-A96 {\952). FLUCTUATIONS IN NEURAL THRESHOLDS* Lawrence S. Frishkopf and Walter A. Rosenblithj Research Laboratory of Electronics, Massachusetts Institute of Technology, Cambridge, Massachusetts Abstract — Over the past twenty-five years several independent investigations of the responsivity of nerve tissue have led to the conclusion that the threshold of a resting neuron fluctuates in time. The conclusion is based on the study of sensory and motor fibers, of monosynaptic arcs and neuromuscular junctions. A number of these studies have been reviewed and com- pared. The degree of threshold correlation among neurons of a given 'pool' or population has been considered for several systems. A number of possible sources of threshold fluctuation, giving rise to correlated and uncorrected threshold variations, have been distinguished. A mathematical model based on the concept of fluctuating thresholds has been described and applied to the problem of ensemble response from the peripheral auditory nervous system. The results of three experiments have been described and compared with the predictions of the model. I. THE CONCEPT OF A FLUCTUATING THRESHOLD The threshold of a nerve fiber is defined as the minimum stimulus intensity that will cause an action potential to propagate. If the threshold of a nerve fiber were a fixed parameter — not changing in time — its value could be deter- mined by presenting stimuli of increasing intensity. The fiber would fail to respond to all stimuli less than some value Srp, and would respond to all stimuli greater than Srp\ Sj, would then be the threshold of the fiber. However, careful experiments on a number of specific neural systems — sensory and motor, peripheral and central — have shown that such a unique value Sj, does not exist; instead, there is a range of stimulus values, 5^ to ^'2, such that a stimulus S lying within that range, when repeatedly presented at a rate well below that which would involve the refractory period of the fiber, sometimes evokes and sometimes fails to evoke a response. We find that the fiber responds in a fraction p of all trials and that p{S) is a monotonic function that rises from zero to one as the stimulus increases from S-^ to 5^2. Stimuli less than S^ never evoke a response; stimuli greater than So always evoke a response. We conclude that the threshold of a neuron which exhibits this behavior is a time-varying para- meter. The value p approximates the fraction of the time that the threshold is somewhere below the stimulus value S. An equivalent statement is that p approximates (and for large sample size, approaches) the probability of finding the threshold of a fiber below the value 5". * This work was supported in part by the U.S. Army (Signal Corps), the U.S. Air Force (Ofiice of Scientific Research, Air Research and Development Command) and the U.S. Navy (Office of Naval Research). t Also in the Department of Electrical Engineering, M.LT. 153 154 Lawrence S. Frishkopf and Walter A. Rosenblith II. SUMMARY OF STUDIES OF OTHER WORKERS The class of phenomena that we have been discussing was first observed by Blair and Erlanger (1). They reported that an electric stimulus, repeatedly presented to a single sciatic nerve fiber of the frog, will for most stimulus values either always produce or always fail to produce a response. The transition between these two situations, however, is not sharp. Upon raising the shock intensity, a value is reached at which the fiber sometimes responds and some- times fails to respond to repeated stimulation. In order to obtain a response every time it is necessary to raise the shock intensity an additional two per cent, far in excess of the uncontrollable variation in the stimulus. Moreover, Blair and Erlanger were able, on occasion, to record simultaneously from two fibers whose potentials could be distinguished by their difference in latencies. On repeated testing with a near-threshold stimulus, sometimes both would respond, sometimes one, sometimes the other, and sometimes neither. Such a result cannot be accounted for on the basis of stimulus instability alone. The most complete study of this kind that has been published to date was made by Charles Pecher (2) in 1939. Using a technique similar to that of A- A- J^ .^ ^ J\^ Fig. 1 . Left : ink tracings of recordings from single units of frog sciatic nerve, showing occurrence and failure of response to repeated presentations of identical shock stimuli. Right: same, with amplitude of pulse producing the shock raised 4 per cent. Each series shown is part of a longer sequence of 100 presentations. Thirty-five responses were obtained with the weaker stimulus (left); 85 responses were obtained with the stronger stimulus (right). After Pecher (2). Blair and Erlanger, he also found a stimulus range in which a fiber sometimes responded and sometimes failed to respond to a constant stimulus. Some of his data appear in Fig. 1. In the column on the left we see the responses to successive identical stimuh, of which some produce a response and some fail to do so. In the second column the intensity was raised four per cent. In Fluctuations in Neural Thresholds 155 Fig. 2 the percentage of responses of a fiber is plotted as a function of stimulus intensity. Again each point is based on 100 stimulus presentations. The total range of thre'shold variation is, on the basis of these data, about seven per cent. The function shown in Fig. 2 approximates the threshold probability function p(S) that was discussed earlier. 100 7b 50 25 O '/]': -.y^: 97 98 99 X)0 101 102 103 Fig. 2. Relation between stimulus intensity (abscissa) and the number of respon- ses obtained in 100 presentations at a fixed intensity from a single unit of frog sciatic nerve (see Fig. 1). The interpolated solid line approximates the threshold probability function of a unit. From Pecher (2). Fig. 3. Left: ink tracings of simultaneous recordings from two units of frog sciatic nerve to repeated presentations of identical shock stimuli. Units A and B are identified by their latencies. Right: same, but recording from two other units, identified by their amplitudes. After Pecher (2). In the left column of Fig. 3 the responses of two different fibers were simul- taneously recorded from a single electrode; the responses arc distinguishable by their latencies. At a fixed level of stimulation all possible combinations of response occur: fiber A responds alone, fiber B responds alone, both respond, neither responds. On the right we see the responses from two other fibers; here the responses are distinguished by their amplitudes. Again, all possible 156 Lawrence S. Fpushkopf and Walter A. Rosenblith combinations occur. Such a result can only be explained as a result of spon- taneous variation in fiber threshold. If threshold were fixed and the stimulus unstable, then only three of the four combinations could occur. That combina- tion would be excluded in which the fiber with higher threshold fires alone. When responses from two fibers can be distinguished, an opportunity is offered to test the degree of correlation of threshold fluctuation among different fibers. If fluctuations occur independently in two fibers, the probability of both firing to a single stimulus would be the product of their probabilities of firing separately. Any correlation in threshold variations would alter the probability of joint firing. These probabilities can be approximated by counting the number of times that fiber A fires, that fiber B fires, and that both fire, and dividing each by A'^. In the table below the results of such measurements by Pecher Table I Calculated Number of stimuli Number of responses of fiber A Number of responses of fiber B number of simultaneous responses (independence assumed) Observed number of simultaneous responses 100 78 25 19.5 19 188 129 26 17.8 18 285 205 33 23.7 18 222 150 79 53.4 56 370 214 93 53.8 50 194 113 34 19.8 19 155 110 62 44.0 40 218 168 87 67.0 59 236 152 24 15.5 17 are given for nine different fiber-pairs. In all of these instances the computed and observed frequencies of joint occurrence are in good agreement. The hypothesis of independent fluctuations is thus supported by this experiment. Pecher tried to determine whether or not for a single fiber the 'response no-response' pattern to a sequence of periodic stimuli can be accounted for by the hypothesis that successive responses occur with equal and independent probability p. He chose a criterion of independence that relates the variables r and n^, where n^ is the number of times that a sequence of r successive responses (bounded at each end by the absence of a response) occurs in a sample of length A^(r u z UJ D o hi cc 20 40 60 80 FIRING INDICES 100 Fig. 5. Histogram showing the number of spinal motoneurons (triceps surae) within each firing index interval ; responses were obtained by delivering repeated shocks to the gastrocnemius nerve. The firing index of a unit is the percentage of total stimulus presentations to which the unit responds. Units with firing indices of zero and 100 are not included in this diagram. From Lloyd and McIntyre (5). same amount; thus some units with a firing index of zero will be shifted into the intermediate range ; some with intennediate firing indices will be shifted into the range of firing index 100. But because the units are uniformly distributed the same number will move into the intermediate range as move out of it, and the distribution of intermediate firing indices will remain unchanged. Fig. 6. Idealized relation between the threshold probability distribution of a motoneuron and the levels of synaptic drive to diff"erent motoneurons of a population (see text). The particular choice of a bell-shaped probability distribution will lead to the U-shaped histogram of Fig. 5. For it is clear that if we divide the abscissa in such a way that equal areas under the distribution are subtended, those intervals will be largest near the tails of the distribution (firing indices near and 100) and smallest at the center of the distribution (firing index near 50). Since the density of units along the abscissa is uniform, this means that many more motoneurons will have firing indices between and 10 than between 45 and 55. Fluctuations in Neural Thresholds 159 As in the study by Pi chlr, the degree of correlation of thresJiold variation for members of the same pool of motoneurons was investigated. The extent of correlated and uncorrelated fluctuations is a measure of the relative impor- tance in producing fluctuations of events extrinsic and intrinsic to the fiber. In the spinal cord there is reason to believe that threshold fluctuation is, at least in part, the eff'ect of background activity in other fibers. Such activity would presumably aff'ect many fibers in a neighborhood; the threshold fluctua- tions of these fibers would therefore show definite correlations. To determine the extent of correlated variation Rall and Hunt (6) recorded the response of a ventral root together with the response of a single moto- neuron belonging to an adjacent root; an example of such a recording is shown in Fig. 7. Fig. 8 shows the results of an experiment based on a thousand -n nHHi « 1 Fig. 7. Simultaneous recording of the responses of a single motoneuron (hori- zontal deflection) and of an adjacent ventral root (vertical deflection) upon repeated stimulation of the gastrocnemius nerve with identical shock stimuli. From Rall and Hunt (6). such responses. The population response amplitudes were divided into class intervals, and the number of responses within each class interval was plotted. For each population response within a class interval, the occurrence or failure of a unit response was noted and the number of unit responses plotted (shaded area). The unit responded a total of 697 times out of 1000. If the population response and the unit response were not correlated, the firing index of the unit would be about the same in each class interval. This is clearly not the case. Instead, firing occurs infrequently when the population response is small, and more often as the population response grows. The probability of unit firing when the population response amplitude is in a given class interval — that is, the ratio of shaded to unshaded amplitude — is plotted in the lower part of the figure. If unit response and population amplitudes were uncorrelated this function would be a horizontal line at about 0.7. However, it is also clear that correlation of unit and population response is not complete. In other words, the thresholds of the units within the population vary with respect to one another, in addition to their collective (that is, correlated) fluctuation. If this were not so, a particular unit would respond only after all units of lower threshold had responded; therefore its probability of response would be zero if the population response were smaller than a certain value, and would be one if the population response were larger than that value. The lower curve would therefore be a step function. III. POSSIBLE SOURCES OF THRESHOLD VARIATIONS Fatt and Katz (7) have found that at motor endplates miniature end-plate potentials occur more or less randomly even though no stimulus is present. 160 Lawrence S. Frishkopf and Walter A. Rosenblith They regard these potentials as being the result of spontaneous firings in the fine terminal branches of a motor nerve. The occurrence of an impulse in the nerve causes simultaneous firing in about a hundred such teitninals, giving rise to the normal end-plate potential. Spontaneous firing implies the existence of a local source of varying excitation. Fatt and Katz compute that for fibers with a diameter of 0.1 /,( thermal fluctuations in ionic concentrations >- u z u o UJ cr 1«U 1 1 ifin 140 - r^'l I2U 100 - — j 80 - f — 60 40 _ ^ 20 r-H -'■> 1 .t, i. -i I n^ {/ ^T / / O " 2 4 6 8 10 12 14 16 18 20 22 24 26 28 POPULATION RESPONSE AMPLITUDE 1.0 0,9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 O / Fig. 8. Top: the upper curve is a histogram of population response amplitudes obtained as in Fig. 7 from triceps surae motoneurons by delivering repeated identical shock stimuli to the gastrocnemius nerve. The lower curve (shaded) was obtained from single-unit recordings like those shown in Fig. 7 ; the number of single-unit responses associated with population responses in each amplitude interval is plotted. Bottom : for a given population amplitude interval the number of single unit responses is divided by the total number of trials in that interval, and the ratio plotted as a function of population amplitude. The interpolated solid curve is a sigmoid fit to the data points and approximates the probability of unit response as a function of population amplitude. Note that when the popula- tion amplitude is large, the probability of unit response is large, and when the population response is small, the single unit probability is small, thus signifying a high degree of correlation among the thresholds of different units of the popula- tion. From Rall and Hunt (6). could cause variations of resting potential of 1 mV to 2 mV. Though probably insufficient to produce excitation, such a variation would cause threshold fluctuations and contribute to spontaneous firing. Both Pecher (2) and Hunt (8) have discussed possible sources of threshold fluctuation. Pecher considers in detail the apparent threshold variation that would result from statistical variations in the number of ions traversing the axon membrane when a constant potential is applied across it. Assuming that the excitatory current that he uses is uniformly distributed over a cross section of the nerve trunk, he concludes that at threshold about a million ions traverse a single nerve fiber. The statistical variation in this number of ions is Fluctuations in Neural Thresholds 161 given by its square root, leading to a variation of about 0.1 per cent. This is several orders of magnitude below the range of threshold variation that he observed. However, he points out that the number of ions actually effecting excitation is probably considerably less than the value mentioned above and the resultant variability correspondingly greater. Pecher also considers as a possible source of threshold fluctuations local statistical variations of mem- brane potential, of the sort discussed by Fatt and Katz. Hunt discusses two classes of possible sources of threshold fluctuation for spinal motoneurons : (a) sources with a local origin such as we have mentioned above, which give rise to an independent component of threshold variation and (b) sources whose effect is felt by many fibers and which therefore produce at least partially correlated variations in threshold. In the latter category are included the effects of activity of spinal interneurons. By using a drug (myanesin), in doses that block transmission through polysynaptic paths without reducing monosynaptic reflex responses, a considerable reduction in the range of variation of population response amplitudes was obtained. On the basis of this result it appears likely that internuncial activity is important in producing correlated threshold changes in spinal motoneurons. IV. A MATHEMATICAL MODEL Let us consider a mathematical model which is based on the concept of fluctuating thresholds, and which attempts to derive the ensemble behavior of large numbers of neural elements from assumed properties of neural units in a specific area of the nervous system (9, 10, II). This model is based on data obtained from the peripheral auditory system of the cat. When an electrode is placed near the round window of the cochlea, responses to clicks can be detected; such responses contain a component that represents the summated activity of peripheral auditory neurons. Fig. 9 shows such population responses at a number of intensities. In Fig. 10 the average peak-to-peak amplitude of such responses has been plotted as a function of stimulus intensity. The resultant 'intensity function' relates the number of units firing and the intensity of the stimulus. The present version of the model (11) postulates the existence of several independent populations of neural units; within a population all units are identical. The threshold of a unit is a fluctuating parameter which can be described by a probability distribution; threshold variations in different units occur independently. At a rate of stimulation slower than one per second the 'response no-response' sequence obtained from a single unit is assumed to consist of a series of independent events. Thus we postulate units whose statistical properties resemble those found by Pecher in the frog's sciatic nerve. The experiments used to test the model fall into three classes: two-click experiments (9, 10), measurements of variability of response amplitude (II), and studies of masking of click responses by noise. When two clicks are delivered at an interval of less than approximately 100 msec the population response to the second click is smaller than it would be if the first click had not occurred. This effect is more pronounced the stronger the first click and the smaller the interclick interval, as illustrated 162 Lawrence S. Frishkopf and Walter A. Rosenblith in Fig. 11. Consider the ratio of the response amplitude ^R^ to a second click and the response amphtude R^ to the same click presented alone. In Fig. 12 this ratio is plotted, for a fixed second-click intensity, as a function of the inten- sity of the first click. The parameter is the interval between clicks, At. If STIMULUS INTENSITY IN DB (RE 1.29 VOLTS) •90 •40 -80 -30 ■70 •20 -60 ■10 ■60 3 msec 3 msec DB RELATIVE GAIN -12 DB Fig. 9. Ink tracings of responses obtained from an anesthetized cat to clicks over a 90-dB range. The electrode was located near the round window. Note that the voltage gain of the recording equipment was reduced by 12 dB (factor of i) for stimulus intensities above —40 dB. The first peak represents the summated activity of first-order auditory neurons. With this calibration, click threshold for humans (verbal report) is about —95 dB. we assume a one-population model, we obtain the result that the ratio 1R2IR2 is Hnearly related to the intensity function for the first click, provided that the second-click intensity (S^) and At are held constant. Specifically, we obtain rR. , ^^^^\l - giS„ Ar)] (1) R, 1 - Rr Determination of a single intensity function therefore permits us to predict the dependence of this ratio on S^ for any value of 5*2 and of At. We may Fluctuations in Neural Tiiresholds 163 u E 3.0 AMPLITUDE (-c^) AND LATENCY (-•-)OF N, AS A FUNCTION OF 2-5\-|NTENSITY OF CLICK STIMULUS " 150 125^ -100 -80 -60 -40 -20 CLICK INTENSITY (dB RE 1.29V ACROSS PHONE) Fig. 10. Intensity function (open circles). A''i is the first diphasic response com- ponent seen in the traces of Fig. 9. The amplitude measurement is made between the positive and negative peaks of N^. Each plotted point is the median of about ten such measurements. m o o o in tr 50 DB 30DB RESTING RESPONSE tl -lODB 12.2 MSC 30MSC 6IMSC TIME INTERVAL BETWEEN CLICKS Fig. 11. Two-click paradigm: the responses shown are to a constant intensity (—45 dB) second click. The vertical set shows the effect of varying the intensity of the first click; the horizontal set shows the effect of varying the interval between clicks. Upper right: response to a — 45 dB click presented alone. From McGiLL (10). in each case choose one constant, ^(5*2, At). Fig. 12 shows a number of fits to the data points which were obtained in this way; 5*2 is constant and each curve corresponds to a different value of At. In a second group of experiments the standard deviation of a hundred response ampHtudes was computed at each stimulus intensity, and the result was plotted as a function of stimulus intensity. It is readily shown that N 164 Lawrence S. Frishkopf and Walter A. Rosenblith independent units, each with a probability p of firing, will have a standard deviation of total response proportional to \/Np{\ — p). As a function of /; this quantity has minima at zero and one and has a maximum at/? = ^. The value of p at any stimulus intensity can be obtained from the intensity function. u Ll -I OU "□ c CM ^2 > CO I- UJ' < cr _i UJ tr 1.00 0.75 0.50 0.25 1.00 0.75 0.50 0.25 |\ V , c X • E G IX \ o - - • "•n N* N. 1 ' D "^•.^ m F S2=-45 dB 1 (m sec} g(Al A 6.4 0.05- B 9.1 0.15 C 12.2 0.27 D 21.5 0.46 E 30 0.54 F 61 0.70 G 102 0.83 60 40 20 60 40 20 60 40 20 INTENSITY OF FIRST CLICK IN dB BELOW REFERENCE LEVEL Fig. 12. 1R2IR2 (see text) as a function of first click intensity. In each block tiiis ratio is plotted for a different interclick interval, as indicated at the lower right. The intensity of the second click was —45 dB throughout. The curves are obtained from the first click intensity function and eq. (1); the parameter ^(At), whose values are given at the lower right, is chosen in each case to give the best fit to the data. After McGill (10). Fig. 13. Intensity function (upper) and the corresponding amplitude variance function predicted by the model: (a) for one population; (b) for two disjoint populations. Oq was chosen arbitrarily. Note that a peak of the variability function occurs at the stimulus value at which an intensity function component reaches half its maximum amplitude. Fig. 13 shows the kind of variability function obtained by assuming one and two disjoint populations; Cq is the stimulus-independent component of variability arising from biological and non-biological sources. We have shown (II) that instability in stimulus intensity, which would also lead to a peaked variability function, can account for at most three per cent of the observed variability. A detailed study of the shape of the intensity function led us to postulate Fluctuations in Neural Thresholds 165 two populations of neural units, one consisting of 'sensitive' units and one of 'insensitive' units. In the three animals tested, variability measurements over the sensitive range are in good agreement with the theory stated above. One case is shown in Fig. 14. The intensity function and the probabilities obtained from it are shown with the derived standard deviation function. Here, Oq is determined from measurements of baseline variability in the absence of a stimulus; TV is chosen to give the best fit to the data. Over the sensitive range CLICK INTENSITY (DB RE I.29VACR0SS PHONE) Fig. 14. Comparison of the theoretical variability function (with 70 per cent con- fidence limits) and the measured values of cr, over the range of initial growth of the intensity function. Each point represented by a solid circle is based on 100 responses; the open circles are based on the first fifty of these responses. The corresponding intensity function, and the probabilities obtained from it, are also shown. (— lOOdB to — 60 dB) the data fall within the indicated confidence interval approximately seventy per cent of the time, as they should if the model is correct. Over the insensitive range of the intensity function (—60 dB to dB), the standard deviation shows a complex behavior which cannot be simply reconciled with the idea of a single population over that interval. The third aspect of this study concerns the masking of the neural responses to clicks by a background noise. Fig. 15 shows the effect of a constant noise level on response amplitude at several stimulus values. In Fig. 16 we have plotted these masked and unmasked intensity functions. The observation was made that a very weak level of continuous noise was sufficient to reduce almost to zero the N^ response to a fairly intense click. A fixed threshold model would predict masking of only the units whose thresholds are below the noise level. If the threshold fluctuates, however, and does so rapidly, nearly all units of a given population will drop below the noise level and fire in a short 12 166 Lawrence S. Frishkopf and Walter A. Rosenblith RESPONSE TO CLICK ALONE CLICK INTENSITY IN DB (RE 1.29 VOLTS) CLICK RESPONSE IN RELATIVE PRESENCE OF NOISE GAIN (-82 DB RE 1.29 VOLTS) -60 ODB -50 -40 ^x^ v/-' -30 •12 03 -20 -10 -3 m sec - -3 msec- Fig. 15. Ink tracings of responses obtained from an anesthetized cat to clicks over a 60-dB range, with and without background noise; noise level, —82 dB. Note that the voltage gain of the recording equipment was reduced by 12 dB (factor of i) at a click intensity of -30 dB. Fluctuations in Neural Thresholds 167 interval preceding the click; this assumes, of course, that the noise level lies within the range of threshold fluctuations of the unit. By a quantitative treatment based on these qualitative notions we have been able to show (a) that the hypothesis of a fixed threshold does not account for the observed data and (b) that over the sensitive range of the intensity i25r 100 75 50 I 25 ^^^_ NO NOISE BACKGROUND (COMPOSITE OF 3 FUNCTIONS) NOISE BACKGROUN0(OB RE 1.29V ACROSS PHONE lUNFILTEF -92 -82 -67 -80 -70 -60 -50 -40 -30 CLICK INTENSITY (06 RE 1.29V ACROSS PHONE) Fig. 1 6. Intensity functions for clicks, with an(d without noise backgroun(i ; noise levels —92, —82 and —67 dB. Each point of the masked functions represents the average A''i amplitude of ten responses to identical stimuli. The upper curve was obtained by averaging the three unmasked functions which correspond to the masked functions shown ; thus each point represents the average A''i amplitude of thirty responses to identical stimuli. Typical data on which these curves are based are shown in Fig. 15. function a single population of units making threshold 'jumps' at a rate of about 2000 times per second can account for the data. In addition, it is observed that low level noise has little effect on the intensity function over the insensitive range, except to reduce it by the constant contribution of the sensitive popu- lation. The need for a division of units into at least two populations is thus confinned. When the noise level is raised into the insensitive range the observed effect is not nearly so marked, implying either that more than one population is involved in that interval or that the rate of threshold fluctuation is considerably slower than for the sensitive units. It is noteworthy that population analyses based on two very diff'erent experiments, variability and masking, have a great deal in common. REFERENCES 1. E. A. Blair and J. Erlanger: A comparison of the characteristics of axons through their individual electrical responses. Anier. J. Physiol. 106, 524-564 (1933). 168 Lawrence S. Frishkopf and Walter A. Rosenblith 2. C. Pecher: La fluctuation d'excitabilite de la fibre nerveuse. Arch. Int. Physiol. 49, 129-152 (1939). 3. A. Hald: Statistical Theory with Engineering Applications, p. 344 (eq. 13.3.6), J. Wiley and Sons, New York (1952). 4. W. A. Rosenblith: Some electrical responses from the auditory nervous system. Pro- ceedings of the Symposium on Information Networks, Polytechnic Institute of Brooklyn, 223-247 (1954). 5. D. P. C. Lloyd and A. K. McIntyre: Monosynaptic reflex responses of individual motoneurons. J. Gen. Physiol. 38, 771-787 (1955). 6. W. Rall and C. C. Hunt: Analysis of reflex variability in terms of partially correlated excitability fluctuation in a population of motoneurons. /. Gen. Physiol. 39, 397-422 (1956). 7. P. Fatt and B. Katz: Spontaneous subthreshold activity at motor nerve endings. /. Physiol. 117, 109-128 (1952). 8. C. C. Hunt: Temporal fluctuation in excitability of spinal motoneurons and its influence on monosynaptic reflex response. J. Gen. Physiol. 38, 801-811 (1955). 9. W. J. McGiLL and W. A. Rosenblith: Electrical responses to two clicks: a simple statistical interpretation. Bull. Math. Biophys. 13, 69-77 (1951). 10. W. J. McGill: a statistical description of neural responses to clicks recorded at the round window of the cat. Ph.D. Thesis, Harvard University (1952). 11. L. S. Frishkopf: A probability approach to certain neuroelectric phenomena. Research Laboratory of Electronics, Massachusetts Institute of Technology, Tech. Rep. No. 307 (1956). PART III DETERMINATION OF INFORMATION MEASURES It is possible (as shown by several papers in this volume) to apply information theory to biology without introducing any actual information measures. Indeed, if one considers that it is very difficult to estimate information measures for living systems, and that the resulting measures are of an irreducibly relative nature, one might wonder whether it is worth-while to take such measures at all. However, it is difficult if not impossible to validate firmly the application of information theory without critical tests based on quantitative measurements; moreover, one hopes to discover lawful relations in the results of the measure- ments themselves. So, attempts are being made to estimate information contents associated with various biological structures and functions. All the papers in this part are chiefly concerned with such estimations; some from a general point of view, some with regard to particular systems, ranging in complexity all the way from simple molecules to whole men. H. Q. 169 CHEMISTRY AND BIOCHEMISTRY AT LOW TEMPERATURES AND DISCRIMINATION OF STATES AND REACTIVITIES* Simon Freed Chemistry Department, Brookhaven National Laboratory, Upton, New York Abstract — In order to apply information theory to biochemistry and biology at the molecular level it is advantageous to reduce the number of classifications and specifications involved by reducing the temperature of the system. In this way the number of species and states with their reactivities is reduced. At the same time the chemical noise level falls and in consequence a resolution may be obtained between components whose properties are practically indis- tinguishable at ordinary temperatures. Weakly bonded systems and intermediates become more easily detectable not only because of an increase in their concentration, that is, an increase in their signal, but in addition because the noise level is weaker at the lower temperatuie. Illustrations are given from chemistry where reactions in solutions proceed at the tempera- tures approaching that of liquid nitrogen. The information content of irreversible reactions at room temperature may be thought of as being stored in intermediates that participate in reversible reactions at the low temperatures. Once the properties of the more stable states have been understood, the way is clear for investigating the system in its thermally active states since allowance can be made for the presence of the former. In this way, an ordering of experimentation according to temperature will bring into activity successive components of the system. Examples have been selected mainly from work on the preservation of biological systems at low temperatures which indicate that biochemical and biological processes may likewise be investigated and that the finer discriminations and specificities associated with lower temperatures may be brought to light in these fields also. If we wish to measure a physical property, such as electrical conductivity or viscosity, with an instrument which we have no intention of modifying, there is little point in seeking the information content of the instrument. On the other hand, if we wish to employ chemical substances as probes for uncovering structures of enzymes by means of enzyme-substrate reactions, we are at once confronted by the need of the structural and functional information of our probes. In fact we are discussing properties at the molecular level. Pure substances at this level are mixtures composed of molecules in various energy states with their characteristic configurations, motions, and reactivities. The application of information theory to biology at the molecular level requires therefore a great expansion in the number of categories and specifications. It is to reduce this number in a systematic manner and make these categories more precise that I wish to draw upon the relation that has been recognized between information and entropy which asserts that the amount of information * Research performed under the auspices of the U.S. Atomic Energy Commission. 171 172 Simon Freed required to specify the system will be less at lower temperatures. The system will redistribute itself from higher to lower energy levels so that only the more basic ones remain appreciably occupied. Fewer chemical species are now present and also active. There has been, in a sense, a reduction in chemical noise differing in its frequency spectrum from the continuum characteristic of an electrical conductor. Chemical noise reflects the structural properties of molecules and may consist of dominant discrete frequencies associated with virtual continua of modulations. Usually these represent couphng of the electronic system of the molecule in a given atomic configuration with its 300 ■ II B 77 II 1 1 II 4 1 Fel III 1 l|l 1 II 1 |1 If 1 Fig. 1 . The variation of absorption spectrum of praseodymium chloride with temperature. Line drawings of visible absorption spectra of crystals of anhydrous praseodymium chloride (PrClg) at room temperature, at that of liquid nitrogen, and that of liquid helium. Sharper spectra, improved resolution, and fewer hnes are evident at lower temperatures. The fewer hnes correspond to fewer energy states which are occupied by the praseodymium ions. At room temperature the blocks of diffuse spectra are actually not uniform in intensity but are more intense as a rule in those regions where the spectrum of the crystal at 77°K possessed its most intense line spectrum. The greater diffuseness of the lines and their increased numbers at the higher temperature may be regarded as chemical noise associated with the spectroscopic signals from the more stable states at the lowest temperature. own vibrations, restricted rotations, etc. If the molecules are complex, fluctua- tions between difl'erent atomic configurations may contribute to the noise. In addition, coupling of the molecule in each of its states with the molecules of its environment in different configurations leads to more and more densely spaced energy levels which I referred to as the continua. A reduction in temperature removes thennal energy required to activate some motions and effect changes in configurations, and reduces the number of perturbations of a given configuration. Not only are fewer species present but each species is more sharply defined; thus, less infonnation is required for specifying the system than at higher temperature. Clearly, the system is now more specific in its reactions than at higher temperature and its specificity can be related to more sharply defined geometric configurations. The chemical system has become a more precise probe. The following illustrations have been selected for the simplicity of their phenomena rather than for their direct relevance to biology. The sharp absorption spectrum of a crystal of a rare earth salt (Fig. 1) shows very clearly that at the lower temperature fewer lines are present; they are sharper and more clearly resolved and the general diffuse background prominent at the higher temperature (not shown in the line drawing of the figure) becomes decidedly weaker. There are then fewer kinds of absorption centers at the lower temperature and, because the stable states are exposed to Chemistry and Biochemistry at Low Temperatures 173 more sharply defined environmental fields, there are fewer kinds of pertur- bations. An especially vivid example of a solution showing somewhat similar pheno- mena is given by the fluorescence spectrum of solutions of europium chloride in ethanol at various temperatures (1), The spectra were taken to discover the discrete number of lines in the three separate sets which may furnish the point 4000 4500 5000 Fig. 2. Absorption spectra of carotene (90% alpha and 10% beta). A — In hep- tane at room temperature; B — In equal volumes of liquid propane and propene at 77°K. group symmetry of the electrical fields about europium ion in the solution. It is clear that at room temperature the continuous noise is so great as to make enumeration impossible. As the temperature is lowered a few discrete Hnes can be resolved with such definiteness that they serve to eliminate some of the possible point group symmetries. At the temperature of liquid nitrogen and even at the temperature of dry ice adequate resolution is clearly achieved and the number of possible symmetries of the environmental fields is reduced to one only. Figure 2 gives the absorption spectrum of a substance of some biological interest, /9-carotene, and illustrates the increased contrast between absorption and transmission at the lower temperature, that is, the increased signal to noise ratio. Figure 3 is presented to illustrate the resolution into components of what 174 Simon Freed is apparently a single species at room temperature. The figure reproduces the absorption spectra of chlorophyll b in ethyl ether and methanol (2). Our first inclination is to ascribe the differences in the spectra to the perturbations produced on the structure of the chlorophyll molecules by the two types of solvent molecules. Figure 3b is a magnification of the Soret band in the blue 400 500 600 700 WAVELENGTH IN mjJi Fig. 3a. Absorption spectra of chloro- phyll b at room temperature. The thin- lined curve with maxima at shorter wave- lengths represents a solution of chloro- phyll in ethyl ether; the thick-lined curve gives the spectrum when the solvent is methanol. 4100 5000A Wavelength Fig. 3b. The dependence of the absorp- tion spectra of chlorophyll b on tem.pera- ture. Only the Soret band in the blue is shown. Enlarged scale of wave-lengths. At 300"K the solvent is 20% propyl ether, 80% hexane. At the lower tem- perature it is 20% propyl ether, 40% propane, and 40 % propene. The hexane was substituted at 300°K for the hydro- carbons propane-propene since they are normally gases at room temperature. region and shows that a solution of chlorophyll b in ether is really a mixture of two species (etherates) in equilibrium with each other in roughly equal amounts and clearly resolved at 180°K. A study of the dependence on tempera- ture of the absorption spectrum of chlorophyll b in methanol reveals that in this solvent, chlorophyll b also exists as a mixture of solvates which are about equal in concentration at room temperature and together they yield the com- posite spectrum. However the spectrum of each alcoholate differs very little in shape from that of each etherate. Fig. 4 illustrates a form stable at a lower Chemistry and Biochemistry at Low Temperatures 175 temperature reacting to produce reversibly a stable intermediate but at still higher temperature ending in an irreversible reaction. The following specific observations may prove worthwhile in illustrating what is probably a rather common phenomenon. Chlorophyll b dissolved in CHLOROPHYLL B' IN 15% MONO-i- PROPYL AMINE AND 1: 1 PROPANE -PROPENE 230°K ! : : : I93°K • i ; 1 z > o i^ '• H ^t^^^^ 1 i cr X^- \ ; o CD y^ 1 > <£ tv -^"'"^ ^^V '^ 1 \ 1 — 1 4000 5000 WAVELENGTH % 6000 Fig. 4. Chlorophyll 6' in 15% mono-/-propyl amine in 1 : 1 propane-propene. To show the presence of the red-brown intermediate stable at 193''K which is in equilibrium with the original chlorophyll. At temperatures higher than about 235''K, an irreversible reaction occurs. ether is deposited as a green powder by pumping off the ether at room tempera- ture. When the temperature of the powder is reduced to that of dry ice (about 193°K) and propylamine is condensed upon it at this temperature, it dissolves quickly, forming a red solution. Note in Fig. 4 the new absorption between 5000 A and 6000 A. A rise in temperature transforms the color into the green of chlorophyll with its characteristic spectrum which reverts back reversibly to the red substance when the temperature is reduced. However, if the tem- perature is kept any length of time at about 235°K or higher, an irreversible reaction sets in. For example, at room temperature the red color lasts only a fraction of a second. This evanescent red color is produced in the well known phase test for chlorophyll. Figure 5 represents a chemical reaction which appears rapid even between 167°K and 75°K. Chlorophyll h dissolved in di-/.so-propylamine is undergoing 176 Simon Freed transformation probably in an acid-base reaction. The quick readjustment to equilibrium is shown by the interchange in relative intensities of the bands in the red region. The band furthest towards the red grows in as the temperature is reduced, at the expense of the band near it toward shorter wavelengths. That these reactions occur rapidly at such temperatures is not very sur- prising since little heat of activation is required for this type of reaction. Figure 6 depicts a type of oxidation-reduction at low temperatures. When iodine 3000 4000 5000 6000 WAVELENGTH ANGSTROMS 7000 Fig. 5. Chlorophyll b in 15% dipropylamine diluted with equal proportions of propane and propene. A chemical readjustment toward equilibrium occurs between 170°K and 75°K. is finely divided it rapidly dissolves in isoprene at the temperature of dry-ice, 193°K. A brown solution forms at the solid-liquid interface but it decolorizes very quickly, becoming colorless a little distance from the iodine surface. In the light of other investigations it was surmised that the solution is brown because of the presence of a 1:1 (molecular iodine-hydrocarbon molecule) addition compound which possesses a characteristic absorption band in the ultraviolet region. To build up any appreciable concentration of this compound it would evidently be necessary to make solutions of iodine in isoprene below 193°K. When a solution of isoprene in propane (to which propene had been added to increase the solubility of isoprene) at the temperature of liquid nitrogen (77°K) is mixed with a solution of iodine in propane and propene. Chemistry and Biochemistry at Low Temperatures 177 the new band anticipated in the ultraviolet does not appear within a day or two. Figure 6 indicates what happens when such a solution is warmed. At 146°K the absorption band shown is due to the iodine-propene molecular addition compound which has been identified in a previous experiment. At 150°K appears the anticipated new band arising from the compound iodine- isoprene. At 154°K, this band quickly disappears irreversibly and at the same time decoloration of the solution occurs. The molecular iodine has been removed, presumably by the halogenation of the double-bond system of isoprene, just 0. q: o (J) CD WAVELENGTH Fig. 6. Isoprene dissolved in 1 : 1 propane-propene to which iodine dissolved in 1 : 1 propane-propene has been added. The new absorption band which appears at 1 50°K is due to a 1 : 1 molecule addition compound of the iodine to isoprene. Its disappearance at 154°K is due to an irreversible reaction, probably halogenation across the double bond. as had occurred when solid iodine reacted with isoprene at the temperature of dry ice. This oxidation appears to require the prior formation of the inter- mediate molecular addition compound stable at about 150°K at the concen- trations employed. By investigating the properties and reactions from the lowest practicable temperature upward we would observe the appearance of new thermally activated states and their subsequent reactions. 1 78 Simon Freed In analogy with the phenomena illustrated we would expect that a knowledge of biochemical and even biological processes of considerable value may be gained by investigations at low temperature. Support for these expectations comes mainly from recent investigations directed toward the preservation of cells, tissues, and entire organisms. Even more cogent for our purposes are the instances of partial preservation at low temperatures which becomes more effective at still lower temperatures. Unless explicit references are given, the following examples are drawn from the excellent review by Audrey U. Smith (3). For example, H. F. Smart found that twenty-one species of bacteria, yeasts, and molds continued to multiply in frozen media at 264. 1°K. Sizer and Joseph- son found that lipase was active at 248. 5°K, tryptic digestion proceeded at 258°K, and that invertase continued to hydrolyze sucrose at 255°K. At 203°K, however, they could detect no hydrolysis during several weeks. In the preser- vation of red blood cells, about ten per cent deterioration occurs per year at dry ice temperature, 193°K, but scarcely any loss is incurred when they are kept at the temperature of liquid air, 80°K. Ovarian tissue failed to survive nine days at 193°K but survived more than a year at 80°K under otherwise similar conditions (4). Revival of rats after cooling to 273. 5°K was reported by Andjus (5, 6). Irreversible reactions are then clearly progressing at low temperatures, in red blood cells and ovarian tissue at 193''K and at somewhat higher tempera- tures in the enzymatic reactions. If the simple reactions such as those of isoprene and iodine, chlorophyll and propylamine serve as models, the irreversible reactions are preceded in their first and intermediate stages by reversible reactions at still lower temperatures.* Becquerel found that rotifers, spores of bacteria, non-sporing bacteria, algae lichens, mosses, and seeds of higher plants, after having been dried in a vacuum of 10^^ mm Hg over barium oxide, could be successfully kept at the temperature of liquid helium (4°K). Parkes showed that human spermatozoa survived exposure and storage at 80°K. Ovarian, testicular, pituitary, and adrenal tissue have given functional grafts after storage at 80°K, especially if glycerine was added. Luyet established that vinegar eels, spermatozoa muscle fibres of frogs, and hearts of embryonic chicks could be revived after sudden cooling to the temperature of liquid air (80°K). It is then not surprising that enzymes have been cooled to such temperatures without loss of subsequent potency. It would seem then that a number of biochemical and biological processes are available for study at low temperatures. I shall consider both homogeneous and heterogeneous solutions. The first implies that solvents must maintain all the reactants in solutions fluid at low temperatures. It would seem well worthwhile to employ conventional solutions at as low temperatures as possible, and aqueous systems near zero degrees or under supercooled conditions. It has been shown (8) that proteins * Lovelock (7) ascribes the deterioration of red cells to a physical mechanism rather than to a chemical process, namely, that the dissolution of lipoprotein and other components of the cell membrane proceeds more rapidly than the biochemical processes can repair them at the low temperature. Since the lipoprotein etc. is presumably bound as an integral part of molecules composing the membrane material, the physical process may also be initiated by reversible chemical transformations. Chemistry and Biochemistry at Low Temperatures 179 such as enzymes are soluble in some non-aqueous solvents and that a few enzymes can be recovered with virtually their full potency. Since some of the solvents have melting points below that of water they can be utilized for investi- gations of solutions of proteins at relatively low temperatures. It appears entirely possible that had the solution process been carried out at lower tempera- ture a larger fraction of the enzymes would have been recovered without deterioration. Indeed it may prove fruitful to undertake studies at low tempera- tures of the first stages of reactions which are toxic at ordinary temperatures since the toxic substances may be removed at temperatures so low that httle permanent injury is done to the enzyme or organism. In analogy with the dissolution of finely divided chlorophyll and iodine by solvents at low temperatures it is to be expected that at low temperatures heterogeneous reactions are also possible between substances in solution and biological materials having high specific areas. Ready-made for such reactions with solutions seem sections of tissue with water removed by freeze-drying. Likewise Becquerel's procedure of removing water by pumping at room temperature would prepare material for reaction at low temperature. Some of the reactions with the surfaces constitute a generalized staining. Many staining processes are acid-base reactions and would be expected to be rather rapid at low temperatures. As has been remarked, molecular steric factors are as a rule more specific at the lower temperatures in general; hence finer dis- criminations between structures within the surfaces are to be anticipated. REFERENCES 1. E. V. Sayre, D. G. Miller, and S. Freed: Symmetries of electric fields about ions in solutions. Absorption and fluorescence spectra of europic chloride in water, methanol, and ethanol. /. Cfiem. Phys. 26, 109-113 (1957). 2. D. G. Harris and F. P. Zcheile: Effects of solvent upon absorption spectra of chloro- phylls A and B. Bot. Gaz. 104, 515-527 (1943). 3. A.U.Smith: Eifects of low temperatures on living cells and tissues. In: Biological Applica- tions of Freezing and Drying, tA. by R. J. Harris, 1-62, Academic Press, New York (1954). 4. A. S. Parkes and A. U. Smith: Regeneration of rat ovarian tissue grafted after exposure to low temperatures. Proc. Roy. Soc, (B) 140, 455-470 London (1953). 5. R. K. Andjus: Sur la possibilite de ranimer le rat adulte refroidi jusqu'a proximite du point de congelation. C.R. Acad. Sci., Paris 232, 1591-1593 (1951). 6. R. K. Andjus and A. U. Smith: Revival of hypothermic rats after arrest of circulation and respiration. J. Physiol. 123, 66-67 (1954). 7. J. E. Lovelock: Physical instability and thermal shock in red cells. Nature, Lond. 173, 659-666 (1954). 8. M. J. Loiseleur: Sur quelques proprietes des proteides en solutions organique. Bull. Soc. Chim. Biol. 14, 1088-1100 (1932). J. J. Katz: Anhydrous hydrogen fluoride as a solvent for proteins and some other bio- logically important substances. Arch. Biochem. Biophys. 51, 293-305 (1954). E. D. Rees and S. J. Singer: A preliminary study of the properties of proteins in some nonaqueous solvents. Arch. Biochem. Biophys. 63, 144-159 (1956). 1 80 Simon Freed DISCUSSION Mahler : I can see where this might be useful in the study of the rate of formation of enzyme-substrate complexes. This is a reaction which proceeds much too rapidly to be measured by most ordinary techniques. It is only with very rare and very stable enzyme complexes and by using very interesting and very sensitive experimental devices that Chance*, for instance, has been able to study this at ordinary temperatures. But if one can find the right kind of solvent for both substrate and enzyme — -there is no reason to assume that some of these solvents might not work — one might be able spectroscopically to study the rate of formation of enzyme-substrate complexes at low temperatures. * B. Chance and G. R. Williams: The respiratory chain and oxidative phosphorylation. In: Advances in Enzymology, ed. by F. F. Nord 17, 65-134. Interscience, New York. (1956). INFORMATION CONTENT OF TRACER DATA WITH RESPECT TO STEADY-STATE SYSTEMS* MoNES Berman and Robert L. Schoenfeld Division of Biophysics, Sloan-Kettering Institute, New York Abstract — A method for the quantification of information in data from tracer experiments on steady-state systems is presented. It is shown that if the system is represented by n com- partments a point in an n^ dimensional space can serve to represent a specific model. Further- more, uncertainty about the system due to statistical fluctuations and incomplete data can be represented by regions in the n"^ dimensional hyperspace. A unit of information for such a system is defined and serves as a measure of the amount of information necessary to determine the system to within a desired accuracy. In order to express the data in terms of the generalized n^ dimensional space, a set of invariants is defined for the data. A concise matrix relation is shown to exist between the invariants of the data and the parameters that characterize the compartmental system. The matrix relation allows mappings between the data and the system. The method presented is applicable to any compartmentalized system that shows linear kinetics. I. INTRODUCTION This paper is concerned with the quantification of information contained in data from tracer experiments performed on steady-state biological systems. In general, the same set of data may be analysed in terms of different systems of various degrees of complexity. To define the information content of the data, therefore, it is necessary to specify the system in terms of which the data are to be analysed. It can be assumed for many tracer experiments that the system! consists of a discrete number of compartments (or pools) each representing a locali- zation or chemical state of the labeled material, with exchange of molecules between compartments. The rate of exchange of the unlabeled molecules between compartments is in general a non-linear function of the amounts of material in the compartments. If, however, the system is in a steady state and the amount of the tracer is sufficiently small compared to its unlabeled isotope, the rate of exchange of the tracer may be treated as a linear function of the amounts of labeled material in the compartments (1). The problems that arise in treating the data of tracer experiments are: first, to define the information content in the data, and second, to translate the information in the data into values of the system parameters (the turn-over rates of the compartments). In addition, it is desirable to have a measure of * This work was supported in part by the U.S. Atomic Energy Commission Grant AT(30-1)-910. t For this paper, the word 'system' will be used to mean a specific number of compartments independently of how they are interconnected. The word 'model' will refer to a specific configuration of the system. 181 13 182 MoNES Berman and Robert L. Schoenfeld uncertainty in the values determined for the system parameters. The uncer- tainty in these values arises from the fact that the collected data may not be sufficient to define the system completely and that the collected data have associated fluctuations. A method for the quantification of the information in data and the systematic formulation of models consistent with it is presented here. The information content in the data is expressed by a set of invariants, and a concise matrix relation is shown to exist between the invariants of the data and the system parameters. Uncertainties in the data due to incompleteness or fluctuations are mapped into a generalized co-ordinate space which also represents the degrees of freedom of the system parameters and their uncertainty. The uncertainties in the data are expressed in terms of regions in the generalized co-ordinate space in such a way as to suggest a criterion for their quantification with respect to the system. II. DATA INVARIANTS AND SYSTEM PARAMETERS The response of the system to a tracer injected into any one compartment can be expressed in terms of the amounts of tracer in the various compartments as a function of time. If we define the probability per unit time for a transition from any compartment / to compartment j as A^^, then the kinetics of the tracer in the /th compartment of an n compartmental system can be represented by the following set of differential equations : ^^ = -K^iit) +lh^qlt) (/ = 1, 2, • • •, n) (1) where ^^(0 is the amount of tracer material in the ;th compartment at time t and hi ^ i hi (2) is the probability per unit time that any molecule in compartment / will leave that compartment. The inequality sign expresses the possibility that a molecule may leave the entire system from compartment / as in the case for open systems. The solution of the set of differential equations (1) is: n q,{t) = I A,, e-^' (3) i=i In a recent paper (2) we have pointed out that data expressed in the form of equation (3) have the following properties: (a) There are at most n a^- in the data and these are invariants of the system and independent of the initial conditions or site of measurements. (b) The Ay.^ represent n^ independent variables in the data. Specification of the initial conditions reduces the Aj.^ to {n^ — n) independent variables which are a function of the system parameters only. The Aj^^ thus represent {n^ — n) invariants of the system parameters. Information Content of Tracer Data With Respect to Steady-state Systems 183 (c) The n a,- and rr" Aj.j comprise a necessary and sufficient set of data to define uniquely the parameters of the system. (d) A simple matrix relation (3) exists between the Aj^^ and a,- of the data and the A,y of the system. This relation can be written: Ml = \A l«l or where I3I — Aj2 — Aj3 /122 — "^ 23 1 32 a = h 33 ai (-11 (4) (5) A, 11 1 '31 All -^22 ^32 '13 '23 ^33 iy.2 0" a. Equation (5) expresses the system parameters in terms of the invariants in the data. If these invariants are known, the fractional turnover rates, Aj-;, can all be determined. However, in most cases the experimental data are incomplete in that certain of the A^j and a^ are not known. For these cases, an infinity of models mathematically consistent with the data can be obtained from equation (5) by inserting arbitrary values for the unknown Aj.j and a^, preserving the initial conditions and other constraints in the data. Most of these arbitrary models, however, will be physically meaningless because some of the fractional turnover rates will be negative. Consequently, it is necessary to investigate what range of values of the unknown A^^ and a, correspond to physically meaningful models. This can be done by relating variations in A^^j and a.; to variations in the X^j. One may define (2) a matrix |P| in such a way that the product \PA\ will preserve the known A^j. The number of variables in \P\ will be equal to the degrees of freedom in the Aj^j. If both sides of equation (4) are premultiplied by the matrix \P\ this equation can be rewritten: (6) (7) a \PXP-'^\ \PA\ = \PA\ which is of the form [A'l l^'l = |/1'| |a| where M'l = l^^l (8) \l'\ = \PKP-^\ (9) Equation (9) expresses a mapping of the matrix \X\ corresponding to varia- tions in the unknown Aj,j only. It also represents a general solution of all models mathematically consistent with the data in terms of a minimum number of variables. This solution is expressed in terms of an arbitrary model represented by the matrix \X\. Similarly, we can define a matrix \D\ so that the product |aZ)| will preserve all the known a^. Incorporating this into equation (4), we get |;.^Z)^-i||^| = |y4||aZ)| (10) 1 84 MoNES Herman and Robert L. Schoenfeld which is of the fonn mMI = MIH (11) where |a I = |ax^| |A'| = \UDA-^ (12 Equation (12) represents a mapping of the matrix |A| in terms of the variations in the unknown a^ only. By applying the restriction that every fractional turnover rate must be positive, r,, ^ A',,^iA',i (13) i = \ equations (9) and (12) limit the range of values of the variables in the matrices \P\ and \D\. Since these variables are all independent, they represent a co-ordinate space of dimension equal to their number. Every point in this space specifies a set of values for the variables in the matrices |P| and \D\ and, thus, defines a model through equations (9) and (12). The restrictions on the range of values of the variables as expressed by equation (13) correspond to a region in the co-ordinate space in which all physically meaningful models must lie. The choice of the starting point for the transformations indicated above is completely arbitrary and does not affect the final result. Any mathematically consistent model leads to a region in the mapping space corresponding to proper physical models. III. UNCERTAINTY MAPPINGS IN GENERALIZED SPACE We now wish to examine the problem from a somewhat different point of view. The system is represented by n^ X^^, generally independent of each other. We can, therefore, consider the X^^ to represent an n^ dimensional space, and any point in that space as a specific model of the system. It was also indicated earlier that the data could be represented by a set of invariants composed of n oij and {n^ — n) A^j or a total of n^ invariants. Hence, the transformation from the data space to the X^^ space is dimensionally consistent and unique. This means that a complete set of A^j and a^ corresponds to a point in the \X\ space, and vice versa. By definition, however, the values of the A,,- must all be positive. Consequently all the models must lie in a restricted region of the \X\ hyperspace. This restriction carries over to the data space, limiting the region in which the Aj^j and a^ may lie. Any specified A^j or a_, implies a one dimensional constraint in the data space. This carries over as a one dimensional constraint in the \X\ space, and restricts all models to a surface in the hyperspace. If, however, the value of A^j or oij is known only within a certain range, the surface has correspondingly a certain thickness. When several A^^ or a^ are known, the dimensions of the space in which all models must he is reduced by a corresponding number. Statistical uncertainties Information Content of Tracer Data With Respect to Steady-state Systems 185 for any of the known values correspond to similar uncertainties along the appropriate co-ordinates in the hyperspace. Thus, if all Aj,j and a^ are known exactly, a point in the hyperspace of n^ dimensions specifies the model. If all the data are known to within a certain statistical precision, the most likely model is estimated as a point in the n~ dimensional space surrounded by a region that corresponds to the statistical uncertainty. If some Aj^j or a_, are unknown, the corresponding dimensions in the n^ dimensional hyperspace extend to the limits imposed by the relation that all Xjj are positive. IV. UNIT OF UNCERTAINTY Based on the point of view presented, we can define a unit of uncertainty to be a certain volume of the hyperspace. The size of the volume so defined is arbitrary; it may correspond to a volume that is equivalent to the actual standard deviation in the data, or to some convenient standard deviation that may serve as a reference. The information necessary to define the system can then be expressed as the number of binary choices, or bits of information, necessary to reduce the total uncertainty space to the size of a defined unit. V. CONCLUSION The treatment presented provides a framework in which information in data from tracer experiments on steady-state systems can be quantified in terms of a compartmental system and its parameters. Before the information can be quantified, however, a number of compartments has to be chosen for the system. Unless this is known from independent sources, the method in choosing the number of compartments is based on the minimum number of exponential terms that 'reasonably' describe the data. This, at present, is by no means a unique procedure. It was shown in this treatment that a model representing the system can be expressed as a point in a generalized co-ordinate space, and that any uncertainty in the system can be represented by a certain region in that space. The nature of the uncertainty (whether incomplete data or statistical fluctuations in the data) did not matter in the treatment. There is, however, one difference in the regions of the hyperspace corre- sponding to these two sources of uncertainty. The difference is in the probability that any model in the region represents the true system. In the case of incomplete data, the probability density over the entire region is assumed constant; that is, every model in the region is considered equally probable. In the case of statistical fluctuations, however, a certain point or unit volume represents the most likely model, and the rest of the points or unit volumes decrease in probability in a manner governed by the statistics of the data. The region in the |A| hyperspace can serve to define the information content in the data of the system as a whole or of each parameter of the system, namely the turn-over rates, separately. The latter can be obtained by investigating their values over the bounded region. One need not necessarily deal with all the dimensions of the hyperspace. One can express the uncertainties in terms of a subspace whose dimensions are equal ] 86 MoNES Berman and Robert L. Schoenfeld to the degrees of freedom of the system, as imphed by equations (9) and (12). In this case, however, the statistical variations of the collected data cannot be represented since their dimensions are omitted. Any new data to be collected, however, can be represented in this subspace. The significance of any new data can also be evaluated by the relative reduction in the size of the region in the subspace. A unit of uncertainty may be defined for this subspace as was done for the hyperspace. In references (1) and (2) it was shown how information about the system from steady-state measurements and thermodynamic considerations can be combined with tracer data to form a unified methodology in reducing the uncertainty about the system. The treatment presented here can be extended to include such additional information. Whereas the concepts presented here are relatively simple, the application to specific problems involves considerable work. One can handle two or three compartmental systems with few degrees of freedom fairly easily using a desk calculator. The handling of more complex systems becomes quite time con- suming. It is hoped that a programming of this on digital computers can be worked out for routine applications. REFERENCES 1. M. Berman: The formulation of biological models from tracer and steady-state data. Ph.D. Thesis, Polytechnic Institute of Brooklyn (unpubUshed) (1957). 2. M. Berman and R. Schoenfeld: Invariants in experimental data on linear kinetics and the formulation of models. /. Appl. Phys., 27, 1361-1370 (1956). 3. H. Margenau and G. M. Murphy: The Mathematics of Physics and Chemistry, chap. 10, Van Nostrand, New York (1943). THE DOMAIN OF INFORMATION THEORY IN BIOLOGY* Henry Quastler Brookhaven National Laboratory, Upton, New York In the proper course of events, a theory is introduced to account for a specific body of facts ; then nobody will presume to expatiate upon the domain of the theory. With information theory and biology, the situation is less simple. The modern development of the theory stems largely from C. E. Shannon's concern with certain problems of communication engineering (1). I have heard Shannon say that he was somewhat dubious about the extension of his results to remote fields, and that he felt that people working in other disciplines might do better to develop their own theories. This is not what happened. Shannon's theory has been taken up with enthusiasm by psychologists, linguists, historians, planners, librarians, sociologists, and by biologists with a wide variety of interests. Motives for such generalizations were supplied by Wiener, who pointed out that all control (in the animal and in the machine) depended on communication, and that all communication involved measurable quantities of information (2) ; and by Weaver, who emphasized the great generality of the information concepts in a searching study (1). It appeared then that information theory was a tool made to order to deal with a vast variety of problems. This variety, however, is not limitless. There- fore, a discourse on the domain of information theory is indicated. One part of this discourse will deal with the negative domain, or with some of the limita- tions of the theory. The other part will be concerned with positive applications ; it is largely an attempt to give clearer definition to the somewhat vague hopes most people have when proposing to apply information theory. It is curious that applied information theory produces rather violent reactions, some of them negative. Certainly, it is entirely possible that every biologist who works with information theory, or any other systems theory, is wasting his time. But this, of course, applies to anybody who works with a new theory. It is difficult to see how applying information theory should irritate people — unless the cause should be the very pleasure of gently playing with the theory. Every scientist is aware that there is a 'difference between the labor of thought, and the sport of musing', and knows well the danger inherent in the latter. To go on with Dr Johnson: 'There is nothing more fatal to a man whose business is to think, than to have learned the art of regaling his mind with those airy gratifications .... This is a formidable and obstinate disease of the intellect, of which, when it has once become radicated in time, the remedy is one of the hardest tasks of reason and of virtue. Its slightest attacks, therefore, should be * Research carried out at Brookhaven National Laboratory under the auspices of the U.S. Atomic Energy Commission. 187 188 Henry Quastler watchfully opposed' (from The Rambler). Is this why so many scientists do not mind too much having collected a lot of useless data but dread to be found working with a useless theory ? I. APPLICATIONS Every kind of structure and every kind of process has its informational aspect and can be associated with information functions. In this sense, the domain of information theory is universal — that is, information analysis can be applied to absolutely anything. The question is only what applications are useful. 1 . Use of Basic Concepts The basic concepts of information theory — measures of information, of noise, of constraint, of redundancy — establish the possibility of associating precise (although relative!) measures with things like form, specificity, lawful- ness, structure, degree of organization. This alluring promise has introduced the information concepts into the thinking of many biologists. The results of conceptual applications range from harmless modernisms of language to very serious reasoning. In particular, the information concepts seem to lend them- selves readily to dealing with the problems of emergence and destruction of order in complicated systems. The problem of emergence of order is usually treated in terms of Darwinian machines, large more or less random assemblies of parts which can both function and, in some manner, register the results of their functioning. The resulting feedback loop produces some order amazingly fast (3, 4). The theory of random networks is a very active field, and some very competent men expect that the main contribution of information theory to biology (and to other fields concerned with very complicated systems) will come from this endeavour. Closely related is the problem of destruction of orderhness. In biology, this is the problem of aging and decay; it is the topic of a major fraction of this conference (5, 6, 7). 2. The Representation Theorem The use of the basic concepts of information theory becomes more powerful if one considers that the behavior of information measures follows certain rules; these rules are the theorems of information theory. There are two basic theorems which I like to call the 'representation theorem' and the 'noise-and-redundancy theorem'. The first has to do with the possibility of representing one kind of information by another kind of information. There are absolutely no quahtative limitations as to how information can be represented ; but, there is a quantita- tive limitation: any physical entity can assume only a limited number of distinguishable states, and this limits the degree to which it can represent information. This degree is further modified by the rules of selecting successive states. The applicability of the representation theorem depends to a high degree on knowing the process by which states are selected. The representation theorem applies every time information is transferred — because the transfer does involve representation of the information existing The Domain of Information Theory in Biology 189 in the transmitter, in the medium and, finally, in the receiver. It can thus be stated as follows: A source cannot transmit more information than it has, a receiver cannot register more information than it can display. This sounds trivial, but the point is that information contents can be precisely estimated in ways which are not trivial. The representation theorem implies that it is possible to establish an upper bound of the flow of information simply by investigating the terminals. It is, thus, a one-sided conservation principle; being one-sided, it is not as strong as the two-sided conservation principles which are so commonly used in physics. It becomes stronger in situations where one may assume that the inequality approaches an equality. There are two conditions which are conducive to the establishment of full conservation of information: one, that information is a valuable and critical commodity, and two, that noise can be minimized. The concept that informa- tion is the most precious commodity for living things has been formulated strikingly by Schroedinger in his assertion that 'living things feed on orderli- ness' — that they feed because they need fresh supplies of orderliness, not of energy or matter (8). The need for fresh supplies of orderliness presupposes that orderliness is somewhere lost, that is, that noise is present. This, however, does not mean that noise is present everywhere. Some processes may occur in 'clockwork fashion', without loss of information. That is the case which Schroedinger classifies as 'generation of order from order'. He suspects that each individual act of transmission of genetic information from parent to offspring occurs without serious loss of information. This idea agrees with the current (Watson-Crick) model of DNA duplication; it recurs in Gamow's and YcAs' models of information transmission from genetic to somatic material (9). 3. The Noise-and- Redundancy Theorem Infonnation transfer from one body of information to another is not often with clockwork regularity. As a rule, interferences occur which will more or less affect the process of information interaction. Interference can be of many kinds: the worst kind of interference is one the results of which are not pre- dictable in detail. In this case, some information will be irretrievably lost. However, in general some but not all order is lost. It is one of the most significant results of information theory to have shown that order and disorder can be measured by a common yardstick. Hence, it is possible to investigate the quantitative relations between total information, noise, and remaining order- liness. The second basic theorem of information theory states that the amount of information effectively transmitted is exactly the amount of information transmitted minus the amount of information lost because of noise. This implies that a source can transmit a certain amount of information reliably in the presence of noise provided it transmits more than the desired amount of information. This surplus must be distributed over the whole activity because it is never known which portions of the total activity will be interfered with by noise; necessarily, the surplus takes the form of redundant information. Thus, the second fundamental theorem states precisely the relation between amount of information to be transmitted, amount of information which will be lost through noise, and amount of redundant information needed to make up the loss. Like the first fundamental theorem, it is a one-sided conservation principle; it limits 190 Henry Quastler the amount of order which can prevail in an 'order-from-disorder' situation. Again, the one-sided conservation principle will become more powerful if it can be assumed to approximate a two-sided conservation. However, very stringent conditions must be fulfilled if one expects to use the second theorem. There is some reason to believe that these conditions are at least approximated in some biological situations; this is stated in Dancoff's principle (10). Dancoff' s principle deals with the economics of information. In 'noisy' situations, information is lost and errors will occur unless they are checked by redundant information. Now, errors may be costly, but so is redundant information; accordingly, the optimum amount of redundant information will be not that which makes all errors vanish, but that which minimizes the sum of the cost of errors plus the cost of redundant information, plus the cost — in information units — of error checking. Dancoff's principle asserts that any organism or organization which has gone through competitive evolution has approximated such an optimum; that is, it will commit as many errors as it can get away with, and use the minimum of redundant information needed to hold errors to this level. It follows from Dancoff's principle that the amount of redundant information in a system is bound to be limited, even if it is a system of enormous information content like a living thing. This is of great interest particularly in radiobiology, because what radiation does very effectively is to destroy information. 4. The Estimation of Information Measures and the Search for Invariants It may well turn out that the qualitative and semi-qualitative applications of information concepts are going to be the most important contribution of information theory to biology. But, even successful qualitative applications have very little power in excluding the possibility that other sets of concepts could have been used just as successfully; besides, all scientists like to take measures. Thus, the problem arises of estimating information measures associated with biological structures and functions. One fundamental diflficulty appears immediately: information measures are relative and not absolute ; hence, any information measure associated with a given set of biological objects will depend on the set itself and on the scientist who does the estimating. To be sure, one can establish objective bounds. Thus, if a certain genetic locus is known to be capable of having thirty-two distinct allelic states, which are transmitted to the offspring with equal prob- ability given the proper conditions, then the information stored in this locus cannot be less than five bits. If it is also known that the region containing the locus under consideration comprises no more than, say, 20,000 atoms, then the total information stored cannot be more than about 60,000 bits (10). These brackets are safe, but they are too wide to be of interest. They can be very much reduced if one introduces specific assumptions. For instance, if the locus is known to contain no more than, say, 2 X 50 nucleic acid residues, and if one assumes that the genetic information is completely coded in the sequence of the residues on one strand of a double helix, with the information carried by each residue corresponding to unconstrained selection from four possibilities, then the upper bound is reduced to 100 bits — but its validity is less absolute. The Domain of Information Theory in Biology 191 Because of the relative nature of information measures, it will always be up to the ingenuity of the biologist to find ensembles which result in useful measures. In many cases, even the estimation of a limit is of interest: as in Ehret's demonstration that a few bits could be sufficient to specify the nature of cytoplasmic structures (11), or the result easily derived from D'Arcy Thompson's work (12) that apparently considerable differences in fonn could be coded in, say, a few nucleic acid residues. The relativism of information measures is a basic difficulty in estimation ; besides, the biologist will encounter a number of technical difficulties arising from the fact that 'message sets' and 'selection rules' are not perfectly known. A number of approximation methods for such situations have been worked out (13). The relative nature of information measures and the technical difficulties of their estimation, cast some doubts on the usefulness of actual information measures in biology. Only experience will show whether these doubts are justified or not. Measures will be valuable if they lead to the discovery of invariants. In psychology, some invariants seem to be crystaUizing out of a number of measurements: there seem to be invariant upper limits for the channel capacity for single activities; for the range of classes distinguishable in a single act, etc. (14). In biology, independent estimates of information transfer associated with three elementary biological functions (allelic, anti- genic, enzymatic specificity) have yielded closely similar values (15). Much more material will be needed before we can draw definite conclusions. The analysis which underlies the estimation of information measures presents certain novel features. Consider, for instance, the informational analysis of a honnonal control system. The traditional approach consists in isolating one hormonal function and one hormone after the other. In principle, this quest never ends — although physiologists might hope that some day they will run out of undiscovered hormones. The information theorist attacks the problem from the opposite end. He will argue that each hormone molecule constitutes a message from a control organ to a target organ, a message which is diffusely broadcast through the blood stream. In general each message must contain two parts, an address and an order. Actually, one or the other part can be omitted. We can imagine a hormonal control system in wliich only the addresses are specified — the 'order' may be completely determined in the target organ, and be executed automatically upon receipt of the only kind of hormone molecule with the proper address; or, the address may be unspecific, but the order such that only the right target organ can execute it. One would expect that the natural systems be somewhere between these two extremes. For the sake of simplicity we will consider a system in which only addresses are specified — the foiTnal results have complete generality. Thus, each hormone will be represented only by the address of the target organ. In the interest of detailed and accurate control, it is desirable to have a maximum number of different addresses. Any duplication of addresses will lead to concomitant responses in other organs. On the other hand, the 'reading' of every single address involves distinguishing it from all other addresses; the greater the variety of addresses, the greater the labor in every single act of recognition. A compromise is indicated between the demand 192 Henry Quastler for a great variety of addresses and the contradictory demand to keep each address simple. For any kind of system, there will be an optimum number of different hormones; the actual number will depend on the relative strength of the two competing needs. By Dancoflf's principle, we expect that the actual number will not be too far from the optimum number. We can add another line of considerations on the number of possible addresses. In order to fulfill its function, the hormone molecule has to enter into some kind of relation with the target organ ; most likely, it has to form a complex. Now, the total surface area of any molecule that can enter into a specific complexing process is rather limited, and so is the number of molecular con- figurations available to living organisms; hence, a limited space accommodates only a limited number of significantly different configurations — and this limits the number of different hormones possible (and, incidentally, the number of distinct antigens and antibodies, enzymes and co-enzymes). The example illustrates the concern with the whole system which is charac- teristic of many applications of infomiation theory. It also illustrates a rather profound difference between the information theorist and many of his scientific colleagues. The information theorist will remain fairly cool at the news that another enzyme, or hormone, or vitamin has been isolated ; his basic question is: 'How many more are there to be discovered?' II. LIMITATIONS Information theory could not possibly apply to a wide variety of situations if it were sensitive to every detail in every situation. Like thermodynamics (to which information theory is related) it has a vast domain of application, and like in thermodynamics, the vastness of