A comprehensive statistical analysis on the relationship between the two variables investment fund performance andinvestment fund management fee.
The project aimsat drawing (statistical) conclusions about the relation(if any) between theperformance of the investment funds(in terms of price) and money spend as thecost of managing the respective funds, the significance of the relation andthereby provide some support in decision making for higher level managementregarding this. Please note that the project only attempts to find if changein one variable is accompanied by the change in the other and this in no wayestablishes the cause and effect relationship i.e. we cannot conclude, atleast by this exercise, whether increase in spending on the management feescauses increase in the fund performance. (the cause and effect relationshipcannot be found by just using statistical techniques but it also requires theexhaustive study of the nature of the funds and the management involved). Ifthe correlation exists between the above two factors then it could mean eitherone of them influences the other or both of them influence each otherconsistently or intermittently or both of them are influenced by some otherthird factor or the correlation could be out of pure chance(i.e. because of thechoice of a wrong sample).
We come acrosssimilar situations in reality where we need to find whether two or moreentities are related to one another and the nature and the degree of theirrelationships. For example the age of husband and wife, price of a commodityand the amount demanded, an increase in rainfall up to a point and productionof rice etc. (examples involving only two entities are considered here becauseit would be easier to discuss, and the given project also involves only twoentities and the same ideas can be extended to the cases involving more thantwo entities). Since the entities assume different values depending upon thetime place or persons these are referred to as variables in Statistics. If arelationship (measurable) exists between the two variables then the variablesare said to be correlated. The measure of the correlation called thecorrelation coefficient or the correlation index gives the degree and thenature of the correlation (i.e. increase in one variable corresponds toincrease or decrease in the other variable) in terms of a number. Thecorrelation analysis deals with the techniques used in measuring the closenessof the relationship between the variables.
We give someimportant definitions in this regard:
- Correlation Analysis deals with the association between two or more variables. (By Simpson and Kafka).
- If two or more quantities vary in sympathy so that movement in one tends to be accompanied by a corresponding movement in the other(s) then they are said to be correlated.(by L.R.Connor)
- When the relationship is of a quantitative nature, the appropriate statistical tool for discovering and measuring the relationship and expressing it in brief formula is known as correlation. (Croxton and Cowden).
There areseveral ways of classifying correlation. Three of the most important are:
- positive (direct) or negative (inverse)correlation:- if one variable increases the othervariable also increases on average and if one decreases the other decreases onaverage then they are said to be positively correlated. If one variableincreases and the other decreases on average and if one decreases the otherincreases on average the two are said to be negatively or inversely correlated.
- Simple, partial and multiple correlation:- when only two variables are studied it is a problem of simplecorrelation. When three or more variables are studied it is a problem of eitherpartial or multiple correlation. In multiple correlation three or morevariables are studied simultaneously. For example, when we study therelationship between the amount of fee paid to a plastic surgeon, thecomplexity of the operation and the quality of their work (in terms of resultsetc.) then it is a problem of multiple correlation. If we consider only twovariables, say, the quality of work and the fee paid to be influencing eachother and the effect of the other influencing variable is kept constant then itis a problem of partial correlation.
- Linear and non-linear (curvilinear)correlation:- if the amount of change in onevariable tends to bear a constant ratio with the amount of change in the otherthen the correlation is said to be linear. If we draw a graph with one variableon X-axis and the other on Y-axis then almost all the point will approximatelyfall on a line. If amount of change in one variable does not bear a constantratio with the amount of change in the other then the correlation is said to benon-linear. In most of the practical situations we find a non-linearrelationship between the variables. But the techniques of analysis formeasuring non-linear correlation are far more complicated than those for linearcorrelation. Therefore, we generally make an assumption that the relationbetween the variables is of linear type.
The various methods of ascertaining whether two variables arecorrelated or not are:-
- Scatter diagram method: - This is the simplest method. Here we take one variable on X-axis,the other on Y-axis and plot the points. The greater the scatter of the plottedpoints lesser is the relationship between the two variables. The more closelythe points come to a straight line, the higher the degree of linearrelationship. The correlation is positive or negative depending upon the signof the slope of this line. Merits of the method are that it is simple, easy tounderstand and the rough idea can be easily formed as to whether or not thevariables are related. It is not influenced by the size of extreme itemswhereas most of the mathematical methods of finding correlation are influencedby extreme items. While investigating the correlation we usually first draw thescatter diagram. Its drawback is that the exact degree of correlation cannot beestablished as it is done with mathematical methods.
- Graphic method: -in this method we obtain two curves for X and Y variable respectively. Byexamining the direction and closeness of the two curves we can infer whether ornot the two variables are related. Merits and demerits are same as those forscatter diagram.
- Karl Pearson's correlation coefficient :- Also called Pearsonian correlation coefficient denoted(universally)by r is given by r = (Sxy) / Nsxsy where x and y are the deviations ofthe variable values from their respective means, N is the number of the pairsof observations and sx, sy are the standard deviations of the variables X and Y respectively.The above formula can also be written in simplified form as r = (Sxy) / (Sx2. Sy2)1/2.Note that this method is to be applied only where the deviations of items, xand y, are taken from the actual means and not from the assumed means. Correlationcoefficient can also be obtained directly without taking the deviations of theitems either from actual means or assumed means by the formula r = (Nixie - SxSy) / {[NSx2 -(Sx)1/2][NSy2- (Sy)1/2]}1/2where x and y are the values of the variables X and Y respectively and not thedeviations from the means as in the earlier formulas. When the deviations aretaken from assumed means(for example, if the values of X and Y are integral butthe means involve fractions the to make calculations simple we take deviationsfrom some integers near to the actual means which are called assumed means) theformula is identical as the one given immediately before with the onlydifference that the actual values of x and y are replaced by the deviationsfrom the assumed means. The Pearsonaion correlation coefficient is based on theassumptions that i) there is a linear relationship between the variables, ii)the two variables form a normal distribution and iii) there is a cause andeffect relationship between the variables. The chief limitations of this methodare i) linear relationship between the variables is assumed ii) the coefficientis prone to misinterpretation iii) the coefficient is unduly affected byextreme items iv) comparatively more time consuming.
- Rank correlation coefficient: - This was developed by the British psychologist, Charles EdwardSpearman and hence named after him. This method does not assume any thing aboutthe parameters of the population or the shape of the distribution. This methodis especially useful when quantitative measures of certain factors cannot befixed but the members of the group can be ranked. The Spearman's rankcorrelation coefficient is defined as rs = 1 - (6SD2) /N (N2 -1) where D denotes the difference of ranks between paireditems. The advantages of this method are i) simpler to understand and easier toapply and if all items are different the coefficient is same as Pearsonian's. ii)advantageous for the data of qualitative nature. For example, surgeons in twocountries can be ranked in order of professionalism and the degree ofcorrelation can be established by applying this method. iii) this is the onlymethod that can be used when the actual data is not given but only the ranksare given iv) even where actual data are given this method can be applied. Itslimitations are that it cannot be used for finding out correlation in groupedfrequency distribution and it cannot be used if the number of items exceed 30
- Concurrent deviation method :- This is the simplest of all methods. The formula is rc =+[+(2C -N)/N]1/2 where C stands for the number ofconcurrent deviations and N = number of pairs of observations less 1. Themethod is simplest of all and may be used to form a quick idea about the degreeof relationship before making use of more complicated methods. It's limitationsare that it does not differentiate between small and big changes. For example,if X changes from 100 to 101 the sign will be plus and if Y changes from 100 to160 the sign will be plus. The results obtained from this method are only arough indicator of the presence or absence of correlation.
Interpretation of correlation coefficient :- the correlationcoefficient is often likely to be misinterpreted. A large amount of experienceis required to interpret it properly. The general rules of interpreting are :
- r = +1 means there is perfect positive relationship between the two variables.
- r = -1 means there is perfect negative relationship between the two variables.
- r = 0 means there is no relationship between the variables.
- Closer the value of r to +1 or -1, the closer the relationship between the variables and closer the r is to 0, the less close the relationship. When estimating the value of one variable from that of the other variable, the higher the value of r the better the estimate.
- The closeness of relationship is not proportional to r. if the value of r is 0.8 it does not indicate a relationship twice as close as one of 0.4. It is in fact very much closer.
The probableerror of correlation coefficient is defined as P.E.r = 0.6745(1 - r2)/✔Nwhere r is the correlation coefficient and N is the number of pairs ofobservations. If the value of r is less than the probable error there is noevidence of correlation, i.e. the value of r is not at all significant. If r> 6 P.E.r the value of r is significant. If r is thecorrelation coefficient of the population then r- P.E.r < r < r+ P.E.r.
Note: thestandard error of r is defined as S.E.r = (1 - r2)/✔N.
The probableerror can be used only when the data approximately satisfies normaldistribution and the sample is unbiased.
The coefficientof determination which is equal to and denoted by r2 and is definedas the ratio of the explained variance to the total variance. It should not bemisinterpreted that the variable X is in determining or casual relationshipwith Y as the statistical evidence never establishes this kind of causality.The statistical evidence only determines covariation.
Some of theproperties of the correlation coefficient:-
The correlationcoefficient r lies between -1 and +1. it is independent of the scale and originof the variable X and Y. it is equal to the geometric mean of the tworegression coefficients.
The correlationanalysis is included in the adjoining Excel sheet. The idea of probable error isused to test the significance of the correlation coefficient.
一个全面的统计分析对两个变量投资基金业绩和投资基金管理费之间的关系。
投资基金的性能之间的关系(如有)在价格方面和金钱花费项目aimsat图(统计)结论作为thecost各基金管理,关系的意义andthereby提供一些支持的决策上级managementregarding 。请注意,该项目只试图找到如果changein一个变量是伴随着的变化在其他和这个没有wayestablishes的原因和影响关系,即我们可以没有得出结论, ATLEAST通过这个练习,无论在管理feescauses增加支出增加在基金的表现。 (因果relationshipcannot被发现只是用统计技术,但它也需要资金的性质和所涉及的管理theexhaustive研究)。如果上述两个因素之间存在相关性,则这可能意味着他们eitherone影响或它们两者相互影响otherconsistently地或间歇地,或它们两者的影响的一些otherthird的因素或相关可能是纯净的机会(即,因为一个错误的样本thechoice )。
来acrosssimilar的情况在现实中,我们需要找到是否两个或moreentities的涉及到另一个性质及程度theirrelationships的。例如,丈夫和妻子的年龄,一个commodityand价格要求的金额,降雨量增加一个点和productionof大米等(举例只涉及两个实体被认为是这里becauseit的讨论会更容易些,给定的项目还涉及只twoentities和同样的想法,可以扩展到案件涉及更多thantwo实体) 。由于实体假设不同的值,这取决于地方或thetime人士,这些统称为变量统计。如果存在两个变量之间的原有关系之间的激情(可测量) ,则所述相关variablesare 。称为thecorrelation系数或相关指数的相关性的措施给出数的相关性(即增加一个变量的对应toincrease或减少在其他变量)的度和thenature的。 Thecorrelation在,测量closenessof变量之间的关系所使用的技术分析处理。
在这方面我们给someimportant的定义:
相关分析涉及两个或两个以上变量之间的关联。 (辛普森和卡夫卡) 。
如果有两个或两个以上的数量有所同情,所以往往要伴随着其他的相应变动(S ) ,运动于一体,那么他们说要相关。 ( LRConnor )
当关系的定量性,适当的统计工具,用于发现和测量的关系,并表示在简短的公式也称为关联。 ( Croxton酒店和考登)。
有分类相关的areseveral方法。三个最重要的是:
阳性(直接)或负(逆)相关: - 如果一个变量增加的othervariable也平均增加,如果降低其他减少onaverage的,然后,他们被说成正相关。如果一个variableincreases和其他减少平均,如果一个平均降低otherincreases两个说成是带负或负相关。
很简单,局部和多个相关: - 当只有两个变量的研究是的问题simplecorrelation 。当三个或更多变量的影响,这是一个问题的eitherpartial或多个相关。在多个相关三或morevariables中同时进行了研究。例如,当我们研究therelationship的金额之间的整形外科医生,操作thecomplexity ,他们的工作质量(在方面resultsetc 。 )支付的费用,那么它是多个相关的问题。如果我们只考虑twovariables ,说,工作质量和影响海誓山盟费和其他影响变量的效果保持不变,那么它是偏相关问题。
线性和非线性(曲线)的相关性: - 如果在onevariable的量的变化往往承受一个恒定的比率,所述的相关性是线性的变化量在otherthen 。如果我们用一个variableon的X轴方向和Y轴绘制的曲线图,那么几乎所有的点上的线approximatelyfall 。如果在一个变量中的变化量不附有constantratio的,在其他的变化的量,然后所述贝农非线性相关。在大部分的实际情况下,我们发现变量之间的非linearrelationship 。但这些技术的分析线性相关formeasuring ,是复杂得多的linearcorrelation 。所以,我们一般做一个假设的关系的变量是直线型的。
确定两个变量是否arecorrelated的或没有的各种方法是: -
散布图法 - 这是最简单的方法。在这里,我们采取一个变量X轴, Y轴和情节点上。的plottedpoints订单中的分散性越大两个变量之间的关系。来的更closelythe的点的直线化程度越高linearrelationship 。根据signof这条线的斜率为正或负的相关性。方法的优点是,它是简单的,易于了解“新新人类”语言是什么含义,粗略的概念可以容易地形成到是否相关thevariables的。它是没有影响的大小极端itemswhereas的找到相关的数学方法是influencedby极端件。虽然调查的相关性,我们通常先画thescatter图。它的缺点是确切的相关程度不能beestablished ,因为它是通过数学方法。
图解法:在这个方法中,我们得到的两条曲线分别为X和Y变量。 Byexamining两条曲线我们可以推断出,不论是否以两个变量相关的方向和亲近。孰是孰非一样那些forscatter图。
给卡尔·皮尔逊相关系数: - 也称为Pearsonian相关系数表示(普遍)由r ř = ( SXY ) / Nsxsy哪里x和y是偏离OFTHE变量值从各自的手段, Ñ数的pairsof观察和SX的系统是上述式中的变量X和Y分别的标准偏差也可以被写为r = ( SXY )/ ( Sy2的Sx2. )的的1/2.Note中以简化形式,该方法是应用仅发生在偏离的项目, x和Y,从实际的装置中,而不是从假定的装置。也可以通过以下方式获得相关系数的情况下直接服用theitems偏差,实际的装置或承担的装置,由公式R = (尿血 - SxSy的)/ { [ NSx2 - (SX) 1/2 ] [ NSy2 (施) 1 /2 〕} 1/2where x和y为变量的值,分别为X和Y ,而不是在前面的式中的装置,从thedeviations 。当偏离aretaken假设的装置(例如,如果X和Y的值是积分但是问题的装置涉及馏分,使计算简单,我们,采取deviationsfrom一些整数附近的实际装置,该装置被称为假设装置)作为theformula是相同前一个给定的x和y的实际值所取代由deviationsfrom假定装置与onlydifference 。的Pearsonaion相关系数基于theassumptions的, i)存在的变量之间的线性关系,ⅱ)两个变量可以形成一个正态分布,以及iii)有是一种病因andeffect的变量之间的关系。假设这个methodare I)线性变量之间的关系的主要限制2 )容易产生误解的coefficientis III)系数是不适当的影响byextreme项目IV )较为费时。
秩相关系数: - 这是由英国心理学家开发的,查尔斯EdwardSpearman ,因此以他的名字命名。此方法不承担任何的东西攻方的人口分布形状的参数。这methodis特别有用的,当某些因素无法量化措施befixed但可以排在该组的成员。 ,斯皮尔曼rankcorrelation系数定义为RS = 1 - ( 6SD2 ) / N (N 2 -1 ) ,其中D表示的行列之间paireditems的差异。这种方法的优点是我)更容易理解,更容易toapply ,如果所有的项目都不同的系数是一样Pearsonian的。 ⅱ)有利的定性性质的数据。例如,外科医生在twocountries可以排名的专业的顺序,并运用这种方法,可以建立学位ofcorrelation的。 ⅲ)这是当实际数据没有给出,但只有ranksare静脉),即使该方法可应用于实际数据给出,可用于onlymethod 。 itslimitations ,它不能被用于找出相关在groupedfrequency分布,它不能被使用的项目数超过30的
并发偏差法 - 这是最简单的方法。其计算公式为RC = + + (2C - N )/ N ] 1 /2,其中C代表数量的ofconcurrent偏差和N =观测减1的对数。 themethod是简单的,可被用于形成一个快速的使用更复杂的方法之前,有关degreeof的关系的想法。这是limitationsare ,它不区分大,小的变化。例如,从100到101如果X变化的标志将是加100 to160如果Y变化的标志将是加。从这个方法所获得的结果是只有arough指标的相关性的存在或不存在。
解读相关系数: - 相关系数往往是容易被误解。需要大量的experienceis的正确解释。解释的一般规则是:
R = +1表示两个变量之间有完美的正相关关系。
R = -1表示两个变量之间有完美的负相关关系。
R = 0表示没有变量之间的关系。
r的值+1或-1的更紧密,更接近变量之间的接近关系的r为0时,关闭的关系。当估算的其他变量的一个变量的值,较高的r的值的更好的估计。
亲密关系是不是正比于r 。如果r的值是0.8 ,它并不表示两次尽可能接近0.4的关系。它实际上是非常接近的。
的probableerror的相关系数被定义为PER = 0.6745 ( 1 - R 2) / ✔ Nwhere r是相关系数, N是多少对ofobservations 。如果r的值小于无证据表明相关的可能的误差,即r的值是不是在所有的显着性。如果R> 6 P.E.r r的值是显着的。如果r是thecorrelation的人口系数,那么R- PER < R < R +每
注意:被定义为SER =(1 - R 2) / ✔ N.的r thestandard错误
只当数据近似满足normaldistribution的和样品是公正的,可以用来probableerror , 。
等于由R2表示的coefficientof测定definedas的解释方差的总方差的比例。它不应该bemisinterpreted变量X是确定或休闲relationshipwith Y为统计证据从来没有建立这种causality.The统计证据来确定共变。
某些的theproperties的相关系数: -
位于-1和+1之间的相关系数r 。它是独立的规模和originof变量X和Y。它是等于的tworegression系数的几何平均数。
相邻的Excel工作表中包含的correlationanalysis 。可能错误的想法是用来测试的相关系数的意义。