Computing Statistics - Single Statistics

Single statistics are statistical functions used to perform preliminary descriptive analyses.

Properties

Properties	Description

Properties	Description
General parameters
Statistics on integer variables are continuous	If selected, statistics will be displayed as continuous values (e.g. 12,34 instead of a rounded off 12).
Sample size
Number of total valid samples	Number of rows with valid values, not including missing values.
Concentration measures
Gini concentration index	Computes the Gini concentration index, which is obtained from the following equation: where: N is the number of valid values, represents the cumulative percentage of the attribute X and is the corresponding expected value under the hypothesis of equal distribution.
Dispersion and heterogeneity measures
Number of mode elements	Shows the number of cases where the value corresponds to the mode. In a statistical distribution the mode represents the most frequently observed value, and the number of mode elements corresponds to the number of occurrences of tied values.
Entropy	A heterogeneity measure, which is obtained, for a categorical variable, from the following equation: where: i indicates each category, p_i is the corresponding frequency k denotes the number of categories.
Normalized entropy	Value obtained by dividing the entropy value by its maximum theoretical value (i.e. log(n)).
Gini coefficient	Coefficient commonly applied to qualitative ordinal variables to evaluate the dispersion of values across different categories. It is estimated by the following equation: where: i indicates each category p_i is the corresponding frequency. G takes the minimum value of zero when there is only one category (maximum concentration), while if there is only one value in each stratum (maximum dispersion) .
Normalized Gini coefficient	Coefficient with the same meaning as the non-normalized statistic, but it is forced to vary between 0 and 1, simply dividing the G value by its maximum theoretical value
Range of values	The difference between the highest and the lowest values observed for the attribute X.
Interquartile range	A dispersion measure often applied to continuous variables when their distribution is non Gaussian. It represents the difference between the 75^thand the 25^thpercentile, corresponding to the 75% and the 25% value of the cumulative distribution of X, respectively.
Standard error of mean	A dispersion measure usually used to evaluate the precision of a mean estimate. It is obtained by the ratio between the standard deviation and the square root of the number of subjects under study.
Standard deviation	The square root of the variance (defined below). It is often used instead of the Variance as it has the same unit of measure as X.
Standard error of standard deviation	Provides a measure of the uncertainty of the Standard deviation estimate. It is seldom used.
Variance	It is among the most commonly used dispersion measures. It is estimated by the following equation (sample variance):
Standard error of variance	A measure of the precision of the variance estimate. It is seldom used.
Coefficient of variation	The ratio between the standard error and the mean of an observed distribution. It represents a dimensionless measure of dispersion and it should be calculated only for continuous attributes, which take positive values.
Mean absolute deviation	The mean of the absolute values of the difference between each value and the average of the attribute X. It is seldom used in favor of the Variance measure as it has better statistical properties.
Median absolute deviation	The median value of the absolute values of the difference between each value and the average of the attribute X. It is a very rarely used as a measure of dispersion.
Descriptive, location and central tendency measures
Number of distinct values	The number of non coincident (tied) values.
Number of missing values	The number of missing values.
Mean value	The simple arithmetic (unweighted) mean of the observed X values:
Absolute mean value	The mean value obtained from the absolute value of each observation.
Geometric mean value	A central tendency measure that should be calculated for positive values only: It is often applied to evaluate the average mean time needed to perform a specific task or as a measure of average speed.
Geometric absolute mean value	The geometric mean value obtained from the absolute value of each observation. It is seldom used.
Harmonic mean value	The harmonic mean value obtained from the value of each observation.
Harmonic absolute mean value	The harmonic mean value obtained from the absolute value of each observation. It is seldom used.
Median value	The value that makes it possible to split the X distribution into two equally sized samples (or almost equally sized, in the presence of an odd number of distinct values), corresponding respectively to the values ≤ median and > median. If there is an odd number of distinct X values, the median belongs to the X observed distribution. For example, if there are 5 valid items of data, the third ordered value corresponds to the median. Otherwise the median value is estimated by a simple interpolation, for example if there are 6 distinct values, the average between the 3^rdand the 4^thordered value is used.
Mode value	The most frequent observation.
Index of the mode element	The position of the most frequently observed value in the original (unsorted) X attribute.
Minimum value	The valid lowest value observed.
Index of the minimum element	The position of the lowest observed value in the original (unsorted) X attribute.
Maximum value	The valid highest value observed.
Index of the maximum element	The position of the highest observed value in the original (unsorted) X attribute.
Sum value	The sum of all valid observations.
Absolute sum value	The sum of the absolute value of each observation.
Product value	The product of all valid observations.
Absolute product value	The absolute value of the product between each observed value.
Lower quartile	The value that separates the quarter of ordered data with the lowest values from the 75% of the remaining observations. It is also called “the 25^thpercentile” and is used together with the upper quartile and the median value to obtain a box plot. For further information on boxplots see Plotting Data - Box Plots.
Upper quartile	The value that separates the quarter of ordered data with the highest values from the 75% of the remaining observations. It is also called “the 75^thpercentile” and is used together with the lower quartile and the median value to obtain a box plot.
Lower whisker for boxplot	The value corresponding to three times the standard deviation below the mean. It is used as a threshold to detect outliers detection and is part of the box plot.
Upper whisker for boxplot	The value corresponding to three times the standard deviation above the mean. It is used as a threshold to detect outliers and is part of the box plot.
Symmetry and shape measures
Skewness value	A measure of asymmetry. Symmetric distributions have zero skewness, whereas positive values are associated with right asymmetry. It is obtained by the following equation:
Standard error of skewness	An estimate of the precision of the skewness value estimate.
Kustosis value	A measure of how data are peaked. For example, kurtosis of a standard Gaussian distribution is 3.0. Distribution with a higher kurtosis is said to be leptokurtic while distributions with a lower value are called platykurtic. The Kurtosis value is obtained by the following equation:
Standard error of kurtosis	A measure of the precision of the Kurtosis estimate.