Kartable: Lessons and exercises for High School

Introduction to set of data

Quantitative & qualitative data

Quantitative data

A quantitative data set is a data set which is measured using numbers.

The following data set is the collection of the heights of students in a class.

Student Name	Height
Amy	180 cm
John	172 cm
Louis	193 cm
Pierre	201 cm
Marcus	162 cm
Emmy	160 cm

Heights are recorded using numbers. Therefore, the above data set is a quantitative data set.

Qualitative data

A data set is qualitative if its data characterizes a property of members of a population without using numbers.

The following data set is the collection of the hair colors of students in a class.

Student Name	Hair color
Amy	Brown
John	Brown
Louis	Blonde
Pierre	Black
Marcus	Red
Emmy	Brown

The colors of people's hair is not recorded using numbers. Therefore, the above data set is a qualitative set.

Discrete or continuous data, levels of measurement

Discrete data

A discrete data set is a data set where the possible values that a member of the population can have is finite.

The following data set is a record of students responses to the question: "Do you enjoy studying math?"

Student Name	Enjoys math
Amy	Yes
John	No
Louis	Yes
Pierre	Yes
Marcus	No
Emmy	Yes

There are only two possible responses to the question "Do you enjoy studying math?" Therefore, the above data set is a discrete data set.

Continuous data

A continuous data set is a data set where the possible values that a member of the population can have is infinite.

The following data set is the recorded speed of cars traveling along a highway.

Car number	Speed in kph
1	100
2	95
3	98.25
4	78.5
5	79
6	110.75
7	105.25
8	103.5
9	100
10	100.5

The possible set of values for speeds of cars is infinite. Therefore, the above data set is a continuous data set.

If we measured speeds of cars but rounded the speed to the nearest unit, then the data set would actually be discrete.

Levels of measurement

There are four different levels of measurement.

Nominal level of measurement

A nominal level measurement is a measurement where attributes cannot be ordered.

The following data set is the collection of the hair colors of students in a class.

Student Name	Hair Color
Amy	Brown
John	Brown
Louis	Blonde
Pierre	Black
Marcus	Red
Emmy	Brown

Colors of people's hair categorizes members of the population in a way which cannot be ordered. Therefore, the measurement of recording people's hair color is a nominal level of measurement.

Ordinal level of measurement

A ordinal level measurement is a measurement where attributes can be ordered.

In a customer survey, a grocery store asked customers how satisfied they were with their shopping experience by asking them to select the value on the following scale which best represents their experience:

5 - Very satisfied
4 - Satisfied
3 - Okay
2 - Unsatisfied
1 - Very unsatisfied

Customer number	Survey response
1	4
2	3
3	2
4	5
5	5
6	3
7	3
8	5
9	4
10	1

The measurements of the customer's satisfaction can be ordered and the measurement of customer's satisfaction is therefore an ordinal level of measurement.

Interval level of measurement

An interval level measurement is an ordinal level of measurement where we know the exact differences between measurements.

The following data set is the average recorded outside temperature in celsius of a city over a period of ten days.

Day number	Temperature
1	25
2	24
3	20
4	17
5	20
6	21
7	15
8	13
9	12
10	12

The measurements of temperature can be ordered, but we also know how to measure the difference between two different measurements. Therefore, the measurement of temperature is an interval level of measurement.

Ratio level of measurement

A ratio level of measurement is an interval level of measurement where there is a true zero, which is the value for which no value of measurement can fall below.

The following data set is the recorded speed of cars traveling along a highway.

Car number	Speed in kph
1	100
2	95
3	98.25
4	78.5
5	79
6	110.75
7	105.25
8	103.5
9	100
10	100.5

The measurements of the speed of cars can be ordered, but we also know how to measure the difference between two different measurements. Therefore, the measurement of speed is an interval level of measurement. On top of this, there is a true zero to speed, namely 0 kph. Therefore, the measurements of the speed of cars on a highway is a ratio level of measurement.

Different forms of graphs

Graphs are used to represent data sets.

Histograms, bar graphs, circle graphs and line graphs

Bar Graphs

A bar graph is a pictorial representation of data which compares values assigned to different categories. Each bar represents only one numeric value of data.

The following bar graph represents number of pages in various books.

Histogram

A histogram is a special kind of bar graph to represent the frequency of numerical data. The bars are organized into equal intervals and there is no gap between them.

The following histogram represents the number of registered students in each academic year at a certain high school.

Circle Graphs

Circle graphs, also known as "pie charts", are circular graphs which are divided into different sections. Each of the sections denotes a percentage or a value out of the total.

The following circle graph represents the relative size of each grade level of students in a high school as numbers and as percentages.

Line Graphs

A line graph is used to display continuous data. It displays information as a series of data points connected by a straight line.

The following line graph represents the evolution of temperatures during a week in degrees Fahrenheit.

Stem-and-leaf plots

Stem-and-leaf plot

In a stem-and-leaf plot, the data is organized from least to greatest. It is a special table where each data value is split into :

A "stem", the first digit or digits of the value.
A "leaf", which is usually the last digit of the value.

Consider the following dataset:

21, 37, 15, 18, 32, 28, 11

It has 7 data values:

21 can be split into a stem of "2" and a leaf of "1".
37 can be split into a stem of "3" and a leaf of "7".
15 is split into a stem of "1" and a leaf of "5" and so on...

In the end, there will be values 1,2, and 3 as stems.

For a stem of 1, there is a leaf of 1 (for 11), a leaf of 5 (for 15), and a leaf of 8 (for 18).

It makes the following steam-and-leaf plot:

Stem	Leaf
1	1,5,8
2	1,8
3	2,7

Box-and-whisker plots

Box-and-whisker plot

A box-and-whisker plot is a diagram that summarizes data by dividing it into four equal parts (quartiles) using five numbers:

The smallest number.
The three quartiles (as defined later in this lesson).
The biggest number.

A box plot is drawn from the first quartile to the third quartile and a vertical line goes through the box at the median. The whiskers go from each quartile to the minimum or maximum:

Consider the following data set:

2, 3, 6, 7, 8, 9, 11

Assume that:

The median is 7.
The lower quartile is 3.
The upper quartile is 9.

The box-and-whisker plot for the above data set is:

Key concepts in statistics

Indicators of central tendency

Mean

The mean of a numerical data set is the average value of the data. To calculate the mean, add up all the numbers (sum) and divide it by the number of numbers (count):

Mean= \dfrac{sum}{count}

Consider the following data set:

2, 3, 5, 6, 4

To find its mean, first add all the terms to find the sum :

2 + 3 + 5 + 6 + 4 = 20

The total number of terms (or the count) is 5.

Now calculate the mean:

\dfrac{20}{5}=4

Median

The median of an ordered data set is the middle value of the data set:

If there is an odd number of terms in a list, then there is one unique value that is in the middle of the sorted list.
If there is an even number of terms in a list, then there will be two values in the middle. In this case, we take the average of these two values in order to calculate the median.

Consider the following data set:

10, 15, 5 , 25, 20.

In order to find the median, the set first needs to be sorted in order:

5 , 10 , 15, 20 , 25

There are 5 terms in the set. The middle value is the value that is found in the third place, which is 15.

Consider finding the median for the following data set:

20, 15, 5, 10, 25, 30.

First, the set needs to be sorted in order:

5, 10, 15, 20, 25, 30

The total number of terms in this data set is 6 (which is an even number). There are two numbers in the middle (15 and 20) which are the third and forth numbers in the set. To find the median, calculate the average of these two values:

\text{Median} = \dfrac{15+20}{2}= \dfrac{35}{2}= 17.5

The median is 17.5.

The mean and the median of a data set can be different.

Consider the following data set:

1{,}1, 10

The mean of the data set is:

\dfrac{1+1+10}{3}=\dfrac{12}{3}=4

However, the median is 1.

Mode

The mode is the number that appears the most in the list. To find the mode, it is best to sort the list in order and then count how many times each number appears. The number that appears most often is the mode.

Consider finding the mode for the following list:

12, 11, 43, 17, 12, 11, 34, 38, 43, 12, 72

First, arrange the given data set in order:

11, 11, 12, 12, 12, 34, 38, 43, 43, 72

Now we can see that 12 occurs more times than any other number in this data set. Therefore, the mode is 12.

Outlier

An outlier is a value that is drastically larger or smaller than the rest of the values in a list.

Consider the following list of numbers:

2,3,5,9,28.

28 is drastically larger than the other values and is an outlier.

Outliers can be subjective or the result of a mistake in experimentation. They are often removed from datasets to better understand tendencies.

Indicators of dispersion

Range

The range of a data set is the difference between the largest and smallest value in the data set.

The following data set contains the ages of 7 students:

15, 12, 9, 13, 17, 10, 11

The highest age is 17.
The lowest age is 9.

The range of the data set is:

17 - 9 = 8

Standard deviation

Standard deviation measures how spread out numbers are from the mean value of a data set. Standard deviation is usually denoted by SD or \sigma .

\sigma =\sqrt {\dfrac {\sum\left(x_i-\overline{x}\right)^2}{N}}

x_i represents each data point.
\overline{x} represents the mean of the dataset.
N is he size of the dataset.

Consider finding the standard deviation and variance of the following data set:

22, 25, 24, 27, 28, 24

First, the mean needs to be calculated:

Mean =\overline{x} = \dfrac{22+25+24+27+28+24}{6}=\dfrac{150}{6}=25

The next step is to create a table:

Data point	Difference from the mean	Squared
x_i	x_i - \overline{x}	\left(x_i-\overline{x}\right)^2
22	-3	9
25	0	0
24	-1	1
27	2	4
28	3	9
24	-1	1

Next, the average of the squared differences needs to be calculated. It represents the variance or \sigma^2 :

\sigma^2 = \dfrac{9+0+1+4+9+1}{6}=\dfrac{24}{6}=4

Finally, the standard deviation is the square root of the variance :

\sigma=\sqrt{4}=2

Standard deviation is used to quantify dispersion in a data set or to simply find out how much the numbers in a data set differ from the mean value.

Variance

Variance is the square of the standard deviation:

V=\sigma^2

First and Third Quartile

The first quartile, or Q_1, is the median of the lower half of the data set and can be called lower quartile.
The third quartile, or Q_3, is the median of the upper half of the data set and can be called upper quartile.

Consider finding the first and third quartiles for the following data set :

3, 9, 13, 8, 4, 5, 5, 10, 7

First, the set needs to be sorted in order:

3, 4, 5, 5, 7, 8, 9, 10, 13

As there are 9 elements in this set, the median is found in the middle (the fifth element in the ordered set). The median is 7.

The first quartile is the median of the lower half of the set : 3, 4, 5, 5. It is the average of the two elements in the middle :

Q_1=\dfrac{4+5}{2}=4.5

The third quartile is the median of the higher half of the set : 8, 9, 10, 13. It is the average of the two elements in the middle:

Q_3=\dfrac{9+10}{2}=9.5

The median of a set is also known as the second quartile or Q_2.

The first quartile (Q_1), the median (Q_2), and the third quartile (Q_3) divide a set into four equal parts.

Interquartile Range

The interquartile range is the difference between the third and first quartile:

\text{Interquartile range}=Q_3-Q_1

In the previous example, the interquartile range is:

Q_3-Q_1=9.5-4.5 =5

Indicators of correlation

Correlation coefficient

Correlation coefficient indicates the extent to which two or more variables fluctuate together.

A positive correlation indicates the extent to which those variables increase the same direction or decrease in the same direction.
A negative correlation indicates the extent to which one variable increases as the other decreases.

Consider the following data:

X	Y
12	20
10	18
9	15
7	14
7	12
5	12
3	10

Before knowing how to calculate the exact correlation coefficient, it is possible to figure out whether it is likely to exist and whether it is a positive or a negative one. The two variables X and Y fluctuate together and as X decreases so does Y. Since they fluctuate in the same direction, the correlation is likely to exist and be a positive one.

The correlation coefficient ranges between -1 and 1. Specifically:

1 is a perfect positive correlation.
0 is no correlation (the values don't seem linked at all).
-1 is a perfect negative correlation.

The correlation coefficient is denoted by r and can be calculated for two variables x and y using this formula :

r_{xy}=\dfrac{\sum_{i=1}^{n}\left(x_i-\overline{x}\right)\left(y_i-\overline{y}\right)}{\sqrt{\sum_{i=1}^{n}\left(x_i-\overline{x}\right)^2\sum_{i=1}^{n}\left(y_i-\overline{y}\right)^2}}

Consider finding the correlation coefficient between the following datasets:

x = \left[ 16, 1, 8, 8, 15 \right]
y = \left[ 5, 13, 1, 4, 4 \right]

First, calculate the mean of each dataset:

\bar{x} = \dfrac{16+1+8+8+15}{5} = 9.6

\bar{y} = \dfrac{5+13+1+4+4}{5} = 5.4

Then, construct and fill in the following table:

x	y	x_i-\bar{x}	y_i-\bar{y}	\left(x_i-\bar{x}\right)\left(y_i-\bar{y}\right)	\left(x_i-\bar{x}\right)^2	\left(y_i-\bar{y}\right)^2
16	5	6.4	-0.4	-2.56	40.96	0.16
1	13	-8.6	7.6	-65.36	73.96	57.76
8	1	-1.6	-4.4	7.04	2.56	19.36
8	4	-1.6	-1.4	2.24	2.56	1.96
15	4	5.4	-1.4	-7.56	29.16	1.96

Then, calculate the sum of the elements in the fifth column:

\sum_{i=1}^{n} \left(x_i-\bar{x}\right)\left( y_i-\bar{y}\right) =-66.2

Calculate the sum of the elements in the sixth column:

\sum_{i=1}^{n} \left(x_i-\bar{x}\right)^2 = 149.2

Calculate the sum of the elements in the seventh column:

\sum_{i=1}^{n} \left(y_i-\bar{y}\right)^2 = 81.2

Finally, put these values in the formula for the correlation coefficient:
r = \dfrac{-66.2}{\sqrt{149.2\times 81.2}} = -0.601

Correlation does not always mean causation. Two values may be correlated without one causing the other one.

The sale of hot beverages and umbrellas are likely to be positively correlated, but one does not cause the other. They are both caused by another element (the weather).

III

Scatter plots

Definition of a scatter plot

Scatter plot

A scatter plot is a set of points plotted on a horizontal and vertical axes. It shows the relationship between two sets of data.

The following scatter plot shows age on the horizontal axis and height on the vertical axis:

Scatter plots and regression line

Regression line

A regression line is a line drawn on a scatter plot that best describes the behavior of a set of data. This line is as close to all the data points as possible and as such represents the best fit for the trend of a given set of data. It can be used to predict values that we do not have.

Here is a regression line drawn on a scatter plot that represents age and height:

Using this regression line, we can predict that a 16 year old person will have a height of 75 inches.

There are formulas for finding regression lines of data sets, but they are lengthy and typically computed using computer software.

Certain bamboo plants have been observed to grow at surprisingly fast rates. The following table is the measured height of a bamboo plant at 5:00 pm each day in a weeks time.

Day of the week	Height of bamboo
Sunday	10cm
Monday	14cm
Tuesday	19cm
Wednesday	26cm
Thursday	31cm
Friday	36cm
Saturday	40cm

We then plot the data points on a graph with the x -axis being the days of the weeks and the y -axis being the height of the bamboo plant.

We then use computer software to find the line of best fit. Software will even provide the equation of the line of best fit which can be used to make predictions about future measurements.

A prediction made using regression line may not be accurate. It is the best guess we can have given the previous data but the actual value might vary.

If the data in the previous example is used in order to predict the height of a 40 year old person, the result would be 160 inches. This is obviously impossible, but since the data stopped at the age 11, the best prediction using the data would be that the height continues to increase at the same rate.