Introduction
A dataset is a collection of data that is organised and stored in a set. To ease understanding and description, the data is generally arranged in the form of tables where each column represents a distinct variable. You must have come across datasets numerous times in your daily routines. For instance, the attendance register of a class is an example of a dataset. Depending on the situation, each row in the table corresponds to an entry in the dataset for which the values of different attributes are listed under their corresponding columns. A complete dataset contains the values of each attribute for each of its members.
In this article, we will delve into the definition of a dataset, the different types of datasets, their properties, and provide solved examples to aid in understanding.
What is a Dataset?
As mentioned, a dataset is a collection of data or observations from experiments, measurements, calculations, etc., that is organised in a specific manner. This data can be of any kind like names, numbers, figures, description, etc. Further, it is not necessary to present the data in tabular format and one can also present it via charts and graphs.
Generally, a single dataset groups together related values and objects. The easiest example would be that of the list of students in a class and their attendance record. This dataset would be organised in the form of rows and columns, where each row could correspond to a student and each column would represent his attendance status on a given date.
Types of Datasets
Since the type of data that we wish to present can vary in nature, datasets are classified into various types as follows:
- Numerical Dataset
- Bivariate Dataset
- Multivariate Dataset
- Categorical Dataset
- Correlation Dataset
We will discuss each of these types with examples.
Numerical Dataset
This is the simplest form of dataset containing numerical data. There are no words or pictures in a numerical dataset and one can also say that numerical datasets contain quantitative data. Since all entries in a numerical dataset are numbers, we can easily apply various arithmetic operations on any entry in the dataset. Some common example of such a dataset would be:
- Age of set of people
- Number of balls played
- Number of shoes in the shoe rack
Bivariate Dataset
The word bivariate is a combination of bi, meaning two, and variate, referring to variables. That is, a bivariate dataset contains two variables which generally have some sort of relationship with each other. The value of the second variable depends on the value of the first one.
If you have ever seen a table listing the number of calories you would burn against the time you work out, that is precisely what a bivariate dataset is. Naturally, the number of calories burnt increases with the workout time.
Multivariate Dataset
A multivariate dataset is akin to a bivariate one, except that it contains more than two variables. Generally, the variables in such a dataset are functions of one or more variables and thus, each column is related to some other one.
For example, a dataset that lists the price of different dishes across different restaurants of your city is an excellent example of a multivariate dataset. Not only does the cost of dishes vary with restaurants, but if the dish in question (say pizza) contains topping, its cost would depend on the cost of that topping in that particular restaurant, leading to a complex, intricately linked dataset.
Categorical Dataset
Generally, categorical datasets contain qualitative data like a person or object’s attributes or characteristics. In datasets, if a variable can take any one of two values, it is said to be dichotomous. On the other hand, if a variable can take one of a large number of values, it is said to be polytomous. Some examples of a categorical dataset would be datasets storing a person’s hair length (short or long) or the type of different cars (automatic or manual).
Correlation Dataset
Correlation datasets contain data whose variables have a relationship with each other and thus, are interdependent. These relationships may be of the following nature:
- Positive correlation: The variation in the variables occurs in the same direction. That is, if one of the variables increases, the other one also increases and vice versa. For instance, a jogger’s distance covered can only increase with time, not decrease.
- Negative correlation: Negatively correlated variables vary in opposite directions. If one of them increases, the other one decreases. For example, the time taken to cover a distance of 10 miles is sure to decrease if the speed of the car increases.
- No correlation: It is also possible for variables to be totally independent of each other. For instance, the number of flowers in a park generally has nothing to do with the number of flowers in it.
Mean, Median, Mode, and Range
Mean, median, mode, and range are quantities used to investigate the nature, variation, and properties of a dataset. Mean, median, and mode are often referred to as measures of central tendency, which means that they describe where the centre of a collection of values lies. We will discuss these topics here.
- Mean: Mean refers to the average value of a variable. Mathematically, mean is calculated by dividing the sum of all the values of a variable and dividing it by the number of observations. For example, if the students in a class scored {40, 50, 60, 40, 60} marks out of 100, then the average score of the class would be calculated as follows:
Mean = sum of observations / number of observations
Mean = (40+50+60+40+60) / 5 = 250/5 = 50 - Median: Median refers to the central value in a collection of values that has been sorted in ascending or descending order. It is important for the data to be sorted or the median calculated would be incorrect. For example, if we are given the values {1, 6, 5, 7, 2}, then in ascending order, we have {1, 2, 5, 6, 7}. Here, 5 is the value that lies in the centre of the datasets and thus, is the median.
- Mode: The entry that occurs the most frequently in a collection of values is known as the mode of that collection. For example, out of {1, 2, 6, 5, 3, 3, 4, 5, 2, 3, 7, 8, 6, 3, 6, 9, 3}, 3 occurs the most frequently and thus, is the mode.
- Range: The range is indicative of the spread of a variable. It is mathematically calculated by subtracting the smallest value from the largest one in a collection. Thus, in {1, 2, 4, 5, 6}, the range is given by R = 6-1 = 5.
Properties
There are a number of properties related to data analysis that can help us understand the dataset in question. Depending on these properties, we can choose the best method of statistical analysis to be applied. These properties are analysed via a process known as exploratory data analysis (EDA) and some of them are listed here:
- The centre of the data.
- The skewness of data.
- Distance between data members.
- Presence or absence of outliers.
- Correlation between variables.
- The probability distribution type of the variable.
Examples
Determine the mean, median, mode, and range of {1, 11, 7, 3, 9, 3, 15}.
It is generally a good practice to sort the data in ascending or descending order from the offset. Thus, we have {1, 3, 3, 7, 9, 11, 15}. Now, we have
Mean = sum of all observations / number of observations.
Mean = (1+3+3+7+9+11+15) / 7 = 49/7 = 7.
Median = 4th value = 7.
Mode = Most frequent value = 3
Range = highest value – lowest value = 14.
Summary
A dataset is a collection of data organized in a specific manner and it is classified into various types like numerical, bivariate, multivariate, categorical, and correlation types. Datasets are studied via various values like mean, median, mode, and range.
Mean refers to the average value of a data set. Median is the central value after the dataset has been arranged in ascending or descending order. Mode is the most frequently occurring value while range describes the spread of data measured as the difference of highest and lowest value.
Frequently Asked Questions
1. What do you mean by a dataset?
A dataset is simply a collection of data organised in a specific manner like in a table.
2. Is the range also a measure of central tendency?
No. Mean, median, and mode are the true measures of central tendency. Range is a useful value, but it isn’t classified as a central tendency measure.
3. How would you find the mode of a dataset which contains no distinct values, i.e., all values are the same?
For such a dataset, the mode would be this repeating value itself. Note that if no value in the dataset repeats, then the mode would be undefined.