How to Calculate Descriptive Statistics In Pandas?

9 minutes read

To calculate descriptive statistics in Pandas, you can use various functions provided by the library. Here are some commonly used functions:

  1. Mean: You can calculate the mean of a column using the mean() function. It computes the average of the values in the column.
  2. Median: The median can be calculated using the median() function. It gives the middle value of a dataset when ordered.
  3. Mode: To find the mode, use the mode() function. It returns the most common value(s) in a dataset.
  4. Standard Deviation: The std() function is used to calculate the standard deviation. It measures the amount of variation or dispersion in the data.
  5. Variance: You can compute the variance using the var() function. It indicates how spread out the dataset is.
  6. Minimum and Maximum: To find the minimum and maximum values in a column, use the min() and max() functions, respectively.
  7. Quartiles: The quantile() function can be used to calculate quartiles. It gives the values that separate the data into quarters (25th percentile, median, and 75th percentile).
  8. Count: To count the number of non-null values in a column, you can use the count() function.
  9. Sum: The sum() function calculates the sum of the values in a column.
  10. Range: Range can be derived by subtracting the minimum value from the maximum value.


These functions can be applied to individual columns or entire dataframes. By utilizing them, you can gain valuable insights into your data and understand its central tendency, dispersion, and distribution.

Best Python Books of September 2024

1
Learning Python, 5th Edition

Rating is 5 out of 5

Learning Python, 5th Edition

2
Head First Python: A Brain-Friendly Guide

Rating is 4.9 out of 5

Head First Python: A Brain-Friendly Guide

3
Python for Beginners: 2 Books in 1: Python Programming for Beginners, Python Workbook

Rating is 4.8 out of 5

Python for Beginners: 2 Books in 1: Python Programming for Beginners, Python Workbook

4
Python All-in-One For Dummies (For Dummies (Computer/Tech))

Rating is 4.7 out of 5

Python All-in-One For Dummies (For Dummies (Computer/Tech))

5
Python for Everybody: Exploring Data in Python 3

Rating is 4.6 out of 5

Python for Everybody: Exploring Data in Python 3

6
Learn Python Programming: The no-nonsense, beginner's guide to programming, data science, and web development with Python 3.7, 2nd Edition

Rating is 4.5 out of 5

Learn Python Programming: The no-nonsense, beginner's guide to programming, data science, and web development with Python 3.7, 2nd Edition

7
Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow 2, 3rd Edition

Rating is 4.4 out of 5

Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow 2, 3rd Edition


How to calculate the sum in Pandas?

In Pandas, you can use the sum() method to calculate the sum of values in a particular column or across the entire DataFrame. Here are some examples:

  1. Sum of values in a single column:
1
df['column_name'].sum()


  1. Sum of values across all columns:
1
df.sum()


  1. Sum of values in each row:
1
df.sum(axis=1)


Note that if your DataFrame contains non-numeric values, the sum() method will attempt to convert them to numeric types. If the conversion fails, it will return an error. In such cases, you may need to specify a subset of columns or use additional parameters to handle non-numeric data.


You can also control the axis along which the sum is calculated by changing the axis parameter. By default, axis=0, which sums values vertically (column-wise). Setting axis=1 sums values horizontally (row-wise).


If you want to exclude missing or NaN values when calculating the sum, you can use the skipna parameter. By default, skipna=True, which skips NaNs, but you can set it to False to include NaNs in the sum calculation.


Here's an example usage:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import pandas as pd

data = {'A': [1, 2, 3, 4, 5], 'B': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)

column_sum = df['A'].sum()
print("Sum of column 'A':", column_sum)

row_sum = df.sum(axis=1)
print("Sum of each row:\n", row_sum)

whole_df_sum = df.sum()
print("Sum across all columns:\n", whole_df_sum)


Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
Sum of column 'A': 15
Sum of each row:
 0    11
1    22
2    33
3    44
4    55
dtype: int64
Sum across all columns:
 A    15
B   150
dtype: int64



How to calculate the covariance in Pandas?

To calculate the covariance in pandas, you can use the cov() function. This function is available for a pandas DataFrame or Series object.


Here's an example using a DataFrame:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
import pandas as pd

# Create a DataFrame
data = {'x': [1, 2, 3, 4, 5],
        'y': [5, 4, 3, 2, 1]}
df = pd.DataFrame(data)

# Calculate covariance
covariance = df['x'].cov(df['y'])
print("Covariance:", covariance)


Output:

1
Covariance: -2.5


In this example, we calculate the covariance between the 'x' and 'y' columns of the DataFrame df. The cov() function returns the covariance between two variables.


What is the difference between mean and median in descriptive statistics?

The mean and median are both measures of central tendency in descriptive statistics, but they differ in how they represent the data.

  1. Mean: The mean is the average of a set of numbers. It is calculated by summing up all the values in the dataset and dividing it by the total number of observations. The mean is sensitive to extreme values or outliers in the dataset, as it takes into account every value in the calculation. It is often used to represent a typical or average value in a dataset.
  2. Median: The median is the middle value in a sorted dataset, separating the upper half from the lower half. If there are an odd number of observations, then the median is the middle value. If there are an even number of observations, the median is the average of the two middle values. The median is not affected by extreme values or outliers, as it only considers the position of values within the data. It is often used when the dataset is skewed or contains outliers, providing a measure of the central value without being influenced by extreme values.


In summary, the mean represents the average value of the dataset and is influenced by outliers, while the median represents the middle value and is not affected by outliers.


How to calculate the mean in Pandas?

In Pandas, you can calculate the mean of a DataFrame or Series using the mean() function. Here are the steps to calculate the mean:

  1. Import the pandas library:
1
import pandas as pd


  1. Create a DataFrame or Series:
1
2
data = {'Col1': [1, 2, 3, 4, 5], 'Col2': [6, 7, 8, 9, 10]}
df = pd.DataFrame(data)


  1. Use the mean() function to calculate the mean:
1
mean_value = df.mean()


  1. Print the mean value:
1
print(mean_value)


The mean() function returns the mean of each column in the DataFrame. If you want to calculate the mean of a specific column, you can specify the column name as an argument to the mean() function:

1
mean_value = df['Col1'].mean()


Facebook Twitter LinkedIn Telegram Whatsapp Pocket

Related Posts:

To describe a column in Pandas Python, you can utilize the describe() method which provides a summary of statistical information about the column. This descriptive statistics summary helps you gain a better understanding of the data distribution in that specif...
To reverse a Pandas series, you can make use of the slicing technique with a step value of -1. Follow these steps:Import the Pandas library: import pandas as pd Create a Pandas series: data = [1, 2, 3, 4, 5] series = pd.Series(data) Reverse the series using sl...
To calculate the custom fiscal year in Pandas, you can follow these steps:Import the necessary libraries: import pandas as pd import numpy as np Create a Pandas DataFrame with a column containing dates: df = pd.DataFrame({'Date': ['2020-01-01',...