Tutorial: Interactive Chi-Squared Test Using Python

Daniele Calixto Barros
14 de mar. de 2024
5 min de leitura

A chi-square test is a statistical test used to compare observed results with expected results. There are two commonly used chi-squared tests:

The chi-square goodness of fit test is used to test whether the frequency distribution of a categorical variable is different from your expectations.
The chi-square test of independence is used to test whether two categorical variables are related to each other.

In this article, we will perform Person's Chi-Square test of independence using Python's SciPy library. This is a complete tutorial, including setting up an interactive visual interface using Python's Framework Streamlit.

Dataset

You can choose any dataset you want but we'll use an example. The chosen was the public Kaggle Dataset of Turkey Political Opinions. Let's explore data information, with questions and possible answers (if it's defined or not), in Turkish and English:

Cinsiyet: Sex Feature
- Erkek = Male
- Kadın = Female
Yas : Age Feature
Bolge : Areas inhabited in Turkey
Egitim : Education Level
- Egitim = Education
- İlkokul = Primary School
- OrtaOkul = Junior High School
- Lise = High School
- Lisans = University
- Lisans Üstü = MA
Soru1/Question1: Do you think our Economic Status is good?
- Evet = Yes
- Hayır = No
Soru2/Question2: Need Reform in Education?
- Evet = Yes
- Hayır = No
Soru3/Question3: Resolve Privatization Are You?
- Evet = Yes
- Hayır = No
Soru4/Question4: Should the state use a penalty like death penalty for certain crimes?
- Evet = Yes
- Hayır = No
Soru5/Question5: Do you find our journalists neutral enough?
- Evet = Yes
- Hayır = No
Soru6/Question6: From 22:00 am Then Are You Supporting the Prohibition to Buy Drinks?
- Evet = Yes
- Hayır = No
Soru7/Question7: Do You Want to Live in a Secular State?
- Evet = Yes
- Hayır = No
Soru8/Question8: Are you supporting the abortion ban?
- Evet = Yes
- Hayır = No
Soru9/Question9: Do you think that the extraordinary state (Ohal) restricts Freedoms?
- Evet = Yes
- Hayır = No
Soru10/Question10: Would you like a new part of the parliament to enter?
- Evet = Yes
- Hayır = No
Parti : Political View

We're only going to use just qualitative variables in this case. The only quantitative variable is age, which could be put into bands to generate the analysis, but we won't go into that in this tutorial. The aim is to analyze the independence of several pairs of variables, which can be chosen by the user.

Chi-Squared Test

The point of this chi-square test is to conclude whether two variables are related to each other not. We will identify these two inside the code as v1 and v2.

We start by defining our null hypothesis which states that there is no relation between the variables. And continue defining the alternate hypothesis which states that there is a significant relationship between the two variables. The result will be interpreted using the p-value.

We will define a significant factor or alpha value of 0.05. This alpha value denotes the probability of erroneously rejecting null hypothesis when it is true. If the p-value for the test comes out to be strictly greater than the alpha value, then we will accept our null hypothesis.

If you want to understand more of the chi-squared test, read this article. Doing this with Python will make it simple and fast.

Let's start with SciPy. First, install:

pip install scipy

Then, install Pandas. We're going to use this to data manipulation.

pip install pandas

Now you need to make the csv file available in the code directory. So, let's code!

Import dependencies:

import pandas as pd
from scipy.stats import chi2_contingency

Transform the csv into a pandas Data Frame. Remember to insert the path of your dataset file:

file_path = 'your_file_path.csv'
df = pd.read_csv(file_path)

Use Pandas to build a frequency table between 2 variables. When we get to the visual interface, we'll put in the variable options. For now, we set it as an empty string:

v1 = ''
v2 = ''
frequency_table = df.groupby(v1)[v2].value_counts()

Transform the frequency table into a 2D array, to enable us to use SciPy:

array_2d = frequency_table.unstack().fillna(0).values

Use chi2_contingency of SciPy:

stat, p, dof, expected = chi2_contingency(array_2d)

Now, we set the alpha value and compare it with the p-value:

alpha = 0.05
if p <= alpha:
    # Dependent
else:
    # Independent

Interactive Interface

The Chi-Squared test is ready, now we can build the data app. This example will be done using the Turkey Political Opinion dataset.

First, install Streamlit:

pip install streamlit

Create a file app.py in your directory and import Streamlit:

import streamlit as st

Now copy and paste your chi-squared code inside app.py. Then, make instructions for your user: they need to understand the dataset to decide which variables they can relate. I'll give an example, but you can do this the way you want.

st.title('Turkey Political Opinions Chi-Squared Test')

st.write('The dataset used in this application consists of a ressearch that reveals he point of view of Turkish people towards political events. These are the features:')
features_data = {
    "Cinsiyet/Gender": {"Erkek": "Male", "Kadın": "Female"},
    "Egitim/Education Level": {"İlkokul": "Primary School", "OrtaOkul": "Junior High School", "Lise": "High School", "Lisans": "University", "Lisans Üstü": "MA"},
    "Soru1/Question1: Do you think our Economic Status is good?": {"Evet": "Yes", "Hayır": "No"},
    "Soru2/Question2: Need Reform in Education?": {"Evet": "Yes", "Hayır": "No"},
    "Soru3/Question3: Resolve Privatization Are You?": {"Evet": "Yes", "Hayır": "No"},
    "Soru4/Question4: Should the state use a penalty like death penalty for certain crimes?": {"Evet": "Yes", "Hayır": "No"},
    "Soru5/Question5: Do you find our journalists neutral enough?": {"Evet": "Yes", "Hayır": "No"},
    "Soru6/Question6: From 22:00 am Then Are You Supporting the Prohibition to Buy Drinks?": {"Evet": "Yes", "Hayır": "No"},
    "Soru7/Question7: Do You Want to Live in a Secular State?": {"Evet": "Yes", "Hayır": "No"},
    "Soru8/Question8: Are you supporting the abortion ban?": {"Evet": "Yes", "Hayır": "No"},
    "Soru9/Question9: Do you think that the extraordinary state (Ohal) restricts Freedoms?": {"Evet": "Yes", "Hayır": "No"},
    "Soru10/Question10: Would you like a new part of the parliament to enter?": {"Evet": "Yes", "Hayır": "No"},
}

def show_list_items(features_data):
    for key, value in features_data.items():
        if isinstance(value, dict):
            st.subheader(key)
            for k, v in value.items():
                st.write(f"- {k}: {v}")
        else:
            st.write(f"{key}: {value}")

show_list_items(features_data)

Define the options for v1 and v2, create a select box and store into variables.

st.subheader("Chi-Squared Test")
options = ['Cinsiyet', 'Egitim', 'soru1', 'soru2', 'soru3', 'soru4', 'soru5', 'soru6', 'soru7', 'soru8', 'soru9', 'soru10']
st.write("Chose two different features to conclude if they are related to each other not. ")
v1 = st.selectbox("Chose the first feature:", options)
v2 = st.selectbox("Chose the second feature:", options)

Make sure that the chosen variables are different, create a button and insert chi-squared test code:

if v1==v2:
    st.write("Choose different variables.")
else:
    if st.button('TEST'):
		# chi squared test code

Remove this:

v1 = ''
v2 = ''

Show the p-value to your user and come back to the comparison code to insert the interpretation:

        st.write("p-value result:")
        st.write(p)
        
        alpha = 0.05
        print("p value is " + str(p))
        if p <= alpha:
            st.write("It means that ", v1, " is **DEPENDENT** to ", v2, " and the null hypothesis is rejected.")
        else:
            st.write("It means that ", v1, " is **INDEPENDENT** to ", v2, " and the null hypothesis holds true.")

Final Interface

Once you've created your app.py, the easiest way to run it is with in your console:

streamlit run app.py

And these are some excerpts from the result:

Suggestions

You can access the complete code on github. Feel free to clone and make improvements. There are two suggestions:

Adapt the code for age ranges;
Adapt the code for any dataset upload.

I am Daniele Calixto, Data Scientist from Brazil. I hope it was useful. See you next time!

For more, my portfolio.