The dataset comes from the National Center for Education Statistics. In this article, I will demonstrate some data cleaning and dataframe manipulation techniques. Hope this helps!
For everything you may need, visit:
Source Data
nces-ed-attainment.csv
National Center for Education.py

The original dataset is titled: Percentage of persons 25 to 29 years old with selected levels of educational attainment, by race/ethnicity and sex: Selected years, 1920 through 2018. The cleaned version has columns for Year, Sex, Educational Attainment, and race/ethnicity categories considered in the dataset. Note that not all columns will have data starting at 1920.

Exploring the data

# packages needed
import pandas as pd
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv('nces-ed-attainment.csv')
df.head()
Year Sex Min degree Total White Black Hispanic Asian Pacific Islander American Indian/Alaska Native Two or more races
0 1920 A high school --- 22.0 6.3 --- --- --- --- ---
1 1940 A high school 38.1 41.2 12.3 --- --- --- --- ---
2 1950 A high school 52.8 56.3 23.6 --- --- --- --- ---
3 1960 A high school 60.7 63.7 38.6 --- --- --- --- ---
4 1970 A high school 75.4 77.8 58.4 --- --- --- --- ---
There are so many "---" in dataset. We should convert it into null values for further manipulation.
df = df.replace('---', np.nan)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214 entries, 0 to 213
Data columns (total 11 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Year                           214 non-null    int64  
 1   Sex                            214 non-null    object 
 2   Min degree                     214 non-null    object 
 3   Total                          214 non-null    object 
 4   White                          214 non-null    float64
 5   Black                          214 non-null    float64
 6   Hispanic                       214 non-null    object 
 7   Asian                          214 non-null    object 
 8   Pacific Islander               214 non-null    object 
 9   American Indian/Alaska Native  214 non-null    object 
 10  Two or more races              214 non-null    object 
dtypes: float64(2), int64(1), object(8)
memory usage: 18.5+ KB

Several columns have wrong type.

race = [
    "Total",
    "Hispanic",
    "Asian",
    "Pacific Islander",
    "American Indian/Alaska Native",
    "Two or more races",
]
df[race] = df[race].apply(pd.to_numeric)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214 entries, 0 to 213
Data columns (total 11 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Year                           214 non-null    int64  
 1   Sex                            214 non-null    object 
 2   Min degree                     214 non-null    object 
 3   Total                          212 non-null    float64
 4   White                          214 non-null    float64
 5   Black                          214 non-null    float64
 6   Hispanic                       204 non-null    float64
 7   Asian                          168 non-null    float64
 8   Pacific Islander               93 non-null     float64
 9   American Indian/Alaska Native  131 non-null    float64
 10  Two or more races              157 non-null    float64
dtypes: float64(8), int64(1), object(2)
memory usage: 18.5+ KB

Have a look at some indicators of each column.

df.describe(include="all").T
count unique top freq mean std min 25% 50% 75% max
Year 214.0 NaN NaN NaN 2005.48 15.6 1920.0 2005.0 2010.0 2014.0 2018.0
Sex 214 3 A 78 NaN NaN NaN NaN NaN NaN NaN
Min degree 214 4 high school 59 NaN NaN NaN NaN NaN NaN NaN
Total 212.0 NaN NaN NaN 42.77 30.15 4.1 22.12 36.1 84.4 94.0
White 214.0 NaN NaN NaN 47.21 31.32 4.5 22.3 42.65 85.9 96.4
Black 214.0 NaN NaN NaN 35.19 32.36 1.1 10.12 23.2 70.62 93.5
Hispanic 204.0 NaN NaN NaN 27.68 26.85 0.6 7.6 16.6 57.02 87.2
Asian 168.0 NaN NaN NaN 62.02 27.03 15.0 46.55 66.0 81.05 98.5
Pacific Islander 93.0 NaN NaN NaN 50.79 33.24 10.0 24.0 34.0 90.7 100.0
American Indian/Alaska Native 131.0 NaN NaN NaN 42.67 32.1 2.1 16.55 24.4 83.05 95.1
Two or more races 157.0 NaN NaN NaN 44.46 31.58 2.9 25.7 34.7 87.9 98.2

Visulization

Correlation Matrix Heatmap

cormat = df[race].corr().round(2)
sns.heatmap(cormat, annot=True)

Enrollment Date Distribution

f, axes = plt.subplots(4, 2, sharey=False, figsize=(15, 12))
for ind, val in enumerate(race):
    sns.histplot(
        data=df,
        x=val,
        hue="Min degree",
        label="100% Equities",
        kde=True,
        stat="density",
        linewidth=0,
        ax=axes[ind // 2, ind % 2],
        bins=200,
    ).set(title=val)
f.tight_layout(pad=3.0)
plt.show()


For Black, Hispanic, Pacific Islander people and people with two or more races, the enrollment rate is low for most of the past time. For Asian and White people, the overall enrollment rate is high for most of the past time.

f, axes = plt.subplots(4, 2, sharey=False, figsize=(14, 14))
for ind, val in enumerate(race):
    sns.lineplot(
        data=df, x="Year", y=val, hue="Min degree", ci=None, ax=axes[ind // 2, ind % 2]
    )


The Black and White people has low enrollment rate at the 1920s to 1970s, but the enrollments increased dramatically. For Black, White and Hispanic people, the enrollment rates for the four degrees has been increasing overtime. The enrollment rates of the four degrees keeps quite static since 2006 for Asian, Pacific Islander, American Indian/Alaska Native and people with two or more races.

Questions

Q1

What are the percent of different degrees completed for a given year range and sex? Parameter arguments are as follows: two year arguments, and a value for sex (’A’, ’F’, or ’M’). Function should return all rows of the data which match the given sex, and have data between the given years (inclusive for the start, exclusive for the end). If no data is found for the parameters, return Python keyword None.

def completion_bet_years(dataframe, year1, year2, sex):
    """ Return percent of different degrees completed between year1 and year2 for Sex==sex.

    Args:
    dataframe (DataFrame): A dataframe containing the needed data
    year1     (int)      : Year number: the  earlier one
    year2     (int)      : Year number: the  later one
    sex       (str)      : Gender

    Returns:
    DataFrame: The percent of degrees completed between year1 and year2 for Sex==sex.
    """
    result = dataframe.loc[
        (dataframe["Sex"] == sex)
        & (dataframe["Year"] >= year1)
        & (dataframe["Year"] < year2)
    ]
    if result.shape[0] == 0:
        return None
    else:
        return result

print(completion_bet_years(df, 1920, 1941, "A"))
    Year Sex   Min degree  Total  White  Black  Hispanic  Asian  \
0   1920   A  high school    NaN   22.0    6.3       NaN    NaN   
1   1940   A  high school   38.1   41.2   12.3       NaN    NaN   
39  1920   A   bachelor's    NaN    4.5    1.2       NaN    NaN   
40  1940   A   bachelor's    5.9    6.4    1.6       NaN    NaN   

    Pacific Islander  American Indian/Alaska Native  Two or more races  
0                NaN                            NaN                NaN  
1                NaN                            NaN                NaN  
39               NaN                            NaN                NaN  
40               NaN                            NaN                NaN  

Q2

What were the percentages for women vs men having earned a Bachelor’s Degree in a given year? Parameter list argument is the year in question and return the percentages as a tuple: (% for men, % for women)

def compare_bachelors_in_year(dataframe, year):
    """ Return the percentages for women vs men having earned a Bachelor’s Degree in Year==year

    Args:
    df   (Dataframe): A dataframe containing the needed data
    year (int)      : The year in which you want to know the info

    Returns:
    tuple: A tuple returns the percentage of Bachelor’s Degree male and femal in Year==year.
    """
    women = dataframe[
        (dataframe["Sex"] == "F")
        & (dataframe["Min degree"] == "bachelor's")
        & (dataframe["Year"] == year)
    ].iloc[0]["Total"]
    men = dataframe[
        (dataframe["Sex"] == "M")
        & (dataframe["Min degree"] == "bachelor's")
        & (dataframe["Year"] == year)
    ].iloc[0]["Total"]
    return (f"{men} % for men", f"{women} % for women")

compare_bachelors_in_year(df, 2010)
('27.8 % for men', '35.7 % for women')

Q3

What were the two most commonly awarded levels of educational attainment awarded between 2000-2010 (inclusive)? Use the mean percent over the years to compare the education levels. Return a list as follows: [#1 level, mean % of #1 level, #2 level, mean % of #2 level].

def top_2_2000s(dataframe):
    """Return the two most common educational attainment between 2000-2010.

    Args:
    DataFrame (DataFrame): A dataframe containing the needed data

    Returns:
    list: A list returns the two most common educational attainment between 2000-2010.
    """
    df_2000s = dataframe.loc[
        (dataframe["Year"] >= 2000)
        & (dataframe["Year"] <= 2010)
        & (dataframe["Sex"] == "A")
    ][["Total", "Min degree"]]
    mean_percent_edu = (
        df_2000s.groupby("Min degree").mean().sort_values(by="Total", ascending=False)
    )
    level1 = mean_percent_edu.iloc[0]["Total"].round(2)
    level2 = mean_percent_edu.iloc[1]["Total"].round(2)
    edu1 = mean_percent_edu.index[0]
    edu2 = mean_percent_edu.index[1]
    return [f"#1 level, {level1} % of {edu1}", f"#2 level, {level2} % of {edu2}"]

top_2_2000s(df)

['#1 level, 87.557 % of high school', "#2 level, 38.757 % of associate's"]