Python Scripts For Beginners: Data Handling

Data Handling

Collection of 100 Python Scripts For Beginners related to Data Handling, each designed to help with fundamental tasks and provide useful examples.

100 Python Scripts For Beginners: Data Handling

1. Reading a CSV File

import pandas as pd

df = pd.read_csv('data.csv')
print(df.head())

Reads a CSV file into a DataFrame and prints the first few rows.


2. Writing Data to a CSV File

import pandas as pd

data = {'Name': ['John', 'Anna'], 'Age': [28, 24]}
df = pd.DataFrame(data)
df.to_csv('output.csv', index=False)

Creates a DataFrame from a dictionary and writes it to a CSV file.


3. Filtering Rows in a DataFrame

import pandas as pd

df = pd.read_csv('data.csv')
filtered_df = df[df['Age'] > 30]
print(filtered_df)

Filters rows where the ‘Age’ column is greater than 30.


4. Sorting Data in a DataFrame

import pandas as pd

df = pd.read_csv('data.csv')
sorted_df = df.sort_values(by='Age', ascending=False)
print(sorted_df)

Sorts the DataFrame by the ‘Age’ column in descending order.


5. Aggregating Data

import pandas as pd

df = pd.read_csv('data.csv')
mean_age = df['Age'].mean()
print(f"Mean Age: {mean_age}")

Calculates the mean age from the ‘Age’ column.


6. Merging Two DataFrames

import pandas as pd

df1 = pd.read_csv('data1.csv')
df2 = pd.read_csv('data2.csv')
merged_df = pd.merge(df1, df2, on='ID')
print(merged_df)

Merges two DataFrames on the ‘ID’ column.


7. Concatenating DataFrames

import pandas as pd

df1 = pd.read_csv('data1.csv')
df2 = pd.read_csv('data2.csv')
concatenated_df = pd.concat([df1, df2])
print(concatenated_df)

Concatenates two DataFrames vertically.


8. Handling Missing Data

import pandas as pd

df = pd.read_csv('data.csv')
df = df.fillna(method='ffill') # Forward fill missing values
print(df)

Fills missing values using forward fill method.


9. Dropping Missing Data

import pandas as pd

df = pd.read_csv('data.csv')
df = df.dropna()
print(df)

Drops rows with any missing values.


10. Renaming Columns

import pandas as pd

df = pd.read_csv('data.csv')
df = df.rename(columns={'old_name': 'new_name'})
print(df)

Renames columns in the DataFrame.


11. Changing Data Types

import pandas as pd

df = pd.read_csv('data.csv')
df['Age'] = df['Age'].astype(int)
print(df.dtypes)

Changes the data type of the ‘Age’ column to integer.


12. Removing Duplicates

import pandas as pd

df = pd.read_csv('data.csv')
df = df.drop_duplicates()
print(df)

Removes duplicate rows from the DataFrame.


13. Data Aggregation with GroupBy

import pandas as pd

df = pd.read_csv('data.csv')
grouped_df = df.groupby('Department').agg({'Salary': 'mean'})
print(grouped_df)

Groups data by ‘Department’ and calculates the mean ‘Salary’ for each group.


14. Plotting Data

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('data.csv')
df['Age'].plot(kind='hist')
plt.show()

Plots a histogram of the ‘Age’ column.


15. Pivot Table Creation

import pandas as pd

df = pd.read_csv('data.csv')
pivot_table = df.pivot_table(values='Salary', index='Department', aggfunc='mean')
print(pivot_table)

Creates a pivot table to show the average ‘Salary’ by ‘Department’.


16. Reading Excel Files

import pandas as pd

df = pd.read_excel('data.xlsx')
print(df.head())

Reads data from an Excel file into a DataFrame.


17. Writing Data to an Excel File

import pandas as pd

data = {'Name': ['John', 'Anna'], 'Age': [28, 24]}
df = pd.DataFrame(data)
df.to_excel('output.xlsx', index=False)

Writes a DataFrame to an Excel file.


18. Parsing JSON Data

import pandas as pd

df = pd.read_json('data.json')
print(df.head())

Reads JSON data into a DataFrame.


19. Writing Data to JSON

import pandas as pd

data = {'Name': ['John', 'Anna'], 'Age': [28, 24]}
df = pd.DataFrame(data)
df.to_json('output.json', orient='records')

Writes a DataFrame to a JSON file.


20. Reading SQL Data into DataFrame

import pandas as pd
import sqlite3

conn = sqlite3.connect('database.db')
df = pd.read_sql_query('SELECT * FROM table_name', conn)
print(df.head())

Reads SQL data into a DataFrame.


21. Writing DataFrame to SQL

import pandas as pd
import sqlite3

conn = sqlite3.connect('database.db')
data = {'Name': ['John', 'Anna'], 'Age': [28, 24]}
df = pd.DataFrame(data)
df.to_sql('table_name', conn, if_exists='replace', index=False)

Writes a DataFrame to an SQL table.


22. Filtering Data Based on Condition

import pandas as pd

df = pd.read_csv('data.csv')
filtered_df = df[df['Salary'] > 50000]
print(filtered_df)

Filters rows where ‘Salary’ is greater than 50,000.


23. Extracting Substrings

import pandas as pd

df = pd.read_csv('data.csv')
df['First_Name'] = df['Full_Name'].str.split().str[0]
print(df)

Extracts the first name from a ‘Full_Name’ column.


24. Handling Dates

import pandas as pd

df = pd.read_csv('data.csv', parse_dates=['Date'])
print(df['Date'].dt.year)

Converts a column to datetime and extracts the year.


25. Data Normalization

import pandas as pd
from sklearn.preprocessing import MinMaxScaler

df = pd.read_csv('data.csv')
scaler = MinMaxScaler()
df[['Age']] = scaler.fit_transform(df[['Age']])
print(df)

Normalizes the ‘Age’ column to a range of [0, 1].


26. Data Standardization

import pandas as pd
from sklearn.preprocessing import StandardScaler

df = pd.read_csv('data.csv')
scaler = StandardScaler()
df[['Age']] = scaler.fit_transform(df[['Age']])
print(df)

Standardizes the ‘Age’ column to have a mean of 0 and standard deviation of 1.


27. Handling Categorical Data

import pandas as pd

df = pd.read_csv('data.csv')
df = pd.get_dummies(df, columns=['Department'])
print(df)

Converts categorical variables into dummy/indicator variables.


28. Combining Multiple CSV Files

import pandas as pd
import glob

files = glob.glob('data/*.csv')
dfs = [pd.read_csv(file) for file in files]
combined_df = pd.concat(dfs, ignore_index=True)
print(combined_df)

Combines multiple CSV files into a single DataFrame.


29. Creating a DataFrame from a Dictionary

import pandas as pd

data = {'Name': ['John', 'Anna'], 'Age': [28, 24]}
df = pd.DataFrame(data)
print(df)

Creates a DataFrame from a dictionary.


30. Saving DataFrame to a Pickle File

import pandas as pd

df = pd.read_csv('data.csv')
df.to_pickle('data.pkl')

Saves a DataFrame to a pickle file.


31. Loading DataFrame from a Pickle File

import pandas as pd

df = pd.read_pickle('data.pkl')
print(df)

Loads a DataFrame from a pickle file.


32. Converting DataFrame to Dictionary

import pandas as pd

df = pd.read_csv('data.csv')
data_dict = df.to_dict()
print(data_dict)

Converts a DataFrame to a dictionary.

34. Creating a DataFrame with MultiIndex

import pandas as pd

arrays = [['A', 'A', 'B', 'B'], ['x', 'y', 'x', 'y']]
index = pd.MultiIndex.from_arrays(arrays, names=('Letter', 'Number'))
df = pd.DataFrame({'Value': [1, 2, 3, 4]}, index=index)
print(df)

Creates a DataFrame with multi-level indexing.


35. Calculating Summary Statistics

import pandas as pd

df = pd.read_csv('data.csv')
summary = df.describe()
print(summary)

Calculates and prints summary statistics for numerical columns.


36. Handling Text Data

import pandas as pd

df = pd.read_csv('data.csv')
df['Text_Length'] = df['Text_Column'].apply(len)
print(df)

Calculates the length of text in a column.


37. Merging DataFrames on Multiple Columns

import pandas as pd

df1 = pd.read_csv('data1.csv')
df2 = pd.read_csv('data2.csv')
merged_df = pd.merge(df1, df2, on=['ID', 'Date'])
print(merged_df)

Merges DataFrames on multiple columns.


38. Applying Functions to DataFrame Columns

import pandas as pd

df = pd.read_csv('data.csv')
df['Age_In_5_Years'] = df['Age'].apply(lambda x: x + 5)
print(df)

Applies a function to each element of a column.


39. Creating Pivot Tables

import pandas as pd

df = pd.read_csv('data.csv')
pivot_table = df.pivot_table(values='Sales', index='Region', columns='Month', aggfunc='sum')
print(pivot_table)

Creates a pivot table to aggregate data based on regions and months.


40. Converting DataFrame to a List of Dictionaries

import pandas as pd

df = pd.read_csv('data.csv')
list_of_dicts = df.to_dict(orient='records')
print(list_of_dicts)

Converts a DataFrame to a list of dictionaries.


41. Normalizing Data

import pandas as pd
from sklearn.preprocessing import MinMaxScaler

df = pd.read_csv('data.csv')
scaler = MinMaxScaler()
df[['Value']] = scaler.fit_transform(df[['Value']])
print(df)

Normalizes a column to a [0, 1] range.


42. Standardizing Data

pythonCopier le codeimport pandas as pd
from sklearn.preprocessing import StandardScaler

df = pd.read_csv('data.csv')
scaler = StandardScaler()
df[['Value']] = scaler.fit_transform(df[['Value']])
print(df)

Standardizes a column to have mean 0 and standard deviation 1.


43. Converting Data Types

import pandas as pd

df = pd.read_csv('data.csv')
df['Age'] = df['Age'].astype(float)
print(df.dtypes)

Changes the data type of a column.


44. Handling Outliers

import pandas as pd

df = pd.read_csv('data.csv')
q1 = df['Value'].quantile(0.25)
q3 = df['Value'].quantile(0.75)
iqr = q3 - q1
filtered_df = df[(df['Value'] >= (q1 - 1.5 * iqr)) & (df['Value'] <= (q3 + 1.5 * iqr))]
print(filtered_df)

Filters outliers from a DataFrame using the IQR method.


45. Aggregating Data with GroupBy

import pandas as pd

df = pd.read_csv('data.csv')
grouped_df = df.groupby('Category').agg({'Value': 'sum'})
print(grouped_df)

Groups data by ‘Category’ and calculates the sum of ‘Value’.


46. Splitting Data into Training and Testing Sets

import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv('data.csv')
train, test = train_test_split(df, test_size=0.2)
print(train.head())
print(test.head())

Splits data into training and testing sets.


47. Reading Data from a JSON File

import pandas as pd

df = pd.read_json('data.json')
print(df.head())

Reads JSON data into a DataFrame.


48. Writing DataFrame to JSON File

import pandas as pd

data = {'Name': ['John', 'Anna'], 'Age': [28, 24]}
df = pd.DataFrame(data)
df.to_json('output.json', orient='records')

Writes a DataFrame to a JSON file.


49. Handling Time Series Data

import pandas as pd

df = pd.read_csv('data.csv', parse_dates=['Date'])
df.set_index('Date', inplace=True)
print(df.resample('M').sum())

Handles time series data and aggregates it by month.


50. Adding Calculated Columns

import pandas as pd

df = pd.read_csv('data.csv')
df['Discounted_Price'] = df['Price'] * 0.9
print(df)

Adds a new column with a calculated value.


51. Combining DataFrames Horizontally

import pandas as pd

df1 = pd.read_csv('data1.csv')
df2 = pd.read_csv('data2.csv')
combined_df = pd.concat([df1, df2], axis=1)
print(combined_df)

Combines two DataFrames side-by-side (horizontally).


52. Extracting Specific Columns

import pandas as pd

df = pd.read_csv('data.csv')
df_subset = df[['Name', 'Age']]
print(df_subset)

Extracts specific columns from a DataFrame.


53. Removing Unnecessary Columns

import pandas as pd

df = pd.read_csv('data.csv')
df = df.drop(columns=['Unnecessary_Column'])
print(df)

Removes unnecessary columns from a DataFrame.


54. Handling Categorical Data with Encoding

import pandas as pd
from sklearn.preprocessing import LabelEncoder

df = pd.read_csv('data.csv')
encoder = LabelEncoder()
df['Category'] = encoder.fit_transform(df['Category'])
print(df)

Encodes categorical data into numerical format.


55. Saving DataFrame to a Database

import pandas as pd
import sqlite3

conn = sqlite3.connect('database.db')
df = pd.read_csv('data.csv')
df.to_sql('table_name', conn, if_exists='replace', index=False)

Saves a DataFrame to an SQL database.


56. Reading Data from a Database

import pandas as pd
import sqlite3

conn = sqlite3.connect('database.db')
df = pd.read_sql_query('SELECT * FROM table_name', conn)
print(df.head())

Reads data from a database into a DataFrame.


57. Filling Missing Values with Interpolation

import pandas as pd

df = pd.read_csv('data.csv')
df = df.interpolate()
print(df)

Fills missing values using interpolation.


58. Exporting Data to Excel

import pandas as pd

data = {'Name': ['John', 'Anna'], 'Age': [28, 24]}
df = pd.DataFrame(data)
df.to_excel('output.xlsx', index=False)

Exports a DataFrame to an Excel file.


59. Reading Excel Sheets

import pandas as pd

df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
print(df.head())

Reads a specific sheet from an Excel file.


60. Appending Data to an Existing CSV

import pandas as pd

new_data = {'Name': ['Tom'], 'Age': [30]}
new_df = pd.DataFrame(new_data)
new_df.to_csv('data.csv', mode='a', header=False, index=False)

Appends new data to an existing CSV file.


61. Deleting Rows Based on Condition

import pandas as pd

df = pd.read_csv('data.csv')
df = df[df['Age'] > 25]
print(df)

Deletes rows where ‘Age’ is less than or equal to 25.


62. Sampling Data

import pandas as pd

df = pd.read_csv('data.csv')
sample_df = df.sample(n=10)
print(sample_df)

Samples 10 random rows from a DataFrame.


63. Resampling Time Series Data

import pandas as pd

df = pd.read_csv('data.csv', parse_dates=['Date'])
df.set_index('Date', inplace=True)
monthly_data = df.resample('M').mean()
print(monthly_data)

Resamples time series data to monthly frequency and calculates the mean.

64. Handling Large Datasets with Chunks

import pandas as pd

chunk_size = 10000
chunks = pd.read_csv('large_data.csv', chunksize=chunk_size)

for chunk in chunks:
process(chunk) # Replace with your processing function

Reads and processes large datasets in chunks to manage memory usage.


65. Removing Extra Whitespace

import pandas as pd

df = pd.read_csv('data.csv')
df.columns = df.columns.str.strip()
df['Name'] = df['Name'].str.strip()
print(df)

Removes extra whitespace from column names and data.


66. Checking for Missing Data

import pandas as pd

df = pd.read_csv('data.csv')
missing_data = df.isnull().sum()
print(missing_data)

Checks for and counts missing data in each column.


67. Filling Missing Data with a Value

import pandas as pd

df = pd.read_csv('data.csv')
df = df.fillna(value={'Age': 30})
print(df)

Fills missing values in the ‘Age’ column with 30.


68. Converting Column to Categorical Data

import pandas as pd

df = pd.read_csv('data.csv')
df['Category'] = df['Category'].astype('category')
print(df.dtypes)

Converts a column to categorical data type.


69. Creating a DataFrame from a List of Tuples

pimport pandas as pd

data = [('John', 28), ('Anna', 24)]
df = pd.DataFrame(data, columns=['Name', 'Age'])
print(df)

Creates a DataFrame from a list of tuples.


70. Creating a DataFrame with Custom Index

import pandas as pd

data = {'Name': ['John', 'Anna'], 'Age': [28, 24]}
df = pd.DataFrame(data, index=['Person1', 'Person2'])
print(df)

Creates a DataFrame with a custom index.


71. Applying Multiple Functions to Columns

import pandas as pd

df = pd.read_csv('data.csv')
df['Age_Square'] = df['Age'].apply(lambda x: x ** 2)
df['Age_Log'] = df['Age'].apply(lambda x: np.log(x))
print(df)

Applies multiple functions to a column and creates new columns.


72. Generating Random Data

import pandas as pd
import numpy as np

df = pd.DataFrame({
'Name': np.random.choice(['John', 'Anna', 'Tom'], size=100),
'Age': np.random.randint(20, 40, size=100)
})
print(df)

Generates a DataFrame with random data.


73. Removing Outliers Using Z-Score

import pandas as pd
from scipy import stats

df = pd.read_csv('data.csv')
df = df[(np.abs(stats.zscore(df['Value'])) < 3)]
print(df)

Removes outliers based on the Z-score.


74. Creating a Histogram from Data

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('data.csv')
df['Age'].hist()
plt.show()

Creates a histogram of the ‘Age’ column.


75. Creating a DataFrame from JSON

import pandas as pd

df = pd.read_json('data.json')
print(df)

Creates a DataFrame from a JSON file.


76. Saving DataFrame to a Pickle File

import pandas as pd

df = pd.read_csv('data.csv')
df.to_pickle('data.pkl')

Saves a DataFrame to a pickle file.


77. Reading Data from a Pickle File

import pandas as pd

df = pd.read_pickle('data.pkl')
print(df)

Loads a DataFrame from a pickle file.


78. Merging DataFrames with Different Column Names

import pandas as pd

df1 = pd.read_csv('data1.csv')
df2 = pd.read_csv('data2.csv')
merged_df = pd.merge(df1, df2, left_on='ID', right_on='Identifier')
print(merged_df)

Merges DataFrames on columns with different names.


79. Pivoting Data

import pandas as pd

df = pd.read_csv('data.csv')
pivot_df = df.pivot(index='Date', columns='Category', values='Value')
print(pivot_df)

Creates a pivot table from a DataFrame.


80. Creating a MultiIndex DataFrame

import pandas as pd

arrays = [['A', 'A', 'B', 'B'], ['x', 'y', 'x', 'y']]
index = pd.MultiIndex.from_arrays(arrays, names=('Letter', 'Number'))
df = pd.DataFrame({'Value': [1, 2, 3, 4]}, index=index)
print(df)

Creates a DataFrame with multi-level indexing.


81. Filtering Data with Regular Expressions

import pandas as pd

df = pd.read_csv('data.csv')
filtered_df = df[df['Email'].str.contains('@gmail.com')]
print(filtered_df)

Filters data using regular expressions to match email domains.


82. Extracting Columns from a DataFrame

import pandas as pd

df = pd.read_csv('data.csv')
columns = df[['Name', 'Age']]
print(columns)

Extracts specific columns from a DataFrame.


83. Removing Duplicates from Data

import pandas as pd

df = pd.read_csv('data.csv')
df = df.drop_duplicates()
print(df)

Removes duplicate rows from a DataFrame.


84. Filling Missing Data with Mean

import pandas as pd

df = pd.read_csv('data.csv')
mean_value = df['Value'].mean()
df['Value'] = df['Value'].fillna(mean_value)
print(df)

Fills missing values in a column with the mean value.


85. Creating DataFrame from SQL Query

import pandas as pd
import sqlite3

conn = sqlite3.connect('database.db')
df = pd.read_sql_query('SELECT * FROM table_name', conn)
print(df)

Reads data from a SQL database into a DataFrame.


86. Aggregating Data with Multiple Functions

import pandas as pd

df = pd.read_csv('data.csv')
agg_df = df.groupby('Category').agg({'Value': ['mean', 'sum', 'max']})
print(agg_df)

Aggregates data using multiple functions.


87. Handling Dates and Times

import pandas as pd

df = pd.read_csv('data.csv', parse_dates=['Date'])
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
print(df)

Extracts year and month from date columns.


88. Exporting Data to Multiple Sheets in Excel

import pandas as pd

with pd.ExcelWriter('output.xlsx') as writer:
df1 = pd.read_csv('data1.csv')
df2 = pd.read_csv('data2.csv')
df1.to_excel(writer, sheet_name='Sheet1', index=False)
df2.to_excel(writer, sheet_name='Sheet2', index=False)

Exports multiple DataFrames to different sheets in an Excel file.


89. Using DataFrame with SQLAlchemy

import pandas as pd
from sqlalchemy import create_engine

engine = create_engine('sqlite:///database.db')
df = pd.read_sql('SELECT * FROM table_name', engine)
print(df)

Reads data from a SQL database using SQLAlchemy.


90. Handling Large Datasets with Dask

import dask.dataframe as dd

df = dd.read_csv('large_data.csv')
df = df.compute() # Convert to pandas DataFrame
print(df.head())

Handles large datasets using Dask.


91. Cleaning Text Data

import pandas as pd

df = pd.read_csv('data.csv')
df['Text_Cleaned'] = df['Text_Column'].str.replace('[^a-zA-Z]', '')
print(df)

Removes non-alphabetic characters from text data.


92. Resampling Data by Week

import pandas as pd

df = pd.read_csv('data.csv', parse_dates=['Date'])
df.set_index('Date', inplace=True)
weekly_data = df.resample('W').sum()
print(weekly_data)

Resamples data to weekly frequency and calculates the sum.


93. Handling Large JSON Files

import pandas as pd

df = pd.read_json('large_data.json', lines=True)
print(df.head())

Reads large JSON files into a DataFrame.

94. Converting DataFrame to Excel with Formatting

import pandas as pd
from openpyxl import Workbook
from openpyxl.styles import Font

df = pd.read_csv('data.csv')
with pd.ExcelWriter('output.xlsx', engine='openpyxl') as writer:
df.to_excel(writer, index=False, sheet_name='Sheet1')
workbook = writer.book
worksheet = writer.sheets['Sheet1']

# Apply formatting
for cell in worksheet["A1:Z1"]: # Adjust range as needed
cell.font = Font(bold=True, color="FFFFFF")
worksheet.sheet_view.showGridLines = False

Writes a DataFrame to an Excel file and applies basic formatting.


95. Adding Conditional Formatting in Excel

import pandas as pd
from openpyxl import load_workbook
from openpyxl.formatting.rule import CellIsRule
from openpyxl.styles import PatternFill

df = pd.read_csv('data.csv')
with pd.ExcelWriter('output.xlsx', engine='openpyxl') as writer:
df.to_excel(writer, index=False, sheet_name='Sheet1')

workbook = writer.book
worksheet = writer.sheets['Sheet1']

# Add conditional formatting
rule = CellIsRule(operator='greaterThan', formula=['100'], fill=PatternFill(start_color='FFFF00', end_color='FFFF00', fill_type='solid'))
worksheet.conditional_formatting.add('B2:B100', rule) # Adjust range as needed

Applies conditional formatting to an Excel sheet.


96. Using SQL to Filter Data

import pandas as pd
import sqlite3

conn = sqlite3.connect('database.db')
query = "SELECT * FROM table_name WHERE Age > 30"
df = pd.read_sql_query(query, conn)
print(df)

Filters data using an SQL query.


97. Combining DataFrames with Concatenation

import pandas as pd

df1 = pd.read_csv('data1.csv')
df2 = pd.read_csv('data2.csv')
combined_df = pd.concat([df1, df2], axis=0, ignore_index=True)
print(combined_df)

Concatenates two DataFrames vertically.


98. Aggregating Data Using GroupBy and Multiple Functions

import pandas as pd

df = pd.read_csv('data.csv')
agg_df = df.groupby('Category').agg({
'Value': ['mean', 'sum', 'max'],
'Age': 'min'
})
print(agg_df)

Aggregates data using multiple functions for different columns.


99. Filtering Data Based on Multiple Conditions

import pandas as pd

df = pd.read_csv('data.csv')
filtered_df = df[(df['Age'] > 25) & (df['Category'] == 'A')]
print(filtered_df)

Filters rows based on multiple conditions.


100. Creating a DataFrame from a Dictionary of Series

import pandas as pd

data = {
'Name': pd.Series(['John', 'Anna']),
'Age': pd.Series([28, 24])
}
df = pd.DataFrame(data)
print(df)

This collection of 100 Python scripts for beginners is designed to enhance your understanding of data handling with practical examples.

Post Comment