Python Scripts For Beginners: Data Handling
Collection of 100 Python Scripts For Beginners related to Data Handling, each designed to help with fundamental tasks and provide useful examples.
100 Python Scripts For Beginners: Data Handling
1. Reading a CSV File
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head())
Reads a CSV file into a DataFrame and prints the first few rows.
2. Writing Data to a CSV File
import pandas as pd
data = {'Name': ['John', 'Anna'], 'Age': [28, 24]}
df = pd.DataFrame(data)
df.to_csv('output.csv', index=False)
Creates a DataFrame from a dictionary and writes it to a CSV file.
3. Filtering Rows in a DataFrame
import pandas as pd
df = pd.read_csv('data.csv')
filtered_df = df[df['Age'] > 30]
print(filtered_df)
Filters rows where the ‘Age’ column is greater than 30.
4. Sorting Data in a DataFrame
import pandas as pd
df = pd.read_csv('data.csv')
sorted_df = df.sort_values(by='Age', ascending=False)
print(sorted_df)
Sorts the DataFrame by the ‘Age’ column in descending order.
5. Aggregating Data
import pandas as pd
df = pd.read_csv('data.csv')
mean_age = df['Age'].mean()
print(f"Mean Age: {mean_age}")
Calculates the mean age from the ‘Age’ column.
6. Merging Two DataFrames
import pandas as pd
df1 = pd.read_csv('data1.csv')
df2 = pd.read_csv('data2.csv')
merged_df = pd.merge(df1, df2, on='ID')
print(merged_df)
Merges two DataFrames on the ‘ID’ column.
7. Concatenating DataFrames
import pandas as pd
df1 = pd.read_csv('data1.csv')
df2 = pd.read_csv('data2.csv')
concatenated_df = pd.concat([df1, df2])
print(concatenated_df)
Concatenates two DataFrames vertically.
8. Handling Missing Data
import pandas as pd
df = pd.read_csv('data.csv')
df = df.fillna(method='ffill') # Forward fill missing values
print(df)
Fills missing values using forward fill method.
9. Dropping Missing Data
import pandas as pd
df = pd.read_csv('data.csv')
df = df.dropna()
print(df)
Drops rows with any missing values.
10. Renaming Columns
import pandas as pd
df = pd.read_csv('data.csv')
df = df.rename(columns={'old_name': 'new_name'})
print(df)
Renames columns in the DataFrame.
11. Changing Data Types
import pandas as pd
df = pd.read_csv('data.csv')
df['Age'] = df['Age'].astype(int)
print(df.dtypes)
Changes the data type of the ‘Age’ column to integer.
12. Removing Duplicates
import pandas as pd
df = pd.read_csv('data.csv')
df = df.drop_duplicates()
print(df)
Removes duplicate rows from the DataFrame.
13. Data Aggregation with GroupBy
import pandas as pd
df = pd.read_csv('data.csv')
grouped_df = df.groupby('Department').agg({'Salary': 'mean'})
print(grouped_df)
Groups data by ‘Department’ and calculates the mean ‘Salary’ for each group.
14. Plotting Data
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
df['Age'].plot(kind='hist')
plt.show()
Plots a histogram of the ‘Age’ column.
15. Pivot Table Creation
import pandas as pd
df = pd.read_csv('data.csv')
pivot_table = df.pivot_table(values='Salary', index='Department', aggfunc='mean')
print(pivot_table)
Creates a pivot table to show the average ‘Salary’ by ‘Department’.
16. Reading Excel Files
import pandas as pd
df = pd.read_excel('data.xlsx')
print(df.head())
Reads data from an Excel file into a DataFrame.
17. Writing Data to an Excel File
import pandas as pd
data = {'Name': ['John', 'Anna'], 'Age': [28, 24]}
df = pd.DataFrame(data)
df.to_excel('output.xlsx', index=False)
Writes a DataFrame to an Excel file.
18. Parsing JSON Data
import pandas as pd
df = pd.read_json('data.json')
print(df.head())
Reads JSON data into a DataFrame.
19. Writing Data to JSON
import pandas as pd
data = {'Name': ['John', 'Anna'], 'Age': [28, 24]}
df = pd.DataFrame(data)
df.to_json('output.json', orient='records')
Writes a DataFrame to a JSON file.
20. Reading SQL Data into DataFrame
import pandas as pd
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql_query('SELECT * FROM table_name', conn)
print(df.head())
Reads SQL data into a DataFrame.
21. Writing DataFrame to SQL
import pandas as pd
import sqlite3
conn = sqlite3.connect('database.db')
data = {'Name': ['John', 'Anna'], 'Age': [28, 24]}
df = pd.DataFrame(data)
df.to_sql('table_name', conn, if_exists='replace', index=False)
Writes a DataFrame to an SQL table.
22. Filtering Data Based on Condition
import pandas as pd
df = pd.read_csv('data.csv')
filtered_df = df[df['Salary'] > 50000]
print(filtered_df)
Filters rows where ‘Salary’ is greater than 50,000.
23. Extracting Substrings
import pandas as pd
df = pd.read_csv('data.csv')
df['First_Name'] = df['Full_Name'].str.split().str[0]
print(df)
Extracts the first name from a ‘Full_Name’ column.
24. Handling Dates
import pandas as pd
df = pd.read_csv('data.csv', parse_dates=['Date'])
print(df['Date'].dt.year)
Converts a column to datetime and extracts the year.
25. Data Normalization
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
df = pd.read_csv('data.csv')
scaler = MinMaxScaler()
df[['Age']] = scaler.fit_transform(df[['Age']])
print(df)
Normalizes the ‘Age’ column to a range of [0, 1].
26. Data Standardization
import pandas as pd
from sklearn.preprocessing import StandardScaler
df = pd.read_csv('data.csv')
scaler = StandardScaler()
df[['Age']] = scaler.fit_transform(df[['Age']])
print(df)
Standardizes the ‘Age’ column to have a mean of 0 and standard deviation of 1.
27. Handling Categorical Data
import pandas as pd
df = pd.read_csv('data.csv')
df = pd.get_dummies(df, columns=['Department'])
print(df)
Converts categorical variables into dummy/indicator variables.
28. Combining Multiple CSV Files
import pandas as pd
import glob
files = glob.glob('data/*.csv')
dfs = [pd.read_csv(file) for file in files]
combined_df = pd.concat(dfs, ignore_index=True)
print(combined_df)
Combines multiple CSV files into a single DataFrame.
29. Creating a DataFrame from a Dictionary
import pandas as pd
data = {'Name': ['John', 'Anna'], 'Age': [28, 24]}
df = pd.DataFrame(data)
print(df)
Creates a DataFrame from a dictionary.
30. Saving DataFrame to a Pickle File
import pandas as pd
df = pd.read_csv('data.csv')
df.to_pickle('data.pkl')
Saves a DataFrame to a pickle file.
31. Loading DataFrame from a Pickle File
import pandas as pd
df = pd.read_pickle('data.pkl')
print(df)
Loads a DataFrame from a pickle file.
32. Converting DataFrame to Dictionary
import pandas as pd
df = pd.read_csv('data.csv')
data_dict = df.to_dict()
print(data_dict)
Converts a DataFrame to a dictionary.
34. Creating a DataFrame with MultiIndex
import pandas as pd
arrays = [['A', 'A', 'B', 'B'], ['x', 'y', 'x', 'y']]
index = pd.MultiIndex.from_arrays(arrays, names=('Letter', 'Number'))
df = pd.DataFrame({'Value': [1, 2, 3, 4]}, index=index)
print(df)
Creates a DataFrame with multi-level indexing.
35. Calculating Summary Statistics
import pandas as pd
df = pd.read_csv('data.csv')
summary = df.describe()
print(summary)
Calculates and prints summary statistics for numerical columns.
36. Handling Text Data
import pandas as pd
df = pd.read_csv('data.csv')
df['Text_Length'] = df['Text_Column'].apply(len)
print(df)
Calculates the length of text in a column.
37. Merging DataFrames on Multiple Columns
import pandas as pd
df1 = pd.read_csv('data1.csv')
df2 = pd.read_csv('data2.csv')
merged_df = pd.merge(df1, df2, on=['ID', 'Date'])
print(merged_df)
Merges DataFrames on multiple columns.
38. Applying Functions to DataFrame Columns
import pandas as pd
df = pd.read_csv('data.csv')
df['Age_In_5_Years'] = df['Age'].apply(lambda x: x + 5)
print(df)
Applies a function to each element of a column.
39. Creating Pivot Tables
import pandas as pd
df = pd.read_csv('data.csv')
pivot_table = df.pivot_table(values='Sales', index='Region', columns='Month', aggfunc='sum')
print(pivot_table)
Creates a pivot table to aggregate data based on regions and months.
40. Converting DataFrame to a List of Dictionaries
import pandas as pd
df = pd.read_csv('data.csv')
list_of_dicts = df.to_dict(orient='records')
print(list_of_dicts)
Converts a DataFrame to a list of dictionaries.
41. Normalizing Data
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
df = pd.read_csv('data.csv')
scaler = MinMaxScaler()
df[['Value']] = scaler.fit_transform(df[['Value']])
print(df)
Normalizes a column to a [0, 1] range.
42. Standardizing Data
pythonCopier le codeimport pandas as pd
from sklearn.preprocessing import StandardScaler
df = pd.read_csv('data.csv')
scaler = StandardScaler()
df[['Value']] = scaler.fit_transform(df[['Value']])
print(df)
Standardizes a column to have mean 0 and standard deviation 1.
43. Converting Data Types
import pandas as pd
df = pd.read_csv('data.csv')
df['Age'] = df['Age'].astype(float)
print(df.dtypes)
Changes the data type of a column.
44. Handling Outliers
import pandas as pd
df = pd.read_csv('data.csv')
q1 = df['Value'].quantile(0.25)
q3 = df['Value'].quantile(0.75)
iqr = q3 - q1
filtered_df = df[(df['Value'] >= (q1 - 1.5 * iqr)) & (df['Value'] <= (q3 + 1.5 * iqr))]
print(filtered_df)
Filters outliers from a DataFrame using the IQR method.
45. Aggregating Data with GroupBy
import pandas as pd
df = pd.read_csv('data.csv')
grouped_df = df.groupby('Category').agg({'Value': 'sum'})
print(grouped_df)
Groups data by ‘Category’ and calculates the sum of ‘Value’.
46. Splitting Data into Training and Testing Sets
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv('data.csv')
train, test = train_test_split(df, test_size=0.2)
print(train.head())
print(test.head())
Splits data into training and testing sets.
47. Reading Data from a JSON File
import pandas as pd
df = pd.read_json('data.json')
print(df.head())
Reads JSON data into a DataFrame.
48. Writing DataFrame to JSON File
import pandas as pd
data = {'Name': ['John', 'Anna'], 'Age': [28, 24]}
df = pd.DataFrame(data)
df.to_json('output.json', orient='records')
Writes a DataFrame to a JSON file.
49. Handling Time Series Data
import pandas as pd
df = pd.read_csv('data.csv', parse_dates=['Date'])
df.set_index('Date', inplace=True)
print(df.resample('M').sum())
Handles time series data and aggregates it by month.
50. Adding Calculated Columns
import pandas as pd
df = pd.read_csv('data.csv')
df['Discounted_Price'] = df['Price'] * 0.9
print(df)
Adds a new column with a calculated value.
51. Combining DataFrames Horizontally
import pandas as pd
df1 = pd.read_csv('data1.csv')
df2 = pd.read_csv('data2.csv')
combined_df = pd.concat([df1, df2], axis=1)
print(combined_df)
Combines two DataFrames side-by-side (horizontally).
52. Extracting Specific Columns
import pandas as pd
df = pd.read_csv('data.csv')
df_subset = df[['Name', 'Age']]
print(df_subset)
Extracts specific columns from a DataFrame.
53. Removing Unnecessary Columns
import pandas as pd
df = pd.read_csv('data.csv')
df = df.drop(columns=['Unnecessary_Column'])
print(df)
Removes unnecessary columns from a DataFrame.
54. Handling Categorical Data with Encoding
import pandas as pd
from sklearn.preprocessing import LabelEncoder
df = pd.read_csv('data.csv')
encoder = LabelEncoder()
df['Category'] = encoder.fit_transform(df['Category'])
print(df)
Encodes categorical data into numerical format.
55. Saving DataFrame to a Database
import pandas as pd
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_csv('data.csv')
df.to_sql('table_name', conn, if_exists='replace', index=False)
Saves a DataFrame to an SQL database.
56. Reading Data from a Database
import pandas as pd
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql_query('SELECT * FROM table_name', conn)
print(df.head())
Reads data from a database into a DataFrame.
57. Filling Missing Values with Interpolation
import pandas as pd
df = pd.read_csv('data.csv')
df = df.interpolate()
print(df)
Fills missing values using interpolation.
58. Exporting Data to Excel
import pandas as pd
data = {'Name': ['John', 'Anna'], 'Age': [28, 24]}
df = pd.DataFrame(data)
df.to_excel('output.xlsx', index=False)
Exports a DataFrame to an Excel file.
59. Reading Excel Sheets
import pandas as pd
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
print(df.head())
Reads a specific sheet from an Excel file.
60. Appending Data to an Existing CSV
import pandas as pd
new_data = {'Name': ['Tom'], 'Age': [30]}
new_df = pd.DataFrame(new_data)
new_df.to_csv('data.csv', mode='a', header=False, index=False)
Appends new data to an existing CSV file.
61. Deleting Rows Based on Condition
import pandas as pd
df = pd.read_csv('data.csv')
df = df[df['Age'] > 25]
print(df)
Deletes rows where ‘Age’ is less than or equal to 25.
62. Sampling Data
import pandas as pd
df = pd.read_csv('data.csv')
sample_df = df.sample(n=10)
print(sample_df)
Samples 10 random rows from a DataFrame.
63. Resampling Time Series Data
import pandas as pd
df = pd.read_csv('data.csv', parse_dates=['Date'])
df.set_index('Date', inplace=True)
monthly_data = df.resample('M').mean()
print(monthly_data)
Resamples time series data to monthly frequency and calculates the mean.
64. Handling Large Datasets with Chunks
import pandas as pd
chunk_size = 10000
chunks = pd.read_csv('large_data.csv', chunksize=chunk_size)
for chunk in chunks:
process(chunk) # Replace with your processing function
Reads and processes large datasets in chunks to manage memory usage.
65. Removing Extra Whitespace
import pandas as pd
df = pd.read_csv('data.csv')
df.columns = df.columns.str.strip()
df['Name'] = df['Name'].str.strip()
print(df)
Removes extra whitespace from column names and data.
66. Checking for Missing Data
import pandas as pd
df = pd.read_csv('data.csv')
missing_data = df.isnull().sum()
print(missing_data)
Checks for and counts missing data in each column.
67. Filling Missing Data with a Value
import pandas as pd
df = pd.read_csv('data.csv')
df = df.fillna(value={'Age': 30})
print(df)
Fills missing values in the ‘Age’ column with 30.
68. Converting Column to Categorical Data
import pandas as pd
df = pd.read_csv('data.csv')
df['Category'] = df['Category'].astype('category')
print(df.dtypes)
Converts a column to categorical data type.
69. Creating a DataFrame from a List of Tuples
pimport pandas as pd
data = [('John', 28), ('Anna', 24)]
df = pd.DataFrame(data, columns=['Name', 'Age'])
print(df)
Creates a DataFrame from a list of tuples.
70. Creating a DataFrame with Custom Index
import pandas as pd
data = {'Name': ['John', 'Anna'], 'Age': [28, 24]}
df = pd.DataFrame(data, index=['Person1', 'Person2'])
print(df)
Creates a DataFrame with a custom index.
71. Applying Multiple Functions to Columns
import pandas as pd
df = pd.read_csv('data.csv')
df['Age_Square'] = df['Age'].apply(lambda x: x ** 2)
df['Age_Log'] = df['Age'].apply(lambda x: np.log(x))
print(df)
Applies multiple functions to a column and creates new columns.
72. Generating Random Data
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Name': np.random.choice(['John', 'Anna', 'Tom'], size=100),
'Age': np.random.randint(20, 40, size=100)
})
print(df)
Generates a DataFrame with random data.
73. Removing Outliers Using Z-Score
import pandas as pd
from scipy import stats
df = pd.read_csv('data.csv')
df = df[(np.abs(stats.zscore(df['Value'])) < 3)]
print(df)
Removes outliers based on the Z-score.
74. Creating a Histogram from Data
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
df['Age'].hist()
plt.show()
Creates a histogram of the ‘Age’ column.
75. Creating a DataFrame from JSON
import pandas as pd
df = pd.read_json('data.json')
print(df)
Creates a DataFrame from a JSON file.
76. Saving DataFrame to a Pickle File
import pandas as pd
df = pd.read_csv('data.csv')
df.to_pickle('data.pkl')
Saves a DataFrame to a pickle file.
77. Reading Data from a Pickle File
import pandas as pd
df = pd.read_pickle('data.pkl')
print(df)
Loads a DataFrame from a pickle file.
78. Merging DataFrames with Different Column Names
import pandas as pd
df1 = pd.read_csv('data1.csv')
df2 = pd.read_csv('data2.csv')
merged_df = pd.merge(df1, df2, left_on='ID', right_on='Identifier')
print(merged_df)
Merges DataFrames on columns with different names.
79. Pivoting Data
import pandas as pd
df = pd.read_csv('data.csv')
pivot_df = df.pivot(index='Date', columns='Category', values='Value')
print(pivot_df)
Creates a pivot table from a DataFrame.
80. Creating a MultiIndex DataFrame
import pandas as pd
arrays = [['A', 'A', 'B', 'B'], ['x', 'y', 'x', 'y']]
index = pd.MultiIndex.from_arrays(arrays, names=('Letter', 'Number'))
df = pd.DataFrame({'Value': [1, 2, 3, 4]}, index=index)
print(df)
Creates a DataFrame with multi-level indexing.
81. Filtering Data with Regular Expressions
import pandas as pd
df = pd.read_csv('data.csv')
filtered_df = df[df['Email'].str.contains('@gmail.com')]
print(filtered_df)
Filters data using regular expressions to match email domains.
82. Extracting Columns from a DataFrame
import pandas as pd
df = pd.read_csv('data.csv')
columns = df[['Name', 'Age']]
print(columns)
Extracts specific columns from a DataFrame.
83. Removing Duplicates from Data
import pandas as pd
df = pd.read_csv('data.csv')
df = df.drop_duplicates()
print(df)
Removes duplicate rows from a DataFrame.
84. Filling Missing Data with Mean
import pandas as pd
df = pd.read_csv('data.csv')
mean_value = df['Value'].mean()
df['Value'] = df['Value'].fillna(mean_value)
print(df)
Fills missing values in a column with the mean value.
85. Creating DataFrame from SQL Query
import pandas as pd
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql_query('SELECT * FROM table_name', conn)
print(df)
Reads data from a SQL database into a DataFrame.
86. Aggregating Data with Multiple Functions
import pandas as pd
df = pd.read_csv('data.csv')
agg_df = df.groupby('Category').agg({'Value': ['mean', 'sum', 'max']})
print(agg_df)
Aggregates data using multiple functions.
87. Handling Dates and Times
import pandas as pd
df = pd.read_csv('data.csv', parse_dates=['Date'])
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
print(df)
Extracts year and month from date columns.
88. Exporting Data to Multiple Sheets in Excel
import pandas as pd
with pd.ExcelWriter('output.xlsx') as writer:
df1 = pd.read_csv('data1.csv')
df2 = pd.read_csv('data2.csv')
df1.to_excel(writer, sheet_name='Sheet1', index=False)
df2.to_excel(writer, sheet_name='Sheet2', index=False)
Exports multiple DataFrames to different sheets in an Excel file.
89. Using DataFrame with SQLAlchemy
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('sqlite:///database.db')
df = pd.read_sql('SELECT * FROM table_name', engine)
print(df)
Reads data from a SQL database using SQLAlchemy.
90. Handling Large Datasets with Dask
import dask.dataframe as dd
df = dd.read_csv('large_data.csv')
df = df.compute() # Convert to pandas DataFrame
print(df.head())
Handles large datasets using Dask.
91. Cleaning Text Data
import pandas as pd
df = pd.read_csv('data.csv')
df['Text_Cleaned'] = df['Text_Column'].str.replace('[^a-zA-Z]', '')
print(df)
Removes non-alphabetic characters from text data.
92. Resampling Data by Week
import pandas as pd
df = pd.read_csv('data.csv', parse_dates=['Date'])
df.set_index('Date', inplace=True)
weekly_data = df.resample('W').sum()
print(weekly_data)
Resamples data to weekly frequency and calculates the sum.
93. Handling Large JSON Files
import pandas as pd
df = pd.read_json('large_data.json', lines=True)
print(df.head())
Reads large JSON files into a DataFrame.
94. Converting DataFrame to Excel with Formatting
import pandas as pd
from openpyxl import Workbook
from openpyxl.styles import Font
df = pd.read_csv('data.csv')
with pd.ExcelWriter('output.xlsx', engine='openpyxl') as writer:
df.to_excel(writer, index=False, sheet_name='Sheet1')
workbook = writer.book
worksheet = writer.sheets['Sheet1']
# Apply formatting
for cell in worksheet["A1:Z1"]: # Adjust range as needed
cell.font = Font(bold=True, color="FFFFFF")
worksheet.sheet_view.showGridLines = False
Writes a DataFrame to an Excel file and applies basic formatting.
95. Adding Conditional Formatting in Excel
import pandas as pd
from openpyxl import load_workbook
from openpyxl.formatting.rule import CellIsRule
from openpyxl.styles import PatternFill
df = pd.read_csv('data.csv')
with pd.ExcelWriter('output.xlsx', engine='openpyxl') as writer:
df.to_excel(writer, index=False, sheet_name='Sheet1')
workbook = writer.book
worksheet = writer.sheets['Sheet1']
# Add conditional formatting
rule = CellIsRule(operator='greaterThan', formula=['100'], fill=PatternFill(start_color='FFFF00', end_color='FFFF00', fill_type='solid'))
worksheet.conditional_formatting.add('B2:B100', rule) # Adjust range as needed
Applies conditional formatting to an Excel sheet.
96. Using SQL to Filter Data
import pandas as pd
import sqlite3
conn = sqlite3.connect('database.db')
query = "SELECT * FROM table_name WHERE Age > 30"
df = pd.read_sql_query(query, conn)
print(df)
Filters data using an SQL query.
97. Combining DataFrames with Concatenation
import pandas as pd
df1 = pd.read_csv('data1.csv')
df2 = pd.read_csv('data2.csv')
combined_df = pd.concat([df1, df2], axis=0, ignore_index=True)
print(combined_df)
Concatenates two DataFrames vertically.
98. Aggregating Data Using GroupBy and Multiple Functions
import pandas as pd
df = pd.read_csv('data.csv')
agg_df = df.groupby('Category').agg({
'Value': ['mean', 'sum', 'max'],
'Age': 'min'
})
print(agg_df)
Aggregates data using multiple functions for different columns.
99. Filtering Data Based on Multiple Conditions
import pandas as pd
df = pd.read_csv('data.csv')
filtered_df = df[(df['Age'] > 25) & (df['Category'] == 'A')]
print(filtered_df)
Filters rows based on multiple conditions.
100. Creating a DataFrame from a Dictionary of Series
import pandas as pd
data = {
'Name': pd.Series(['John', 'Anna']),
'Age': pd.Series([28, 24])
}
df = pd.DataFrame(data)
print(df)
This collection of 100 Python scripts for beginners is designed to enhance your understanding of data handling with practical examples.
Post Comment