We are now surrounded by highly efficient and transformative data science and machine learning models. But did you know the raw datasets available for training and building models are never clean and ready to use out of the box? Such datasets contain missing values, irrelevant columns, or noisy entries that can significantly affect data analysis and machine learning model training. This makes data manipulation or data wrangling necessary. In this article, we will discuss how data science professionals use Pandas, a powerful Python library, to clean, transform, and prepare data in a structured form.
Understanding Pandas in Python
Pandas is a widely used open-source Python library built on top of NumPy, designed specifically for data manipulation and analysis. It consists of two primary data structures:
· Series: it is a one-dimensional labeled array (like a column in an Excel sheet)
· DataFrame: it is a two-dimensional tabular data structure with rows and columns, similar to a table or spreadsheet
These structures help load, explore, and manipulate real-world datasets efficiently.
Installing and Importing Pandas
You need to install Pandas before you use it. Here is how to do it:
pip install pandas
In your Python code, typically you import it with the alias pd:
import pandas as pd
This is a standard practice widely used in the data science industry.
Creating Pandas DataFrame in Python
The next step in data manipulation is creating a Pandas DataFrame in Python. Let’s create a DataFrame for a company’s monthly sales records, for this, you can use the following command:
import pandas as pd
sales_data = pd.DataFrame({
'Month': ['January', 'February', 'March', 'April'],
'Product': ['Laptop', 'Laptop', 'Mobile', 'Tablet'],
'Units_Sold': [120, 85, 200, 150],
'Unit_Price': [800, 820, 500, 300]
})
print(sales_data)
This dataset contains the month of sale, product type, units sold, and unit price of each product.
Output
Adding and Removing Data
1. Adding a New Row
Let’s consider the example, the company sells 100 smartwatches at $250 each in May:
new_entry = {
'Month': 'May',
'Product': 'Smartwatch',
'Units_Sold': 100,
'Unit_Price': 250
}
sales_data = pd.concat([sales_data, pd.DataFrame([new_entry])], ignore_index=True)
Output
2. Adding a New Column
If we want to calculate the Total Revenue for each row:
sales_data['Revenue'] = sales_data['Units_Sold'] * sales_data['Unit_Price']
3. Removing Rows or Columns
Deleting the Unit Price column:
sales_data = sales_data.drop('Unit_Price', axis=1)
print(sales_data)
Removing the March entry
sales_data = sales_data.drop(2, axis=0)
Exploring Your Data: Shape, Info, and Statistics
Before transforming your data, you must understand its structure and basic statistics:
1. Shape of Data
sales_data.shape
It returns rows and columns
2. Information Summary
sales_data.info()
This shows column types, non-null values, and memory usage.
3. Statistical Summary
sales_data.describe()
This gives mean, standard deviation, min, max, and other statistics for numeric columns like units sold and revenue.
Handling Missing Data
Handling missing data is the most common task in the data wrangling process. You can do it easily with Pandas through simple commands:
1. Dropping Missing Values
sales_data = sales_data.dropna()
2. Filling Missing Values
sales_data = sales_data.fillna(0)
Or use more intelligent strategies like forward fill:
sales_data = sales_data.fillna(method='ffill')
Selecting and Filtering Data
1. Using .loc (Label-Based)
Get all rows where product is “Laptop”
sales_data.loc[sales_data['Product'] == 'Laptop']
2. Using .iloc (Position-Based)
Get first two rows and first three columns
sales_data.iloc[:2, :3]
Applying Functions and Column Transformations
If you want to convert all month names to uppercase, then use the following code:
sales_data['Month'] = sales_data['Month'].apply(lambda x: x.upper())
And if you want to increase revenue by 10% (e.g., applying tax), then use:
sales_data['Revenue'] = sales_data['Revenue'].apply(lambda x: x * 1.10)
Other Important Data Wrangling Methods
Apart from these, you can also use Pandas for data manipulation in Python as follows:
1. Grouping and Aggregation
Total units sold by product
sales_data.groupby('Product')['Units_Sold'].sum()
Total revenue by month
sales_data.groupby('Month')['Revenue'].sum()
Multiple aggregations
sales_data.groupby('Product').agg({
'Units_Sold': 'sum',
'Revenue': 'mean'
})
2. Merging, Joining, Concatenating
Merging discount data (for example)
discounts = pd.DataFrame({ 'Product': ['Laptop', 'Mobile', 'Tablet', 'Smartwatch'], 'Discount_Percent': [10, 5, 7, 3] })
merged_data = pd.merge(sales_data, discounts, on='Product')
Concatenating two dataframes
more_sales = pd.DataFrame({...}) combined_data = pd.concat([sales_data, more_sales], ignore_index=True)
3. Reshaping Data
Pivot Table — Revenue by Month and Product
sales_data.pivot_table(values='Revenue', index='Month', columns='Product', aggfunc='sum')
Melting Data — Wide to Long Format
pd.melt(sales_data, id_vars=['Month'], var_name='Category', value_name='Value')
Best Practices for Data Manipulation in Python
Here are some pro tips for data engineers to keep their data wrangling using Pandas efficient:
1. Avoid loops when possible: Use vectorized operations instead of Python loops. Pandas and NumPy are used best for array operations
2. Be cautious with apply(): It can be slower than built-in vectorized functions. Use it when you want to apply custom logic
3. Chain operations: You can combine multiple Pandas methods together to make data wrangling more efficient
4. Always inspect your data: For this, you can use info(), describe(), and other exploratory commands early to understand your dataset
5. Document transformations: When preprocessing data for a machine learning model, keep a clear log of all transformations. This will help you reproduce later when needed.
Final Thoughts!
Data manipulation in Python using Pandas is one of the core and essential data science skills for anyone working in data science, analytics, or machine learning domain. With Pandas, you can easily load raw data and explore its structure, address data issues like missing and incorrect values, transform fields, or reshape data for analysis.
This Python library has expressive syntax and offers powerful methods to help you turn messy, real-world into structured, meaningful information. If you want to learn these essential skills, then enroll in USDSI® data science certifications that consist of comprehensive curriculum, meet the latest industry requirements, and cover latest tools and technologies.
Empowering yourself with these important data wrangling skills and earning recognized credentials from USDSI® will open doors to wonderful data science career opportunities in the years to come.

Comments