How To Perform Data Manipulation in Python with Pandas?

We are now surrounded by highly efficient and transformative data science and machine learning models. But did you know the raw datasets available for training and building models are never clean and ready to use out of the box? Such datasets contain missing values, irrelevant columns, or noisy entries that can significantly affect data analysis and machine learning model training. This makes data manipulation or data wrangling necessary. In this article, we will discuss how data science professionals use Pandas, a powerful Python library, to clean, transform, and prepare data in a structured form.

Understanding Pandas in Python

Pandas is a widely used open-source Python library built on top of NumPy, designed specifically for data manipulation and analysis. It consists of two primary data structures:

· Series: it is a one-dimensional labeled array (like a column in an Excel sheet)

· DataFrame: it is a two-dimensional tabular data structure with rows and columns, similar to a table or spreadsheet

These structures help load, explore, and manipulate real-world datasets efficiently.

Installing and Importing Pandas

You need to install Pandas before you use it. Here is how to do it:

pip install pandas

In your Python code, typically you import it with the alias pd:

import pandas as pd

This is a standard practice widely used in the data science industry.

Creating Pandas DataFrame in Python

The next step in data manipulation is creating a Pandas DataFrame in Python. Let’s create a DataFrame for a company’s monthly sales records, for this, you can use the following command:

import pandas as pd

sales_data = pd.DataFrame({

'Month': ['January', 'February', 'March', 'April'],

'Product': ['Laptop', 'Laptop', 'Mobile', 'Tablet'],

'Units_Sold': [120, 85, 200, 150],

'Unit_Price': [800, 820, 500, 300]

})

print(sales_data)

This dataset contains the month of sale, product type, units sold, and unit price of each product.

Output

Adding and Removing Data

1. Adding a New Row

Let’s consider the example, the company sells 100 smartwatches at $250 each in May:

new_entry = {

'Month': 'May',

'Product': 'Smartwatch',

'Units_Sold': 100,

'Unit_Price': 250

}

sales_data = pd.concat([sales_data, pd.DataFrame([new_entry])], ignore_index=True)

Output

2. Adding a New Column

If we want to calculate the Total Revenue for each row:

sales_data['Revenue'] = sales_data['Units_Sold'] * sales_data['Unit_Price']

3. Removing Rows or Columns

Deleting the Unit Price column:

sales_data = sales_data.drop('Unit_Price', axis=1)

print(sales_data)

Removing the March entry

sales_data = sales_data.drop(2, axis=0)

Exploring Your Data: Shape, Info, and Statistics

Before transforming your data, you must understand its structure and basic statistics:

1. Shape of Data

sales_data.shape

It returns rows and columns

2. Information Summary

sales_data.info()

This shows column types, non-null values, and memory usage.

3. Statistical Summary

sales_data.describe()

This gives mean, standard deviation, min, max, and other statistics for numeric columns like units sold and revenue.

Handling Missing Data

Handling missing data is the most common task in the data wrangling process. You can do it easily with Pandas through simple commands:

1. Dropping Missing Values

sales_data = sales_data.dropna()

2. Filling Missing Values

sales_data = sales_data.fillna(0)

Or use more intelligent strategies like forward fill:

sales_data = sales_data.fillna(method='ffill')

Selecting and Filtering Data

1. Using .loc (Label-Based)

Get all rows where product is “Laptop”

sales_data.loc[sales_data['Product'] == 'Laptop']

2. Using .iloc (Position-Based)

Get first two rows and first three columns

sales_data.iloc[:2, :3]

Applying Functions and Column Transformations

If you want to convert all month names to uppercase, then use the following code:

sales_data['Month'] = sales_data['Month'].apply(lambda x: x.upper())

And if you want to increase revenue by 10% (e.g., applying tax), then use:

sales_data['Revenue'] = sales_data['Revenue'].apply(lambda x: x * 1.10)

Other Important Data Wrangling Methods

Apart from these, you can also use Pandas for data manipulation in Python as follows:

1. Grouping and Aggregation

Total units sold by product

sales_data.groupby('Product')['Units_Sold'].sum()

Total revenue by month

sales_data.groupby('Month')['Revenue'].sum()

Multiple aggregations

sales_data.groupby('Product').agg({

'Units_Sold': 'sum',

'Revenue': 'mean'

})

2. Merging, Joining, Concatenating

Merging discount data (for example)

discounts = pd.DataFrame({ 'Product': ['Laptop', 'Mobile', 'Tablet', 'Smartwatch'], 'Discount_Percent': [10, 5, 7, 3] })

merged_data = pd.merge(sales_data, discounts, on='Product')

Concatenating two dataframes

more_sales = pd.DataFrame({...}) combined_data = pd.concat([sales_data, more_sales], ignore_index=True)

3. Reshaping Data

Pivot Table — Revenue by Month and Product

sales_data.pivot_table(values='Revenue', index='Month', columns='Product', aggfunc='sum')

Melting Data — Wide to Long Format

pd.melt(sales_data, id_vars=['Month'], var_name='Category', value_name='Value')

Best Practices for Data Manipulation in Python

Here are some pro tips for data engineers to keep their data wrangling using Pandas efficient:

1. Avoid loops when possible: Use vectorized operations instead of Python loops. Pandas and NumPy are used best for array operations

2. Be cautious with apply(): It can be slower than built-in vectorized functions. Use it when you want to apply custom logic

3. Chain operations: You can combine multiple Pandas methods together to make data wrangling more efficient

4. Always inspect your data: For this, you can use info(), describe(), and other exploratory commands early to understand your dataset

5. Document transformations: When preprocessing data for a machine learning model, keep a clear log of all transformations. This will help you reproduce later when needed.

Final Thoughts!

Data manipulation in Python using Pandas is one of the core and essential data science skills for anyone working in data science, analytics, or machine learning domain. With Pandas, you can easily load raw data and explore its structure, address data issues like missing and incorrect values, transform fields, or reshape data for analysis.

This Python library has expressive syntax and offers powerful methods to help you turn messy, real-world into structured, meaningful information. If you want to learn these essential skills, then enroll in USDSI® data science certifications that consist of comprehensive curriculum, meet the latest industry requirements, and cover latest tools and technologies.

Empowering yourself with these important data wrangling skills and earning recognized credentials from USDSI® will open doors to wonderful data science career opportunities in the years to come.

Contests

Forums

Whiz Picks