Tho Le

A Data Scientist. Looking for knowledge!

Python - Useful code snippets

09 Mar 2025 » python, example

Ref

General

  • Chain Transformation with Pipe
df = df.pipe(lambda d: d.rename(columns={'old_name': 'new_name'})).pipe(lambda d: d.query('new_name > 10'))
  • Pivot Data with Multiple Aggregation
import pandas as pd
import numpy as np

data = {
    'category': ['A', 'A', 'B', 'B', 'C', 'C'],
    'sub_category': ['X', 'Y', 'X', 'Y', 'X', 'Y'],
    'value': [10, 20, 30, 40, 50, 60]
}
df = pd.DataFrame(data)

pivot_df = df.pivot_table(index='category', columns='sub_category', values='value',
                            aggfunc={'value': [np.mean, np.sum]})
  • Time Series Resampling with Multiple Aggregation
    • Time series data resampling is a method of data summarization in a time-frequent manner we want, such as daily, weekly, monthly, etc.
df_resampled = df.set_index('timestamp').resample('D').agg({'value': ['mean', 'max'], 'count': 'sum'}).reset_index()
  • Conditional Selection For Assigning Values
import pandas as pd
import numpy as np

data = {
    'employee_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'salary': [3500, 5000, 2500, 8000, 1000]
}
df = pd.DataFrame(data)

df['salary_level'] = np.select(
    [df['salary'] = 3000) & (df['salary']  6000],
    ['Low', 'Medium', 'High']
  • Conditional Replacement For Several Columns
df = df.pipe(lambda d: d.rename(columns={'old_name': 'new_name'})).pipe(lambda d: d.query('new_name > 10'))
  • Multiple Columns Combination
    • Not necessarily work for numerical data, more for text data.
df['combined'] = df[['col1', 'col2', 'col3']].astype(str).agg('_'.join, axis=1)
  • Column Splitting
df[['first', 'last']] = df['full_name'].str.split(' ', n=1, expand=True)
  • Outlier Identification and Removal
df['capped'] = df['value'].clip(lower=df['value'].quantile(0.05), upper=df['value'].quantile(0.95))
  • Merge Multiple DataFrame with Reduce
    • Without having to manually merge with intermediate dataframes. Not sure about speed improvement.
import pandas as pd
from functools import reduce

df1 = pd.DataFrame({'key': [1, 2, 3], 'A': ['a1', 'a2', 'a3']})
df2 = pd.DataFrame({'key': [2, 3, 4], 'B': ['b2', 'b3', 'b4']})
df3 = pd.DataFrame({'key': [3, 4, 5], 'C': ['c3', 'c4', 'c5']})

list_of_dfs = [df1, df2, df3]

df_merged = reduce(lambda left, right: pd.merge(left, right, on='key', how='outer'), list_of_dfs)
  • DataFrame Query Optimization with Eval
    • Creating a new column based on a DataFrame calculation might take some time, especially if the data are large.
    • To optimize this process, we can use the Pandas eval() function. By using a procedure similar to the query() function, eval() can improve the execution time while reducing the need for an intermediate object.
df = df.eval("col3 = (col1 * 0.8 + col2 * 0.2) / col4", inplace=False)