Research Project Template - Vincenzo Imperati

Key Features#

Modular Analysis Structure: Separate notebooks for individual analyses with dedicated data storage
Pipeline Framework: Orchestrated execution of data processing functions
Environment Management: Virtual environment setup with dependency tracking
Version Control Ready: Pre-configured .gitignore for data science projects
Scalable Architecture: Easy addition of new analyses and pipeline functions

Table of Contents#

Project Structure
Prerequisites
Quick Start
How to Use the Project
Adding New Content
- Adding a New Analysis
- Adding a Function to the Project Pipeline
Examples
Troubleshooting
Contributing

Project Structure#

project/
├── analysis/                    # Individual analysis notebooks
│   └── example_analysis.ipynb
├── data/
│   ├── analysis/               # Analysis-specific data
│   │   └── example_analysis/
│   ├── raw/                    # Initial, unprocessed data
│   ├── processed/              # Cleaned and processed data
│   └── [pipeline_outputs]/     # Additional folders created by pipeline
├── src/
│   ├── project_execution.ipynb # Main pipeline orchestration
│   └── project_functions.py    # Reusable pipeline functions
├── utils/                      # Shared utilities and helper functions
├── .gitignore                  # Excludes data/ and venv/ from version control
├── data_structure.txt          # Documentation of data folder structure
├── requirements.txt            # Project dependencies
└── README.md                   # This file

Description of Folders and Files#

analysis/:
- Contains Jupyter notebooks for individual analyses. Each notebook has its own folder in data/analysis/ to store generated or utilized data.
data/:
- analysis/: Analysis-specific data, organized by notebook name.
- raw/: Initial, unprocessed data files.
- processed/: Processed data derived from raw/ folder. This folder is typically generated by the first function in the project pipeline.
- Additional folders may exist to store files generated by the project pipeline. These folders can be organized based on the type of output or the specific functions that create them.
src/:
- project_execution.ipynb: Orchestrates and displays the project pipeline.
- project_functions.py: Contains reusable functions used in the pipeline.
utils/:
- Useful and reusable Python modules for analyses or other projects.
.gitignore:
- Excludes the data/ and venv/ folders from version control.
data_structure.txt:
- Documents the structure of the data/ folder (generated by tree -asD data > data_structure.txt).
requirements.txt:
- Lists all dependencies required for the project.

Prerequisites#

Python 3.8+
Git (for version control)
Jupyter Notebook/Lab (for running notebooks)
pip (Python package installer)

Quick Start#

# Clone the repository
git clone <repository-url>
cd <project-name>

# Create and activate virtual environment
python -m venv venv

# On Windows:
venv\Scripts\activate

# On macOS/Linux:
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Start Jupyter
jupyter lab

How to Use the Project#

1. Environment Setup#

Create a Virtual Environment:

python -m venv venv

Activate the Virtual Environment:

# On Windows:
venv\Scripts\activate

# On macOS/Linux:
source venv/bin/activate

Install Dependencies:

pip install -r requirements.txt

2. Running the Project#

Execute the Project Pipeline:

Open src/project_execution.ipynb in Jupyter Lab/Notebook
This notebook orchestrates the various functions and processes defined in the project
Execute cells sequentially to run the complete workflow

Run Individual Analyses:

Navigate to the analysis/ folder
Open any Jupyter notebook (e.g., example_analysis.ipynb)
These notebooks are independent and can be executed separately from the main pipeline

3. Data Organization#

Place raw data files in data/raw/
Processed data will be automatically saved to data/processed/
Analysis-specific data is stored in data/analysis/[notebook_name]/

Adding New Content#

Adding a New Analysis#

Create a New Jupyter Notebook:

# Navigate to analysis folder
cd analysis/
# Create new notebook (or use Jupyter interface)
touch new_analysis.ipynb

Create a Corresponding Data Folder:
```
mkdir data/analysis/new_analysis
```
Implement Your Analysis:
- Write your analysis code in the new notebook
- Save output data files in data/analysis/new_analysis/
- Import utilities from the utils/ folder as needed
Document Your Analysis:
- Include markdown cells explaining your methodology
- Document key findings and conclusions
- Add comments to complex code sections

Adding a Function to the Project Pipeline#

Test Your Function:
- Develop and test your function in a separate Jupyter notebook first
- Ensure it handles edge cases and errors appropriately

Define Your Function in src/project_functions.py:

def new_function(input_data, output_path, **kwargs):
    """
    Brief description of what the function does.
    
    Parameters:
    -----------
    input_data : str or pd.DataFrame
        Description of the input parameter
    output_path : str
        Path where output files will be saved
    **kwargs : dict
        Additional parameters for function customization
    
    Returns:
    --------
    bool or str
        Description of the return value (success status or output path)
    
    Raises:
    -------
    ValueError
        If input validation fails
    FileNotFoundError
        If required input files don't exist
    """
    try:
        # Function implementation here
        print(f"Processing {input_data}...")
        
        # Your logic here
        result = process_data(input_data)
        
        # Save results
        save_results(result, output_path)
        
        print(f"✅ Function completed successfully. Output saved to {output_path}")
        return output_path
        
    except Exception as e:
        print(f"❌ Error in new_function: {str(e)}")
        raise

Integrate into the Pipeline:

Open src/project_execution.ipynb
Add your function to the execution list:

pipeline_steps = [
    {
        'function': pu.existing_function,
        'execute': True,
        'args': ['data/raw/input.csv', 'data/processed/'],
        'description': 'Processes raw data'
    },
    {
        'function': pu.new_function,
        'execute': True,
        'args': ['data/processed/input.csv', 'data/outputs/', {'param1': 'value1'}],
        'description': 'Your new function description'
    }
]

Examples#

Example: Adding a Data Cleaning Analysis#

Create the analysis notebook:

# In analysis/data_cleaning.ipynb

import pandas as pd
import sys
sys.path.append('../utils')
from data_helpers import load_data, save_clean_data

# Load data
df = load_data('../data/raw/dataset.csv')

# Perform cleaning
df_clean = df.dropna().reset_index(drop=True)

# Save results
save_clean_data(df_clean, '../data/analysis/data_cleaning/cleaned_dataset.csv')

Data folder structure:

data/analysis/data_cleaning/
├── cleaned_dataset.csv
├── cleaning_report.html
└── data_quality_plots.png

Example: Adding a Pipeline Function#

# In src/project_functions.py

def feature_engineering(input_path, output_path, feature_config=None):
    """
    Creates engineered features from processed data.
    
    Parameters:
    -----------
    input_path : str
        Path to the processed data file
    output_path : str  
        Path to save engineered features
    feature_config : dict, optional
        Configuration for feature creation
    """
    import pandas as pd
    from sklearn.preprocessing import StandardScaler
    
    # Load data
    df = pd.read_csv(input_path)
    
    # Create features
    df['feature_1'] = df['column_a'] * df['column_b']
    df['feature_2'] = df['column_c'].rolling(window=3).mean()
    
    # Scale features if requested
    if feature_config and feature_config.get('scale_features', False):
        scaler = StandardScaler()
        numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns
        df[numeric_cols] = scaler.fit_transform(df[numeric_cols])
    
    # Save results
    df.to_csv(output_path, index=False)
    print(f"✅ Feature engineering completed. Features saved to {output_path}")
    
    return output_path

Troubleshooting#

Common Issues#

Virtual Environment Issues:

# If activation fails, try:
python -m pip install virtualenv
python -m virtualenv venv

Jupyter Kernel Issues:

# Install ipykernel in your virtual environment
pip install ipykernel
python -m ipykernel install --user --name=venv

Import Errors:

Ensure your virtual environment is activated
Check that all dependencies are installed: pip install -r requirements.txt
Verify that utils/ modules are importable by adding sys.path.append('../utils') in notebooks

Data Path Issues:

Use relative paths from the notebook’s location
Ensure data folders exist before running functions
Check file permissions for read/write access

Performance Tips#

Use pandas.read_csv(chunksize=1000) for large datasets
Implement progress bars with tqdm for long-running processes
Use pickle or joblib to cache intermediate results

Contributing#

Code Style#

Follow PEP 8 for Python code formatting
Use descriptive variable and function names
Include docstrings for all functions
Add type hints where appropriate

Testing#

Test new functions in isolation before adding to pipeline
Include error handling and input validation
Document expected input/output formats

Documentation#

Update this README when adding major features
Include inline comments for complex logic
Create examples for new functionality

Pull Request Process#

Create a feature branch from main
Test your changes thoroughly
Update documentation as needed
Submit a pull request with clear description

License#

This project is licensed under the MIT License - see the LICENSE file for details.

Contact#

For questions or suggestions, please open an issue.