1026 words
5 minutes
Research Project Template
VincenzoImp
/
research-project-template
Waiting for api.github.com...
00K
0K
0K
Waiting...

Key Features#

  • Modular Analysis Structure: Separate notebooks for individual analyses with dedicated data storage
  • Pipeline Framework: Orchestrated execution of data processing functions
  • Environment Management: Virtual environment setup with dependency tracking
  • Version Control Ready: Pre-configured .gitignore for data science projects
  • Scalable Architecture: Easy addition of new analyses and pipeline functions

Table of Contents#

  1. Project Structure
  2. Prerequisites
  3. Quick Start
  4. How to Use the Project
  5. Adding New Content
  6. Examples
  7. Troubleshooting
  8. Contributing

Project Structure#

project/
├── analysis/                    # Individual analysis notebooks
│   └── example_analysis.ipynb
├── data/
│   ├── analysis/               # Analysis-specific data
│   │   └── example_analysis/
│   ├── raw/                    # Initial, unprocessed data
│   ├── processed/              # Cleaned and processed data
│   └── [pipeline_outputs]/     # Additional folders created by pipeline
├── src/
│   ├── project_execution.ipynb # Main pipeline orchestration
│   └── project_functions.py    # Reusable pipeline functions
├── utils/                      # Shared utilities and helper functions
├── .gitignore                  # Excludes data/ and venv/ from version control
├── data_structure.txt          # Documentation of data folder structure
├── requirements.txt            # Project dependencies
└── README.md                   # This file

Description of Folders and Files#

  • analysis/:

    • Contains Jupyter notebooks for individual analyses. Each notebook has its own folder in data/analysis/ to store generated or utilized data.
  • data/:

    • analysis/: Analysis-specific data, organized by notebook name.
    • raw/: Initial, unprocessed data files.
    • processed/: Processed data derived from raw/ folder. This folder is typically generated by the first function in the project pipeline.
    • Additional folders may exist to store files generated by the project pipeline. These folders can be organized based on the type of output or the specific functions that create them.
  • src/:

    • project_execution.ipynb: Orchestrates and displays the project pipeline.
    • project_functions.py: Contains reusable functions used in the pipeline.
  • utils/:

    • Useful and reusable Python modules for analyses or other projects.
  • .gitignore:

    • Excludes the data/ and venv/ folders from version control.
  • data_structure.txt:

    • Documents the structure of the data/ folder (generated by tree -asD data > data_structure.txt).
  • requirements.txt:

    • Lists all dependencies required for the project.

Prerequisites#

  • Python 3.8+
  • Git (for version control)
  • Jupyter Notebook/Lab (for running notebooks)
  • pip (Python package installer)

Quick Start#

# Clone the repository
git clone <repository-url>
cd <project-name>

# Create and activate virtual environment
python -m venv venv

# On Windows:
venv\Scripts\activate

# On macOS/Linux:
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Start Jupyter
jupyter lab

How to Use the Project#

1. Environment Setup#

Create a Virtual Environment:

python -m venv venv

Activate the Virtual Environment:

# On Windows:
venv\Scripts\activate

# On macOS/Linux:
source venv/bin/activate

Install Dependencies:

pip install -r requirements.txt

2. Running the Project#

Execute the Project Pipeline:

  • Open src/project_execution.ipynb in Jupyter Lab/Notebook
  • This notebook orchestrates the various functions and processes defined in the project
  • Execute cells sequentially to run the complete workflow

Run Individual Analyses:

  • Navigate to the analysis/ folder
  • Open any Jupyter notebook (e.g., example_analysis.ipynb)
  • These notebooks are independent and can be executed separately from the main pipeline

3. Data Organization#

  • Place raw data files in data/raw/
  • Processed data will be automatically saved to data/processed/
  • Analysis-specific data is stored in data/analysis/[notebook_name]/

Adding New Content#

Adding a New Analysis#

  1. Create a New Jupyter Notebook:

    # Navigate to analysis folder
    cd analysis/
    # Create new notebook (or use Jupyter interface)
    touch new_analysis.ipynb
  2. Create a Corresponding Data Folder:

    mkdir data/analysis/new_analysis
  3. Implement Your Analysis:

    • Write your analysis code in the new notebook
    • Save output data files in data/analysis/new_analysis/
    • Import utilities from the utils/ folder as needed
  4. Document Your Analysis:

    • Include markdown cells explaining your methodology
    • Document key findings and conclusions
    • Add comments to complex code sections

Adding a Function to the Project Pipeline#

  1. Test Your Function:

    • Develop and test your function in a separate Jupyter notebook first
    • Ensure it handles edge cases and errors appropriately
  2. Define Your Function in src/project_functions.py:

    def new_function(input_data, output_path, **kwargs):
        """
        Brief description of what the function does.
        
        Parameters:
        -----------
        input_data : str or pd.DataFrame
            Description of the input parameter
        output_path : str
            Path where output files will be saved
        **kwargs : dict
            Additional parameters for function customization
        
        Returns:
        --------
        bool or str
            Description of the return value (success status or output path)
        
        Raises:
        -------
        ValueError
            If input validation fails
        FileNotFoundError
            If required input files don't exist
        """
        try:
            # Function implementation here
            print(f"Processing {input_data}...")
            
            # Your logic here
            result = process_data(input_data)
            
            # Save results
            save_results(result, output_path)
            
            print(f"✅ Function completed successfully. Output saved to {output_path}")
            return output_path
            
        except Exception as e:
            print(f"❌ Error in new_function: {str(e)}")
            raise
  3. Integrate into the Pipeline:

    • Open src/project_execution.ipynb
    • Add your function to the execution list:
    pipeline_steps = [
        {
            'function': pu.existing_function,
            'execute': True,
            'args': ['data/raw/input.csv', 'data/processed/'],
            'description': 'Processes raw data'
        },
        {
            'function': pu.new_function,
            'execute': True,
            'args': ['data/processed/input.csv', 'data/outputs/', {'param1': 'value1'}],
            'description': 'Your new function description'
        }
    ]

Examples#

Example: Adding a Data Cleaning Analysis#

  1. Create the analysis notebook:

    # In analysis/data_cleaning.ipynb
    
    import pandas as pd
    import sys
    sys.path.append('../utils')
    from data_helpers import load_data, save_clean_data
    
    # Load data
    df = load_data('../data/raw/dataset.csv')
    
    # Perform cleaning
    df_clean = df.dropna().reset_index(drop=True)
    
    # Save results
    save_clean_data(df_clean, '../data/analysis/data_cleaning/cleaned_dataset.csv')
  2. Data folder structure:

    data/analysis/data_cleaning/
    ├── cleaned_dataset.csv
    ├── cleaning_report.html
    └── data_quality_plots.png

Example: Adding a Pipeline Function#

# In src/project_functions.py

def feature_engineering(input_path, output_path, feature_config=None):
    """
    Creates engineered features from processed data.
    
    Parameters:
    -----------
    input_path : str
        Path to the processed data file
    output_path : str  
        Path to save engineered features
    feature_config : dict, optional
        Configuration for feature creation
    """
    import pandas as pd
    from sklearn.preprocessing import StandardScaler
    
    # Load data
    df = pd.read_csv(input_path)
    
    # Create features
    df['feature_1'] = df['column_a'] * df['column_b']
    df['feature_2'] = df['column_c'].rolling(window=3).mean()
    
    # Scale features if requested
    if feature_config and feature_config.get('scale_features', False):
        scaler = StandardScaler()
        numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns
        df[numeric_cols] = scaler.fit_transform(df[numeric_cols])
    
    # Save results
    df.to_csv(output_path, index=False)
    print(f"✅ Feature engineering completed. Features saved to {output_path}")
    
    return output_path

Troubleshooting#

Common Issues#

Virtual Environment Issues:

# If activation fails, try:
python -m pip install virtualenv
python -m virtualenv venv

Jupyter Kernel Issues:

# Install ipykernel in your virtual environment
pip install ipykernel
python -m ipykernel install --user --name=venv

Import Errors:

  • Ensure your virtual environment is activated
  • Check that all dependencies are installed: pip install -r requirements.txt
  • Verify that utils/ modules are importable by adding sys.path.append('../utils') in notebooks

Data Path Issues:

  • Use relative paths from the notebook’s location
  • Ensure data folders exist before running functions
  • Check file permissions for read/write access

Performance Tips#

  • Use pandas.read_csv(chunksize=1000) for large datasets
  • Implement progress bars with tqdm for long-running processes
  • Use pickle or joblib to cache intermediate results

Contributing#

Code Style#

  • Follow PEP 8 for Python code formatting
  • Use descriptive variable and function names
  • Include docstrings for all functions
  • Add type hints where appropriate

Testing#

  • Test new functions in isolation before adding to pipeline
  • Include error handling and input validation
  • Document expected input/output formats

Documentation#

  • Update this README when adding major features
  • Include inline comments for complex logic
  • Create examples for new functionality

Pull Request Process#

  1. Create a feature branch from main
  2. Test your changes thoroughly
  3. Update documentation as needed
  4. Submit a pull request with clear description

License#

This project is licensed under the MIT License - see the LICENSE file for details.

Contact#

For questions or suggestions, please open an issue.

Research Project Template
https://vincenzo.imperati.dev/posts/research-project-template/
Author
Vincenzo Imperati
Published at
2024-11-10