Categories
Tags
address-clustering address-heuristics algorithms android apache-spark app-design auction-bidding bachelor-thesis base-sepolia bitcoin blockchain browser-automation bsc-computer-science business-process-modeling c cantina-royale chain-analysis chainlink-oracles classification complex-system computer-vision css data-analysis data-mining data-modeling data-science data-structures data-visualization dcgan deep-learning defi-protocol digital-art docker drissionpage dynamic-programming ebay-automation education educational electric-vehicle employee-management energy-management ethereum ethereum-goerli event-logging events-syncronizer firebase gamification gaming-nfts generative-adversarial-networks governance gps-location graph-analysis graph-theory home-automation homomorphic-encryption hr-system html human-resources identity-bridge image-recognition java javascript jupyter-notebooks kotlin lottery-system machine-learning marketplace mininet mobile-app modelica msc-computer-science multi-agent-simulation network-optimization network-simulation next-js nextjs nft-analysis nft-generation nft-lending nostr object-centric-event-data on-chain-linking onchain-voting openflow openmp parallel-computing pattern-matching peer-to-pool performance-analysis php postgresql process-mining project-structure project-workflow pthread pyspark python python-bot pytorch q-learning qualification-tracking quiz-app react regex reinforcement-learning relays-monitor reproducible-research research-template rfc3986 ryu-controller scaffold-eth scikit-learn sdn smart-grid sniper-bot social-network solidity sudoku-solver traffic-engineering typescript uri-components uri-parsing url-parser web-development web-scraping web3-tools website zero-knowledge
1026 words
5 minutes
Research Project Template
Waiting for api.github.com...
Key Features
- Modular Analysis Structure: Separate notebooks for individual analyses with dedicated data storage
- Pipeline Framework: Orchestrated execution of data processing functions
- Environment Management: Virtual environment setup with dependency tracking
- Version Control Ready: Pre-configured .gitignore for data science projects
- Scalable Architecture: Easy addition of new analyses and pipeline functions
Table of Contents
Project Structure
project/
├── analysis/ # Individual analysis notebooks
│ └── example_analysis.ipynb
├── data/
│ ├── analysis/ # Analysis-specific data
│ │ └── example_analysis/
│ ├── raw/ # Initial, unprocessed data
│ ├── processed/ # Cleaned and processed data
│ └── [pipeline_outputs]/ # Additional folders created by pipeline
├── src/
│ ├── project_execution.ipynb # Main pipeline orchestration
│ └── project_functions.py # Reusable pipeline functions
├── utils/ # Shared utilities and helper functions
├── .gitignore # Excludes data/ and venv/ from version control
├── data_structure.txt # Documentation of data folder structure
├── requirements.txt # Project dependencies
└── README.md # This file
Description of Folders and Files
analysis/:
- Contains Jupyter notebooks for individual analyses. Each notebook has its own folder in
data/analysis/
to store generated or utilized data.
- Contains Jupyter notebooks for individual analyses. Each notebook has its own folder in
data/:
- analysis/: Analysis-specific data, organized by notebook name.
- raw/: Initial, unprocessed data files.
- processed/: Processed data derived from
raw/
folder. This folder is typically generated by the first function in the project pipeline. - Additional folders may exist to store files generated by the project pipeline. These folders can be organized based on the type of output or the specific functions that create them.
src/:
- project_execution.ipynb: Orchestrates and displays the project pipeline.
- project_functions.py: Contains reusable functions used in the pipeline.
utils/:
- Useful and reusable Python modules for analyses or other projects.
.gitignore:
- Excludes the
data/
andvenv/
folders from version control.
- Excludes the
data_structure.txt:
- Documents the structure of the
data/
folder (generated bytree -asD data > data_structure.txt
).
- Documents the structure of the
requirements.txt:
- Lists all dependencies required for the project.
Prerequisites
- Python 3.8+
- Git (for version control)
- Jupyter Notebook/Lab (for running notebooks)
- pip (Python package installer)
Quick Start
# Clone the repository
git clone <repository-url>
cd <project-name>
# Create and activate virtual environment
python -m venv venv
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Start Jupyter
jupyter lab
How to Use the Project
1. Environment Setup
Create a Virtual Environment:
python -m venv venv
Activate the Virtual Environment:
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate
Install Dependencies:
pip install -r requirements.txt
2. Running the Project
Execute the Project Pipeline:
- Open
src/project_execution.ipynb
in Jupyter Lab/Notebook - This notebook orchestrates the various functions and processes defined in the project
- Execute cells sequentially to run the complete workflow
Run Individual Analyses:
- Navigate to the
analysis/
folder - Open any Jupyter notebook (e.g.,
example_analysis.ipynb
) - These notebooks are independent and can be executed separately from the main pipeline
3. Data Organization
- Place raw data files in
data/raw/
- Processed data will be automatically saved to
data/processed/
- Analysis-specific data is stored in
data/analysis/[notebook_name]/
Adding New Content
Adding a New Analysis
Create a New Jupyter Notebook:
# Navigate to analysis folder cd analysis/ # Create new notebook (or use Jupyter interface) touch new_analysis.ipynb
Create a Corresponding Data Folder:
mkdir data/analysis/new_analysis
Implement Your Analysis:
- Write your analysis code in the new notebook
- Save output data files in
data/analysis/new_analysis/
- Import utilities from the
utils/
folder as needed
Document Your Analysis:
- Include markdown cells explaining your methodology
- Document key findings and conclusions
- Add comments to complex code sections
Adding a Function to the Project Pipeline
Test Your Function:
- Develop and test your function in a separate Jupyter notebook first
- Ensure it handles edge cases and errors appropriately
Define Your Function in
src/project_functions.py
:def new_function(input_data, output_path, **kwargs): """ Brief description of what the function does. Parameters: ----------- input_data : str or pd.DataFrame Description of the input parameter output_path : str Path where output files will be saved **kwargs : dict Additional parameters for function customization Returns: -------- bool or str Description of the return value (success status or output path) Raises: ------- ValueError If input validation fails FileNotFoundError If required input files don't exist """ try: # Function implementation here print(f"Processing {input_data}...") # Your logic here result = process_data(input_data) # Save results save_results(result, output_path) print(f"✅ Function completed successfully. Output saved to {output_path}") return output_path except Exception as e: print(f"❌ Error in new_function: {str(e)}") raise
Integrate into the Pipeline:
- Open
src/project_execution.ipynb
- Add your function to the execution list:
pipeline_steps = [ { 'function': pu.existing_function, 'execute': True, 'args': ['data/raw/input.csv', 'data/processed/'], 'description': 'Processes raw data' }, { 'function': pu.new_function, 'execute': True, 'args': ['data/processed/input.csv', 'data/outputs/', {'param1': 'value1'}], 'description': 'Your new function description' } ]
- Open
Examples
Example: Adding a Data Cleaning Analysis
Create the analysis notebook:
# In analysis/data_cleaning.ipynb import pandas as pd import sys sys.path.append('../utils') from data_helpers import load_data, save_clean_data # Load data df = load_data('../data/raw/dataset.csv') # Perform cleaning df_clean = df.dropna().reset_index(drop=True) # Save results save_clean_data(df_clean, '../data/analysis/data_cleaning/cleaned_dataset.csv')
Data folder structure:
data/analysis/data_cleaning/ ├── cleaned_dataset.csv ├── cleaning_report.html └── data_quality_plots.png
Example: Adding a Pipeline Function
# In src/project_functions.py
def feature_engineering(input_path, output_path, feature_config=None):
"""
Creates engineered features from processed data.
Parameters:
-----------
input_path : str
Path to the processed data file
output_path : str
Path to save engineered features
feature_config : dict, optional
Configuration for feature creation
"""
import pandas as pd
from sklearn.preprocessing import StandardScaler
# Load data
df = pd.read_csv(input_path)
# Create features
df['feature_1'] = df['column_a'] * df['column_b']
df['feature_2'] = df['column_c'].rolling(window=3).mean()
# Scale features if requested
if feature_config and feature_config.get('scale_features', False):
scaler = StandardScaler()
numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])
# Save results
df.to_csv(output_path, index=False)
print(f"✅ Feature engineering completed. Features saved to {output_path}")
return output_path
Troubleshooting
Common Issues
Virtual Environment Issues:
# If activation fails, try:
python -m pip install virtualenv
python -m virtualenv venv
Jupyter Kernel Issues:
# Install ipykernel in your virtual environment
pip install ipykernel
python -m ipykernel install --user --name=venv
Import Errors:
- Ensure your virtual environment is activated
- Check that all dependencies are installed:
pip install -r requirements.txt
- Verify that
utils/
modules are importable by addingsys.path.append('../utils')
in notebooks
Data Path Issues:
- Use relative paths from the notebook’s location
- Ensure data folders exist before running functions
- Check file permissions for read/write access
Performance Tips
- Use
pandas.read_csv(chunksize=1000)
for large datasets - Implement progress bars with
tqdm
for long-running processes - Use
pickle
orjoblib
to cache intermediate results
Contributing
Code Style
- Follow PEP 8 for Python code formatting
- Use descriptive variable and function names
- Include docstrings for all functions
- Add type hints where appropriate
Testing
- Test new functions in isolation before adding to pipeline
- Include error handling and input validation
- Document expected input/output formats
Documentation
- Update this README when adding major features
- Include inline comments for complex logic
- Create examples for new functionality
Pull Request Process
- Create a feature branch from main
- Test your changes thoroughly
- Update documentation as needed
- Submit a pull request with clear description
License
This project is licensed under the MIT License - see the LICENSE file for details.
Contact
For questions or suggestions, please open an issue.
Research Project Template
https://vincenzo.imperati.dev/posts/research-project-template/