[tutorial_ml:01] Project Structure and Functionalities

Section 001

4 min readDec 8, 2023

Photo by Pixabay from Pexels: https://www.pexels.com/photo/blank-close-up-composition-data-372787/

Project Structure

Our sample project, tutorial_ml, is a Python package with a well-defined structure that promotes readability, maintainability, and ease of use:

tutorial_ml/
├── README.md         # Provides a comprehensive introduction and user guide for the package.
├── src/              # The main source directory for the package.
│   ├── tutorial/     # The core package containing all the primary modules and sub-packages.
│   │   ├── schemas/  # Contains data models and schemas, defining the data structure.
│   │   │   ├── ad_unit.py    # Defines the AdUnit model used for representing ad data.
│   │   │   └── __init__.py   # Signifies that 'schemas' is a Python sub-package.
│   │   ├── wrangler/ # Sub-package with modules for data collection, cleaning, and analysis.
│   │   │   ├── analyser.py   # Module for analyzing the ad data (e.g., statistical analysis, ML models).
│   │   │   ├── cleaner.py    # Module for cleaning and preprocessing ad data.
│   │   │   ├── collector.py  # Module for collecting or simulating ad performance data.
│   │   │   └── __init__.py   # Marks 'wrangler' as a Python sub-package.
│   │   └── __init__.py       # Marks 'tutorial' as a Python package.
│   └── __init__.py   # Signifies that 'src' directory is a Python package.
├── tests/            # Contains unit tests for the package, ensuring code reliability and correctness.
├── docs/             # Documentation for the package, including usage examples, API documentation, etc.
├── pyproject.toml    # Modern configuration file for specifying build system and dependencies.
└── setup.cfg         # Configuration file for setuptools, used to define package metadata and behavior.

This structure is designed to encapsulate different aspects of the project, keeping the source code, documentation, and configurations distinct and organised.

Functionalities

Schemas

The use of Pydantic in our project is illustrated through the AdUnit model. Pydantic utilizes Python type annotations for data validation, ensuring that each data attribute adheres to the specified type, enhancing data integrity and error handling:

from pydantic import BaseModel

class AdUnit(BaseModel):
    ad_id: int
    views: int
    clicks: int
    engagement_rate: float

This AdUnit class provides a clear, concise, and self-validating representation of an advertisement unit. Each instance of AdUnit will be automatically validated against the type annotations. For instance, if you try to create an AdUnit with a string for the ad_id, Pydantic will raise an error.

Moreover, Pydantic’s validation is not limited to simple data types. It also supports more complex validations, like regex patterns, min/max values, and custom validators, making it incredibly versatile for ensuring data quality.

This model not only provides a clear schema for ad units but also simplifies data handling processes like parsing and serialisation, making it ideal for applications like web APIs or data processing tasks.

Wrangler module

The wrangler module within our package encompasses several scripts:

collector.py: Generates simulated data for ad performance, crucial for testing and development without relying on real-world data

import numpy as np
from typing import List

from ..schemas import AdUnit
def generate_ad_data(num_ads: int = 100) -> List[AdUnit]:
    """Generates random ad performance data."""
    data = [
        AdUnit(
            **{'ad_id': i,
               'views': np.random.randint(100, 10000),
               'clicks': np.random.randint(0, 1000),
               'engagement_rate': np.random.uniform(0, 1)
               })
        for i in range(num_ads)]    return data

cleaner.py: Implements functions to clean and preprocess the data, ensuring it is in the right format and quality for analysis.

from typing import List, Dict
from ..schemas import AdUnit

def clean_data(data: List[AdUnit]) -> List[AdUnit]:
    """Performs basic data cleaning."""
    # Example: Fill missing values
    # TODO: implement your operation here
    return data
def preprocess_data(data: List[AdUnit]) -> List[AdUnit]:
    """Preprocesses the data for analysis."""
    # Example: Normalize the 'views' and 'clicks'    for datum in data:
        datum.views -= 10
        datum.clicks /= 100    return data

analyser.py: Contains a basic machine learning model (linear regression) to derive insights from the ad performance data, demonstrating the application of data science techniques in Python.

# analyser.py
from typing import List, Tuple
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from ..schemas import AdUnit

def analyze_ad_performance(data: List[AdUnit]) -> Tuple[LinearRegression, np.ndarray]:
    """Analyzes ad performance using a linear regression model."""
    X = np.array([[datum.views, datum.clicks] for datum in data])
    y = np.array([datum.engagement_rate for datum in data])
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    model = LinearRegression()
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    return model, predictions

Introduction: Mastering Python Project Structure

Next Section 002: Python project setup, and installation