[tutorial_ml:01] Project Structure and Functionalities

Section 001

Rafa Felix
4 min readDec 8, 2023
Photo by Pixabay from Pexels: https://www.pexels.com/photo/blank-close-up-composition-data-372787/

Project Structure

Our sample project, tutorial_ml, is a Python package with a well-defined structure that promotes readability, maintainability, and ease of use:

tutorial_ml/
├── README.md # Provides a comprehensive introduction and user guide for the package.
├── src/ # The main source directory for the package.
│ ├── tutorial/ # The core package containing all the primary modules and sub-packages.
│ │ ├── schemas/ # Contains data models and schemas, defining the data structure.
│ │ │ ├── ad_unit.py # Defines the AdUnit model used for representing ad data.
│ │ │ └── __init__.py # Signifies that 'schemas' is a Python sub-package.
│ │ ├── wrangler/ # Sub-package with modules for data collection, cleaning, and analysis.
│ │ │ ├── analyser.py # Module for analyzing the ad data (e.g., statistical analysis, ML models).
│ │ │ ├── cleaner.py # Module for cleaning and preprocessing ad data.
│ │ │ ├── collector.py # Module for collecting or simulating ad performance data.
│ │ │ └── __init__.py # Marks 'wrangler' as a Python sub-package.
│ │ └── __init__.py # Marks 'tutorial' as a Python package.
│ └── __init__.py # Signifies that 'src' directory is a Python package.
├── tests/ # Contains unit tests for the package, ensuring code reliability and correctness.
├── docs/ # Documentation for the package, including usage examples, API documentation, etc.
├── pyproject.toml # Modern configuration file for specifying build system and dependencies.
└── setup.cfg # Configuration file for setuptools, used to define package metadata and behavior.

This structure is designed to encapsulate different aspects of the project, keeping the source code, documentation, and configurations distinct and organised.

Functionalities

Schemas

The use of Pydantic in our project is illustrated through the AdUnit model. Pydantic utilizes Python type annotations for data validation, ensuring that each data attribute adheres to the specified type, enhancing data integrity and error handling:

from pydantic import BaseModel
class AdUnit(BaseModel):
ad_id: int
views: int
clicks: int
engagement_rate: float

This AdUnit class provides a clear, concise, and self-validating representation of an advertisement unit. Each instance of AdUnit will be automatically validated against the type annotations. For instance, if you try to create an AdUnit with a string for the ad_id, Pydantic will raise an error.

Moreover, Pydantic’s validation is not limited to simple data types. It also supports more complex validations, like regex patterns, min/max values, and custom validators, making it incredibly versatile for ensuring data quality.

This model not only provides a clear schema for ad units but also simplifies data handling processes like parsing and serialisation, making it ideal for applications like web APIs or data processing tasks.

Wrangler module

The wrangler module within our package encompasses several scripts:

  • collector.py: Generates simulated data for ad performance, crucial for testing and development without relying on real-world data
import numpy as np
from typing import List
from ..schemas import AdUnit
def generate_ad_data(num_ads: int = 100) -> List[AdUnit]:
"""Generates random ad performance data."""
data = [
AdUnit(
**{'ad_id': i,
'views': np.random.randint(100, 10000),
'clicks': np.random.randint(0, 1000),
'engagement_rate': np.random.uniform(0, 1)
})
for i in range(num_ads)]
return data
  • cleaner.py: Implements functions to clean and preprocess the data, ensuring it is in the right format and quality for analysis.
from typing import List, Dict
from ..schemas import AdUnit
def clean_data(data: List[AdUnit]) -> List[AdUnit]:
"""Performs basic data cleaning."""
# Example: Fill missing values
# TODO: implement your operation here
return data
def preprocess_data(data: List[AdUnit]) -> List[AdUnit]:
"""Preprocesses the data for analysis."""
# Example: Normalize the 'views' and 'clicks'
for datum in data:
datum.views -= 10
datum.clicks /= 100
return data
  • analyser.py: Contains a basic machine learning model (linear regression) to derive insights from the ad performance data, demonstrating the application of data science techniques in Python.
# analyser.py
from typing import List, Tuple
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from ..schemas import AdUnit
def analyze_ad_performance(data: List[AdUnit]) -> Tuple[LinearRegression, np.ndarray]:
"""Analyzes ad performance using a linear regression model."""
X = np.array([[datum.views, datum.clicks] for datum in data])
y = np.array([datum.engagement_rate for datum in data])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
return model, predictions

Introduction: Mastering Python Project Structure

Next Section 002: Python project setup, and installation

Photo by Pavel Danilyuk from Pexels: https://www.pexels.com/photo/a-robot-holding-a-wine-8439094/

Code base

Access the entire code on github.com/rfelixmg/tutorial_ml

Remember to fork and star if you like it!

--

--

Rafa Felix
Rafa Felix

Written by Rafa Felix

Machine Learning | Computer Vision | NLP | Vision and Language

No responses yet