[tutorial_ml:01] Project Structure and Functionalities
Section 001
Project Structure
Our sample project, tutorial_ml
, is a Python package with a well-defined structure that promotes readability, maintainability, and ease of use:
tutorial_ml/
├── README.md # Provides a comprehensive introduction and user guide for the package.
├── src/ # The main source directory for the package.
│ ├── tutorial/ # The core package containing all the primary modules and sub-packages.
│ │ ├── schemas/ # Contains data models and schemas, defining the data structure.
│ │ │ ├── ad_unit.py # Defines the AdUnit model used for representing ad data.
│ │ │ └── __init__.py # Signifies that 'schemas' is a Python sub-package.
│ │ ├── wrangler/ # Sub-package with modules for data collection, cleaning, and analysis.
│ │ │ ├── analyser.py # Module for analyzing the ad data (e.g., statistical analysis, ML models).
│ │ │ ├── cleaner.py # Module for cleaning and preprocessing ad data.
│ │ │ ├── collector.py # Module for collecting or simulating ad performance data.
│ │ │ └── __init__.py # Marks 'wrangler' as a Python sub-package.
│ │ └── __init__.py # Marks 'tutorial' as a Python package.
│ └── __init__.py # Signifies that 'src' directory is a Python package.
├── tests/ # Contains unit tests for the package, ensuring code reliability and correctness.
├── docs/ # Documentation for the package, including usage examples, API documentation, etc.
├── pyproject.toml # Modern configuration file for specifying build system and dependencies.
└── setup.cfg # Configuration file for setuptools, used to define package metadata and behavior.
This structure is designed to encapsulate different aspects of the project, keeping the source code, documentation, and configurations distinct and organised.
Functionalities
Schemas
The use of Pydantic in our project is illustrated through the AdUnit
model. Pydantic utilizes Python type annotations for data validation, ensuring that each data attribute adheres to the specified type, enhancing data integrity and error handling:
from pydantic import BaseModel
class AdUnit(BaseModel):
ad_id: int
views: int
clicks: int
engagement_rate: float
This AdUnit
class provides a clear, concise, and self-validating representation of an advertisement unit. Each instance of AdUnit
will be automatically validated against the type annotations. For instance, if you try to create an AdUnit
with a string for the ad_id
, Pydantic will raise an error.
Moreover, Pydantic’s validation is not limited to simple data types. It also supports more complex validations, like regex patterns, min/max values, and custom validators, making it incredibly versatile for ensuring data quality.
This model not only provides a clear schema for ad units but also simplifies data handling processes like parsing and serialisation, making it ideal for applications like web APIs or data processing tasks.
Wrangler module
The wrangler
module within our package encompasses several scripts:
collector.py
: Generates simulated data for ad performance, crucial for testing and development without relying on real-world data
import numpy as np
from typing import List
from ..schemas import AdUnit
def generate_ad_data(num_ads: int = 100) -> List[AdUnit]:
"""Generates random ad performance data."""
data = [
AdUnit(
**{'ad_id': i,
'views': np.random.randint(100, 10000),
'clicks': np.random.randint(0, 1000),
'engagement_rate': np.random.uniform(0, 1)
})
for i in range(num_ads)] return data
cleaner.py
: Implements functions to clean and preprocess the data, ensuring it is in the right format and quality for analysis.
from typing import List, Dict
from ..schemas import AdUnit
def clean_data(data: List[AdUnit]) -> List[AdUnit]:
"""Performs basic data cleaning."""
# Example: Fill missing values
# TODO: implement your operation here
return data
def preprocess_data(data: List[AdUnit]) -> List[AdUnit]:
"""Preprocesses the data for analysis."""
# Example: Normalize the 'views' and 'clicks' for datum in data:
datum.views -= 10
datum.clicks /= 100 return data
analyser.py
: Contains a basic machine learning model (linear regression) to derive insights from the ad performance data, demonstrating the application of data science techniques in Python.
# analyser.py
from typing import List, Tuple
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from ..schemas import AdUnit
def analyze_ad_performance(data: List[AdUnit]) -> Tuple[LinearRegression, np.ndarray]:
"""Analyzes ad performance using a linear regression model."""
X = np.array([[datum.views, datum.clicks] for datum in data])
y = np.array([datum.engagement_rate for datum in data])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
return model, predictions
Introduction: Mastering Python Project Structure
Next Section 002: Python project setup, and installation
Code base
Access the entire code on github.com/rfelixmg/tutorial_ml
Remember to fork and star if you like it!