Why Some ML Teams End Up With a Messy S3 Bucket
May 28, 2026
Your data scientist just pinged you: "Hey, can you set up an S3 bucket for the new fraud detection project?"
Easy, right? You create a bucket, maybe toss in a data/ folder, and call it done. Except... then the questions start rolling in.
Where does raw data go versus cleaned data? What about model artifacts? Training outputs? Feature stores? Inference results? Logs? Notebooks? And wait — the customer churn team wants the same structure but for their project. Oh, and we need this in three environments. Across two regions.
Suddenly your "quick S3 setup" has turned into a week-long architecture exercise. And if you're being honest, you'll probably just copy-paste whatever folder structure you used last time and hope it still makes sense six months from now.
The Folder Structure Nobody Wants to Design
Here's the dirty secret of ML infrastructure: everyone needs roughly the same S3 layout, but everyone builds it from scratch every single time.
A proper ML storage structure needs to handle:
- Raw data landing (immutable, date-partitioned)
- Curated data (cleaned, validated, ready for feature extraction)
- Processed data (train/validation/test splits, feature stores)
- Inference data (batch inputs, predictions, real-time request/response logs)
- Model artifacts (experiments, training outputs, registry with prod/staging/dev)
- Notebooks, code, configs, reports, visualizations...
That's 130+ folders if you do it properly. Most teams get about 10 folders in and then wing it. Six months later, nobody knows where anything lives, and your "data lake" is really a data swamp.
What If You Just Declared What You Needed?
Instead of designing folder hierarchies from scratch, imagine defining your entire S3 infrastructure in a config file shorter than a Slack message:
client:
company_name: Acme Corp
account_id: "123456789012"
environment:
env: prod
region: us-west-1
s3:
versioning: true
lifecycle_policy: ml-optimized
tags:
Project: Customer Churn ML
Owner: data-science-team
One command later, you've got a production-ready bucket with 130+ organized folders covering the entire ML lifecycle — from raw data ingestion all the way to model registry and inference outputs. No guessing, no copy-pasting from last year's project, no "I'll organize this later."
Clone It for Every New Project
Here's where it gets really useful. Your team just kicked off a fraud detection project alongside the existing customer churn work. Do you redesign the folder structure? Copy folders manually? Write another script?
Nope. You just run a command with these keys:
--action deploy-solution
--solution fraud-detection
Same battle-tested structure, instantly cloned. Works for any ML solution — NLP, computer vision, time series, recommendation engines, whatever. The folder layout is universal because the ML lifecycle is universal: you always ingest, clean, process, train, evaluate, and deploy.
Need it for a different subsidiary? Different region? Different environment? Change one line in the config and deploy again. The naming handles itself:
acme-prod-a001-us-west-1-s3
acme-dev-a001-us-west-2-s3
globex-prod-b002-eu-west-1-s3
No naming collisions, no manual tracking spreadsheets.
Your Storage Costs Are Probably Out of Control
Here's a number that might sting: if you're storing 100 TB of ML data without lifecycle policies, you're paying roughly $27,600 per year in storage alone.
The S3 Provisioner ships with four built-in lifecycle profiles:
- ml-optimized — transitions to cheaper storage after 30/90 days. Saves 63%.
- compliance — 7-year retention with Glacier archiving. Saves 74%.
- development — auto-deletes after 90 days. Saves basically everything.
- none — you manage it yourself (good luck).
That 100 TB drops from $27,600/year to $10,140 with ml-optimized. You don't configure this manually — it's one line in your YAML:
s3:
lifecycle_policy: ml-optimized
Plus there's a built-in cost estimator that pulls live pricing from the AWS Pricing API. You know exactly what you'll pay before you deploy.
Drift Happens. Catch It Before It Breaks Things.
You know the story. Someone logs into the AWS console at 2 AM to "quickly fix" a bucket policy. A script creates rogue folders. A lifecycle rule gets accidentally deleted.
Because everything deploys as a CloudFormation stack, you get:
- Drift detection — run
check-driftand instantly see what's been changed outside of your code - Change preview — see exactly what will change before you apply updates
- Automatic rollback — if a deployment fails, everything reverts cleanly
- Full audit trail — CloudFormation logs who changed what and when
No more "who touched the bucket?" mysteries.
It Fits Into What You Already Have
The S3 Provisioner isn't a standalone island. It integrates with the VPC Provisioner for private S3 access via VPC endpoints — your training data never touches the public internet. And it works alongside the SEC Provisioner for IAM policies that actually follow least-privilege.
It runs as a Docker container, so it slots into any CI/CD pipeline, any environment, any workflow you already have.
Stop Reinventing the Bucket
Every ML team needs the same thing: organized storage, cost controls, and the ability to spin up new projects without a week of infrastructure work.
You can keep designing folder structures from scratch, manually configuring lifecycle policies, and hoping nobody breaks your bucket layout. Or you can define it once in YAML and let automation handle the rest.
If you're tired of messy buckets, check out our products page or find us on the AWS Marketplace.
Questions? Drop us a line at support@axontechlabs.com.