A Shadow Azure Function Wiped Every Gold Table at 2am
Traced a second HRIS outage to an Azure Function running a parallel ETL that truncated all staging and Gold tables when D365 OData returned zero rows, discovered by pulling the function's source code from a SquashFS deployment package on Azure Blob Storage.
ON THIS PAGE
A Shadow Azure Function Wiped Every Gold Table at 2am
Three hours after fixing the HRIS views (see: D365 Silently Dropped 20 OData Columns), every Insert-mode Gold table dropped to zero rows. fct_d365_workers: 0. dim_d365_PositionDetails: 0. dim_d365_EmploymentDetails: 0. Dashboards broke again.
Only three tables survived — the ones configured with upsert write behavior, which does not truncate before loading:
| Table | Rows | Write Mode |
|---|---|---|
dim_d365_jobscompensations | 501 | upsert |
fct_d365_compfixedempls | 2,613 | upsert |
fct_d365_ContractorsV2 | 13 | upsert |
Finding the Culprit
I checked ADF pipeline runs. None of the pipelines I knew about — ExtractDynamics365, TransformDynamics365 — had run recently. But ADF had a trigger called Trigger_Dynamics365 firing every hour. I had assumed it ran the Extract pipeline. It did not.
{
"name": "Trigger_Dynamics365",
"properties": {
"pipelines": [{
"pipelineReference": {
"referenceName": "TriggerAzureFunctionELT",
"type": "PipelineReference"
}
}],
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Hour",
"interval": 1
}
}
}
}
It ran TriggerAzureFunctionELT — a pipeline with a single WebActivity that POSTed to an Azure Function:
https://<function-app>.azurewebsites.net/api/Orchestrate
A completely separate ETL system, running hourly, against the same D365 endpoints and the same SQL tables.
Pulling the Source Code
The function app was Python 3.12 on a Linux consumption plan. Kudu VFS returned 404 (limited on Linux consumption). I found the deployment package on the function’s storage account:
az storage blob list --account-name <storage-account> \
# REVIEW: redacted — confirm
--container-name scm-releases
Name Modified Size
scm-latest-<function-app>.zip 2025-12-22T09:13:16+00:00 51851264
Downloaded it. file reported SquashFS, not ZIP:
/tmp/eltd365.zip: Squashfs filesystem, little endian, version 4.0,
zlib compressed, 51850148 bytes
Extracted with unsquashfs:
unsquashfs -d /tmp/eltd365-code /tmp/eltd365.zip
What the Function Did
The Orchestrate endpoint spawned a background thread and returned HTTP 202 immediately:
def background_task():
elt_d365.main() # Extract
transform_d365.main() # Transform
transform_d365_pdd.main() # PositionDefaultDimensions
Three steps, sequentially, fire-and-forget.
Step 1: Extract (elt_d365.py) — fetched each D365 OData entity and loaded it into a staging table. The critical flaw was in load_to_sql:
def load_to_sql(df, table_name, conn):
conn.execute(text(f"TRUNCATE TABLE [{SQL_SCHEMA}].[{target_table}]"))
df.to_sql(name=target_table, con=conn, schema=SQL_SCHEMA,
if_exists='append', index=False)
Truncate first, load second. If D365 returned an empty DataFrame, the staging table was wiped and nothing was loaded. The function logged a warning — “No data returned” — and moved to the next table. No rollback.
Step 2: Transform (transform_d365.py) — read each staging table and loaded it into the corresponding Gold table. Same pattern:
truncate_sql = text(
f"IF OBJECT_ID('{target_table}', 'U') IS NOT NULL "
f"TRUNCATE TABLE {target_table}"
)
with engine.begin() as conn:
conn.execute(truncate_sql)
df.to_sql(name=target_table.split('.')[-1], con=engine,
schema=target_schema, if_exists="append", index=False)
Read from staging (now empty), truncate Gold, load zero rows from the empty DataFrame.
Step 3: PositionDefaultDimensions (transform_d365_pdd.py) — a hardcoded truncate-and-reload for a single table. Same vulnerability.
Why D365 Returned Zero Rows
The function authenticated to D365 using OAuth client credentials:
url = f"https://login.microsoftonline.com/{TENANT_ID}/oauth2/token"
payload = {
'grant_type': 'client_credentials',
'client_id': CLIENT_ID,
'client_secret': CLIENT_SECRET,
'resource': RESOURCE
}
The exact cause of the empty responses is unclear. Possibilities include an expired client secret, D365 rate limiting (HTTP 429) after the ADF pipelines had already queried the same endpoints, or a transient D365 service issue. The function had no retry logic on the OData calls — a single failed request would produce an empty DataFrame, and the truncate-then-load pattern would wipe the table.
When I ran ExtractDynamics365 through ADF shortly after, 14 of 40 entities initially failed with HTTP 429. ADF’s built-in retry (3 attempts, 120-second intervals) recovered all of them. The Azure Function had no such retry.
The Shadow ETL Problem
Two independent systems were running the same Extract and Transform workloads against the same D365 OData endpoints, writing to the same staging and Gold tables, on the same hourly schedule:
- ADF pipelines —
ExtractDynamics365andTransformDynamics365, with retry logic, structured logging, error handling, and status tracking inconfig_adf_pipeline_settings - Azure Function —
<function-app>, with no retries, truncate-before-validate, background threading, and separate logging toazure_func_run_log
They competed for D365 rate limits and raced on the same tables. The ADF pipelines were the designed system. The Azure Function was likely a prototype that was promoted to production and never decommissioned — the deployment package dated from December 2025, five months before this incident.
Immediate Response
Stopped the trigger to prevent the function from firing again:
az datafactory trigger stop \
--resource-group <resource-group> \
# REVIEW: redacted — confirm
--factory-name <adf-factory> \
# REVIEW: redacted — confirm
--name Trigger_Dynamics365
Re-ran the full E2E through ADF:
# Extract
az datafactory pipeline create-run --name ExtractDynamics365 \
--parameters '{"Environment":"DEV","SrcSystem":"D365",
"RunBacklog":0,"Pipelineid":-999}'
# Transform (after Extract succeeded)
az datafactory pipeline create-run --name TransformDynamics365 \
--parameters '{"Environment":"DEV","SrcSystem":"D365",
"RunBacklog":0,"PipelineId":-999}'
# hrisHistory (after Transform succeeded)
az datafactory pipeline create-run --name hrisHistory
All three succeeded. fct_d365_workers back to 252 rows. Dashboards restored.
What This Taught Me
Check what a trigger actually runs. I assumed Trigger_Dynamics365 ran ExtractDynamics365. It ran a WebActivity that called an Azure Function. The name was misleading, and I did not verify until after the second outage.
Truncate-before-validate is a data loss pattern. Any ETL that truncates the destination before confirming the source returned data will eventually wipe production. The fix is simple: check row count before truncating, or use a swap-table pattern.
Shadow ETL systems accumulate. A prototype function that “works fine” in dev gets a trigger in prod and runs alongside the real pipelines for months. Nobody notices because both write the same data. The moment one of them breaks, it overwrites the other’s work.
SquashFS, not ZIP. Azure Functions on Linux consumption plans package deployments as SquashFS images, not ZIP files. The blob name says .zip but file reveals the truth. Use unsquashfs to extract.
Discussion