Blog Field Notes A Shadow Azure Function Wiped Every Gold Table at 2am
Incident #azure-functions#azure-data-factory#dynamics-365#data-pipeline#incident-response#shadow-it

A Shadow Azure Function Wiped Every Gold Table at 2am

Traced a second HRIS outage to an Azure Function running a parallel ETL that truncated all staging and Gold tables when D365 OData returned zero rows, discovered by pulling the function's source code from a SquashFS deployment package on Azure Blob Storage.

· Gideon Warui
ON THIS PAGE

A Shadow Azure Function Wiped Every Gold Table at 2am

Three hours after fixing the HRIS views (see: D365 Silently Dropped 20 OData Columns), every Insert-mode Gold table dropped to zero rows. fct_d365_workers: 0. dim_d365_PositionDetails: 0. dim_d365_EmploymentDetails: 0. Dashboards broke again.

Only three tables survived — the ones configured with upsert write behavior, which does not truncate before loading:

TableRowsWrite Mode
dim_d365_jobscompensations501upsert
fct_d365_compfixedempls2,613upsert
fct_d365_ContractorsV213upsert

Finding the Culprit

I checked ADF pipeline runs. None of the pipelines I knew about — ExtractDynamics365, TransformDynamics365 — had run recently. But ADF had a trigger called Trigger_Dynamics365 firing every hour. I had assumed it ran the Extract pipeline. It did not.

{
  "name": "Trigger_Dynamics365",
  "properties": {
    "pipelines": [{
      "pipelineReference": {
        "referenceName": "TriggerAzureFunctionELT",
        "type": "PipelineReference"
      }
    }],
    "type": "ScheduleTrigger",
    "typeProperties": {
      "recurrence": {
        "frequency": "Hour",
        "interval": 1
      }
    }
  }
}

It ran TriggerAzureFunctionELT — a pipeline with a single WebActivity that POSTed to an Azure Function:

https://<function-app>.azurewebsites.net/api/Orchestrate

A completely separate ETL system, running hourly, against the same D365 endpoints and the same SQL tables.


Pulling the Source Code

The function app was Python 3.12 on a Linux consumption plan. Kudu VFS returned 404 (limited on Linux consumption). I found the deployment package on the function’s storage account:

az storage blob list --account-name <storage-account> \
# REVIEW: redacted — confirm
  --container-name scm-releases
Name                             Modified                   Size
scm-latest-<function-app>.zip   2025-12-22T09:13:16+00:00  51851264

Downloaded it. file reported SquashFS, not ZIP:

/tmp/eltd365.zip: Squashfs filesystem, little endian, version 4.0,
zlib compressed, 51850148 bytes

Extracted with unsquashfs:

unsquashfs -d /tmp/eltd365-code /tmp/eltd365.zip

What the Function Did

The Orchestrate endpoint spawned a background thread and returned HTTP 202 immediately:

def background_task():
    elt_d365.main()            # Extract
    transform_d365.main()      # Transform
    transform_d365_pdd.main()  # PositionDefaultDimensions

Three steps, sequentially, fire-and-forget.

Step 1: Extract (elt_d365.py) — fetched each D365 OData entity and loaded it into a staging table. The critical flaw was in load_to_sql:

def load_to_sql(df, table_name, conn):
    conn.execute(text(f"TRUNCATE TABLE [{SQL_SCHEMA}].[{target_table}]"))
    df.to_sql(name=target_table, con=conn, schema=SQL_SCHEMA,
              if_exists='append', index=False)

Truncate first, load second. If D365 returned an empty DataFrame, the staging table was wiped and nothing was loaded. The function logged a warning — “No data returned” — and moved to the next table. No rollback.

Step 2: Transform (transform_d365.py) — read each staging table and loaded it into the corresponding Gold table. Same pattern:

truncate_sql = text(
    f"IF OBJECT_ID('{target_table}', 'U') IS NOT NULL "
    f"TRUNCATE TABLE {target_table}"
)
with engine.begin() as conn:
    conn.execute(truncate_sql)

df.to_sql(name=target_table.split('.')[-1], con=engine,
          schema=target_schema, if_exists="append", index=False)

Read from staging (now empty), truncate Gold, load zero rows from the empty DataFrame.

Step 3: PositionDefaultDimensions (transform_d365_pdd.py) — a hardcoded truncate-and-reload for a single table. Same vulnerability.


Why D365 Returned Zero Rows

The function authenticated to D365 using OAuth client credentials:

url = f"https://login.microsoftonline.com/{TENANT_ID}/oauth2/token"
payload = {
    'grant_type': 'client_credentials',
    'client_id': CLIENT_ID,
    'client_secret': CLIENT_SECRET,
    'resource': RESOURCE
}

The exact cause of the empty responses is unclear. Possibilities include an expired client secret, D365 rate limiting (HTTP 429) after the ADF pipelines had already queried the same endpoints, or a transient D365 service issue. The function had no retry logic on the OData calls — a single failed request would produce an empty DataFrame, and the truncate-then-load pattern would wipe the table.

When I ran ExtractDynamics365 through ADF shortly after, 14 of 40 entities initially failed with HTTP 429. ADF’s built-in retry (3 attempts, 120-second intervals) recovered all of them. The Azure Function had no such retry.


The Shadow ETL Problem

Two independent systems were running the same Extract and Transform workloads against the same D365 OData endpoints, writing to the same staging and Gold tables, on the same hourly schedule:

  1. ADF pipelinesExtractDynamics365 and TransformDynamics365, with retry logic, structured logging, error handling, and status tracking in config_adf_pipeline_settings
  2. Azure Function<function-app>, with no retries, truncate-before-validate, background threading, and separate logging to azure_func_run_log

They competed for D365 rate limits and raced on the same tables. The ADF pipelines were the designed system. The Azure Function was likely a prototype that was promoted to production and never decommissioned — the deployment package dated from December 2025, five months before this incident.


Immediate Response

Stopped the trigger to prevent the function from firing again:

az datafactory trigger stop \
  --resource-group <resource-group> \
# REVIEW: redacted — confirm
  --factory-name <adf-factory> \
# REVIEW: redacted — confirm
  --name Trigger_Dynamics365

Re-ran the full E2E through ADF:

# Extract
az datafactory pipeline create-run --name ExtractDynamics365 \
  --parameters '{"Environment":"DEV","SrcSystem":"D365",
                 "RunBacklog":0,"Pipelineid":-999}'

# Transform (after Extract succeeded)
az datafactory pipeline create-run --name TransformDynamics365 \
  --parameters '{"Environment":"DEV","SrcSystem":"D365",
                 "RunBacklog":0,"PipelineId":-999}'

# hrisHistory (after Transform succeeded)
az datafactory pipeline create-run --name hrisHistory

All three succeeded. fct_d365_workers back to 252 rows. Dashboards restored.


What This Taught Me

Check what a trigger actually runs. I assumed Trigger_Dynamics365 ran ExtractDynamics365. It ran a WebActivity that called an Azure Function. The name was misleading, and I did not verify until after the second outage.

Truncate-before-validate is a data loss pattern. Any ETL that truncates the destination before confirming the source returned data will eventually wipe production. The fix is simple: check row count before truncating, or use a swap-table pattern.

Shadow ETL systems accumulate. A prototype function that “works fine” in dev gets a trigger in prod and runs alongside the real pipelines for months. Nobody notices because both write the same data. The moment one of them breaks, it overwrites the other’s work.

SquashFS, not ZIP. Azure Functions on Linux consumption plans package deployments as SquashFS images, not ZIP files. The blob name says .zip but file reveals the truth. Use unsquashfs to extract.

#azure-functions#azure-data-factory#dynamics-365#data-pipeline#incident-response#shadow-it