How I Migrated 3.2 Million SharePoint Files to Azure Blob: Four Bugs That Almost Stopped It
Built and operated a distributed Python migration service on AKS that moved 3.26 million files and ~15TB from SharePoint Online to Azure Blob Storage across three drives, diagnosing and resolving four production bugs along the way.
ON THIS PAGE
The brief was written on February 17, 2026. Core infrastructure running by tomorrow. Source: SharePoint. Target: Azure Blob Storage. File count: approximately two million. Time available: about fourteen hours.
The driver was storage cost. A financial institution had accumulated years of archived claims documents — PDFs, mostly — in SharePoint Online. SharePoint is priced at enterprise rates per seat and per gigabyte. Azure Blob cold tier is not. With roughly 15TB sitting in SharePoint and a target of zero, the cost case was clear. Moving that much data reliably, safely, and without disrupting anything still accessing the documents was not.
This post covers what I built — a Python service running on Kubernetes, backed by PostgreSQL and Redis — the architecture decisions that held up under pressure, and the four bugs that didn’t show up until the system was running at scale.
What Needed to Happen
Moving files out of SharePoint is not a simple copy operation. Microsoft Graph API exposes content through a paginated, cursor-driven enumeration model — there is no flat directory listing. Each file requires a download URL resolved at request time. Uploads to Azure Blob Storage need to be chunked correctly for large-file reliability. And before anything is deleted from SharePoint, you need to be certain the blob landing was successful.
Beyond the mechanics, the system had to be resumable. A pod restart, a throttle burst, a transient network failure — none of these could result in a file being processed twice or silently dropped. The audit trail had to be immutable. And the delete operation had to be reversible if something went wrong: SharePoint’s soft-delete gives a 93-day recovery window, but only if the delete itself is deliberate and gated.
The design constraints were: enumerate, transfer, verify, delete — in that order, with no shortcuts.
Day 1 Architecture
I kept the initial design deliberately minimal. One worker type discovers files and queues work. Another pulls from the queue and moves the data. A state machine in PostgreSQL tracks every file from first discovery through final deletion. A message queue decouples the two workers so they can scale independently. The whole thing runs on Kubernetes — AKS specifically — with workers scaling automatically based on queue depth.
The most consequential structural decision on day one was building a single deployable unit whose role is selected by configuration at runtime. Adding a new pipeline stage — the delete phase, a new source library, a future variant of the migration — means adding a new role to the dispatch logic, not a new codebase, not a new build pipeline, not a new secrets configuration. One container image. Many deployment shapes.
The delete pipeline was added a few days later, extending the data flow:
Discover → Queue → Transfer to Blob
↓
Claim completed files → Queue → Verify blob → Soft-delete from SharePoint
Workers autoscaled based on queue depth. During peak transfer, the system ran up to ten parallel workers against the first library. Delete workers ran at their ceiling for several days clearing the backlog that had accumulated while a bug was in flight (more on that below).
The State Machine
Every file has a lifecycle. It is discovered, queued, picked up, transferred, and eventually deleted from the source. Any of those steps can fail transiently or permanently. A pod can die mid-transfer. A network call can time out after the blob write succeeded but before the database acknowledged it.
The state machine handles all of this by making each transition idempotent. Retries are safe. Uploads are safe to repeat. The database record for a file reflects what has actually happened, not what a worker last intended. A recovery process runs on a schedule and resets any file that has been in an intermediate state too long — these are almost always pods that died before completing their acknowledgement.
The atomic work-claiming pattern — where multiple workers compete for the same pool of pending files without double-processing — was designed to hold under 20-50 concurrent workers. It held.
The Throttle Wall
About a week into the migration, transfer throughput dropped. Microsoft Graph API was returning rate-limit responses. The naive first response — reduce concurrency — worked, but left throughput lower than it needed to be. Dropping from 20 parallel requests to 8 meant the migration pace would miss the target.
The problem with a fixed concurrency limit is that it doesn’t adapt. If the API is comfortable at 15 requests per second today and 12 tomorrow, a hard cap of 8 is perpetually conservative.
I implemented an adaptive rate limiter borrowed conceptually from TCP congestion control — AIMD: additive increase, multiplicative decrease. The limiter starts conservative, increases allowed throughput gradually on success, and cuts sharply on a rate-limit signal. The decrease is aggressive (halving) because the cost of another 429 response is higher than the cost of a brief underutilization.
The first version ran per-pod. That broke immediately at scale: when ten transfer workers are each running their own independent limiter, they collectively have no idea what the system-level rate is. One pod’s successful increase was another pod’s throttle trigger.
The second version shared state across all pods through Redis. All workers read and write to the same adaptive limit. A single 429 anywhere cuts the shared ceiling for everyone. A recovery period applies globally. This gave the migration a stable equilibrium — throughput oscillated narrowly around what the API would actually sustain rather than swinging between overload and excessive caution.
Multi-Source Expansion and the Credential Problem
<library-1> was mid-flight when scope expanded. A second library — Claims, with approximately 940,000 files — needed to be added without stopping the first migration. The configuration model was extended to support multiple source libraries with distinct blob prefixes, and separate enumerator deployments were spun up per source, each scoped to its own library.
Around the same time, I discovered a quieter problem. The initial implementation used a static authentication token generated manually and loaded into a cluster secret. Tokens expire. When this one did, API requests started returning 401 — but silently, because there was no distinct handling for an authentication failure. It was counted as a generic error, and the error rate panel in Grafana showed elevated counts without indicating why.
The fix had two parts. First, switch to proper application-identity authentication — the service acquires and refreshes its own tokens automatically, with no manual rotation required. Second, make 401 responses visible as their own signal in the metrics dashboard so a credential failure is immediately distinguishable from a transient service error.
I also split the authentication identity between transfer and delete workers. Each phase of the migration runs as a separate application identity with its own API quota. This was a security decision as much as a capacity one: the principle of least privilege applies here — the worker that reads files doesn’t need the permissions held by the worker that deletes them.
Rate Limiter Interference
With two separate API identities in place, a subtle problem remained. The shared adaptive rate limiter treated all workers as equivalent. When delete workers hit rate limits against their identity, the shared ceiling dropped — and that ceiling was also respected by transfer workers running against a completely different identity with its own quota.
The fix was to give each pipeline phase its own independent rate limiter namespace in Redis. Transfer workers adapt to their identity’s quota. Delete workers adapt to theirs. A spike in delete-phase rate limiting no longer affects transfer throughput, and vice versa.
The Claims Bug
Claims was the hardest problem. After the multi-source rollout, the Claims enumerator pod ran, reported completion, and exited cleanly. Files queued: zero. Every restart produced the same result — a short run, a clean exit, nothing discovered.
The root cause was two independent bugs that compounded into one symptom.
The first was a data problem. During an early test run, the Claims enumerator had traversed the drive and saved a resumption cursor to the database. That run went through folder pages only — the library’s structure is deep with nested folders, and early pages contained no files. The cursor was saved as current.
Every subsequent run used this cursor as an incremental delta. The API correctly returned: no changes since that point. Zero new items. The 940,000 files weren’t new — they predated the cursor. The system had no record of having missed them.
The second was a logic problem. Each enumerator deployment is scoped to a single source library via a configuration filter. The function that decided whether enumeration was complete checked all configured sources in the database — not just the one this pod was responsible for. <library-1> had already completed and had a valid cursor. So the moment Claims got any cursor — even the stale zero-items one from the test run — the function saw both sources as complete and exited.
The fix required clearing the stale cursor in the database to force a full re-scan, and patching the completion check to respect the per-pod source filter.
After the restart, the Claims enumerator began traversing from scratch. The library had well over 100,000 folder-only pages before reaching files — the logs showed nothing but empty-page signals for several minutes. Then the file count started climbing. Within 24 hours, 940,028 files were in the state machine and moving through the pipeline.
<library-3>: A Different Drive, a Different Edge Case
A third library — <library-3>, roughly 61,000 files and 400GB — was added as a separate migration into a dedicated blob container. It completed in under three hours, which validated the throughput design at a smaller scale.
The edge case here was naming. The <library-3> library had spaces in folder and file names. The other two libraries did not. The delete verification step reconstructed the blob path from a URL returned by the storage API. That URL had spaces encoded as %20. The path stored in the database was the raw string. The comparison failed for every file with a space in its name, blocking the entire delete phase for that library.
One-line fix, once found. Two hours to find it.
The lesson: test with names that contain spaces, hyphens, parentheses, and non-ASCII characters before go-live. Character encoding assumptions are invisible until the data violates them.
What the System Handled in Production
Beyond the core bugs, a few operational issues during the run:
Grafana volume deadlock. The monitoring stack uses a Kubernetes volume that only supports a single mount at a time. Rolling updates — where the new pod schedules before the old one releases the volume — cause the new pod to hang in a pending state indefinitely. Fix: manually delete the old pod after the new one is scheduled. Forced release, clean mount. Known constraint of that volume type in Kubernetes; requires a manual step in the upgrade procedure.
Dead-letter queue messages with stale routing. When the <library-3> source was added with a dedicated container, some older messages already sitting in the dead-letter queue from earlier failures had been published against the previous global default. Reprocessing those messages sent them to the wrong container, where blob verification failed. Fix: reset the claim state for the affected files in the database, purge the queue, and let the system re-publish with the correct routing. The procedure took about ten minutes.
The Final Numbers
| Drive | Files | Data | Status |
|---|---|---|---|
<library-1> | 2,258,670 | ~12TB | Complete — transferred + soft-deleted |
| Claims | 940,028 | ~2.5TB | Complete — transferred + soft-deleted |
<library-3> | ~61,000 | ~400GB | Complete — under 3 hours |
| Total | ~3,260,000 | ~15TB |
724 files failed permanently across all three drives. Every one of those was a not-found response on the transfer attempt — the file had been deleted from SharePoint by a user before the migration reached it. They are recorded in the audit trail and were not retried.
Transfer workers peaked at ten replicas during the Claims drive. Delete workers ran at their ceiling for several days clearing the backlog that accumulated while the Claims bug was in flight.
What I’d Do Differently
Wire proper application-identity authentication from day one. The static token was a shortcut that cost more time to clean up than it would have taken to do correctly upfront. Authentication failures are never obvious from generic error counts alone.
Apply source scoping to every function that queries across sources. The Claims bug traced back to one function that checked all sources instead of the filtered subset. Any query that crosses source boundaries in a multi-tenant or multi-source system is a potential latent bug for the next expansion.
Test with adversarial filenames before go-live. Spaces in filenames are not exotic. They showed up in the third library. They should have been in the test suite for the first one.
What’s Next
The pattern holds. The next phase is a larger migration — different data, different requirements, a different interaction model between what lands in blob storage and what stays visible in SharePoint. The infrastructure carries forward. The lessons carry forward. The scope does not.
15TB was the proof of concept. What comes next is considerably larger.
Discussion