Removing Models from Google App Engine at minimum cost

There come times when data schema changes so you’ll have to remove some models by the bunch. AFAIK there is no currently a „TRUNCATE TABLE MyKind” or „DROP TABLE MyKind” equivalent in Google App Engine data store, so you’ll have to write an ordinary request handler to do the bidding or use MapReduce.

I have wrote some utility code to do background processing on the very first versions of GAE — when there was no MapReduce available — and I’m happy with that, but I surely would like to compare my own solution with MapReduce based once — maybe some other time…

Regardless of what type of solution you’ll use, you’ll need to remember about indexes and how the writing cost is calculated — deleting is considered writing. Each time you write something you’ll update an index, this count toward data store the quotas where you”ll pay for each write (I mean delete): 2 Writes + 2 Writes per indexed property value + 1 Write per composite index value

So it’s good thing to remove all composite indexes on the model before you’ll start deleting them in bulk, it’s not immediate so remember to check indexes section in the app engine console. Other thing you might do is to disable indexing on all the model properties — I’m not sure thou if it will impact the index update on delete, it’ probably will not (need to check that some day), but there is harm in disabling those indexes also.

I use a simple push task queue handler:

def purge_audit_logs(request):
    url = reverse('purge_audit_logs')
    q = AuditLog.query()
    seq, cursor, more  = fetch_page(request, q, page_size=50, keys_only=True)
    ndb.delete_multi(seq)
    if more:
        schedule_next(request, url, cursor=cursor, queue_name="cleanup")
    return HttpResponse("OK")

This will do the purging in serial minimizing the cost and allowing you to do things slowly, but if you’ll what a fast solution use fan-out and a faster queue:

def purge_audit_logs(request):
    url = reverse('purge_audit_logs')
    q = AuditLog.query()
    seq, cursor, more  = fetch_page(request, q, page_size=50, keys_only=True)
    if more:
        schedule_next(request, url, cursor=cursor, queue_name="cleanup")
    ndb.delete_multi(seq)
    return HttpResponse("OK")

And here are my utility functions for task queues handlers:

def fetch_page(request, query, page_size=30, **query_options):
    cursor = request.POST.get("cursor", None)
    if cursor:
        cursor = Cursor(urlsafe=cursor)
    col, cursor, more = query.fetch_page(page_size, start_cursor=cursor, **query_options)
    if cursor:
        cursor = cursor.urlsafe()
    return col, cursor, more

def schedule_next(url, queue_name='deferred', cursor=None):
    task = Task(countdown=0, url=url, params={'cursor': cursor})
    Queue(queue_name).add(task)

These are simplified, in reality these are do much more, like task naming to disallow duplicate task scheduling for the same data. Task can occasionally run twice, so you’ll need to make them idempotent.

Leave a Reply

Back to Top