When switching a Python codebase to use the dataclasses feature, added in Python 3.7, I ran into an unexpected issue when running code with Django: dates were not updating.
It was actually a little more general than that: entire objects were not updating: some object variables might have contents from the current day’s run, others might have values that were set on the day the current instance of Django was started. When running on the command line, there were no issues!
My first suspect was Celery; I knew that Celery did do some object caching (or, more accurately, function return caching). I was indeed using django-celery-results (it was no longer needed, and I ended up disabling it entirely, although this did not fix the problem).
The first step was to scatter @app.task(ignore_result=True)
decorators around:
@app.task(ignore_result=True)
def run_process():
pass
However, this did nothing to fix the issue. This was not a surprise, since I had spent some time investigating the results cache using celery shell
, and found it uniformly empty.
The next step was to look at how the stale variables were created, and this lead to more success. All the variables that were at issue were not using field() with a default_factory
!
The reason for the field function is basically because, when you create multiple instances of an object, only the first version is properly initialised; subsequent versions are merely copied across. (This is described in the Mutable default values section of the dataclass documentation: “That is, two instances of class D that do not specify a value for x when creating a class instance will share the same copy of x”.)
Presumably this is a result of Python’s design, and its focus on saving memory where possible.
Default factories are the correct way around this problem. In the following example, (if the code worked, instead of producing ValueError: mutable default <class 'list'> for field mylist is not allowed: use default_factory
, c2.mylist
would have the contents [1, 2, 3]
.
@dataclass
class C:
mydate: datetime.date = datetime.date.today(tz=datetime.UTC)
c1 = C()
c1.mylist = [1, 2, 3]
c2 = C()
print(c2)
We can see this in a less contrived example:
@dataclass
class C:
mydate: datetime.datetime = datetime.datetime.now(datetime.UTC)
c1 = C()
time.sleep(5)
c2 = C()
print(c1.mydate==c2.mydate)
Here, we would expect c2.mydate
to have a time 5 second later than c1.mydate
, but in fact, the time is identical.
In my case, when I ran my code on the command line, it was creating one instance of the object, and then exiting. However, in Django, the state would stick around, and when the same function was run, creating the same object, it would be initialised with the values of the first run.
So, easy fix, right?
@dataclass
class C:
mydate: datetime.datetime = field(default_factory=datetime.datetime.now)
… wait, not so fast, what happened to the tz
argument? It turns out that default_factory
“must be a zero-argument callable”, according to the docs. This is annoying, but lambda
s to the rescue:
@dataclass
class C:
mydate: datetime.datetime = field(default_factory=lambda: datetime.datetime.now(datetime.UTC))
Finally, a complete solution? Well, no, there’s one more wrinkle, and I don’t think it’s in the documentation. Possibly because I’m doing things I shouldn’t. The remaining problem lies with dataclass’s __post_init__()
:
@dataclass
class C:
mydate: datetime.datetime = field(init=False)
def __post_init__(self, mydate: datetime.datetime = datetime.datetime.now()) -> None:
print(mydate)
self.mydate = mydate
c1 = C()
time.sleep(5)
c2 = C()
print(c1.mydate==c2.mydate)
… It turns out, perhaps unsurprisingly, that __post_init__()
has the same behaviour. Let’s reach for field()
:
@dataclass
class C:
mydate: datetime.datetime = field(init=False)
def __post_init__(self, mydate: datetime.datetime = field(default_factory=datetime.datetime.now)) -> None:
print(mydate)
self.mydate = mydate
c1 = C()
print(c1.mydate)
Oh dear, this doesn’t work. We get:
Field(name=None,type=None,default=<dataclasses._MISSING_TYPE object at 0x76f9d4783260>,default_factory=<built-in method now of type object at 0xa4fd20>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=<dataclasses._MISSING_TYPE object at 0x76f9d4783260>,_field_type=None)
This is not a datetime.datetime
… it’s a Field
. So field()
doesn’t appear to be usable as a function argument.
Let’s reach for lambda again:
from dataclasses import dataclass, field
import datetime, time
@dataclass
class C:
mydate: datetime.datetime = field(init=False)
def __post_init__(self, mydate: datetime.datetime = lambda: datetime.datetime.now()) -> None:
print(mydate)
self.mydate = mydate
c1 = C()
print(c1.mydate)
This doesn’t work either! We now get:
<function C.<lambda> at 0x74bd298cf240>
Sadly, the best we can do here is not provide a default value to the function argument:
from dataclasses import dataclass, field
import datetime, time
@dataclass
class C:
mydate: datetime.datetime = field(init=False)
def __post_init__(self, mydate: datetime.datetime = None) -> None:
if not mydate:
mydate = datetime.datetime.now()
self.mydate = mydate
c1 = C()
print(c1.mydate)
This finally works!
Hopefully this helps if you’re having trouble with stale data in your Python objects when using dataclasses, even if you’re not using long-lived processes with Django and Celery!