How to Use Python Data Classes (A Beginner’s Guide)
In Python, a data class is a class that is designed to only hold data values. They aren't different from regular classes, but they usually don't have any other methods. They are typically used to store information that will be passed between different parts of a program or a system.
However, when creating classes to work only as data containers, writing the __init__ method repeatedly can generate a great amount of work and potential errors.
The dataclasses module, a feature introduced in Python 3.7, provides a way to create data classes in a simpler manner without the need to write methods.
In this article, we'll see how to take advantage of this module to quickly create new classes that already come not only with __init__, but several other methods already implemented so we don't need to implement them manually. Also, we can do that with just a few lines of code.
We expect you to have some intermediate python experience, including an understanding of how to create classes and object-oriented programming in general.
Using the dataclasses Module
As a starting example, let's say we're implementing a class to store data about a certain group of people. For each person, we'll have attributes such as name, age, height, and email address. This is what a regular class looks like:
class Person():
def __init__(self, name, age, height, email):
self.name = name
self.age = age
self.height = height
self.email = email
If we use the dataclasses module, however, we need to import dataclass to use it as a decorator in the class we're creating. When we do that, we no longer need to write the init function, only specify the attributes of the class and their types. Here's the same Person class, implemented in this way:
from dataclasses import dataclass
@dataclass
class Person():
name: str
age: int
height: float
email: str
We can also set default values to the class attributes:
@dataclass
class Person():
name: str = 'Joe'
age: int = 30
height: float = 1.85
email: str = '[email protected]'
print(Person())
Person(name='Joe', age=30, height=1.85, email='[email protected]')
As a reminder, Python doesn't accept a non-default attribute after default in both class and functions, so this would throw an error:
@dataclass
class Person():
name: str = 'Joe'
age: int = 30
height: float = 1.85
email: str
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_5540/741473360.py in <module>
1 @dataclass
----> 2 class Person():
3 name: str = 'Joe'
4 age: int = 30
5 height: float = 1.85
~\anaconda3\lib\dataclasses.py in dataclass(cls, init, repr, eq, order, unsafe_hash, frozen)
1019
1020 # We're called as @dataclass without parens.
-> 1021 return wrap(cls)
1022
1023
~\anaconda3\lib\dataclasses.py in wrap(cls)
1011
1012 def wrap(cls):
-> 1013 return _process_class(cls, init, repr, eq, order, unsafe_hash, frozen)
1014
1015 # See if we're being called as @dataclass or @dataclass().
~\anaconda3\lib\dataclasses.py in _process_class(cls, init, repr, eq, order, unsafe_hash, frozen)
925 if f._field_type in (_FIELD, _FIELD_INITVAR)]
926 _set_new_attribute(cls, '__init__',
--> 927 _init_fn(flds,
928 frozen,
929 has_post_init,
~\anaconda3\lib\dataclasses.py in _init_fn(fields, frozen, has_post_init, self_name, globals)
502 seen_default = True
503 elif seen_default:
--> 504 raise TypeError(f'non-default argument {f.name!r} '
505 'follows default argument')
506
TypeError: non-default argument 'email' follows default argument
Once the class is defined, it's easy to instantiate a new object and access its attributes, just like with a standard class:
person = Person('Joe', 25, 1.85, '[email protected]')
print(person.name)
Joe
So far we've used regular data types like string, integer, and float; we can also combine dataclass with the typing modules to create attributes of any kind in the class. For instance, let's add a house_coordinates attribute to the Person:
from typing import Tuple
@dataclass
class Person():
name: str
age: int
height: float
email: str
house_coordinates: Tuple
print(Person('Joe', 25, 1.85, '[email protected]', (40.748441, -73.985664)))
Person(name='Joe', age=25, height=1.85, email='[email protected]', house_coordinates=(40.748441, -73.985664))
Following the same logic, we can create a data class to hold multiple instances of the Person class:
from typing import List
@dataclass
class People():
people: List[Person]
Notice that the people attribute in the People class is defined as a list of instances of the Person class. For example, we could instantiate an object of People like this:
joe = Person('Joe', 25, 1.85, '[email protected]', (40.748441, -73.985664))
mary = Person('Mary', 43, 1.67, '[email protected]', (-73.985664, 40.748441))
print(People([joe, mary]))
People(people=[Person(name='Joe', age=25, height=1.85, email='[email protected]', house_coordinates=(40.748441, -73.985664)), Person(name='Mary', age=43, height=1.67, email='[email protected]', house_coordinates=(-73.985664, 40.748441))])
This allows us to define the attribute as being any type we want, but also a combination of data types.
Representation and Comparisons
As we mentioned earlier, dataclass implements not only the __init__ method, but several others, including the __repr__ method. In a regular class, we use this method to display a representation of an object in the class.
For instance, we'd define the method as in the example below when we call the object:
class Person():
def __init__(self, name, age, height, email):
self.name = name
self.age = age
self.height = height
self.email = email
def __repr__(self):
return (f'{self.__class__.__name__}(name={self.name}, age={self.age}, height={self.height}, email={self.email})')
person = Person('Joe', 25, 1.85, '[email protected]')
print(person)
Person(name=Joe, age=25, height=1.85, [email protected])
When using dataclass, however, there's no need to write any of that:
@dataclass
class Person():
name: str
age: int
height: float
email: str
person = Person('Joe', 25, 1.85, '[email protected]')
print(person)
Person(name='Joe', age=25, height=1.85, email='[email protected]')
Notice that without all that code, the output is equivalent to the one from the standard Python class.
We can always overwrite it if we want to customize the representation of our class:
@dataclass
class Person():
name: str
age: int
height: float
email: str
def __repr__(self):
return (f'''This is a {self.__class__.__name__} called {self.name}.''')
person = Person('Joe', 25, 1.85, '[email protected]')
print(person)
This is a Person called Joe.
Notice that the output of the representation is customized.
When it comes to comparisons, the dataclasses module makes our lives easier. For example, we can directly compare two instances of a class just like this:
@dataclass
class Person():
name: str = 'Joe'
age: int = 30
height: float = 1.85
email: str = '[email protected]'
print(Person() == Person())
True
Notice that we used default attributes to make the example shorter.
In this case, the comparison is valid because the dataclass creates behind the scenes an __eq__ method, which performs the comparison. Without the decorator, we'd have to create this method ourselves.
The same comparison would result in a different outcome if using a standard Python class, even though the classes are in fact equal to each other:
class Person():
def __init__(self, name='Joe', age=30, height=1.85, email='[email protected]'):
self.name = name
self.age = age
self.height = height
self.email = email
print(Person() == Person())
False
Without the use of the dataclass decorator, that class doesn't test whether two instances are equal. So, by default, Python will use the object's id to make the comparison, and, as we see below, they are different:
print(id(Person()))
print(id(Person()))
1734438049008
1734438050976
All this means that we'd have to write an __eq__ method that makes this comparison:
class Person():
def __init__(self, name='Joe', age=30, height=1.85, email='[email protected]'):
self.name = name
self.age = age
self.height = height
self.email = email
def __eq__(self, other):
if isinstance(other, Person):
return (self.name, self.age,
self.height, self.email) == (other.name, other.age,
other.height, other.email)
return NotImplemented
print(Person() == Person())
True
Now we see the two objects are equal to each other, but we had to write more code to get this result.
The @dataclass Parameters
As we saw above, when using the dataclass decorator, the __init__, __repr__, and __eq__ methods are implemented for us. The creation of all these methods is set by the init, repr, and eq parameters of dataclass. These three parameters are True by default. If one of them is created inside the class, then the parameter is ignored.
However, we have other parameters of dataclass that we should look at before moving on:
order: enables sorting of the class as we'll see in the next section. The default isFalse.frozen: WhenTrue, the values inside the instance of the class can't be modified after it's created. The default isFalse.
There are a few other methods that you can check in the documentation.
Sorting
When working with data, we often need to sort values. In our scenario, we may want to sort our different people based on some attribute. For that, we'll use the order parameter of the dataclass decorator mentioned above which enables sorting in the class:
@dataclass(order=True)
class Person():
name: str
age: int
height: float
email: str
When the order parameter is set to True, it automatically generates the __lt__ (less than), __le__ (less or equal), __gt__ (greater than), and __ge__ (greater or equal) methods used for sorting.
Let's instantiate our joe and mary objects to see if one is greater than the other:
joe = Person('Joe', 25, 1.85, '[email protected]')
mary = Person('Mary', 43, 1.67, '[email protected]')
print(joe > mary)
False
Python tells us that joe is not greater than mary, but based on what criteria? The class compares the objects as tuples containing their attributes, like this:
print(('Joe', 25, 1.85, '[email protected]') > ('Mary', 43, 1.67, '[email protected]'))
False
As the letter "J" comes before "M", it says the joe < mary. If the names were the same, it would move to the next element in each tuple. As it is, it's comparing the objects alphabetically. Although that can make some sense depending on the problem we're dealing with, we want to be able to control how the objects will be sorted.
To achieve that, we'll take advantage of two other features of the dataclasses module.
The first is the field function. This function is used to customize one attribute of a data class individually, which allows us to define new attributes that will depend on another attribute and will only be created after the object is instantiated.
In our sorting problem, we'll use field to create a sort_index attribute in our class. This attribute can only be created after the object is instantiated and is what dataclasses uses for sorting:
from dataclasses import dataclass, field
@dataclass(order=True)
class Person():
sort_index: int = field(init=False, repr=False)
name: str
age: int
height: float
email: str
The two arguments that we passed as False state that this attribute isn't in the __init__ and that it shouldn't be displayed when we call __repr__. There are other parameters in the field function that you can check in the documentation.
After we've referenced this new attribute, we'll use the second new tool: the __post_int__ method. As it goes by the name, this method is executed right after the __init__ method. We'll use __post_int__ to define the sort_index, right after the creation of the object. As an example, let's say we want to compare people based on their age. Here's how:
@dataclass(order=True)
class Person():
sort_index: int = field(init=False, repr=False)
name: str
age: int
height: float
email: str
def __post_init__(self):
self.sort_index = self.age
If we make the same comparison, we know that Joe is younger than Mary:
joe = Person('Joe', 25, 1.85, '[email protected]')
mary = Person('Mary', 43, 1.67, '[email protected]')
print(joe > mary)
False
If we wanted to sort people by height, we'd use this code:
@dataclass(order=True)
class Person():
sort_index: float = field(init=False, repr=False)
name: str
age: int
height: float
email: str
def __post_init__(self):
self.sort_index = self.height
joe = Person('Joe', 25, 1.85, '[email protected]')
mary = Person('Mary', 43, 1.67, '[email protected]')
print(joe > mary)
True
Joe is taller than Mary. Notice that we set sort_index as a float.
We were able to implement sorting in our data class without the need to write multiple methods.
Working with Immutable Data Classes
Another parameter of @dataclass that we mentioned above is frozen. When set to True, frozen doesn't allow us to modify the attributes of an object after it's created.
With frozen=False, we can easily perform such modification:
@dataclass()
class Person():
name: str
age: int
height: float
email: str
joe = Person('Joe', 25, 1.85, '[email protected]')
joe.age = 35
print(joe)
Person(name='Joe', age=35, height=1.85, email='[email protected]')
We created a Person object and then modified the age attribute without any problems.
However, when set to True, any attempt to modify the object throws an error:
@dataclass(frozen=True)
class Person():
name: str
age: int
height: float
email: str
joe = Person('Joe', 25, 1.85, '[email protected]')
joe.age = 35
print(joe)
---------------------------------------------------------------------------
FrozenInstanceError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_5540/2036839054.py in <module>
8 joe = Person('Joe', 25, 1.85, '[email protected]')
9
---> 10 joe.age = 35
11 print(joe)
<string> in __setattr__(self, name, value)
FrozenInstanceError: cannot assign to field 'age'
Notice that the error message states FrozenInstanceError.
There's a trick that can modify the value of the immutable data class . If our class contains a mutable attribute, this attribute can change even though the class is frozen. This may seem like it doesn't make sense, but let's look at an example.
Let's recall the People class that we created earlier in this article, but now let's make it immutable:
@dataclass(frozen=True)
class People():
people: List[Person]
@dataclass(frozen=True)
class Person():
name: str
age: int
height: float
email: str
We then create two instances of the Person class and use them to create an instance of People that we'll name two_people:
joe = Person('Joe', 25, 1.85, '[email protected]')
mary = Person('Mary', 43, 1.67, '[email protected]')
two_people = People([joe, mary])
print(two_people)
People(people=[Person(name='Joe', age=25, height=1.85, email='[email protected]'), Person(name='Mary', age=43, height=1.67, email='[email protected]')])
The people attribute in the People class is a list. We can easily access the values in this list in the two_people object:
print(two_people.people[0])
Person(name='Joe', age=25, height=1.85, email='[email protected]')
So, even though both Person and People classes are immutable, the list is not, which means we can change the values in it:
two_people.people[0] = Person('Joe', 35, 1.85, '[email protected]')
print(two_people.people[0])
Person(name='Joe', age=35, height=1.85, email='[email protected]')
Notice that the age is now 35.
We didn't change the attributes of any object of the immutable classes, but we replaced the first element of the list with a different one, and the list is mutable.
Keep in mind that all the attributes of the class should also be immutable in order to safely work with immutable data classes.
Inheritance with dataclasses
The dataclasses module also supports inheritance, which means we can create a data class that uses the attributes of another data class. Still using our Person class, we'll create a new Employee class that inherits all the attributes from Person.
So we have Person:
@dataclass(order=True)
class Person():
name: str
age: int
height: float
email: str
And the new Employee class:
@dataclass(order=True)
class Employee(Person):
salary: int
departament: str
Now we can create an object of the Employee class using all the attributes of the Person class:
print(Employee('Joe', 25, 1.85, '[email protected]', 100000, 'Marketing'))
Employee(name='Joe', age=25, height=1.85, email='[email protected]', salary=100000, departament='Marketing')
From now on we can use everything we saw in this article in the Employee class as well.
Take note of the default attributes. Let's say we have default attributes in Person, but not in Employee. This scenario, as in the code below, raises an error:
@dataclass
class Person():
name: str = 'Joe'
age: int = 30
height: float = 1.85
email: str = '[email protected]'
@dataclass(order=True)
class Employee(Person):
salary: int
departament: str
print(Employee('Joe', 25, 1.85, '[email protected]', 100000, 'Marketing'))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_5540/1937366284.py in <module>
9
10 @dataclass(order=True)
---> 11 class Employee(Person):
12 salary: int
13 departament: str
~\anaconda3\lib\dataclasses.py in wrap(cls)
1011
1012 def wrap(cls):
-> 1013 return _process_class(cls, init, repr, eq, order, unsafe_hash, frozen)
1014
1015 # See if we're being called as @dataclass or @dataclass().
~\anaconda3\lib\dataclasses.py in _process_class(cls, init, repr, eq, order, unsafe_hash, frozen)
925 if f._field_type in (_FIELD, _FIELD_INITVAR)]
926 _set_new_attribute(cls, '__init__',
--> 927 _init_fn(flds,
928 frozen,
929 has_post_init,
~\anaconda3\lib\dataclasses.py in _init_fn(fields, frozen, has_post_init, self_name, globals)
502 seen_default = True
503 elif seen_default:
--> 504 raise TypeError(f'non-default argument {f.name!r} '
505 'follows default argument')
506
TypeError: non-default argument 'salary' follows default argument
If the base class has default attributes, all the attributes in the class derived from it must have default values too.
Conclusion
In this article, we saw how the dataclasses module is a very powerful tool to create data classes in a quick, intuitive way. Although we've seen a lot in this article, the module contains many more tools, and there's always more to learn about it.
So far, we've learned how to:
-
Define a class using
dataclasses -
Use default attributes and their rules
-
Create a representation method
-
Compare data classes
-
Sort data classes
-
Use inheritance with data classes
-
Work with immutable data classes