The Trick
In ordinary 2D, translation is x' = x + tx, which is not linear in (x, y), so we cannot write it as a 2x2 matrix multiplication. In homogeneous coordinates we represent (x, y) as (x, y, 1) and use 3x3 matrices. Now translation, rotation, and scaling all become a single matrix multiplication, and they can be composed by multiplication.
Matrix Forms (column vector convention)
Translation: T(tx,ty) = [ 1 0 tx ] [ 0 1 ty ] [ 0 0 1 ]
Rotation about origin: R(theta) = [ cos -sin 0 ] [ sin cos 0 ] [ 0 0 1 ]
Scaling about origin: S(sx,sy) = [ sx 0 0 ] [ 0 sy 0 ] [ 0 0 1 ]
Reflection across X-axis: [ 1 0 0 ] [ 0 -1 0 ] [ 0 0 1 ]
X-shear: [ 1 shx 0 ] [ 0 1 0 ] [ 0 0 1 ]
Apply
A point P = [x, y, 1]^T is transformed as P' = M * P. The third row of M is [0 0 1] for affine maps; perspective transforms use a non-trivial third row, which is why projective geometry uses homogeneous coordinates too.
Why w?
Two homogeneous points (x, y, w) and (kx, ky, k*w) for any k != 0 represent the same 2D point. Setting w = 0 represents a point at infinity (a direction vector), useful for parallel lines and directional lights.
Example: Combining
Rotate 30 deg about (4, 5): M = T(4, 5) R(30) T(-4, -5). Multiplying by hand:
- T(-4,-5): subtract origin.
- R(30): standard rotation.
- T(4,5): put back.
The product is a single 3x3 matrix you can apply to thousands of vertices.
GPU Reality
Vertex shaders multiply each vertex by a 4x4 matrix (Model View Projection). The 4x4 form simply extends the same idea to 3D plus perspective.