4.1 getsockopt / setsockopt
int getsockopt(int fd, int level, int optname, void *optval, socklen_t *optlen);
int setsockopt(int fd, int level, int optname, const void *optval, socklen_t optlen);
level selects the protocol layer: SOL_SOCKET (generic), IPPROTO_IP (IPv4), IPPROTO_IPV6, IPPROTO_ICMPV6, IPPROTO_TCP. Most options are integer flags or values; some are structs (linger, timeval).
4.2 Generic (SOL_SOCKET) options — the ones with stories
| Option | Effect | The story |
|---|---|---|
| SO_REUSEADDR | allow bind to a port in TIME_WAIT | every TCP server should set it — otherwise a restart within 2·MSL fails with EADDRINUSE |
| SO_KEEPALIVE | probe an idle peer (~2 h default) | detects the Unit-2 "host crashed" silence; server-side housekeeping |
| SO_LINGER | control close() behaviour | struct: off = default (close returns at once, kernel delivers in background); on+0 = RST, data discarded; on+N = close blocks ≤ N sec for delivery |
| SO_RCVBUF / SO_SNDBUF | socket buffer sizes | receive buffer = TCP's advertised window; for high bandwidth×delay paths set it before connect/listen (window scale is negotiated in the SYN) |
| SO_RCVLOWAT / SO_SNDLOWAT | low-water marks for select readiness | tune when select says "ready" |
| SO_RCVTIMEO / SO_SNDTIMEO | I/O timeouts (struct timeval) | timeout method #3 below |
| SO_BROADCAST | permit sending to broadcast addresses | required before any broadcast (Unit 3) |
| SO_ERROR | fetch & clear pending error | how nonblocking connect reports success/failure |
| SO_REUSEPORT | multiple sockets on one port (load balancing) | modern multi-process accept |
Socket states caveat: some options must be set at the right moment — buffer sizes before the connection exists; options on a listening socket are inherited by accepted sockets, so set SO_KEEPALIVE etc. on listenfd.
4.3 IPv4 / IPv6 / TCP level options
| Level | Option | Use |
|---|---|---|
| IPPROTO_IP | IP_TTL | set TTL — traceroute's whole trick (Unit 4) |
| IPPROTO_IP | IP_HDRINCL | "I build the IP header myself" — raw sockets |
| IPPROTO_IP | IP_MULTICAST_TTL / IP_ADD_MEMBERSHIP ... | multicast controls (Unit 3) |
| IPPROTO_IPV6 | IPV6_V6ONLY, IPV6_UNICAST_HOPS | dual-stack & hop limit |
| IPPROTO_ICMPV6 | ICMP6_FILTER | choose which ICMPv6 types a raw socket receives |
| IPPROTO_TCP | TCP_NODELAY | disable the Nagle algorithm |
| IPPROTO_TCP | TCP_MAXSEG | read/set MSS |
Nagle in one paragraph (perennial viva): Nagle's algorithm delays small segments while an ACK is outstanding, coalescing keystroke-sized writes — great for telnet over WAN, deadly for latency-sensitive request/response (especially interacting with delayed ACKs: the infamous 40 ms stalls). Interactive/real-time protocols set TCP_NODELAY; bulk transfer leaves it alone.
4.4 Socket timeouts — the three techniques
- SIGALRM around the blocking call — classic, but signal-global and racy.
- select with a timeout before read/write — portable and precise.
- SO_RCVTIMEO / SO_SNDTIMEO — set once, applies to all subsequent operations.
4.5 recv / send and the flags
ssize_t recv(int fd, void *buf, size_t n, int flags);
ssize_t send(int fd, const void *buf, size_t n, int flags);
| Flag | Meaning |
|---|---|
| MSG_DONTWAIT | this call only: non-blocking |
| MSG_PEEK | look at the data without consuming it |
| MSG_WAITALL | don't return until the full n bytes arrived |
| MSG_OOB | send/receive out-of-band (urgent) byte |
| MSG_DONTROUTE | bypass routing (direct LAN) |
4.6 Scatter/gather: readv & writev
struct iovec { void *iov_base; size_t iov_len; };
ssize_t writev(int fd, const struct iovec *iov, int iovcnt);
ssize_t readv (int fd, const struct iovec *iov, int iovcnt);
One atomic call gathers from / scatters to multiple buffers — e.g. write a header struct + payload without copying them together and without two writes (which Nagle + delayed-ACK would punish). writev is the clean cure for the header/body problem.
4.7 recvmsg / sendmsg & ancillary data — the most general I/O
struct msghdr {
void *msg_name; /* address (like sendto/recvfrom) */
socklen_t msg_namelen;
struct iovec *msg_iov; /* scatter/gather (like readv) */
int msg_iovlen;
void *msg_control; /* ANCILLARY DATA */
socklen_t msg_controllen;
int msg_flags; /* returned flags */
};
These two subsume all the other I/O calls (read/write/readv/writev/recv/send/recvfrom/sendto are special cases). Ancillary (control) data — cmsghdr records in msg_control — carries the exotic payloads:
- descriptor passing (SCM_RIGHTS): send an open file descriptor to another process over a Unix-domain socket — how preforked servers hand connections around;
- credentials (SCM_CREDENTIALS);
- packet metadata: receiving interface & destination address of a UDP datagram (IP_RECVDSTADDR / IP_PKTINFO — needed by multihomed UDP servers), TTL, IPv6 hop limit.
4.8 How much data is queued? — and sockets vs stdio
To learn how much is readable without reading: MSG_PEEK (with MSG_DONTWAIT), or ioctl(fd, FIONREAD, &n). And a hard-won warning: don't mix stdio (fprintf/fgets) with sockets — stdio's own buffering (line-buffered terminal vs fully-buffered elsewhere) interleaves unpredictably with the socket stream; classic deadlocks result. Use read/write/readn/writen on sockets, full stop.
4.9 SO_LINGER, all three settings traced
The linger struct controls what close() means — three behaviours, one struct:
struct linger {
int l_onoff; /* 0 = off, nonzero = on */
int l_linger; /* seconds, when on */
};
| Setting | What close() does | What goes on the wire | When you'd want it |
|---|---|---|---|
| off (default) | returns immediately; kernel keeps trying to deliver buffered data, then FIN | data... FIN | almost always |
| on, l_linger = 0 | returns immediately; connection aborted | RST; buffered data in both directions discarded; no TIME_WAIT! | deliberately killing misbehaving peers; load-test tools avoiding TIME_WAIT exhaustion |
| on, l_linger = N | blocks up to N seconds until data is delivered and ACKed (or times out → EWOULDBLOCK) | data... FIN, but the application waits for it | when the app must know delivery happened before proceeding |
Two examiner-grade footnotes: even the lingering close only confirms the peer's TCP ACKed the data — not that the peer application read it (only an application-level acknowledgement can promise that — the deep reason application protocols have their own confirmations); and skipping TIME_WAIT via the RST trick sacrifices exactly the protections TIME_WAIT exists for (lesson 1's two reasons) — name that trade-off whenever you mention it.
4.10 SO_REUSEADDR — the restart scenario, step by step
The story behind the "every TCP server" rule, traced:
- Server listens on port 9877; a client connects; the server (or its child) closes first — the server side enters TIME_WAIT for 2·MSL.
- The administrator restarts the server seconds later.
- bind(9877) fails with
EADDRINUSE— not because anything is listening, but because a TIME_WAIT connection still references the port. - With SO_REUSEADDR set before bind, the bind succeeds: the option means "binding is allowed even if connections in TIME_WAIT exist on this port".
What it does not allow (common misconception, common trap question): two sockets simultaneously bound to the same (IP, port) both in LISTEN — that needs SO_REUSEPORT. SO_REUSEADDR also enables binding specific IPs on a port whose wildcard is taken (one process per virtual-host IP, old-style web hosting).
4.11 Nonblocking sockets and nonblocking connect
The flag lives on the descriptor, set with fcntl:
int flags = fcntl(fd, F_GETFL, 0);
fcntl(fd, F_SETFL, flags | O_NONBLOCK); /* read-modify-write — never just SET */
Behaviour change per call: read/recvfrom with no data → EWOULDBLOCK instead of sleeping; write with a full send buffer → partial write or EWOULDBLOCK; accept with no connection → EWOULDBLOCK; and most interestingly connect returns immediately with EINPROGRESS while the handshake proceeds in the background. The completion protocol (a small algorithm worth memorising):
- connect → EINPROGRESS (if it returns 0, the connect finished at once — localhost);
- select for writability with your chosen timeout;
- on writability, fetch
SO_ERRORwith getsockopt: 0 = connected, else the errno of failure (writable-on-error is §3.7's rule in action); - timeout expired → close the socket, report your own timeout.
This is how real clients impose a 3-second connect timeout instead of the kernel's 75 seconds — the payoff promised back in the connect() lesson.
Exam pointers
- "Explain SO_LINGER with the structure" — the three-row table is the answer skeleton; the RST/no-TIME_WAIT row is where marks hide.
- "Why must every TCP server set SO_REUSEADDR?" — the four-step restart trace; explicitly say the conflict is with a TIME_WAIT connection, not another listener.
- "What is the Nagle algorithm? When would you disable it?" — coalesce small segments while an ACK is outstanding; interacts with delayed ACK (the 40 ms anecdote); TCP_NODELAY for interactive traffic; add that writev solves the self-inflicted version (header+body in two writes).
- "Differentiate readv/writev from recvmsg/sendmsg" — scatter/gather only vs scatter/gather + address + ancillary data + flags; "most general I/O functions" is the expected phrase.
Check yourself
- A server sets SO_RCVBUF after accept and wonders why the window scale didn't change. What went wrong, and on which socket should it have set the option?
- close() returns 0 instantly — what do you actually know about your last 100 KB of sent data, under each of the three linger settings?
- Which option + which mechanism detects a peer host that crashed while the connection sat idle? How long does it take by default, and which Unit-2 scenario does it cure?
- Why is descriptor passing (SCM_RIGHTS) impossible over a TCP socket between two machines? (What is a descriptor, really?)
- Write the four-step nonblocking-connect-with-timeout recipe from memory, naming the select set and the socket option used.