Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

Unit 2: Socket Options & Advanced I/O Functions

Lesson 8 of 15 in the free Network Programming notes on Siksha Sarovar, written by Rohit Jangra.

4.1 getsockopt / setsockopt

int getsockopt(int fd, int level, int optname, void *optval, socklen_t *optlen);
int setsockopt(int fd, int level, int optname, const void *optval, socklen_t optlen);

level selects the protocol layer: SOL_SOCKET (generic), IPPROTO_IP (IPv4), IPPROTO_IPV6, IPPROTO_ICMPV6, IPPROTO_TCP. Most options are integer flags or values; some are structs (linger, timeval).

4.2 Generic (SOL_SOCKET) options — the ones with stories

OptionEffectThe story
SO_REUSEADDRallow bind to a port in TIME_WAITevery TCP server should set it — otherwise a restart within 2·MSL fails with EADDRINUSE
SO_KEEPALIVEprobe an idle peer (~2 h default)detects the Unit-2 "host crashed" silence; server-side housekeeping
SO_LINGERcontrol close() behaviourstruct: off = default (close returns at once, kernel delivers in background); on+0 = RST, data discarded; on+N = close blocks ≤ N sec for delivery
SO_RCVBUF / SO_SNDBUFsocket buffer sizesreceive buffer = TCP's advertised window; for high bandwidth×delay paths set it before connect/listen (window scale is negotiated in the SYN)
SO_RCVLOWAT / SO_SNDLOWATlow-water marks for select readinesstune when select says "ready"
SO_RCVTIMEO / SO_SNDTIMEOI/O timeouts (struct timeval)timeout method #3 below
SO_BROADCASTpermit sending to broadcast addressesrequired before any broadcast (Unit 3)
SO_ERRORfetch & clear pending errorhow nonblocking connect reports success/failure
SO_REUSEPORTmultiple sockets on one port (load balancing)modern multi-process accept

Socket states caveat: some options must be set at the right moment — buffer sizes before the connection exists; options on a listening socket are inherited by accepted sockets, so set SO_KEEPALIVE etc. on listenfd.

4.3 IPv4 / IPv6 / TCP level options

LevelOptionUse
IPPROTO_IPIP_TTLset TTL — traceroute's whole trick (Unit 4)
IPPROTO_IPIP_HDRINCL"I build the IP header myself" — raw sockets
IPPROTO_IPIP_MULTICAST_TTL / IP_ADD_MEMBERSHIP ...multicast controls (Unit 3)
IPPROTO_IPV6IPV6_V6ONLY, IPV6_UNICAST_HOPSdual-stack & hop limit
IPPROTO_ICMPV6ICMP6_FILTERchoose which ICMPv6 types a raw socket receives
IPPROTO_TCPTCP_NODELAYdisable the Nagle algorithm
IPPROTO_TCPTCP_MAXSEGread/set MSS

Nagle in one paragraph (perennial viva): Nagle's algorithm delays small segments while an ACK is outstanding, coalescing keystroke-sized writes — great for telnet over WAN, deadly for latency-sensitive request/response (especially interacting with delayed ACKs: the infamous 40 ms stalls). Interactive/real-time protocols set TCP_NODELAY; bulk transfer leaves it alone.

4.4 Socket timeouts — the three techniques

  1. SIGALRM around the blocking call — classic, but signal-global and racy.
  2. select with a timeout before read/write — portable and precise.
  3. SO_RCVTIMEO / SO_SNDTIMEO — set once, applies to all subsequent operations.

4.5 recv / send and the flags

ssize_t recv(int fd, void *buf, size_t n, int flags);
ssize_t send(int fd, const void *buf, size_t n, int flags);
FlagMeaning
MSG_DONTWAITthis call only: non-blocking
MSG_PEEKlook at the data without consuming it
MSG_WAITALLdon't return until the full n bytes arrived
MSG_OOBsend/receive out-of-band (urgent) byte
MSG_DONTROUTEbypass routing (direct LAN)

4.6 Scatter/gather: readv & writev

struct iovec { void *iov_base; size_t iov_len; };
ssize_t writev(int fd, const struct iovec *iov, int iovcnt);
ssize_t readv (int fd, const struct iovec *iov, int iovcnt);

One atomic call gathers from / scatters to multiple buffers — e.g. write a header struct + payload without copying them together and without two writes (which Nagle + delayed-ACK would punish). writev is the clean cure for the header/body problem.

4.7 recvmsg / sendmsg & ancillary data — the most general I/O

struct msghdr {
    void *msg_name;            /* address (like sendto/recvfrom)   */
    socklen_t msg_namelen;
    struct iovec *msg_iov;     /* scatter/gather (like readv)      */
    int msg_iovlen;
    void *msg_control;         /* ANCILLARY DATA                   */
    socklen_t msg_controllen;
    int msg_flags;             /* returned flags                   */
};

These two subsume all the other I/O calls (read/write/readv/writev/recv/send/recvfrom/sendto are special cases). Ancillary (control) data — cmsghdr records in msg_control — carries the exotic payloads:

  • descriptor passing (SCM_RIGHTS): send an open file descriptor to another process over a Unix-domain socket — how preforked servers hand connections around;
  • credentials (SCM_CREDENTIALS);
  • packet metadata: receiving interface & destination address of a UDP datagram (IP_RECVDSTADDR / IP_PKTINFO — needed by multihomed UDP servers), TTL, IPv6 hop limit.

4.8 How much data is queued? — and sockets vs stdio

To learn how much is readable without reading: MSG_PEEK (with MSG_DONTWAIT), or ioctl(fd, FIONREAD, &n). And a hard-won warning: don't mix stdio (fprintf/fgets) with sockets — stdio's own buffering (line-buffered terminal vs fully-buffered elsewhere) interleaves unpredictably with the socket stream; classic deadlocks result. Use read/write/readn/writen on sockets, full stop.

4.9 SO_LINGER, all three settings traced

The linger struct controls what close() means — three behaviours, one struct:

struct linger {
    int l_onoff;    /* 0 = off, nonzero = on */
    int l_linger;   /* seconds, when on      */
};
SettingWhat close() doesWhat goes on the wireWhen you'd want it
off (default)returns immediately; kernel keeps trying to deliver buffered data, then FINdata... FINalmost always
on, l_linger = 0returns immediately; connection abortedRST; buffered data in both directions discarded; no TIME_WAIT!deliberately killing misbehaving peers; load-test tools avoiding TIME_WAIT exhaustion
on, l_linger = Nblocks up to N seconds until data is delivered and ACKed (or times out → EWOULDBLOCK)data... FIN, but the application waits for itwhen the app must know delivery happened before proceeding

Two examiner-grade footnotes: even the lingering close only confirms the peer's TCP ACKed the data — not that the peer application read it (only an application-level acknowledgement can promise that — the deep reason application protocols have their own confirmations); and skipping TIME_WAIT via the RST trick sacrifices exactly the protections TIME_WAIT exists for (lesson 1's two reasons) — name that trade-off whenever you mention it.

4.10 SO_REUSEADDR — the restart scenario, step by step

The story behind the "every TCP server" rule, traced:

  1. Server listens on port 9877; a client connects; the server (or its child) closes first — the server side enters TIME_WAIT for 2·MSL.
  2. The administrator restarts the server seconds later.
  3. bind(9877) fails with EADDRINUSE — not because anything is listening, but because a TIME_WAIT connection still references the port.
  4. With SO_REUSEADDR set before bind, the bind succeeds: the option means "binding is allowed even if connections in TIME_WAIT exist on this port".

What it does not allow (common misconception, common trap question): two sockets simultaneously bound to the same (IP, port) both in LISTEN — that needs SO_REUSEPORT. SO_REUSEADDR also enables binding specific IPs on a port whose wildcard is taken (one process per virtual-host IP, old-style web hosting).

4.11 Nonblocking sockets and nonblocking connect

The flag lives on the descriptor, set with fcntl:

int flags = fcntl(fd, F_GETFL, 0);
fcntl(fd, F_SETFL, flags | O_NONBLOCK);   /* read-modify-write — never just SET */

Behaviour change per call: read/recvfrom with no data → EWOULDBLOCK instead of sleeping; write with a full send buffer → partial write or EWOULDBLOCK; accept with no connection → EWOULDBLOCK; and most interestingly connect returns immediately with EINPROGRESS while the handshake proceeds in the background. The completion protocol (a small algorithm worth memorising):

  1. connect → EINPROGRESS (if it returns 0, the connect finished at once — localhost);
  2. select for writability with your chosen timeout;
  3. on writability, fetch SO_ERROR with getsockopt: 0 = connected, else the errno of failure (writable-on-error is §3.7's rule in action);
  4. timeout expired → close the socket, report your own timeout.

This is how real clients impose a 3-second connect timeout instead of the kernel's 75 seconds — the payoff promised back in the connect() lesson.

Exam pointers

  • "Explain SO_LINGER with the structure" — the three-row table is the answer skeleton; the RST/no-TIME_WAIT row is where marks hide.
  • "Why must every TCP server set SO_REUSEADDR?" — the four-step restart trace; explicitly say the conflict is with a TIME_WAIT connection, not another listener.
  • "What is the Nagle algorithm? When would you disable it?" — coalesce small segments while an ACK is outstanding; interacts with delayed ACK (the 40 ms anecdote); TCP_NODELAY for interactive traffic; add that writev solves the self-inflicted version (header+body in two writes).
  • "Differentiate readv/writev from recvmsg/sendmsg" — scatter/gather only vs scatter/gather + address + ancillary data + flags; "most general I/O functions" is the expected phrase.

Check yourself

  1. A server sets SO_RCVBUF after accept and wonders why the window scale didn't change. What went wrong, and on which socket should it have set the option?
  2. close() returns 0 instantly — what do you actually know about your last 100 KB of sent data, under each of the three linger settings?
  3. Which option + which mechanism detects a peer host that crashed while the connection sat idle? How long does it take by default, and which Unit-2 scenario does it cure?
  4. Why is descriptor passing (SCM_RIGHTS) impossible over a TCP socket between two machines? (What is a descriptor, really?)
  5. Write the four-step nonblocking-connect-with-timeout recipe from memory, naming the select set and the socket option used.