A network diagnostic

This is a checklist to diagnose network errors. It starts by outlining the quickest tests, which find the most general kinds of errors. As you go further along, it lists more specific possibilities you can look into.

This is far from comprehensive, but it aims to provide a solid starting point.

Diagnosing specific network configuration errors was a focus. I covered all of the following:

  • static ARP entry needs fixing
  • (if the reverse direction could have been configured, then it should also be testable)
  • ip address conflict (router side covered by “check router status”)
  • bad router address (mentioned)
  • over-broad netmask on route table entry
  • one-way connectivity at L1
  • bad DNS
  • MTU / PMTU blackhole
  • HTTP proxy server

The document is in sections as follows:

1. Accept the initial report

Be brief, but start by recognizing the report, your impressions, and your expectations (known configuration etc).

Usually, you want to start by deciding where the problem falls –

  • local: a problem connecting to your network or router.
  • non-local: a problem connecting from the local router to the internet.

You probably already tried a web browser, and checked a standard homepage or search engine. If that failed, you’ll also have asked if it works on another computer on the same network.

If you want to find a root cause for a problem which can reoccur, then investigate carefully. You will need to avoid making changes until you’ve identified a specific problem. With practice you can run diagnostics quickly, without immediately resorting to the Windows troubleshooter.

2. Check for onscreen alerts

Look for any onscreen notifications or alerts. Inspect the onscreen network icon specifically.

Network icon shows as completely disconnected -> proceed to “6. Local failure”.

Note the above is qualified. I know two potential alert icons which differ from the disconnected icon:

  1. Modern Windows tries to test non-local connectivity as well. It shows a different mark on the icon when this fails. The mark won’t always be up to date; you can’t expect it to be re-tested continuously.
  2. Historical Windows had a default fallback for router-less communication. It became most significant as a trap to confuse the unwary troubleshooter. Thankfully, a specific mark was added to the network icon.

It’s very useful to become familiar with the network icon. That said it’s best considered as a hint, which you can verify as you proceed.

It’s almost equally informative when you see the network icon is not completely disconnected. If the standard auto-configuration succeeded, that means some packets were successfully exchanged with the local network (e.g. DHCP or RA). This leaves a number of possibilities which we can proceed to test.

  1. problem is between router and internet (non-local)
  2. packets are dropped intermittently, e.g. lossy wireless
  3. local problem started later, after the auto-configuration
  4. network connection was actually configured without packet exchange (“static IP”)
  5. failure is more specific, e.g. dropping of larger packets

3. Confirm problem with internet access

Expert troubleshooters might skip this and try to ping the router first, which is step 4. Otherwise, people tend to start with the command below. This combines several tests in the form of one elegantly simple command:

$ ping google.com

Experts will notice the following pitfalls: Some websites block ping. The *nix command ping only tests IPv4. DNS results may be from a old cache entry e.g. on Windows.

High latency -> over 100-200ms is notably undesirable. It could be the bufferbloat bug, if a network link is 100% utilized. Mobile data or satellite can tend to have higher delays.

Packet loss -> 10% (consistently) is considered untenably high. The specific effects of packet loss can vary depending on other factors.

Wireless – if packets are being dropped, and there’s no router response in the form of an ICMP error -> first ensure good line of sight (and range, maybe 5 metres).

Wireless – you have no response at all despite having good line of sight -> test if you can re-connect.

3.1 If ping succeeds, but web browser has problems

-> interesting

So it could be a browser issue. Particularly check proxy setting (and connectivity).

Maybe SSL issue – try something without SSL?

Maybe the problem only becomes visible with full-size packets – see “9. Test path MTU”.

4. First hop router

4.1. Identify first hop router

Can try to provoke a nice response (may theoretically be blocked or spoofed) e.g.

$ traceroute -n 8.8.8.8

C:\> tracert -d 8.8.8.8

Otherwise, just read out the routing table. The command is not as well-known, but it’s very important. E.g.

$ route

C:\> route -print

Check the route that’s being used. One example this could show, is if you have the wrong netmask on a local route entry.

4.2. Contact first hop router

$ ping $ROUTER # (may theoretically be blocked)

See step 3 for ping analysis. If it appears that ping might be blocked, you can try checking e.g.

$ arping -b $ROUTER  # IPv4, Linux only

$ arp -a  # check arp table

ARP (or ipv6 ND) cannot be blocked on its own. IP communication would also be blocked.

You may see a static ARP entry. This has been configured instead of discovered dynamically – could be wrong or out of date.

Router ping/ARP/ND ok -> “7. Test global connectivity”.

Note ARP/ND protocols use small packets and retries. It’s not quite the same as the auto-configuration packet exchange, but in theory it’s exercising rather similar connectivity.

NOTE: the remainder of this document is a bit rough :NOTE

ARP works but not ping, traceroute, or admin interface -> maybe host has duplicate IP address. Ideally inspect ARP on router. Otherwise check arping -D. Windows will probably alert onscreen and in the event
log. Business router might log an event.

5. ARP/ND failure

This is a solid indication. It contradicts good results from the network icon. Re-connect to confirm.

Icon re-connects -> re-run previous test to confirm (4.1 & 4.2. Hopefully, if stale ARP entries lasted longer than expected, this would also flush them out).

First-hop still failing after re-connection -> Check for static configuration.

If the other side could have been configured, then you also want to test the opposite direction. (Hosts may well block ping, but we’re interested in the effect on the arp table).

Static configuration -> network icon may only tell you if you’re actually physically disconnected (technically, pure layer 2 or below). See “5. Local failure”.

At the same time (FIXME: how?), consider mis-configuration. Both for static- and auto- configuration. Do you have the correct address for the first-hop router?

If the problem is that you can’t find any router address, then try running a packet sniffer. 1) as you cycle the connection, and 2) as you power-cycle the router. You should really be able to see the MAC address, and quite likely an IP address. With “Wireshark”, listen on “any”. If you listen on a specific interface, Wireshark will stop when the interface goes down, and it won’t re-start automatically.

One way connectivity at L1? -> might be detected by packet capture, at either end. Node will not receive autoconfig or ARP responses and will send repeated queries. (May want to provoke other traffic to test receive… if you’re not sure if any nodes are broadcasting you can power cycle router as above and it really should emit a few packets).

Otherwise this sounds bad, maybe wireless dies immediately after connecting. To analyze any further, would probably want clever packet sniffing or log file analysis.

6. Local failure

Trace along physical path to test transition points:

Wireless – shows no network -> First ensure good line of sight (and range, maybe 5 meters). Re-scan / try to re-connect. Check for a disabled radio. The radio is often disabled by a physical switch, or toggled from a special key on a laptop keyboard. Also check for disabled radio on the wireless access point, you can probably see a status light for the radio.

Consider layer 2 related configuration – wireless (or wired auth).

Is network/router turned off or disconnected from computer?

Notice lights blinking constantly without pauses – possible packet flood.

Mis-configuration is probably less common. Most networks are simple and rarely re-configured.

The main reason to look out for mis-configuration is where you have multiple networking devices which could conflict (or be connected incorrectly). People sometimes use an old router as an extra wireless point. They may either be bridged (lan port to lan port) or routed (lan port to wan port). When bridged lan port to lan port, hosts should get configured with the real router address.

Bridging multiple routers creates a mis-configuration risk of re-using the same IP address, or the secondary router advertising itself over DHCP. This might show itself when you access status/configuration details and they don’t match the right device. If it’s not immediately apparent it’s a really annoying problem. It’s easy to detect by attaching a packet sniffer, a dedicated test, or even muck around with MAC addresses or disconnecting wires. The problem is it could easily be mistaken for a global connectivity problem. I.e. the router WAN interface is down – because you’re looking at the router with nothing connected to the WAN port.

If you start looking at the router, note the router’s status lights and model. If they don’t match the router status fetched over the network, you could be talking to the wrong device. If you don’t notice then it tends to be very frustrating!

I.e. look out for this if you check for disconnected internet cable or admin password. There should be a green/yellow light for WAN/internet. (You might not notice this if you’re colour-blind. If you want to double-check it, there’s an app for that. Seriously – try the original DanKam, or HueView).

7. Test global connectivity

Router status information.

Router considers internet connection ok -> “9. Test DNS.” Particularly if you’ve been reconfiguring it. Even without that, if your ISP accepts connections but internet seems down, then a problem with DNS servers sounds quite plausible.

DNS ok -> traceroute -n google.com

8. Test DNS

What DNS server is configured? Does it respond to queries? Do you get useful answers? You probably use a resolver on your local router, check what it’s using as upstream DNS. (Router may have DNS test page). Does the upstream DNS respond to queries? traceroute to upstream DNS.

DNS servers respond to ping but not DNS – maybe large responses (DNSSec) are getting dropped -> “9. Test path MTU”.

DNS note – your computer could have a local cache (Windows default, Linux with nscd, unbound or systemd-resolved).

9. Test Path MTU

Test larger packets, making sure to set “Do not Fragment”. E.g. ping -M do -s 1500 (MTU could be higher in some cases, so to be thorough you need to check that). That command should cause at least one round of size / fragmentation errors. Retry with the size indicated, accounting for IP and ICMP headers – it’s how Path MTU Discovery works. An example session is shown below. (Exact commands for Windows will be different. ping -f -l 1500. But then you guys can always download mturoute.exe and feel superior :).

If full-size packets are dropped randomly -> it’s not a simple PMTU issue. There’s probably random interference on the path, like bit errors – longer packets are more likely to get hit by an error. ping will detect altered packets, but in most cases they will be dropped at an earlier point due to failing the checksum.

The normal problem with MTUs is that larger packets are being discarded, but your computer isn’t receiving the corresponding ICMP error packet, so it has no way to learn what size of packet it’s allowed to send.

It’s pretty easy if the error is in the transmit direction. If the ping reply is considered too large, the ICMP error will be sent to the target, and not to you. So the receive direction is harder to work out.

$ tracepath 8.8.8.8
1?: [LOCALHOST] pmtu 1500
1: gateway 3.486ms
1: gateway 3.500ms
2: gateway 3.365ms pmtu 1492

$ ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=55 time=24.1 ms

$ ip link show dev wlp2s0
3: wlp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DORMANT group default qlen 1000

$ ping -M do -s 1500 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 1500(1528) bytes of data.
ping: local error: Message too long, mtu=1500

$ ping -M do -s $((1500-28)) 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 1472(1500) bytes of data.
From 192.168.7.1 icmp_seq=1 Frag needed and DF set (mtu = 1492)
ping: local error: Message too long, mtu=1492

$ ping -M do -s $((1492-28)) 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 1464(1492) bytes of data.
1472 bytes from 8.8.8.8: icmp_seq=1 ttl=55 time=56.5 ms

General outline

  1. http://google.com
  2. ping google.com

some non-local website works but others don’t -> TODO

can’t ping google -> find router and ping it
router doesn’t respond to ping -> ARP router
can’t ARP router -> trace L2 (inc wireless)

This document was edited with StackEdit. Thanks, StackEdit.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s