BGPTrace – BGPtrace Tech Blog

Geo-Mapping ASN and Prefixes: Still Relevant?

I am currently revamping the entire BGPtrace system. Many ideas, but limited time.

One topic I keep thinking about is: how important is the country code associated with an ASN or prefix? We understand that it can affect localization applications or impose IP-level service restrictions.

Applications may attempt to force the client (browser or app) to use location data for verification (yes, we know about GPS spoofing tricks).

However, at the ISP or network service provider level, with multihoming and Anycast structures, prefixes are advertised globally without strict geographic boundaries. So, is it still useful to assign a country code to them? (Yes, we could use the country of company registration, but that’s a different issue.)

This ties into another topic: summarizing traffic by country code. For that, probes or NetFlow exporters should be geographically distributed. Only then can accurate data reports be produced. This is common practice in Internet traffic analysis.

If that’s the case, then using data from multiple RouteViews servers becomes essential for prefix analysis. Or perhaps combining tunneling with NetFlow/BGP route exports (an older method) is still the better approach?

Thoughts? Discussion welcome.

Can AI Replace Network Engineers? A Real-World Perspective from the Trenches

Today, I watched the latest YouTube video by NetworkChuck.
He discussed how AI is affecting network engineers. It was really interesting — and it motivated me to continue my journey as a network engineer.

Of course, I’m facing some difficulties right now.
If you have any open positions for network engineers, please feel free to message me anytime. I’d really appreciate it.

I’m currently based in the UK.
Yes, I could join Amazon — I mean, Amazon Flex.
No joking. I’m serious.

Right now, I’m especially interested in data center networking and hyperscale networking for AI.
Regardless of the use case — whether it’s business, research, or something more creative like cryptocurrency — I find these areas fascinating.

Also, Internet route checking (RPKI again? Yes, definitely) is something I want to explore further.

On the topic of device deployment — automation, zero-touch provisioning, scripting, etc.:

Yes, scripting in Bash or Perl (SSH into the device, wait for a prompt, automate the commands) has been used for a long time. But today, there are many modern approaches: agents, tailored remote access methods, tunneling deployments, and more.

How do they compare? Which method is best?
Ansible? Python scripting? YAML? Network programming?

And… you prepare every configuration manually? (No way — maybe for 10,000+ devices?!)

OK, let’s step back. Here’s how it usually goes:

Gather and analyze all requirements
Define the needed solutions
Prepare configurations
Plan deployment and coordinate across multiple parties

So the question is: what can AI help with?

I asked ChatGPT about step 3 — configuration preparation and the best automation method.
It gave me this:

Use Case                  | Best Tool
--------------------------|--------------------------
First-time provisioning   | 🟠 ZTP
Day-2 config, updates     | 🟢 Ansible
Dynamic runtime scripting | 🔵 Python

You don’t have to choose just one — combining them leads to full end-to-end automation.

Great! But… what about the rest of the steps to complete a full solution?
Can AI help us with the entire plan?

I think this is how you should approach it:

Analyze the requirements in detail
Identify what the solution needs
Prepare the configurations
Design the deployment plan

You’d need to provide AI with a comprehensive scope of work, and then ask it to generate a plan.

But can you trust it?

YOU need to verify all outputs from AI.
YOU need to plan the project timeline and human resources.
YOU need to review the configurations generated by AI carefully.

If you don’t know networking, project planning, or resource management — how would you even do that?

That’s why:
YOU = Network Engineer, Network Project Manager

We’re always learning new things. AI is becoming a great assistant, but right now, we shouldn’t fully trust it without verification.

As I said before:
Can AI operate the Internet?
Well… that depends on how the AI is designed to manage networks, and which parts of the network it’s capable of operating.

Discussion continues…

TCP Congestion Control – again

This topic keeps coming up again and again:

Which TCP congestion control algorithm has the best performance?
Which handles high latency, delays, and bottlenecks better?
And some might ask, what does it mean when the congestion window becomes zero? AH…

Yes, it’s true — no matter how much technology improves, how fast chipsets and processors become, or how advanced fast-forwarding gets — congestion still happens.

Today’s topic isn’t about local datacenter LAN-side congestion (which may be caused by high network usage in clustered, fabric-based structures — AI workloads, maybe?).

We’re talking about long-haul network connections — like between two distant endpoints. Take Starlink, for example: a satellite-based system with unpredictable latency and environmental effects.

So, how are these “OLD” TCP congestion control algorithms still being used today?

Can monitoring TCP congestion control behavior — like how the congestion window changes — be used to detect potential congestion and trigger a BGP or transit path switch?
Maybe — but it depends on the access technology. If the last mile is wireless, measurements might not be reliable enough.

So, which one is best suited for the modern era?

Here’s a list of Linux-supported TCP congestion control algorithms (summarized by ChatGPT):

highspeed: Designed for networks with large bandwidth-delay products.

reno: The traditional TCP congestion control algorithm.

cubic: Default in Linux since kernel 2.6.19; optimized for high-speed networks.

bbr: Developed by Google (since kernel 4.9); focuses on bottleneck bandwidth and round-trip time.

bic: Binary Increase Congestion Control; predecessor to CUBIC.

htcp: Hamilton TCP; designed for high-speed, long-distance networks.

vegas: A delay-based algorithm that adjusts the sending rate based on RTT variations.

westwood: Optimized for lossy networks; adjusts congestion window based on bandwidth estimation.

yeah: Yet Another Highspeed TCP; combines delay and loss-based congestion detection.

hybla: Improves performance over high-latency networks by compensating for RTT.

illinois: Dynamically adjusts the congestion window for high-speed, long-distance links.

lp: Low Priority; designed for background traffic to yield to more important flows.

veno: Hybrid of Reno and Vegas; optimized for wireless networks.

scalable: Increases congestion window more rapidly than traditional TCP.

nv: New Vegas; an improvement over TCP Vegas.

cdg: Congestion Distance; uses delay gradients to detect congestion.

dctcp: Data Center TCP; optimized for data center networks using ECN.

www.pingnetbox.com project

PingNetBox – https://www.pingnetbox.com is developed using Python and PHP, and it runs on a Raspberry Pi to act as a probe for network measurement.

The source code is available at: https://github.com/royleung01

It’s easy for me to share this small-scale measurement tool with a web interface and reporting features. Of course, compared to RIPE Atlas, this is a much smaller project—on a completely different scale.

The Raspberry Pi operates with low power consumption while delivering high performance. It fulfills basic network testing needs and can also function as an iPerf probe. Since it’s Linux-based, tools like Speedtest-CLI can also be used for measurements.

It seems promising! I’m thinking of developing PingNetBox version 2 with more embedded features for the Raspberry Pi.

Rasp-SDWAN, perhaps? IPSEC?

#raspberryPI #speedtest #RIPE #atlas #IP #network #measurement #Internet #github #IPSEC

Can AI Operate the Internet?

Skynet? No, no Terminator, please.

Just joking—don’t jump to the movie scene right away. Let’s consider the idea of AI operating a network, similar to how automation handles network device configuration and zero-touch provisioning. Is it possible?

Let’s start with automation. A device comes with an initial configuration that includes an IP address and gateway. Once it’s network-accessible, the central server recognizes the device and pushes the final configuration. That’s a simple flow, right?

But what happens next—who operates the device?
Today, it’s still a human network operator.

Could AI replace the network operator? It depends on how the AI model is trained, how it handles fault tolerance, and how it avoids operational errors. There’s no such thing as 100% guarantee.

However, using AI to operate parts of the network—such as traffic tuning, AS path changes, or metric modifications—seems realistic.

AI can assist with tasks like network measurements, route and traffic tuning, DDoS detection and auto-mitigation (which already exists), device fault detection and traffic redirection, and DNS updates. That all seems feasible. You could train an AI module specifically for this purpose.

But then again—do we need AI if existing technologies already meet these requirements?

Let’s consider a scenario and see what AI (specifically, ChatGPT) can do.

Scenario:

An Internet provider has 2 transit links:
- Link A: 1 Gbps
- Link B: 10 Gbps
The subnets 10.0.0.0/24 and 11.0.0.0/24 (AS 65000) are advertised to Link A.
The same subnets, but with AS path prepending (65000, 65000), plus 12.0.0.0/24 (65000), are advertised to Link B.
Link A is congested—11.0.0.0/24 is using nearly 900 Mbps.
Link B’s total usage is just 5 Gbps.
A customer on 10.0.0.0/24 complains they cannot access the Internet.

ChatGPT’s Suggested Operation:

Summary:

Remove or reduce AS path prepend on Link B.
Optionally add prepends on Link A.
Let traffic shift to Link B (10 Gbps).
This will relieve congestion on Link A.
It should also restore Internet access for users on 10.0.0.0/24.

Business Follow-Up:

“From a technical perspective, that works. But from a business point of view, Link A is now underutilized. If Link B has a higher cost per bps, this is inefficient.”

ChatGPT’s Response:

Aspect	Strategy
Technical QoS	Offload heavy prefixes like 11.0.0.0/24 to Link B
Cost Control	Keep lighter prefixes like 10.0.0.0/24 on Link A to avoid overusing B
Business Fit	Avoid paying premium for B unless capacity is needed
Scalable Option	Automate prepend adjustments using monitoring-based thresholds

Yes, that’s a great example of how AI can analyze your needs if you provide enough detail and criteria.

The remaining concern is AI’s operational cost and power consumption.

This topic definitely deserves more discussion—SD-WAN + AI?

#BGP #bgptrace #AI #chatGPT #AS #ASPrepend #path #link #transit #INTERNET #SDWAN

Test with ECMP on Linux – Part 2

Continuing from the previous article on ECMP on a Linux machine, the goal is to enhance overall performance, but it will always be limited by the session-wise characteristics of traffic.

Surprisingly, something interesting happens during a multiple-flow Speedtest. When a single test generates multiple traffic flows using different port numbers (Layer 4), the aggregated result shows a higher combined throughput—regardless of whether the traffic is routed through two different public IPs with NAT.

For example, consider two links:

150Mbps Download / 30Mbps Upload (DIA)
30Mbps Download / 30Mbps Upload (DIA via public WiFi)

A Linux gateway is configured to use ECMP with two next-hop routes pointing to these links. When the traffic test starts, the portal or app will display only one of the link’s public IP addresses. However, the test results show 170+ Mbps download and 40+ Mbps upload. WOOOO!!!

Of course, this is just a traffic test—similar to running iPerf with multiple flow tests and aggregating the results. So, yes, it’s possible!

I believe some vendors use similar techniques for load sharing. Since it’s Linux-based, that shouldn’t be an issue.

Any other ideas? I’m also thinking about UDP-based video streaming—should we be considering application-layer optimization?

#ECMP #internet #NAT #IP #loadsharing #BGP #DIA #Speedtest #IPERF #measurement #traffictest

Test with ECMP on Linux

I was reading an article about ECMP (Equal Cost Multipath) for traffic load sharing, and it brought back memories of my previous traffic engineering tests. It seems simple at first glance, but it’s actually more complex—especially when it comes to policy-based routing.

The challenge lies in determining traffic redirection and sharing in a session-wise connection, whether with or without NAT, across multiple links or circuits with different latencies. There’s also the complication of firewall interception with asymmetric return traffic. These factors make achieving ideal traffic load sharing quite difficult.

Of course, if tunneling is involved, things get simpler. It essentially blinds both endpoints and allows you to add two routes with the same metric in overlay routing. However, it doesn’t clearly explain why load-sharing performance behaves the way it does.

What about service enhancement? If the primary link becomes congested, should the secondary link pick up some of the traffic? That’s not exactly round-robin behavior—it would require active measurement and monitoring of the links. Maintaining session flow on the primary link while redirecting new flows to the secondary link sounds ideal, but it’s difficult to implement. For MPLS-TE, that’s straightforward—but what if you have two internet links, like one DIA (Direct Internet Access) and one mobile network? How would you handle that?

Well, just for fun, I haven’t done any serious measurements yet. But after setting up load sharing on my node, it seems to be working—though I haven’t really thought through the next steps. Running a Speedtest shows that the flows (by ports) are transmitting separately. Hmm… not ideal, but not bad either. But what about other applications? If they’re using two different IP addresses for outgoing traffic—ahhhh…

Let’s discuss this, bro.

Enable 2 Multipath load sharing
sudo ip route add default scope global \
nexthop via 192.168.X.X dev XXX weight 1 \
nexthop via 192.168.X.X dev XXX weight 1

For multipath routing, disabling connection tracking for locally-generated traffic helps
sudo sysctl -w net.netfilter.nf_conntrack_tcp_loose=0

Enable Layer 4 Hashing
sudo sysctl -w net.ipv4.fib_multipath_hash_policy=1

Enable IP Forwarding
sudo sysctl -w net.ipv4.ip_forward=1

Force More Aggressive Flow-Based Balancing:
Set rp_filter to 0 (disable reverse path filtering) so the kernel won’t drop asymmetric traffic
sudo sysctl -w net.ipv4.conf.all.rp_filter=0

Flush all route cache
sudo ip route flush cache

#ECMP #Linux #Internet #Routing #IP #Firewall #Tunneling #MPLS #trafficEngineering #ChatGPT

Looking glass function provided by RIPE Atlas?

I performed some traceroute tests using the public looking glass of another organization/provider. I found that some test functions, like Ping and Traceroute, were launched using RIPE Atlas probes. It looks impressive and kind of funny.

In the previous year, the provider developed a web interface and API to launch commands from their own PE (Provider Edge) or Internet BG (Border Gateway) routers and return the results. The geographical router list allows users to select region-based tests.

This seems to be a new method using RIPE Atlas, where queries can be made via an API. The web interface lets users select which probe to use for the measurement, deducting the web provider’s “RIPE Atlas Credits” for each test.

However, I’m wondering — since looking glass aims to provide insights into a specific network provider’s or AS owner’s network — if we’re using this method, why not just go to the official RIPE Atlas website to launch the test?

Well, I guess the more user-friendly web portal makes it easier for users.

Pingnetbox – http://www.pingnetbox.com

#ripe #atlas #lookingglass #measurement #ping #traceroute #test #internet #AS #chatgpt #proofreading

Starlink Satellites’ Movement Proven by Periodical Measurement – Part 2

Tuning the measurement to 5 minutes each, the result portal summarizes the data in a single file by RIPE Atlas Probe ID.

The results show a predictable pattern of latency changes with increases and decreases, which may indicate satellite movement. We assume that the latency between the ground station, CDN server, and client site remains constant (unless under a DDoS attack… um…).

With the current resources available on RIPE Atlas, can we compare country-based latency and service levels of Starlink? Ah, that should probably be done by the Starlink NOC…

https://www.bgptrace.com/atlas/starlink

#starlink #CDN #cloudflare #satellites #ping #latency #movement #probe #RIPE #atlas

Starlink Satellites’ Movement Proven by Periodical Measurement

Using the Atlas RIPE probe (what a great network measurement platform!), we selected the probe, which uses Starlink to continuously measure connections to CDN servers.

We assume that, no matter which Starlink satellites are passing over the area, the network service connection will still be provided to the same region. For example, if the satellites are crossing the US regions, it doesn’t matter which satellite; it will send data back to a US-based station on the ground.

The test seems a bit funny, but the latency trend appears to follow a pattern. It shows a progression from high latency to low latency and then back to high latency over time. Assuming that the ground station link is a fixed connection to destination CDN server, the latency remains constant. Therefore, the movement of the satellites affects the latency. When the latency decreases, it suggests that another satellite has taken over that area, and the roaming process is complete.

You can think of this like when you are using a mobile device. As you move from one cell site (A) to another (B), roaming occurs, which registers your device from Cell Site A to Cell Site B. This is a similar process with satellites.

Now, back to the Starlink client probe: if its location doesn’t change, then as the satellites move through space, the distance between the satellites and the probe site will increase, and latency can indicate this. When the latency decreases, we may assume that another satellite has taken over the service coverage (similar to the roaming process). This is because the satellites do not move backward.

Moreover, does the change in latency over time affect the user experience?
For instance, during a video or voice call, latency may fluctuate—increasing or decreasing.

However, live gameplay presents a different scenario. Unlike calls, it often relies on a stable connection. A fixed connection typically doesn’t exhibit the same fluctuating physical characteristics, making latency more predictable in gaming environments.

Currently, measurements are taken every 15 minutes. If we shorten this test period, we may get more accurate insights into this operation.

https://www.bgptrace.com/atlas/starlink

#starlink #satellites #probe #RIPE #atlas #internet #measurement #roaming #cellsite #cell #mobile #ping #latency