Revisiting: How I Identified a Critical Flaw in XTLS

Notes: This English article was translated by an LLM from the Chinese version.

Note: The events described in this article took place in 2021, approximately five years before this post was written. Some factual inaccuracies or missing details may remain.

Note: I have no formal background in security or cryptography. At the time of these events, I had not received systematic computer science training; at the time of writing, I have only completed part of an undergraduate CS curriculum.

Introduction

A recent wave of major controversy around censorship-circumvention proxy software prompted me to revisit an old incident. After I submitted the original report, I never wrote a postmortem because of academic pressure. This article is a retrospective account of how I identified the issue.

Background

In late August 2021, I was about to enter my final year of high school. In a developer group chat, someone suggested that it might be possible to design a protocol by imitating XTLS. I became interested and started analyzing how XTLS worked.

A Difficult Analysis Process

From a software engineering perspective, the XTLS codebase was extremely difficult to read. It contained many uncommented magic numbers, so I had to annotate large portions of the code while cross-referencing the TLS RFC.

An Unexpected Finding

During analysis, I encountered the following code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
// https://github.com/XTLS/Go/blob/3632bf3b74991685fba4906a68e4c49fb73d7ad5/conn.go#L1323-L1334

 if c.DirectOut { 
   if s := len(b) - 31; s >= 0 && b[s] == 21 { 
     if b[s+1] == 3 && b[s+2] == 3 && b[s+3] == 0 && b[s+4] == 26 { 
       if c.SHOW { 
         fmt.Println(c.MARK, "discarded 21 3 3 0 26 at s =", s) 
       } 
       c.Connection.Write(b[:s]) 
       return s + 31, nil 
     } 
   } 
   return c.Connection.Write(b) 
 } 

This branch is triggered when writing data while directly copying TLS records after the connection starts. Its behavior appears to remove something from the output.

Looking more closely, the code removes the last 31 bytes if those bytes begin with 21 (0x15), i.e., a TLS Alert record. The next four bytes are 3 3 0 26, or 0x0303001a in hex. Here, 0x0303 denotes TLS 1.2 (the same legacy version field appears in TLS 1.3), and 0x001a is the alert payload length.

At that point, it became clear that the code was filtering out an Alert record with a specific length. I suspected this might correspond to the TLS 1.2 close_notify alert. In TLS 1.2, close_notify is sent during orderly shutdown before closing the TCP connection. In TLS 1.3, this design changed. This means that in TLS 1.2, such an alert can appear even without abnormal behavior.

I checked The Illustrated TLS 1.2 Connection and found that the close_notify example there is 35 bytes, not 31. That implied the filter might miss some variants and let them pass through unchanged.

After forming this hypothesis, I needed to test what could cause the size discrepancy. Different cipher suites were an obvious candidate, since IV length, block behavior, and tag length vary across ciphers.

After several attempts, I found that with ECDHE-ECDSA-CHACHA20-POLY1305, an XTLS-proxied pseudo-TLS-1.3 connection emitted a TLS Alert record at termination. This supported the hypothesis: in Direct mode, XTLS could expose a passive DPI fingerprinting risk. The issue can be reproduced with:

1
curl --tlsv1.2 --tls-max 1.2 --ciphers ECDHE-ECDSA-CHACHA20-POLY1305 https://www.google.com

I then opened a brief GitHub issue in poor English. Because I had classes that afternoon, the report was written hastily and had many grammatical errors.

Later that evening, with help from another developer studying overseas, I rewrote the report.

Further Findings

The issue above affected XTLS in Direct mode. XTLS in Origin mode did not show the same behavior. The fix for this specific problem was straightforward; I will return to that below.

Given the differences between TLS 1.2 and TLS 1.3, I considered whether Origin mode might still leak traits inconsistent with TLS 1.3. A quick look at Go’s crypto/tls stack showed this in common.go:

1
2
3
4
5
6
const (
  maxPlaintext               = 16384        // maximum plaintext payload length
  maxCiphertext              = 16384 + 2048 // maximum ciphertext payload length
  maxCiphertextTLS13         = 16384 + 256  // maximum ciphertext length in TLS 1.3
  //...
)

These constants indicate a concrete difference: for the same maximum plaintext size, TLS 1.2 allows a larger maximum ciphertext than TLS 1.3. I suspected this was related to cipher-suite design differences across versions. In XTLS, this could potentially surface as Application Data records exceeding TLS 1.3 bounds. I did not later validate this hypothesis experimentally.

Aftermath

A few days later, @DuckSoft contacted me, saying @yuhan6665 wanted to design a new protocol to address this class of issue and asked for my input. I proposed two options:

Forward only TLS 1.3 records directly. This is the simplest approach with fewer potential side effects.
For Alert records, oversized Application Data records (beyond TLS 1.3 limits), and similar cases, send data through the original handshake TLS connection. Upon receiving records, attempt decryption first; if decryption fails, forward directly. Since only one block would need decryption in the key path, the performance impact should remain limited. The protocol and state-machine design, however, would be significantly more complex.

DuckSoft considered option 1 easy to implement, and option 2 feasible only with a robust state machine. Ultimately, yuhan6665 adopted option 1, which became XTLS Vision.

At the beginning of this story, I had intended to pursue option 2 myself. I did start implementing a new protocol in that direction, but the logic was difficult to debug, and high-school academic pressure was severe. I eventually abandoned the effort. The unfinished code still exists in a private GitHub repository.

Present-Day Reflection

Returning to the controversy mentioned at the beginning, issue disclosure has generally followed two approaches:

Report privately to the relevant developers first, with delayed public disclosure. Academic organizations often follow this path because moving from draft to conference acceptance and publication usually takes substantial time.
Publish technical details directly. Developers have more commonly followed this approach.

In the incident described above, I chose the second approach.

Reporting issues directly to commercial security companies that provide traffic-classification capabilities, without first notifying developers, appears relatively uncommon in research not sponsored by those companies. This is also a key reason the public controversy became so intense. As for the exact facts, I cannot provide reliable sources here.

The topic of “The Parrot is Dead” also appears to be difficult to avoid in censorship circumvention. Even so, with sound design, mitigating these problems remains a viable path for large-scale deployment. TLS fingerprinting is more concentrated in the handshake phase. TLSMirror in V2Ray may be a strong design in this regard: it forwards the handshake transparently, continuously generates normal traffic during the connection, and inserts data-to-be-transmitted into that traffic by encapsulating it as TLS records. This better limits active probing that does not disrupt existing connections in networks.