Data Coding Scheme in SMS: The Complete Guide

June 25, 2026
Posted By: TeleOSS

If you have ever sent an SMS that mysteriously split into two messages, or watched a bulk campaign burn through more credits than expected, you have already met the data coding scheme without knowing its name. It is a small setting buried inside every SMS, and it decides how a network reads your characters.

A data coding scheme tells the receiving handset and network whether your message uses plain English letters, an extended character set, or full Unicode for emojis and non-Latin scripts. Get this setting wrong and you lose characters, pay for extra segments, or watch delivery rates drop on certain handsets. Get it right and your messages move cleanly through every carrier, in every language you need to support.

This guide walks through what a data coding scheme actually does, how it lives inside the SMS PDU format, why some texts split into multiple parts, and how telecom operators, SMS aggregators, and enterprises should choose the right scheme for their traffic. By the end, you will know exactly why your character count changes when you add an emoji, and what to check before you blame your SMS gateway for delivery problems.

What Is a Data Coding Scheme in SMS?

A data coding scheme is a one byte field inside every SMS message that tells the network which character set and bit encoding the message uses. It sits in the SMS PDU, right alongside the sender number and message body, and every carrier on the route reads it before deciding how to render your text.

Think of it as a label on a package. The label does not contain the contents, but it tells the handling system exactly how to open and process what is inside. Without that label, a phone would have no reliable way to know if your message is plain text, compressed data, or a string of Chinese characters.

The data coding scheme standard comes from GSM 03.38, later folded into 3GPP TS 23.038, which defines the GSM 7-bit default alphabet and the rules for switching to 8-bit data or UCS-2 Unicode. Every SMPP protocol implementation and every SMS gateway you have ever used relies on this same specification, whether the traffic originates in Lagos, London, or Los Angeles.

I have worked with operators who assumed their gateway handled encoding automatically and never looked at this setting again, right up until a client complained that Arabic text was arriving as garbled symbols. The data coding scheme was the first thing we checked, and it was the entire problem.

Key takeaway: the data coding scheme is the instruction set for how a handset reads your message, not a setting you can ignore once your platform is configured.

How the Data Coding Scheme Works Inside an SMS PDU

Every SMS sent over a mobile network travels inside a structure called a Protocol Data Unit, or PDU. The data coding scheme occupies a specific byte in that structure, and the bits inside that byte tell the receiving system three things: the character set, whether the message uses compression, and which message class it belongs to.

In practical terms, the first two bits usually flag the coding group, while a separate pair of bits points to one of three options:

GSM 7-bit default alphabet
8-bit data
UCS-2 (16-bit Unicode)

A fourth, rarely used option is reserved. Most production traffic uses only the first and third.

This matters for anyone building or buying SMS gateway software, because the gateway has to read and set this byte correctly for every message, every time, across every connection type including SMPP, HTTP API, and SS7 based signaling.

A wholesale SMS provider I consulted for was losing roughly 3 percent of messages to handset rendering errors on older Android devices in East Africa. The root cause traced back to a misconfigured DCS byte that flagged messages as 8-bit data instead of GSM 7-bit, even though the content was plain English. Once the gateway logic was corrected, the rendering errors disappeared within a day.

Key takeaway: the DCS byte is not cosmetic. A single bit flipped in the wrong place can break delivery on specific handset models even when the message content looks completely normal.

GSM 7-Bit vs UCS-2: The Two Encodings You Need to Know

Most real world SMS traffic runs on one of two encodings, and choosing between them is the most common decision tied to the data coding scheme.

GSM 7-bit covers the standard Latin alphabet, numbers, and common punctuation. It packs more characters into a single message because each character only needs 7 bits instead of 16. UCS-2 covers the full Unicode range, which includes emojis, Arabic, Chinese, Cyrillic, and accented characters like é or ñ that fall outside the GSM 7-bit default alphabet.

Encoding	Bits per character	Single SMS limit	Multipart segment limit	Typical use case
GSM 7-bit	7	160 characters	153 characters	English and basic Latin text
8-bit data	8	140 octets	134 octets	Binary data, ringtones, OTA config
UCS-2	16	70 characters	67 characters	Emojis, non-Latin scripts, accented text

GSM 7-Bit vs UCS-2

Here is the part many marketing teams miss. If your message contains even one character outside the GSM 7-bit alphabet, the entire message switches to UCS-2, not just that one character. A single emoji at the end of an otherwise plain English promotional text can cut your character allowance by more than half and silently double your segment count.

A US retail brand I reviewed traffic for had this exact issue. Their campaign team added a small checkmark emoji to confirmation texts. Character count per message jumped from 140 to 76 character segments, and their monthly SMS spend rose by close to 40 percent before anyone noticed why.

Key takeaway: know exactly which characters trigger UCS-2 before you finalize message templates, because the cost difference is real and immediate.

How Many Characters Fit in an SMS With UCS-2 Encoding?

A single SMS using UCS-2 encoding holds 70 characters. Once a message needs more than one segment, that limit drops to 67 characters per segment, because a small header is reserved to tell the handset how to reassemble the parts in order.

This is roughly half of what GSM 7-bit allows, which is why UCS-2 traffic costs more at scale even though the per-segment carrier fee is usually the same. You are not paying more per character, you are paying for more segments to carry the same amount of readable text.

If you are sending two-factor authentication codes with a special character, loyalty point balances with currency symbols, or marketing texts with emojis, you are very likely sending UCS-2 traffic and should budget accordingly.

Key takeaway: when planning international campaigns, count on roughly half the character budget per segment if any part of your message requires Unicode.

Why SMS Messages Split Into Multiple Parts

A message splits into a concatenated SMS, also called multi-part SMS, the moment it exceeds the single message limit for its encoding. The network does not simply cut the text off. Instead, it breaks the message into segments, attaches a small User Data Header to each one, and reassembles them in the correct order on the receiving handset.

That header takes up space. This is why the segment limit for concatenated SMS is lower than the single message limit. GSM 7-bit drops from 160 to 153 characters per part, and UCS-2 drops from 70 to 67, because a portion of each segment’s payload is now used for routing instructions rather than your actual text.

Handsets generally stitch these parts back together seamlessly, so the reader just sees one longer message. But on the carrier side, each part is billed and routed as its own message, which is why a 320 character SMS does not cost the price of one message, it costs the price of three.

I have seen enterprises write message templates in a word processor, copy them straight into a gateway, and get blindsided by hidden characters like smart quotes or em dashes carried over from autocorrect. Those characters often force GSM 7-bit content into UCS-2 without anyone realizing it, turning a one segment message into two or three.

Key takeaway: always test your exact final message string in a character counter before launch, not a draft version, since invisible formatting characters can quietly change your data coding scheme and segment count.

Does Data Coding Scheme Affect SMS Pricing?

Yes, the data coding scheme directly affects SMS pricing because pricing is based on the number of segments a message uses, and the encoding determines how many characters fit into each segment.

A 200 character GSM 7-bit message splits into two segments. The same 200 characters in UCS-2, triggered by a single emoji or accented letter, splits into three or four segments depending on exact length. If you are sending that message to a million recipients, the difference between two and three segments translates directly into your monthly invoice.

This is one of the most overlooked cost levers in enterprise messaging. Marketing and customer experience teams often write copy without knowing it touches UCS-2, and finance teams only see the aggregate bill, not the character level cause. Connecting those two views early saves real money.

Key takeaway: audit your highest volume message templates for hidden Unicode characters before scaling a campaign, since pricing follows segment count, not character count alone.

Can Data Coding Scheme Cause SMS Delivery Failures?

Yes, an incorrect data coding scheme can cause delivery failures, garbled text, or messages that arrive as blank on certain handsets. This usually happens when a gateway sets the wrong DCS byte for the actual content being sent, or when a receiving network does not support the requested encoding properly.

Older feature phones and some legacy network elements in parts of Africa and South Asia still have inconsistent support for UCS-2 rendering, which is a known pain point for enterprises running awareness campaigns or OTP delivery in those regions. A message that looks perfect on an iPhone in London can render as a string of question marks on a basic handset in a rural delivery area if the local network handles the encoding differently.

The fix is not to avoid Unicode entirely, since that is often necessary for local language support. The fix is testing on representative handset samples for your actual target market before a full rollout, and working with an SMS gateway or SMS wholesale platform that gives you visibility into delivery receipts segmented by carrier and handset type.

Key takeaway: delivery failures tied to encoding are rarely random. They cluster around specific carriers and handset generations, and testing on real devices in your target market catches the problem before launch.

How Telecom Operators Choose a Data Coding Scheme

Telecom operators generally default to GSM 7-bit for standard Latin traffic because it is the most efficient and broadly supported encoding across legacy and modern handsets. Operators only push traffic into UCS-2 or 8-bit data when the message content requires it, such as local language scripts, binary configuration payloads, or messages containing characters outside the default alphabet.

In practice, this decision happens at the network element level, often inside the SMSC or SMS gateway, where rules inspect outgoing content and assign the correct DCS byte automatically. Well configured systems do this without any manual intervention from the sender. Poorly configured systems force every message through the same default encoding regardless of content, which is exactly the kind of misstep that caused the rendering issue I mentioned earlier in East Africa.

Operators serving multilingual markets, such as those across Europe and the Middle East, need encoding logic that can detect Arabic, Cyrillic, or Greek script automatically and switch to UCS-2 only when needed, rather than defaulting every message to Unicode and inflating segment counts unnecessarily.

Key takeaway: the best operators treat data coding scheme selection as an automated, content aware decision, not a static network setting.

Data Coding Scheme for SMS Aggregators and Wholesale Platforms

SMS aggregators sit in the middle of the messaging supply chain, routing traffic from enterprises and other aggregators into terminating networks around the world. This position makes data coding scheme handling especially important, because a single aggregator might carry GSM 7-bit traffic from one client and UCS-2 traffic from another, often within the same connection.

A wholesale SMS platform needs to preserve the original encoding instructions as traffic passes through, rather than re-encoding everything to a default scheme. Re-encoding, even when done with good intentions, can introduce character substitution errors, especially with accented letters or symbols that exist in one character set but not another.

This is one of the areas where TeleOSS’s SMS gateway software focuses heavily, since aggregators routing high volumes of mixed encoding traffic need the platform to detect and preserve the correct data coding scheme automatically, segment by segment, without manual rule writing for every client.

I have seen aggregators lose client trust over encoding issues that had nothing to do with the actual message content. A reseller routing OTP traffic for a fintech client noticed Cyrillic characters in test messages were rendering as boxes on a portion of receiving handsets. The aggregator’s platform was silently converting UCS-2 to GSM 7-bit somewhere in the routing chain, stripping characters it could not represent. Fixing the pass through logic resolved it immediately.

Key takeaway: aggregators should test encoding preservation across their full route map, not just their direct upstream connection, since the failure point is often a downstream hop they do not directly control.

Choosing the Right Data Coding Scheme for International Enterprise SMS

Enterprises sending SMS across multiple countries should default to GSM 7-bit wherever the message content allows it, and reserve UCS-2 specifically for messages that require local language scripts, emojis, or special characters that genuinely improve the customer experience.

The decision should be driven by the target market’s language needs, not by what looks nice in a marketing mockup. A customer service confirmation in French Canadian markets may need accented characters that push the message into UCS-2, while the same confirmation in US English markets does not.

For enterprises managing SMS volume through an SMS gateway selection process, ask your vendor directly how their platform handles automatic encoding detection, whether it gives you visibility into segment counts before sending, and whether it preserves original encoding when routing through multiple carrier hops. These three questions reveal more about platform quality than almost any other technical specification.

According to GSMA Intelligence research on global mobile messaging trends, SMS remains one of the highest open rate channels available to enterprises, often cited above 90 percent within minutes of delivery. That reach advantage disappears quickly if encoding mistakes cause even a small percentage of messages to render incorrectly or fail outright in specific markets.

Key takeaway: treat data coding scheme selection as part of your international SMS strategy, not an afterthought handled entirely by your technical team after campaigns are already designed.

Common Mistakes With Data Coding Scheme

Assuming all characters cost the same.

A single emoji or accented letter can switch an entire message from GSM 7-bit to UCS-2, cutting available characters per segment roughly in half.

Copying text from word processors without checking for hidden characters.

Smart quotes, em dashes, and special punctuation from autocorrect often force unintended UCS-2 encoding.

Letting gateways re-encode traffic without preserving original DCS settings.

This is especially common among aggregators and can cause character substitution errors downstream.

Skipping handset testing in target markets.

Encoding support varies by device generation and region, and assuming universal support leads to silent delivery failures.

Ignoring segment count in campaign budgeting.

Marketing teams often plan based on character count alone, missing that encoding choice changes the real cost per message.

Best Practices for Managing Data Coding Scheme

Test final message copy in an actual character counter tool before launch, not a draft version, to catch hidden Unicode triggers early.
Default to GSM 7-bit wherever your audience and language allow it, and reserve UCS-2 for content that genuinely requires it.
Choose an SMS gateway or wholesale SMS platform that preserves original encoding through every routing hop rather than re-encoding by default.
Build automated content scanning into your sending workflow so encoding decisions happen at the system level, not through manual review.
Run periodic handset and carrier testing in your top target markets to catch rendering issues before they affect a full campaign.

Conclusion

The data coding scheme is one of those settings that stays invisible right up until it causes a problem, whether that is an inflated SMS bill, a garbled message on a customer’s phone, or a delivery failure that no one can explain at first glance. Once you understand how GSM 7-bit, 8-bit data, and UCS-2 work, and how each one changes your character limits and segment counts, you gain real control over both your messaging costs and your delivery quality.

Whether you are an operator routing traffic across multiple networks, an aggregator managing mixed encoding from dozens of clients, or an enterprise planning an international campaign, the data coding scheme deserves a place in your technical checklist, not just your gateway’s default configuration. Test your actual message content, watch for hidden Unicode triggers, and choose a platform that preserves encoding accurately across every hop.

If you want a gateway that handles data coding scheme detection automatically across mixed traffic without the guesswork, take a look at how TeleOSS’s SMS gateway software manages encoding and routing at scale, and reach out for a walkthrough built around your actual traffic mix.

FAQs

What is a data coding scheme in SMS?

A data coding scheme is a single byte inside the SMS PDU that tells the receiving network and handset which character encoding a message uses. It can indicate GSM 7-bit default alphabet, 8-bit data, or UCS-2 Unicode. This byte exists in every SMS, whether you notice it or not, and it directly controls how your text is interpreted on the other end. Getting it right means your message displays exactly as written, regardless of the recipient’s device or carrier.

How does data coding scheme affect SMS character limits?

The encoding indicated by the data coding scheme sets how many characters fit into a message. GSM 7-bit allows 160 characters in a single SMS, while UCS-2 allows only 70, because each character takes more bits to represent. Once a message needs multiple parts, those limits drop further, to 153 and 67 characters respectively, since part of each segment is reserved for reassembly instructions.

What’s the difference between GSM 7-bit and UCS-2 data coding schemes?

GSM 7-bit covers standard Latin letters, numbers, and common punctuation using 7 bits per character, making it efficient for English and similar languages. UCS-2 uses 16 bits per character and covers the full Unicode range, including emojis, Arabic, Chinese, and accented Latin characters. The tradeoff is capacity. UCS-2 supports far more characters and scripts but fits roughly half as many characters into each message segment.

Why do some SMS messages get split into multiple parts?

Messages split into multiple parts, known as concatenated SMS, whenever the content exceeds the single message character limit for its encoding. The network breaks the text into segments, each carrying a small header that tells the receiving handset how to reassemble them in order. This is why multipart segments hold fewer characters than a single message, since some space goes to that header.

Which data coding scheme should enterprises use for international SMS?

Enterprises should default to GSM 7-bit wherever the target language and content allow it, since it is the most efficient and broadly supported option. UCS-2 should be reserved for messages that genuinely need local language scripts, emojis, or special characters that improve the customer experience. The right choice depends on the target market’s language, the handset generation typically used there, and the cost tolerance for the campaign.