Character encoding is a subject that can sometimes be a pain in the ass. Regardless of the technology, there will come a day when someone will give us CP-1250 and we would like UTF-8 (not to be confused with WTF, he he).
Once upon a time everything was in ASCII and there were no problems. Those were the days. Now there are no days.
Old IT Guy
At some point, someone came up with the idea that ASCII is not enough. That we need a character set that will cover all the scripts used in the world. And so Unicode was created.
It's a great invention in itself. Thanks to it we have emojis! 🙌
However, we must pay attention to how our files are encoded, our databases, and what character encoding we use in our programs.
And today I wanted to share my story.
I wanted to connect to MySQL in Python and load some data.
Like on a construction site, blocks into the wheelbarrow and from the wheelbarrow up.
How to connect to MySQL from Python? Well, I installed SQLAlchemy because it popped up on Google at the top. I prepared a piece of code. Its task was to connect to the MySQL database, load data from csv to DataFrame, and then throw DataFrame into the database (pandas has a ready to_sql()
method for this)
import pandas as pd import sqlalchemy def get_db_connection(): # connect to the database engine = sqlalchemy.create_engine( "mysql://user:passhardasfuck@server/database", encoding="utf-8" ) connection = engine.connect() return connection def ingest_data(db_conn): data_df = pd.read_csv("/data/file.csv", sep=",", encoding="utf-8") data_df.to_sql( "table", db_conn, if_exists="append", index=False ) if __name__ == "__main__": with get_db_connection() as db_connection: ingest_data(db_connection)
Cool. I test locally - it works. I wrap the script in a container, because it's supposed to be part of a bigger whole. I define a simple Dockerfile
FROM python:3.9 WORKDIR /app COPY requirements.txt ./ RUN pip install --no-cache-dir -r requirements.txt COPY ingest_data.py ./ ENTRYPOINT ./ingest_data.py
And we can slowly finish for today. Just the last tests. I type in the console
docker run my-fancy-ingestion
And zonk. Error, data not loading.
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-3: character maps to <undefined>
// Full stacktrace at the end
Something is wrong with the character encoding. We start looking on the Internet.
First I checked if I had LOCALE set correctly in the container. All according to UTF-8, which is how it should be.
Just in case, I set the PYTHONIOENCODING
variable to UTF-8 – it doesn't help.
I check with PYTHONUTF8
set to 1 but it doesn't help either.
Everything is set up. All file reading methods have encoding set to UTF-8 ( encoding="utf-8"
)
I'm slowly giving up. Everything seems to be set up correctly. And yet - it doesn't work. And only in Docker, because locally it worked...
Only somewhere in the depths of the internet did I find that you need to add ?charset=utf8mb4
to the connection string.
And this despite the fact that I was reading all files in UTF-8, and that I had encoding="utf-8"
set in the create_engine
method. Only adding the appropriate charset setting in the connection string helped. This is what the correct fragment responsible for connecting to the database looked like:
engine = sqlalchemy.create_engine( "mysql://user:passhardasfuck@server/database?charset=utf8mb4", encoding="utf-8" )
After this change, everything started working magically. And I could stop pulling my hair out and do the happy dance #thisfeelingwhenitworks.
Traceback (most recent call last): File "/app/./ingest_data.py", line 72, in <module> ingest_people_csv(db_connection) File "/app/./ingest_data.py", line 15, in ingest_people_csv load_df_to_database(db_conn, people_df, "person") File "/app/./ingest_data.py", line 46, in load_df_to_database df.to_sql( File "/usr/local/lib/python3.9/site-packages/pandas/core/generic.py", line 2779, in to_sql sql.to_sql( File "/usr/local/lib/python3.9/site-packages/pandas/io/sql.py", line 601, in to_sql pandas_sql.to_sql( File "/usr/local/lib/python3.9/site-packages/pandas/io/sql.py", line 1411, in to_sql table.insert(chunksize, method=method) File "/usr/local/lib/python3.9/site-packages/pandas/io/sql.py", line 845, in insert exec_insert(conn, keys, chunk_iter) File "/usr/local/lib/python3.9/site-packages/pandas/io/sql.py", line 762, in _execute_insert conn.execute(self.table.insert(), data) File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1201, in execute return meth(self, multiparams, params, _EMPTY_EXECUTION_OPTS) File "/usr/local/lib/python3.9/site-packages/sqlalchemy/sql/elements.py", line 313, in _execute_on_connection return connection._execute_clauseelement( File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1390, in _execute_clauseelement ret = self._execute_context( File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1749, in _execute_context self._handle_dbapi_exception( File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1934, in _handle_dbapi_exception util.raise_(exc_info[1], with_traceback=exc_info[2]) File "/usr/local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 211, in raise_ raise exception File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1706, in _execute_context self.dialect.do_execute( File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 716, in do_execute cursor.execute(statement, parameters) File "/usr/local/lib/python3.9/site-packages/MySQLdb/cursors.py", line 199, in execute args = tuple(map(db.literal, args)) File "/usr/local/lib/python3.9/site-packages/MySQLdb/connections.py", line 280, in literal s = self.string_literal(o.encode(self.encoding)) File "/usr/local/lib/python3.9/encodings/cp1252.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_table) UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-3: character maps to <undefined>