[{"data":1,"prerenderedAt":359},["ShallowReactive",2],{"content-query-qP7dI1rzs5":3},{"_path":4,"_dir":5,"_draft":6,"_partial":6,"_locale":7,"title":8,"description":9,"date":10,"draft":6,"tags":11,"thumbnail":15,"alt_description":16,"slug":17,"body":18,"_type":353,"_id":354,"_source":355,"_file":356,"_stem":357,"_extension":358},"/posts/fastest-way-to-upload-data-into-postgresql","posts",false,"","The Fastest Way to Upload Data into PostgreSQL","Learn how to significantly speed up your PostgreSQL data uploads by switching from INSERT to COPY method","2024-10-27T00:00:00.000Z",[12,13,14],"data-engineering","postgresql","airflow","/img/the_fastest_way_to_upload_data.png","Uploading data into PostgreSQL using COPY method","fastest-way-to-upload-data-into-postgresql",{"type":19,"children":20,"toc":341},"root",[21,38,43,50,64,69,81,86,97,102,108,113,122,127,133,193,198,204,211,256,262,305,311,316,330],{"type":22,"tag":23,"props":24,"children":25},"element","blockquote",{},[26,33],{"type":22,"tag":27,"props":28,"children":29},"p",{},[30],{"type":31,"value":32},"text","According to Pareto principal, 20% of your code do 80% of compilation",{"type":22,"tag":27,"props":34,"children":35},{},[36],{"type":31,"value":37},"                                                                            -- Christian Mayer, The Art of Clean Code",{"type":22,"tag":27,"props":39,"children":40},{},[41],{"type":31,"value":42},"Recently I changed my job from Data Analyst in a Big Tech to a Data Product Manager in Enterprise. And I was freaking out of FOMO concerning that Enterprise company would not challenge me enough to stay on point with technical approach to creative solutions. I gotta say, that I was wrong and my learning curve is as steep as it should be for anyone who changes the jobs in between to seize the learning opportunity.",{"type":22,"tag":44,"props":45,"children":47},"h2",{"id":46},"the-problem-with-pandas-default-insert",[48],{"type":31,"value":49},"The Problem with Pandas Default Insert",{"type":22,"tag":27,"props":51,"children":52},{},[53,55,62],{"type":31,"value":54},"Recently our Data Team faced an interesting challenge. Our Airflow DAG was taking forever to upload a DataFrame into PostgreSQL. The culprit? The default pandas ",{"type":22,"tag":56,"props":57,"children":59},"code",{"className":58},[],[60],{"type":31,"value":61},"to_sql()",{"type":31,"value":63}," method that uses INSERT statements.",{"type":22,"tag":27,"props":65,"children":66},{},[67],{"type":31,"value":68},"Here's what happens under the hood when you use the default INSERT approach:",{"type":22,"tag":70,"props":71,"children":76},"pre",{"className":72,"code":74,"language":75,"meta":7},[73],"language-python","df.to_sql('table_name', engine, if_exists='append')\n","python",[77],{"type":22,"tag":56,"props":78,"children":79},{"__ignoreMap":7},[80],{"type":31,"value":74},{"type":22,"tag":27,"props":82,"children":83},{},[84],{"type":31,"value":85},"This innocent-looking line generates something like this for EACH row:",{"type":22,"tag":70,"props":87,"children":92},{"className":88,"code":90,"language":91,"meta":7},[89],"language-sql","INSERT INTO table_name (col1, col2, col3) \nVALUES ('value1', 'value2', 'value3');\n","sql",[93],{"type":22,"tag":56,"props":94,"children":95},{"__ignoreMap":7},[96],{"type":31,"value":90},{"type":22,"tag":27,"props":98,"children":99},{},[100],{"type":31,"value":101},"Imagine doing this millions of times! Each INSERT statement requires a round trip to the database. It's like delivering packages one by one instead of using a container ship. No wonder our DAG was running slower than my previous employer's internet connection.",{"type":22,"tag":44,"props":103,"children":105},{"id":104},"enter-the-copy-method",[106],{"type":31,"value":107},"Enter the COPY Method",{"type":22,"tag":27,"props":109,"children":110},{},[111],{"type":31,"value":112},"One of the greatest mind in our team comes with the thought \"I may have realized the fastest way to upload dataframe into PostgreSQL\". Here's the core of the solution:",{"type":22,"tag":70,"props":114,"children":117},{"className":115,"code":116,"language":75,"meta":7},[73],"def psql_insert_copy(table, conn, keys, data_iter):\n    dbapi_conn = conn.connection\n    with dbapi_conn.cursor() as cur:\n        s_buf = StringIO()\n        writer = csv.writer(s_buf)\n        writer.writerows(data_iter)\n        s_buf.seek(0)\n\n        columns = ', '.join('\"{}\"'.format(k) for k in keys)\n        table_name = '{}.{}'.format(table.schema, table.name)\n        sql = 'COPY {} ({}) FROM STDIN WITH CSV'.format(\n            table_name, columns)\n        cur.copy_expert(sql=sql, file=s_buf)\n",[118],{"type":22,"tag":56,"props":119,"children":120},{"__ignoreMap":7},[121],{"type":31,"value":116},{"type":22,"tag":27,"props":123,"children":124},{},[125],{"type":31,"value":126},"Suggesting to use PostgreSQL's COPY command, this beast can handle bulk data loading like it's nothing. Instead of sending individual INSERT statements, COPY streams the data in a single transaction. It's like upgrading from a bicycle courier to a cargo plane!",{"type":22,"tag":44,"props":128,"children":130},{"id":129},"the-results-speak-for-themselves",[131],{"type":31,"value":132},"The Results Speak for Themselves",{"type":22,"tag":134,"props":135,"children":138},"div",{"className":136},[137],"table-container",[139],{"type":22,"tag":140,"props":141,"children":142},"table",{},[143,162],{"type":22,"tag":144,"props":145,"children":146},"thead",{},[147],{"type":22,"tag":148,"props":149,"children":150},"tr",{},[151,157],{"type":22,"tag":152,"props":153,"children":154},"th",{},[155],{"type":31,"value":156},"Method",{"type":22,"tag":152,"props":158,"children":159},{},[160],{"type":31,"value":161},"Time to Upload 1M Rows",{"type":22,"tag":163,"props":164,"children":165},"tbody",{},[166,180],{"type":22,"tag":148,"props":167,"children":168},{},[169,175],{"type":22,"tag":170,"props":171,"children":172},"td",{},[173],{"type":31,"value":174},"INSERT",{"type":22,"tag":170,"props":176,"children":177},{},[178],{"type":31,"value":179},"~20 minutes",{"type":22,"tag":148,"props":181,"children":182},{},[183,188],{"type":22,"tag":170,"props":184,"children":185},{},[186],{"type":31,"value":187},"COPY",{"type":22,"tag":170,"props":189,"children":190},{},[191],{"type":31,"value":192},"~20 seconds",{"type":22,"tag":27,"props":194,"children":195},{},[196],{"type":31,"value":197},"Yes, you read that right. What used to take half an hour now completes in seconds. Our DBAs finally stopped giving us the evil eye during peak load times.",{"type":22,"tag":44,"props":199,"children":201},{"id":200},"lets-break-it-down",[202],{"type":31,"value":203},"Let's break it down",{"type":22,"tag":205,"props":206,"children":208},"h3",{"id":207},"insert-method",[209],{"type":31,"value":210},"INSERT Method",{"type":22,"tag":212,"props":213,"children":214},"ul",{},[215,221,226,231,236,241,246,251],{"type":22,"tag":216,"props":217,"children":218},"li",{},[219],{"type":31,"value":220},"Simple to implement",{"type":22,"tag":216,"props":222,"children":223},{},[224],{"type":31,"value":225},"Good for small datasets",{"type":22,"tag":216,"props":227,"children":228},{},[229],{"type":31,"value":230},"Better for real-time row-by-row updates",{"type":22,"tag":216,"props":232,"children":233},{},[234],{"type":31,"value":235},"Easier error handling per row",{"type":22,"tag":216,"props":237,"children":238},{},[239],{"type":31,"value":240},"Painfully slow for bulk uploads",{"type":22,"tag":216,"props":242,"children":243},{},[244],{"type":31,"value":245},"Creates heavy network traffic",{"type":22,"tag":216,"props":247,"children":248},{},[249],{"type":31,"value":250},"Causes database connection overhead",{"type":22,"tag":216,"props":252,"children":253},{},[254],{"type":31,"value":255},"Makes DBAs cry",{"type":22,"tag":205,"props":257,"children":259},{"id":258},"copy-method",[260],{"type":31,"value":261},"COPY Method",{"type":22,"tag":212,"props":263,"children":264},{},[265,270,275,280,285,290,295,300],{"type":22,"tag":216,"props":266,"children":267},{},[268],{"type":31,"value":269},"Blazing fast for bulk uploads",{"type":22,"tag":216,"props":271,"children":272},{},[273],{"type":31,"value":274},"Minimal network overhead",{"type":22,"tag":216,"props":276,"children":277},{},[278],{"type":31,"value":279},"Single transaction",{"type":22,"tag":216,"props":281,"children":282},{},[283],{"type":31,"value":284},"Makes DBAs smile",{"type":22,"tag":216,"props":286,"children":287},{},[288],{"type":31,"value":289},"More complex implementation",{"type":22,"tag":216,"props":291,"children":292},{},[293],{"type":31,"value":294},"All-or-nothing transaction",{"type":22,"tag":216,"props":296,"children":297},{},[298],{"type":31,"value":299},"Harder to handle individual row errors",{"type":22,"tag":216,"props":301,"children":302},{},[303],{"type":31,"value":304},"Not suitable for real-time updates",{"type":22,"tag":44,"props":306,"children":308},{"id":307},"conclusion",[309],{"type":31,"value":310},"Conclusion",{"type":22,"tag":27,"props":312,"children":313},{},[314],{"type":31,"value":315},"If you're dealing with bulk data uploads in PostgreSQL, switching from INSERT to COPY is like upgrading from a Honda Civic to a Ferrari (without the expensive maintenance). Just remember - with great power comes great responsibility. Make sure your data is clean before attempting the upload, as COPY is an all-or-nothing operation.",{"type":22,"tag":27,"props":317,"children":318},{},[319,321],{"type":31,"value":320},"The full implementation and comparison available in Askin Tamanli ",{"type":22,"tag":322,"props":323,"children":327},"a",{"href":324,"rel":325},"https://github.com/askintamanli/Fastest-Methods-to-Bulk-Insert-Pandas-Dataframe-into-PostgreSQL",[326],"nofollow",[328],{"type":31,"value":329},"repository",{"type":22,"tag":27,"props":331,"children":332},{},[333,335,339],{"type":31,"value":334},"Yours,",{"type":22,"tag":336,"props":337,"children":338},"br",{},[],{"type":31,"value":340},"\nBad Dog",{"title":7,"searchDepth":342,"depth":342,"links":343},2,[344,345,346,347,352],{"id":46,"depth":342,"text":49},{"id":104,"depth":342,"text":107},{"id":129,"depth":342,"text":132},{"id":200,"depth":342,"text":203,"children":348},[349,351],{"id":207,"depth":350,"text":210},3,{"id":258,"depth":350,"text":261},{"id":307,"depth":342,"text":310},"markdown","content:posts:fastest-way-to-upload-data-into-postgresql.md","content","posts/fastest-way-to-upload-data-into-postgresql.md","posts/fastest-way-to-upload-data-into-postgresql","md",1775831731497]