Equivalencias SAS – pySpark

por Juan Vidal Gil
A la hora de implementar procesos de data science tenemos diferentes herramientas para hacerlo. Dos de las más utilizadas son SAS y pySpark. Spark es un framework open source para ejecución en un cluster y pyspark la librería para trabajar con Python. En entradas anteriores vimos como trabajar con Python y SAS de forma integrada en la plataforma SAS Viya: https://online.datademy.es/videotutorial-integrar-sas-y-python-en-plataforma-sas-vita/
A continuación vemos una serie de ejemplos de operaciones con los datos realizados con ambas herramientas. Esta lista de ejemplos es la primera de una serie que vamos a ir publicando.
Ejemplos de operaciones con datasets realizados en SAS y su equivalencia en pyspark
Lectura de fichero
Vamos a trabajar con el siguiente fichero. CUSTOMER_CHURN_TELCO. Mostramos un extracto:
customerID;gender;tenure;MultipleLines;MonthlyCharges;StreamingTV;DeviceProtection;Contract;Churn
7590-VHVEG;Female;1;No phone service;29.85;No;No;Month-to-month;No
5575-GNVDE;Male;34;No;56.95;No;Yes;One year;No
3668-QPYBK;Male;2;No;53.85;No;No;Month-to-month;Yes
7795-CFOCW;Male;45;No phone service;;No;Yes;One year;No
9237-HQITU;Female;2;No;70.7;No;No;Month-to-month;Yes
Explicación de los campos:
- customerid: identificador de clientes
- gender: género del cliente (male/female)
- tenure: antigüedad del cliente en meses
- MultipleLines: indica si cliente tiene varias líneas (Yes/No)
- MonthlyCharges: cargos mensuales del cliente (factura)
- StreamingTV: indica si el cliente tiene contratados servicios de TV (Yes/No)
- DeviceProtection: indica si el cliente tiene un servicio de seguro (Yes/No)
- Contract: tipo de contrato, duración (Two year, One year, Month-yo_month)
- Churn: indica si el cliente ha causado baja (Yes/No)
El paso de lectura en SAS es el siguiente:
DATA WORK.CUSTOMER1;
LENGTH customerID $10 gender $6 tenure 8
MultipleLines $20 MonthlyCharges 8 StreamingTV $20
DeviceProtection $20 Contract $20 Churn $3 ;
INFILE '/home/ficheros/CUSTOMER_CHURN_TELCO.txt'
FIRSTOBS=2 DLM=';' DSD ;
INPUT
customerID gender tenure MultipleLines MonthlyCharges
StreamingTV DeviceProtection Contract Churn ;
RUN;
Obtenemos la equivalencia en pyspark
%fs ls dbfs:/FileStore/tables/
pathnamesize1234567dbfs:/FileStore/tables/CUSTOMER_CHURN_TELCO.txtCUSTOMER_CHURN_TELCO.txt432871dbfs:/FileStore/tables/SALES.csvSALES.csv60197dbfs:/FileStore/tables/appl_stock.csvappl_stock.csv143130dbfs:/FileStore/tables/people.csvpeople.csv34dbfs:/FileStore/tables/ventas.txtventas.txt50dbfs:/FileStore/tables/ventas_ciudad.txtventas_ciudad.txt57dbfs:/FileStore/tables/ventas_ciudad2.txtventas_ciudad2.txt92
Showing all 7 rows.
Método read
df_customer1 = spark.read \
.format("csv") \
.option("header", "true") \
.option("inferSchema", "true") \
.option("delimiter", ";") \
.load("dbfs:/FileStore/tables/CUSTOMER_CHURN_TELCO.txt")
df_customer1.show()
+———-+——+——+—————-+————–+——————-+——————-+————–+—–+ |customerID|gender|tenure| MultipleLines|MonthlyCharges| StreamingTV| DeviceProtection| Contract|Churn| +———-+——+——+—————-+————–+——————-+——————-+————–+—–+ |7590-VHVEG|Female| 1|No phone service| 29.85| No| No|Month-to-month| No| |5575-GNVDE| Male| 34| No| 56.95| No| Yes| One year| No| |3668-QPYBK| Male| 2| No| 53.85| No| No|Month-to-month| Yes| |7795-CFOCW| Male| 45|No phone service| null| No| Yes| One year| No| |9237-HQITU|Female| 2| No| 70.7| No| No|Month-to-month| Yes| |9305-CDSKC|Female| 8| Yes| 99.65| Yes| Yes|Month-to-month| Yes| |9225-CDGGG| Male| 1| Yes| 52.65| No| Yes|Month-to-month| Yes| |9625-CDAAG|Female| 2| Yes| 32.75| Yes| Yes|Month-to-month| Yes| |9225-PPGGG| Male| 1| Yes| 31.65| No| Yes|Month-to-month| Yes| |9275-CDGVV|Female| 2| Yes| 34.65| Yes| Yes|Month-to-month| Yes| |9825-UUGGG|Female| 1| Yes| 32.85| Yes| Yes|Month-to-month| Yes| |6225-CDGGT| Male| 1| Yes| 42.65| Yes| Yes|Month-to-month| Yes| |9385-AEUGG| Male| 55| Yes| 65.65| Yes| Yes| Two year| No| |1452-KIOVK| Male| 22| Yes| 89.1| Yes| No|Month-to-month| No| |6713-OKOMC|Female| 10|No phone service| 29.75| No| No|Month-to-month| No| |7892-POOKP|Female| 28| Yes| 104.8| Yes| Yes|Month-to-month| Yes| |6388-TABGU| Male| 62| No| 56.15| No| No| One year| No| |9763-GRSKD| Male| 13| No| 49.95| No| No|Month-to-month| No| |7469-LKBCI| Male| 16| No| 18.95|No internet service|No internet service| Two year| No| |8091-TTVAX| Male| 58| Yes| 100.35| Yes| Yes| One year| No| +———-+——+——+—————-+————–+——————-+——————-+————–+—–+ only showing top 20 rows
Filtros
Realizamos filtrados sobre el dataset leído
El paso de filtro en SAS es el siguiente. Realizado con PASO DATA:
DATA CUST_FIL;
SET CUSTOMER1;
WHERE Churn = 'Yes';
RUN;
Realizado con PROC SQL:
PROC SQL;
CREATE TABLE CUST_FIL AS
(SELECT * FROM CUSTOMER1 WHERE Churn = 'Yes');
QUIT;
Obtenemos la equivalencia en pyspark empleano el método where
df_cust_fil = df_customer1.where("Churn == 'Yes'")
df_cust_fil.show(5)
+———-+——+——+————-+————–+———–+—————-+————–+—–+ |customerID|gender|tenure|MultipleLines|MonthlyCharges|StreamingTV|DeviceProtection| Contract|Churn| +———-+——+——+————-+————–+———–+—————-+————–+—–+ |3668-QPYBK| Male| 2| No| 53.85| No| No|Month-to-month| Yes| |9237-HQITU|Female| 2| No| 70.7| No| No|Month-to-month| Yes| |9305-CDSKC|Female| 8| Yes| 99.65| Yes| Yes|Month-to-month| Yes| |9225-CDGGG| Male| 1| Yes| 52.65| No| Yes|Month-to-month| Yes| |9625-CDAAG|Female| 2| Yes| 32.75| Yes| Yes|Month-to-month| Yes| +———-+——+——+————-+————–+———–+—————-+————–+—–+ only showing top 5 rows
En pyspark se puede realizar también el filtrado mediante SQL
Primero creamos la tabla con el método createOrReplaceTempView y después la consultamos con el método spark.sql
df_customer1.createOrReplaceTempView("t_customer1")
df_cust_fil = spark.sql("SELECT * FROM t_customer1 where Churn = 'Yes'")
df_cust_fil.show(5)
+———-+——+——+————-+————–+———–+—————-+————–+—–+ |customerID|gender|tenure|MultipleLines|MonthlyCharges|StreamingTV|DeviceProtection| Contract|Churn| +———-+——+——+————-+————–+———–+—————-+————–+—–+ |3668-QPYBK| Male| 2| No| 53.85| No| No|Month-to-month| Yes| |9237-HQITU|Female| 2| No| 70.7| No| No|Month-to-month| Yes| |9305-CDSKC|Female| 8| Yes| 99.65| Yes| Yes|Month-to-month| Yes| |9225-CDGGG| Male| 1| Yes| 52.65| No| Yes|Month-to-month| Yes| |9625-CDAAG|Female| 2| Yes| 32.75| Yes| Yes|Month-to-month| Yes| +———-+——+——+————-+————–+———–+—————-+————–+—–+ only showing top 5 rows
Proyección
Selección de campos
El paso de proyección en SAS es el siguiente. Realizado con KEEP que indica los campos a mantener:
DATA CUST_SEL (KEEP= customerID gender tenure);
SET CUSTOMER1;
RUN;
Otra opción es la sentencia DROP que indica los campos a eliminar:
DATA CUST_SEL (DROP= MultipleLines MonthlyCharges
StreamingTV DeviceProtection Contract Churn);
SET CUSTOMER1;
RUN;
Realizado con PROC SQL:
PROC SQL;
CREATE TABLE CUST_SEL AS
(SELECT customerID, gender, tenure FROM CUSTOMER1 );
QUIT;
En pyspark empleamos el método select
df_cust_sel = df_customer1.select("customerID", "gender", "tenure")
df_cust_sel.show(5)
+———-+——+——+ |customerID|gender|tenure| +———-+——+——+ |7590-VHVEG|Female| 1| |5575-GNVDE| Male| 34| |3668-QPYBK| Male| 2| |7795-CFOCW| Male| 45| |9237-HQITU|Female| 2| +———-+——+——+ only showing top 5 rows
Otra opción es el método drop
df_cust_sel = df_customer1.drop("MultipleLines","MonthlyCharges", "StreamingTV", "DeviceProtection", "Contract", "Churn")
df_cust_sel.show(5)
+———-+——+——+ |customerID|gender|tenure| +———-+——+——+ |7590-VHVEG|Female| 1| |5575-GNVDE| Male| 34| |3668-QPYBK| Male| 2| |7795-CFOCW| Male| 45| |9237-HQITU|Female| 2| +———-+——+——+ only showing top 5 rows
Mediante SQL
Columnas calculadas
En SAS podemos calcular campos en un PASO DATA:
DATA CUSTOMER2;
SET CUSTOMER1;
TENURE_YEARS = TENURE/12;
RUN;
O bien con PROC SQL:
PROC SQL;
CREATE TABLE CUSTOMER2 AS
(SELECT t.*, t.tenure/12 as tenure_years
FROM CUSTOMER1 t );
QUIT;
En pyspark empleamos las sql functions y el método withColumnn
from pyspark.sql.functions import col
df_customer2 = df_customer1.withColumn("tenure_years", col("tenure")/12 )
df_customer2.show(5)
+———-+——+——+—————-+————–+———–+—————-+————–+—–+——————-+ |customerID|gender|tenure| MultipleLines|MonthlyCharges|StreamingTV|DeviceProtection| Contract|Churn| tenure_years| +———-+——+——+—————-+————–+———–+—————-+————–+—–+——————-+ |7590-VHVEG|Female| 1|No phone service| 29.85| No| No|Month-to-month| No|0.08333333333333333| |5575-GNVDE| Male| 34| No| 56.95| No| Yes| One year| No| 2.8333333333333335| |3668-QPYBK| Male| 2| No| 53.85| No| No|Month-to-month| Yes|0.16666666666666666| |7795-CFOCW| Male| 45|No phone service| null| No| Yes| One year| No| 3.75| |9237-HQITU|Female| 2| No| 70.7| No| No|Month-to-month| Yes|0.16666666666666666| +———-+——+——+—————-+————–+———–+—————-+————–+—–+——————-+ only showing top 5 rows
Mediante SQL
df_customer2 = spark.sql("SELECT *, tenure/12 as tenure_years FROM t_customer1 ")
df_customer2.show(5)
+———-+——+——+—————-+————–+———–+—————-+————–+—–+——————-+ |customerID|gender|tenure| MultipleLines|MonthlyCharges|StreamingTV|DeviceProtection| Contract|Churn| tenure_years| +———-+——+——+—————-+————–+———–+—————-+————–+—–+——————-+ |7590-VHVEG|Female| 1|No phone service| 29.85| No| No|Month-to-month| No|0.08333333333333333| |5575-GNVDE| Male| 34| No| 56.95| No| Yes| One year| No| 2.8333333333333335| |3668-QPYBK| Male| 2| No| 53.85| No| No|Month-to-month| Yes|0.16666666666666666| |7795-CFOCW| Male| 45|No phone service| null| No| Yes| One year| No| 3.75| |9237-HQITU|Female| 2| No| 70.7| No| No|Month-to-month| Yes|0.16666666666666666| +———-+——+——+—————-+————–+———–+—————-+————–+—–+——————-+ only showing top 5 rows
Ordenación
En SAS realizamos la ordenación con un procedimiento PROC SORT:
PROC SORT DATA=CUSTOMER1;
by customerID;
run;
En PySpark empleamos método sort
df_customer1.sort("customerID").show(5)
+———-+——+——+————-+————–+———–+—————-+————–+—–+ |customerID|gender|tenure|MultipleLines|MonthlyCharges|StreamingTV|DeviceProtection| Contract|Churn| +———-+——+——+————-+————–+———–+—————-+————–+—–+ |0002-ORFBO|Female| 9| No| 65.6| Yes| No| One year| No| |0003-MKNFE| Male| 9| Yes| 59.9| No| No|Month-to-month| No| |0004-TLHLJ| Male| 4| No| 73.9| No| Yes|Month-to-month| Yes| |0011-IGKFF| Male| 13| No| 98.0| Yes| Yes|Month-to-month| Yes| |0013-EXCHZ|Female| 3| No| 83.9| Yes| No|Month-to-month| Yes| +———-+——+——+————-+————–+———–+—————-+————–+—–+ only showing top 5 rows
Concatenación. Unión de datasets por filas (operación añadir, concatenar)
En SAS podemos realizar la concatenación con un PASO DATA:
DATA CUST_TOT;
SET CUSTOMER1 CUST_FIL;
RUN;
Otra opción es el procedimiento PROC APPEND:
PROC APPEND BASE=CUSTOMER1 NEW=CUST_FIL;
RUN;
Paso de concatenación en PySpark con método union
df_cust_tot.count()
Out[11]: 9089
df_cust_tot = df_customer1.union(df_cust_fil)
df_cust_tot.show(5)
+———-+——+——+—————-+————–+———–+—————-+————–+—–+ |customerID|gender|tenure| MultipleLines|MonthlyCharges|StreamingTV|DeviceProtection| Contract|Churn| +———-+——+——+—————-+————–+———–+—————-+————–+—–+ |7590-VHVEG|Female| 1|No phone service| 29.85| No| No|Month-to-month| No| |5575-GNVDE| Male| 34| No| 56.95| No| Yes| One year| No| |3668-QPYBK| Male| 2| No| 53.85| No| No|Month-to-month| Yes| |7795-CFOCW| Male| 45|No phone service| null| No| Yes| One year| No| |9237-HQITU|Female| 2| No| 70.7| No| No|Month-to-month| Yes| +———-+——+——+—————-+————–+———–+—————-+————–+—–+ only showing top 5 rows
df_cust_tot.count()
Agrupaciones
En SAS podemos realizar la agrupación con un PROC SQL:
PROC SQL;
CREATE TABLE CUST_AG1 AS
(SELECT CHURN, COUNT(customerID) as numcust, avg(tenure) as avg_ten,
sum(MonthlyCharges) as sum_mc
from CUSTOMER1
GROUP BY CHURN);
QUIT;
Otra opción es el procedimiento PROC MEANS:
PROC MEANS
PROC MEANS DATA = CUSTOMER1 NWAY;
CLASS Churn;
VAR tenure MonthlyCharges;
output out=CUST_AG(drop=_type_ _freq_)
sum(MonthlyCharges) = sum_mc
mean(tenure) = avg_ten
N = num_cust;
RUN;
En PySpark empleamos método groupby y agg
import pyspark.sql.functions as f
df_cust_ag = df_customer1.groupBy("Churn").agg(f.count("CustomerId").alias("num_cust"),f.avg("tenure").alias("avg_tenu"),f.sum("MonthlyCharges").alias("su_mc"))
df_cust_ag.show(5)
+—–+——–+——————+——————+ |Churn|num_cust| avg_tenu| su_mc| +—–+——–+——————+——————+ | No| 5203|40.915433403805494|329405.54999999877| | Yes| 1943| 9.902213072568193|133832.49999999983| +—–+——–+——————+——————+
Joins
En SAS podemos realizamos el cruce en un paso data con un MERGE y una ordenación previa:
PROC SORT DATA=CUSTOMER1;
by customerID;
run;
PROC SORT DATA=CITY;
by customerID;
run;
DATA CUSTOMER3;
MERGE CUSTOMER1(in=a) CITY(in=b);
by customerID;
if a;
run;
En PySpark empleamos el método join
df_city = spark.createDataFrame([
("7590-VHVEG", "Chicago"),
("5575-GNVDE", "Boston"),
("3668-QPYBK", "New York"),
("7795-CFOCW", "Washington")
]) \
.toDF("id_customer", "city")
df_city.show()
+———–+———-+ |id_customer| city| +———–+———-+ | 7590-VHVEG| Chicago| | 5575-GNVDE| Boston| | 3668-QPYBK| New York| | 7795-CFOCW|Washington| +———–+———-+
df_customer3 = df_customer1.join(df_city, df_customer1["customerID"] == df_city["id_customer"], "outer")
df_customer3.where("customerID in ('7590-VHVEG','5575-GNVDE')").show()
+———-+——+——+—————-+————–+———–+—————-+————–+—–+———–+——-+ |customerID|gender|tenure| MultipleLines|MonthlyCharges|StreamingTV|DeviceProtection| Contract|Churn|id_customer| city| +———-+——+——+—————-+————–+———–+—————-+————–+—–+———–+——-+ |5575-GNVDE| Male| 34| No| 56.95| No| Yes| One year| No| 5575-GNVDE| Boston| |7590-VHVEG|Female| 1|No phone service| 29.85| No| No|Month-to-month| No| 7590-VHVEG|Chicago| +———-+——+——+—————-+————–+———–+—————-+————–+—–+———–+——-+
Escritura a fichero
En SAS podemos realizamos el cruce en un paso data:
data _null_;
set CUSTOMER1;
file "/home/ficheros/fcustomer.csv" dlm = ';';
put customerID gender tenure MultipleLines MonthlyCharges
StreamingTV DeviceProtection Contract Churn;
run;
O bien con el procedimiento PROC EXPORT
proc export data=CUSTOMER1
outfile="/home/ficheros/fcustomer.csv" dbms=CSV;
run;
En PySpark empleamos el método write
%fs ls /tmp/test/fcustomer
pathnamesize1234dbfs:/tmp/test/fcustomer/_SUCCESS_SUCCESS0dbfs:/tmp/test/fcustomer/_committed_2652053655009358260_committed_2652053655009358260111dbfs:/tmp/test/fcustomer/_started_2652053655009358260_started_26520536550093582600dbfs:/tmp/test/fcustomer/part-00000-tid-2652053655009358260-4c54284e-9ade-40ed-8f13-abbc3f814490-2-1-c000.csvpart-00000-tid-2652053655009358260-4c54284e-9ade-40ed-8f13-abbc3f814490-2-1-c000.csv426420
Showing all 4 rows.
df_customer1.write \
.format("csv") \
.mode("overwrite") \
.save("dbfs:/tmp/test/fcustomer")
Tratamiento de fechas
En SAS podemos crear fechas con la función DATE() y usar funciones de fecha como MONTH() y formatos con FORMAT:
data CUSTOMER4;
set CUSTOMER1;
fc_act = DATE();
month_act = MONTH(fc_act);
format fc_act yymmdd10.;
run;
from pyspark.sql.functions import current_date, current_timestamp, month
df_customer4 = df_customer1.withColumn("fc_act", current_date() ).withColumn("month_act", month("fc_act"))
df_customer4.show(5)
+———-+——+——+—————-+————–+———–+—————-+————–+—–+———-+———+ |customerID|gender|tenure| MultipleLines|MonthlyCharges|StreamingTV|DeviceProtection| Contract|Churn| fc_act|month_act| +———-+——+——+—————-+————–+———–+—————-+————–+—–+———-+———+ |7590-VHVEG|Female| 1|No phone service| 29.85| No| No|Month-to-month| No|2021-01-26| 1| |5575-GNVDE| Male| 34| No| 56.95| No| Yes| One year| No|2021-01-26| 1| |3668-QPYBK| Male| 2| No| 53.85| No| No|Month-to-month| Yes|2021-01-26| 1| |7795-CFOCW| Male| 45|No phone service| null| No| Yes| One year| No|2021-01-26| 1| |9237-HQITU|Female| 2| No| 70.7| No| No|Month-to-month| Yes|2021-01-26| 1| +———-+——+——+—————-+————–+———–+—————-+————–+—–+———-+———+ only showing top 5 rows
pendientes: acceso BBDD (ejemplo Hive), fechas, …
Acceso a BBDD (Hive)
Se crea una librería para acceder a Hive y mediante la librería se pueden realizar pasos data desde SAS:
libname hivesas hadoop server=hadoop srv user='user';
data hivesas.t_customer;
set CUSTOMER1;
run;
Tablas del metastore de Hive
spark.sql("show tables").show()
+——–+————–+———–+ |database| tableName|isTemporary| +——–+————–+———–+ | default|appl_stock_csv| false| | default| people_csv| false| | default| sales_csv| false| | default| sales_csv1| false| | default| t_ventas1| false| +——–+————–+———–+
Escribimos la tabla en Hive
df_customer1.write.saveAsTable("t_customer")
Continuaremos publicando más equivalencias. Para conocer más equivalencias entre ambas herramientas, obtener formación, consultoría o asesoría en las mismas contactar con: formacion@datademy.es
Recommended Posts

Videotutorial: Integrar SAS y Python en plataforma SAS Viya
30 junio, 2020

Llamar a Phyton desde SAS
26 julio, 2019